What Is an Evaluator?
A scoring function in an LLM evaluation pipeline that takes input, output, and optional context, and returns a score, label, and reason.
What Is an Evaluator?
In LLM evaluation, an evaluator is the scoring function. the smallest reusable unit that turns a model’s output into a number plus a reason. It accepts an input prompt, the model’s response, and (optionally) retrieved context or a reference answer. It returns a structured result: a numeric score, a categorical label like pass/fail, and the reasoning behind the verdict. Evaluators are how you compose an eval suite: stack a Groundedness check, a JSONValidation check, a ToolSelectionAccuracy check, and a CustomEvaluation rubric, and you have a working quality gate. They are to evals what unit tests are to backend code. atomic, composable, versioned, and re-runnable.
In 2026 the evaluator surface has expanded well past single-turn QA. Frontier production systems run trajectory evaluators on multi-step agents, tool-call evaluators on MCP interactions, voice evaluators on STT/TTS pipelines, and policy evaluators on every limited-risk and high-risk decision under the EU AI Act. The class library that ships with fi.evals reflects that. 50+ evaluator classes covering eval, RAG, agent, compliance, voice, security, and custom-policy use cases.
Why evaluators matter in production LLM and agent systems
Without explicit evaluators, “quality” lives in someone’s head. The first product manager who runs the demo decides if it works; the next 10,000 users find out for themselves. Evaluators externalize that judgment as code, which means it survives turnover, gets versioned with the rest of the system, and runs on every release. In 2026 production stacks where frontier models are commodities and the moat is in the eval program, the maturity of your evaluator library is the single best predictor of release reliability.
Concrete failure modes that a missing evaluator hides: a JSON-output prompt that breaks for emoji-containing inputs, caught only after a downstream parser starts throwing. JSONValidation would have flagged it on day one. An agent that loops on the same tool, eating $40 of inference per request. TrajectoryScore and ToolSelectionAccuracy would have spiked. A RAG pipeline whose answer quality silently degraded after a chunking change. Groundedness and Faithfulness would have shown the regression on the next eval run. A hiring assistant whose refusal rate dropped 12% for female-coded names after a prompt rewrite. NoGenderBias would have caught it before deploy.
The pain hits ML engineers first (debugging quality regressions one trace at a time), then SREs (paged at 3am for a cost spike with no eval breadcrumb), then compliance (asked to attest to model behavior with no measurement record). In multi-step 2026 agent stacks, a single user request fans out into 5-15 LLM calls; one evaluator per step (planner-quality, tool-selection, response-faithfulness, refusal-policy) is what keeps trajectory failures from being mysteries. The Ragas faithfulness evaluator alone, for example, only inspects the final answer. it cannot tell you which retrieval step lost the relevant chunk or which tool call wrote stale state. That is why a real eval stack ships dozens of evaluator classes, not one. Compared with DeepEval and Promptfoo, which center on offline judge-LLM checks, our integration runs the same evaluator class both as a CI gate and a production trace signal. the offline score and the online score are produced by the same code path.
Three shapes of evaluator
The 50+ classes in fi.evals collapse into three implementation shapes. Knowing which shape an evaluator uses tells you its latency, cost, and reliability tradeoffs.
| Shape | Examples | Latency | Cost | Best for |
|---|---|---|---|---|
| Programmatic | JSONValidation, RegexMatch, ExactMatch, SQLInjectionDetector, ToolCallSchema | <5ms | Free | Schema / format / safety checks |
| Embedding-based | EmbeddingSimilarity, ContextRelevance (local), ChunkAttribution | 10-50ms | Embedding tokens only | Semantic similarity, retrieval quality |
| Judge-model-based | Groundedness, Faithfulness, AnswerRelevancy, TaskCompletion, TrajectoryScore, IsCompliant, CustomEvaluation, BiasDetection | 200-2000ms | Judge LLM call cost | Subjective quality, rubric checks, policy |
Production rule: programmatic evaluators run on every request, embedding-based on a high sample, judge-model-based on a controlled sample (5-20% of traffic) plus 100% of release-gating regression evals. Mixing the three so the cheap ones gate the expensive ones is the cost pattern that scales.
How FutureAGI ships evaluators
FutureAGI’s approach is to ship evaluators as importable Python classes through fi.evals, plus a hosted equivalent for no-code teams. There are 50+ built-in evaluators across the three shapes above and seven domain families: eval (Groundedness, AnswerRelevancy, Faithfulness, GroundTruthMatch), RAG (ContextRelevance, ContextPrecision, ContextRecall, ChunkAttribution), agent (ToolSelectionAccuracy, TaskCompletion, TrajectoryScore), compliance (BiasDetection, NoGenderBias, IsCompliant, DataPrivacyCompliance), safety (Toxicity, PII, PromptInjection, Sexist), voice (ASRAccuracy, LipSync, TurnTakingQuality), and security (SQLInjectionDetector, JSONValidation). The CustomEvaluation class lets engineers encode a product-specific rubric as a judge check that runs on the same trace path as the built-ins.
Every evaluator implements the same contract: evaluate(input, output, context=None, expected=None, ...) returns a Result with score, label, reason, plus per-evaluator extras (Groundedness returns supported and unsupported claim lists; TrajectoryScore returns step-level annotations; BiasDetection returns detected categories). You can chain them: register multiple evaluators against a single Dataset.add_evaluation() call, and FutureAGI runs them in parallel and stores results columnwise.
A typical setup in 2026: an engineer instruments their app with traceAI integrations (traceAI-langchain, traceAI-openai-agents, traceAI-llamaindex, or a direct OpenTelemetry instrumentation), samples production into a Dataset, attaches four evaluators (Faithfulness, AnswerRelevancy, ToolSelectionAccuracy, and a CustomEvaluation for tone and policy), and wires the aggregate into the evaluate release gate. When the score drops below threshold on a new prompt version, Agent Command Center flips that route to model-fallback while the team investigates.
A concrete example: tool-using agent evaluator stack
A working example. A B2B SaaS team ships an agent that uses 6 internal tools. billing lookup, account modification, refund issuance, plan change, support ticket creation, and policy lookup. running on Claude Sonnet 4.6 through Agent Command Center. The evaluator stack: JSONValidation on every tool call schema (programmatic, <5ms), ToolSelectionAccuracy on which tool fires for a given intent (judge-LLM, sampled at 10%), TaskCompletion on whether the user’s original ask was resolved (judge-LLM, full coverage on regression dataset, sampled at 5% online), TrajectoryScore on whether the multi-step path matched expectations (judge-LLM, full coverage on regression), Groundedness on policy-lookup answers (judge-LLM, full coverage on regression), IsCompliant with the team’s policy rubric (judge-LLM, full coverage on regression and gated escalation), and PromptInjection plus PII on incoming user text (programmatic, every request). Online cost lands at ~$0.008 per traced trajectory because expensive evaluators are sampled. Offline regression cost lands at ~$0.40 per dataset row because every evaluator runs against the full row. The release gate consumes the offline TaskCompletion mean and the per-cohort IsCompliant minimum; production routing consumes PromptInjection and JSONValidation flags. Same evaluator classes, different sampling rates, one unified score history.
Online vs offline evaluators
A senior engineer should know which evaluators run where. Offline (against a dataset) is the regression-eval surface. every release reruns the same evaluator stack against the pinned dataset version. Online (against live traces) is the production-monitoring surface. a sample of real traffic runs through the same evaluators, with cohort segmentation, so data drift and behavior changes surface fast. The principle that pays back: keep the same evaluator class on both sides. Different evaluators offline vs online means your release gate is grading something different from what your production dashboard is grading, and the two will silently diverge. FutureAGI’s fi.evals classes are designed for both modes. the only thing that changes between offline and online is the data the evaluator sees.
Evaluator stacks worth copying
A few production-grade evaluator combinations we see consistently in 2026 mature stacks. Single-turn QA: AnswerRelevancy plus Faithfulness plus a CustomEvaluation for tone, gated by JSONValidation for structured outputs. RAG: add ContextRelevance, ContextPrecision, ContextRecall, and ChunkAttribution to localize whether a failure came from retrieval, ranking, or generation. Agent / tool-using: ToolSelectionAccuracy, TaskCompletion, TrajectoryScore, plus ToolCallSchema for the structural side of tool calls. Compliance-sensitive: BiasDetection, NoGenderBias, NoRacialBias, DataPrivacyCompliance, IsCompliant with your policy rubric. Voice: ASRAccuracy, LipSync, TurnTakingQuality, plus the standard QA stack on the transcript. Safety: Toxicity, PII, PromptInjection, plus ProtectFlash for low-latency runtime filtering. None of these stacks are exotic. they are the table-stakes combinations a senior 2026 engineer should be running before treating an eval program as complete.
Evaluator versioning and drift
Evaluators themselves drift. A Groundedness evaluator that uses a judge LLM is only as stable as that judge model. When the judge upgrades from Claude Sonnet 4.6 to 4.7, the score distribution shifts. The 2026 best practice is to pin the judge model inside the evaluator instance, version the rubric, and rerun the evaluator against a calibration set on every judge bump. We’ve found that teams that skip judge-model pinning lose 3-5 weeks per year to “the score moved but the system didn’t” investigations. Compare with how Promptfoo manages judge selection through the run config. it works, but the calibration discipline still lives with the engineer.
A practical versioning convention: every CustomEvaluation instance carries a semver-style rubric version, the pinned judge model snapshot, and a calibration kappa from the last reviewed run. When any of those three change, the evaluator is treated as a new evaluator for release-gate purposes. old scores cannot be compared to new ones until the calibration is refreshed. This is the discipline that turns judge-LLM evaluation from “a number we trust because we ran it” into “a number we trust because we know its measurement error.”
How to measure evaluator quality
Evaluators themselves need quality control. A useful 2026 evaluator-QA stack covers six signals:
- Calibration: agreement-with-humans (Cohen’s kappa or simple percent agreement) for judge-based evaluators. Aim ≥0.7 before trusting them in a release gate. Anything below 0.5 is a noise generator.
- Latency: judge-model evaluators add 200ms-2s per trace. Use
fi.evalsprogrammatic and embedding-based evaluators for hot-path use cases; gate judge-based evaluators behind sampling. - Cost:
llm.token_count.prompt× judge-model price × eval volume. Plot evaluator cost as a line item. at scale, judge-LLM eval cost can match or exceed the production LLM cost. - Eval-fail-rate-by-cohort: the headline dashboard signal. what percent of evaluated traces fail per evaluator, sliced by user cohort, locale, plan, or model route.
- Reason coherence: spot-check the
reasonfield for hallucinated justifications. A high score with a wrong reason is a calibration bug; track reason-quality manually on a sample. - Judge-model parity: when the same evaluator runs with different judge models, scores should correlate above 0.85. Below that, the judge choice is dominating the verdict.
Minimal Python:
from fi.evals import Groundedness, JSONValidation, ToolSelectionAccuracy, CustomEvaluation
evaluators = [
Groundedness(),
JSONValidation(schema=order_schema),
ToolSelectionAccuracy(),
CustomEvaluation(rubric="answer must cite source URL"),
]
for e in evaluators:
result = e.evaluate(input=q, output=a, context=docs, trajectory=trace)
print(e.__class__.__name__, result.score, result.label, result.reason[:120])
The healthy state: every evaluator has a documented calibration kappa, a pinned judge model where applicable, a measured cost-per-eval, and a per-cohort fail-rate dashboard. Anything missing turns the evaluator into a black-box voter, which is exactly what evaluators were supposed to replace.
Common mistakes
- Stacking 15 evaluators with no aggregation strategy. You end up with 15 dashboards and no decision rule. Define a release-gate aggregate (weighted mean, all-must-pass, or worst-of-N) and a per-evaluator threshold.
- Using
EmbeddingSimilarityas a faithfulness proxy. Semantic closeness does not equal factual support. UseGroundednessorFaithfulnessfor correctness;EmbeddingSimilarityfor retrieval and cache decisions. - Skipping the
reasonfield in code review. Score-only reviews miss judge hallucinations; the reason is where bad rubrics show their seams. - One evaluator for all task types. A summarization task and a tool-call task need different evaluators; do not force a single rubric across them.
- No regression cohort. Evaluators that run only on new traces miss model-version regressions; always rerun against a frozen golden dataset.
- Self-judging with the same model family. Using GPT-5.1 to judge GPT-5.1 output inflates scores by 5-15 points on subjective rubrics. Pin the judge to a different family or use a reference-based metric.
- Treating every evaluator as a judge-LLM call. Programmatic and embedding-based evaluators are usually 50-200x cheaper and more reliable. Use a judge only where rubric subjectivity demands it.
- No calibration set. Without a labeled calibration set, judge-LLM evaluators silently drift as the judge model upgrades. Maintain 50-200 human-labeled rows per evaluator and rerun on every judge bump.
- Letting
CustomEvaluationrubrics rot. Custom rubrics are written once and edited rarely. Run them through human review quarterly to confirm they still match product intent. - Mixing offline and online evaluator versions. If your release gate uses evaluator v3 but production dashboards still show v2 scores, your two safety nets are measuring different things.
- Treating
JSONValidationas optional. Programmatic schema checks are the cheapest, most reliable evaluator class. Every structured output should have one. they catch the failures judge-LLM evaluators are too expensive to run on every request. - Ignoring evaluator-cost dashboards. A judge-LLM evaluator firing on 100% of production traffic can easily 2-3x your inference bill. Track eval cost as a line item from day one.
Evaluators in the broader 2026 eval program
Anchoring at least a slice of every evaluator class to a public benchmark keeps the library honest. HaluEval (35K Q&A; GPT-4 ~16.4% hallucination rate), TruthfulQA (817 Q; frontier 60-80%), RAGTruth (18K labeled chunks), τ-bench (top 2026 systems at 65-75% pass@1), and BFCL v3 (Berkeley function calling) are all small enough to fit inside a regression dataset while giving stable baselines across judge-model upgrades. A standalone evaluator is a useful tool; an evaluator program is a moat. The teams shipping reliably in 2026 treat their evaluator library as a versioned product asset, with owners, calibration history, and a deprecation policy. New evaluators land through review. does this evaluator measure something the existing library misses, or is it noise? Old evaluators retire when their failure cohort stops producing meaningful traffic or when the underlying obligation changes. The library grows linearly with product complexity, not exponentially with launches. Compared with Arize Phoenix and LangSmith, both of which expose evaluator catalogs but leave the program governance to the engineer, our evaluate surface treats the library as a first-class versioned object with explicit ownership and deprecation flow.
Frequently Asked Questions
What is an evaluator?
An evaluator is a scoring function that takes an LLM input/output (and optional context) and returns a score, label, and reason. It is the atomic unit of an eval pipeline. programmatic, embedding-based, or judge-model-driven.
How is an evaluator different from a metric?
A metric is the number an evaluator returns; the evaluator is the function that produces it. One evaluator can return multiple metrics (e.g. score plus latency plus token count) and many evaluators can share the same metric definition.
How do you measure with an evaluator?
FutureAGI's fi.evals package ships 50+ evaluator classes (Groundedness, AnswerRelevancy, JSONValidation, CustomEvaluation, etc.). Each exposes an evaluate() method returning a structured Result with score, label, and reason.