How is an evaluator different from a metric?

A metric is the number an evaluator returns; the evaluator is the function that produces it. One evaluator can return multiple metrics (e.g. score plus latency plus token count) and many evaluators can share the same metric definition.

How do you measure with an evaluator?

FutureAGI's fi.evals package ships 50+ evaluator classes (Groundedness, AnswerRelevancy, JSONValidation, CustomEvaluation, etc.). Each exposes an evaluate() method returning a structured Result with score, label, and reason.

What Is an Evaluator? Definition & FutureAGI Guide (2026)

Q: What is an evaluator?

An evaluator is a scoring function that takes an LLM input/output (and optional context) and returns a score, label, and reason. It is the atomic unit of an eval pipeline — programmatic, embedding-based, or judge-model-driven.

What Is an Evaluator?

In LLM evaluation, an evaluator is the scoring function — the smallest reusable unit that turns a model’s output into a number plus a reason. It accepts an input prompt, the model’s response, and (optionally) retrieved context or a reference answer. It returns a structured result: a numeric score, a categorical label like pass/fail, and the reasoning behind the verdict. Evaluators are how you compose an eval suite: stack a Groundedness check, a JSONValidation check, and a custom rubric judge, and you have a working quality gate. They are to evals what unit tests are to backend code.

Why Evaluators Matter in Production LLM and Agent Systems

Without explicit evaluators, “quality” lives in someone’s head. The first product manager who runs the demo decides if it works; the next 10,000 users find out for themselves. Evaluators externalize that judgment as code, which means it survives turnover, gets versioned with the rest of the system, and runs on every release.

Concrete failure modes that a missing evaluator hides: a JSON-output prompt that breaks for emoji-containing inputs, caught only after a downstream parser starts throwing — JSONValidation would have flagged it on day one. An agent that loops on the same tool, eating $40 of inference per request — StepEfficiency would have spiked. A RAG pipeline whose answer quality silently degraded after a chunking change — Groundedness would have shown the regression on the next eval run.

The pain hits ML engineers first (debugging quality regressions one trace at a time), then SREs (paged at 3am for a cost spike with no eval breadcrumb), then compliance (asked to attest to model behavior with no measurement record). In multi-step 2026 agent stacks, a single user request fans out into five-to-fifteen LLM calls; one evaluator per step (planner-quality, tool-selection, response-faithfulness) is what keeps trajectory failures from being mysteries. The Ragas faithfulness evaluator alone, for example, only inspects the final answer — it cannot tell you which retrieval step lost the relevant chunk. That is why a real eval stack ships dozens of evaluator classes, not one.

How FutureAGI Handles Evaluators

FutureAGI’s approach is to ship evaluators as importable Python classes through fi.evals, plus a hosted equivalent for no-code teams. There are 50+ built-in evaluators across four shapes: local-metric classes (e.g. Groundedness, EmbeddingSimilarity, JSONValidation) that run in-process; cloud-template evaluators (e.g. PromptInjection, ContextRelevance) that run as managed judge models; framework-eval classes (CoherenceEval, ReasoningQualityEval) tuned for agent trajectories; and security-detector classes (SQLInjectionDetector, PIIDetection) for compliance use cases.

Every evaluator implements the same contract: evaluate(input, output, context=None, expected=None) returns a Result with score, label, reason, plus per-evaluator extras (e.g. Groundedness returns supported and unsupported claim lists). You can chain them: register multiple evaluators against a single Dataset.add_evaluation() call, and FutureAGI runs them in parallel and stores results columnwise.

A typical setup: an engineer instruments their app with traceAI-langchain, samples production into a Dataset, attaches three evaluators (Faithfulness, AnswerRelevancy, CustomEvaluation for tone), and wires AggregatedMetric to combine them into one release-gating score. When the score drops below threshold on a new prompt version, the Agent Command Center’s post-guardrail flips that route to model-fallback while the team investigates.

How to Measure or Detect Issues With Evaluators

Evaluators themselves need quality control. Track:

Calibration: agreement-with-humans (Cohen’s kappa) for judge-based evaluators. Aim ≥0.7 before trusting them in CI.
Latency: judge-model evaluators add 200ms–2s per trace. Use fi.evals local-metric evaluators (no LLM call) for hot-path use cases.
Cost: llm.token_count.prompt × judge-model price × eval volume. Plot evaluator cost as a line item.
Eval-fail-rate-by-cohort: the headline dashboard signal — what % of evaluated traces fail per evaluator, sliced by user cohort.
Reason coherence: spot-check the reason field for hallucinated justifications.

Minimal Python:

from fi.evals import Groundedness, JSONValidation

evaluators = [Groundedness(), JSONValidation(schema=order_schema)]
for e in evaluators:
    result = e.evaluate(input=q, output=a, context=docs)
    print(e.__class__.__name__, result.score)

Common Mistakes

Stacking 15 evaluators with no aggregation strategy. You end up with 15 dashboards and no decision rule. Use AggregatedMetric and pick a release-gate.
Using EmbeddingSimilarity as a faithfulness proxy. Semantic closeness ≠ factual support. Use Groundedness or Faithfulness instead.
Skipping the reason field in code review. Score-only reviews miss judge hallucinations; the reason is where bad rubrics show their seams.
One evaluator for all task types. A summarization task and a tool-call task need different evaluators; do not force a single rubric.
No regression cohort. Evaluators run only on new traces miss model-version regressions; always re-run against a frozen golden set.