Evaluation

What Is a Metric (in AI Evaluation)?

A quantitative function that scores a model output against an input, reference, or context — producing a number, label, or pass/fail signal aggregated across a dataset.

What Is a Metric (in AI Evaluation)?

In AI evaluation, a metric is a quantitative function that scores a model output against an input, a reference, or a context — producing a number, a label, or a pass/fail signal that engineering teams aggregate across a dataset. Metrics range from deterministic checks (regex, JSON-schema match), to embedding-based similarities, to reference-based n-gram scores like BLEU and ROUGE, to judge-model rubrics like groundedness and answer-relevancy. A metric on its own is a number; what makes it useful is wiring it to a threshold, a cohort, and a regression gate so changes in the score actually block bad releases.

Why It Matters in Production LLM and Agent Systems

A metric is the contract between the model and the system that ships it. Without metrics, “the model got better” is a vibe. With metrics, it is a 6% improvement on Groundedness against a 1,200-row golden set and a 1.2% regression on AnswerRelevancy for the long-tail cohort, and the team can decide on data, not feeling. Treat that contract carelessly and silent failure modes accumulate every release.

The pain shows up across roles. The ML engineer ships a prompt change that improves median quality but blows the 99th-percentile failure rate, because the headline metric was a mean. The product manager looks at a dashboard with one global score and cannot tell whether the regression is on enterprise queries or free-tier chat. The platform engineer sees a model swap pass eval in CI but melt down in production because no metric covered tool-call accuracy, the actual failure mode.

For 2026-era agent stacks, single-metric thinking gets worse. A trajectory has step-level metrics (StepEfficiency, ToolSelectionAccuracy), trajectory-level metrics (TrajectoryScore, GoalProgress), and end-state metrics (TaskCompletion). Collapsing them into one number throws away the diagnostic signal you need to know whether the agent failed at planning, retrieval, tool selection, or response generation.

How FutureAGI Handles Metrics

FutureAGI’s approach is to make the metric a first-class object with a known shape: input, output, optional context, returned score, returned label, returned reason. Every evaluator in fi.evals — over 50 of them — implements that contract. Groundedness, AnswerRelevancy, JSONValidation, ToolSelectionAccuracy, BLEUScore, EmbeddingSimilarity, HallucinationScore, and ASRAccuracy are all callable with the same signature, so a team can compose, cache, and gate them uniformly.

In practice, a team running an agentic-RAG service builds a Dataset, calls Dataset.add_evaluation with a list of metrics — Groundedness, ContextRelevance, StepEfficiency — and gets a versioned per-metric score per row. AggregatedMetric then collapses the per-row scores into a configurable composite (weighted mean, min, threshold-pass) for a single ship/no-ship gate. For domain-specific quality questions, a CustomEvaluation wraps a rubric LLM-as-a-judge prompt as a callable metric, with the same input/output/context shape and the same downstream wiring. In production, traceAI feeds live spans into the same metric set so offline gates and online dashboards use the same source of truth. Metric-fail-rate-by-cohort and per-metric drift charts are surfaced in the FutureAGI evaluation dashboard as the headline reliability view.

How to Measure or Detect It

A metric is itself a measurement, but the metric layer needs its own hygiene checks:

  • Per-metric distribution — track the histogram, not just the mean; bimodal scores hide failure cohorts.
  • Per-cohort metric breakdown — split by route, model variant, user segment to surface regressions invisible in the global mean.
  • Aggregated metricfi.evals.AggregatedMetric combines several into a single gate; lets you express “groundedness ≥ 0.8 AND injection ≤ 0.05”.
  • Metric-on-metric stability — Pearson correlation between metric versions across releases; drift means the metric itself changed.
  • Threshold breach rate — fraction of evaluated rows below the configured metric-threshold; the canonical alarm.

Minimal Python:

from fi.evals import Groundedness, AggregatedMetric

g = Groundedness()
combined = AggregatedMetric(metrics=[g], aggregation="mean")
result = combined.evaluate(
    input=query, output=answer, context=retrieved_docs
)
print(result.score, result.label, result.reason)

Common Mistakes

  • One metric, one number. A single global metric hides the failure cohorts that actually matter; report a vector, not a scalar.
  • No threshold attached. A metric without a configured threshold and a downstream alert is a vanity dashboard.
  • Reference-based metrics on open-ended tasks. BLEU and exact-match are useless for chat; use judge-model rubrics or embedding similarity.
  • Letting the judge model be the same model under evaluation. Self-evaluation inflates scores; pin the judge to a different family.
  • Skipping metric versioning. A change to the rubric prompt rewrites every historical score; version the metric and tag releases.

Frequently Asked Questions

What is a metric in AI evaluation?

A metric is a quantitative function that scores a model output against an input, reference, or context. It returns a number, label, or pass/fail signal that teams aggregate across a dataset.

How is a metric different from an evaluator?

An evaluator is the runnable component — a class or callable. A metric is the quantity it returns. In FutureAGI's `fi.evals`, an evaluator like `Groundedness` is wired to the metric score it emits.

How do you choose the right metric?

Match metric to task. Reference-based for canonical answers (exact-match, BLEU), reference-free for open-ended generation (judge-model rubrics, embedding similarity), and structural for typed outputs (JSONValidation, SchemaCompliance).