Evaluation

What Is an Evaluation Metric?

The numerical or categorical signal returned by an LLM evaluator, used to quantify output quality, faithfulness, or task fit.

What Is an Evaluation Metric?

An evaluation metric is the output of an evaluator: the score, label, or rating that quantifies how well an LLM did on one specific dimension. Metrics fall into three buckets. Reference-based metrics compare output to a gold answer (ExactMatch, BLEUScore, ROUGEScore). Reference-free metrics score the output on its own (AnswerRelevancy, Coherence). Context-grounded metrics use retrieved documents as the reference (Groundedness, Faithfulness, ContextRelevance). Choosing the right one is the first technical decision in any eval design — the wrong metric will rank a worse model higher.

Why Evaluation Metrics Matter in Production

The wrong metric is worse than no metric, because it gives false confidence. A team that ships on BLEUScore for an open-ended Q&A bot will rank a memorized-rephrasing model above a fresh, correct answer — BLEU rewards n-gram overlap, not truth. A team that ships on EmbeddingSimilarity for faithfulness will rank a smoothly-worded hallucination above an awkward but cited correct answer. A team that uses a single global score for a multi-tenant agent will miss that one cohort is failing 12% of the time while the global average looks healthy.

The pain shows up downstream. Product managers sign off on a release because “eval scores improved” while users churn. ML engineers chase phantom regressions because the metric drifted, not the model. Compliance can’t answer “is this safer?” because nobody picked a safety metric in the first place.

For 2026-era agentic stacks, single-number metrics are especially dangerous. An agent has a final answer, a trajectory, and several tool calls — three different surfaces, each needing its own metric. TaskCompletion for the outcome, StepEfficiency for the trajectory, ToolSelectionAccuracy for tool calls. Lump them together with AggregatedMetric for headline reporting, but never throw away the per-dimension breakdown. Comparable open-source frameworks (Ragas, TruLens) ship metric catalogs but leave aggregation up to you; the bug-prone part is exactly there.

How FutureAGI Handles Evaluation Metrics

FutureAGI’s approach is a curated metric catalog plus a composition layer. The fi.evals package ships 50+ pre-built evaluators that each return one or more metrics: AnswerRelevancy returns a 0–1 relevance score; RAGScore returns a composite plus four sub-metrics (faithfulness, relevance, recall, noise sensitivity); JSONValidation returns a boolean plus a list of validation errors. Each metric is calibrated against human-annotated reference sets so you don’t tune blindly.

The composition layer is AggregatedMetric, which combines multiple metric evaluators into a single weighted score. You declare which metrics participate, what weight each carries, and a fail threshold. The aggregate becomes your release gate; the individual metrics stay visible for debugging.

A real example: a SaaS team building a customer-support agent on traceAI-openai-agents runs four metrics per response — IsHelpful, IsPolite, Groundedness against the help-doc retrieval, and a CustomEvaluation that grades brand-voice compliance. They aggregate into a single “support quality” score weighted 0.4 / 0.1 / 0.3 / 0.2. When the score drops on a new model version, the dashboard breaks down which sub-metric fell — usually Groundedness, when the retriever index is stale. The fix is targeted, not a generic prompt rewrite.

How to Measure or Detect Issues With a Metric

Metrics need their own quality assurance. Track:

  • Score distribution per metric: is it spread or stuck at a ceiling? Stuck means the metric isn’t discriminating.
  • fi.evals.AggregatedMetric weight stability: when sub-metric correlations change, your aggregate gets noisy. Re-tune quarterly.
  • Human-agreement for judge-based metrics: Cohen’s kappa or simple accuracy on a 50-trace held-out set.
  • Cost per metric: judge-model metrics carry token cost; programmatic metrics are free. Optimize the mix.
  • Drift across releases: plot metric mean and p90 over the last 30 days to catch slow regressions.

Minimal Python:

from fi.evals import AnswerRelevancy, Groundedness, AggregatedMetric

agg = AggregatedMetric(
    evaluators=[AnswerRelevancy(), Groundedness()],
    weights=[0.4, 0.6],
)
result = agg.evaluate(input=q, output=a, context=docs)
print(result.score, result.sub_scores)

Common Mistakes

  • Using BLEU or ROUGE for open-ended chat. Both reward n-gram overlap; both miss factual correctness. Use a judge metric.
  • Aggregating without weights. A naive mean lets a noisy sub-metric dominate the gate signal.
  • Picking the metric after seeing the model output. Metrics chosen post-hoc to confirm a release decision are biased by definition.
  • Ignoring metric variance. A single number on 20 traces is noise; report mean ± 95% CI on at least 100 traces.
  • Comparing absolute scores across model families. A judge model’s AnswerRelevancy on GPT-4o is not directly comparable to AnswerRelevancy on Llama 3 — calibrate per-model first.

Frequently Asked Questions

What is an evaluation metric?

An evaluation metric is the score an LLM evaluator returns — a number, a label, or a structured rating — used to quantify output quality on a specific dimension like faithfulness, helpfulness, or schema compliance.

How is an evaluation metric different from a benchmark?

A benchmark is a fixed dataset and metric pair (e.g. MMLU, HumanEval) used to compare models. An evaluation metric is the scoring function itself; the same metric can be used in many benchmarks or in production evals on your own data.

How do you choose an evaluation metric?

Match the metric to the failure mode. For RAG, use Groundedness or Faithfulness; for structured outputs, JSONValidation; for open-ended chat, judge-based AnswerRelevancy. FutureAGI's fi.evals catalog organizes them by task family.