Guides

LLM Evaluation Metrics: Everything You Need in 2026

There aren't 50 LLM eval metrics. There are three primitive families and eight rubrics that matter in production. The opinionated 2026 reference, with the CI gate and the cascade that make per-trace eval affordable.

·
Updated
·
12 min read
llm-evaluation llm-metrics rag-evaluation guardrails llm-as-judge agent-evaluation 2026
Editorial cover image for LLM Evaluation Metrics: Everything You Need in 2026
Table of Contents

There are not 50 LLM evaluation metrics. There are three primitive families and a handful of named rubrics that decide whether a system ships. Most vendor catalogs are noise on top of that — a permutation of the same primitives applied to a slightly different failure mode, given a new name, charged for separately. The point of this reference is to flatten the catalog. Once you know the primitives and the rubrics that earn their slot, the rest is selection.

TL;DR: the shape of the catalog

LayerWhat it isCount that matters
Primitive familiesThe mechanism the metric uses to score3
Rubrics that ship in productionThe named jobs the rubric does8
Everything elsePermutations of the aboveSkip until a real failure demands one

Three primitives. Eight rubrics. One cascade that makes per-trace evaluation affordable. The rest of this post is the working definition of each.

The three primitive families

Every LLM eval metric is built from one of three primitives or a stack of them. Learn the three and the catalog becomes a lookup table.

Deterministic. A function with no model in the loop. Parse the response into JSON, validate against a schema. Run a regex for a refusal phrasing. Look up cited chunk IDs in the retrieval context. Match a tool call against an expected signature. Deterministic checks are sub-10 ms, free, and never drift. They are also the wrong tool for “is this helpful.” Use them for closed-form questions where the answer is provably right or wrong against a rule.

Embedding-based. Project candidate and reference into a vector space and measure distance. BLEU and ROUGE at the n-gram level. METEOR adding stem and synonym matches. BERTScore at the contextual-embedding token level. Output is a similarity score that tolerates more paraphrase as you climb the stack. Useful when you have a clean gold answer or as a feature for clustering failing traces. Confidently wrong answers that share vocabulary still score high — embedding metrics score “looks similar,” not “is correct.” See What is BLEU, ROUGE, and BERTScore? for the deep dive.

LLM-as-judge. A capable model reads the rubric, reads the candidate response, reads the context, returns a score. G-Eval formalized the pattern in 2023. Pairwise variants scaled it to ship decisions across millions of comparisons. The judge is the only general-purpose primitive for rubrics that require reasoning — faithfulness, refusal calibration, role adherence, helpfulness. It is also the most expensive primitive and the one most prone to bias. See LLM-as-judge best practices for the calibration pattern.

Classifier-backed metrics — LlamaGuard 3, Qwen3Guard, Granite Guardian, WildGuard, ShieldGemma — are a special case of the judge primitive. The model is small (600M-8B) and fine-tuned for one taxonomy, so it runs in 50-200 ms at fractions of a cent. Treat them as the cheap judge, not as a fourth primitive. Future AGI ships ten of them behind a single Guardrails interface, plus its in-house TURING_FLASH.

The skill is matching the question to the cheapest primitive that answers it honestly. Running a frontier judge on a binary toxicity decision a 4B Gemma adapter answers in 65 milliseconds is the most common pattern in the audits we run. Wrong tool, right answer for the wrong reason.

The eight rubrics that matter in production

Across hundreds of production deployments, the same rubrics earn their slot. Pick from these first. Add specialized rubrics only when a real failure cluster forces you.

1. Groundedness. Does the answer only assert claims supported by the retrieved context? The RAG and QA core. Failure mode: hallucination against perfectly good retrieval. Implementation: judge primitive against the context. Future AGI ships it as Groundedness.

2. Refusal. Did the system refuse what it should refuse, and not refuse what it should answer? Two-sided calibration. Over-refusal of benign requests is as much a failure as under-refusal of harmful ones. Implementation: judge primitive against the policy. Future AGI ships it as AnswerRefusal.

3. Factual Accuracy. The final ground-truth check when you have one. When you don’t, the judge scores against an authoritative rubric. Distinct from Groundedness — Groundedness checks “did the model stick to the retrieved context,” Factual Accuracy checks “is the claim actually true in the world.” Future AGI ships it as FactualAccuracy.

4. Toxicity. Multi-axis toxicity scoring across insult, threat, identity attack, sexual, and violent content. Classifier-backed primitive (LlamaGuard 3, ShieldGemma, Turing Flash). Both a CI rubric and a runtime guardrail. Future AGI ships it as Toxicity.

5. PII / Data Privacy Compliance. Detect personal identifiers in inputs and outputs, optionally redact, log the policy decision. Mix of deterministic (regex packs for SSN, credit card, phone) and classifier (named-entity recognition for free-form PII). Future AGI ships it as DataPrivacyCompliance.

6. Tone. Did the assistant stay in its declared persona, register, and brand voice? Judge primitive against a tone rubric. Most product owners under-spec this one and find out from a customer-support escalation. Pin the rubric to brand-voice exemplars rather than asking the judge for “professional tone” in the abstract.

7. Latency. A correct answer at 30 seconds is a product failure. Track p50 and p95 per route, gate p95 against an absolute floor in CI. Hard metric, not a graph in an ops dashboard.

8. Cost. A correct answer that costs $0.50 per request is a unit-economics failure. Track per-request cost and cost per resolved task. Gate cost per resolved task as a CI metric. Production owners ship faster when latency and cost block the CI gate the same way Groundedness does.

That is the list. Eight rubrics, three primitives. Everything else in a vendor’s marketing taxonomy — and you will see lists of 30, 50, 100 — is a permutation. ChunkAttribution is Groundedness scored at chunk granularity. IsHarmfulAdvice is Toxicity scored against a domain-specific policy. LLMFunctionCalling is Refusal-plus-Task-Completion scored against a tool signature. Useful permutations exist; the point is that you start with the eight, not the eighty.

The rubrics that show up in specific shapes

A few rubrics earn their slot only on specific application shapes. They are not in the core eight, but they are non-negotiable when the shape demands them.

RAG-specific. ContextRelevance (do retrieved chunks address the question?), ContextAdherence (does the answer only use what was retrieved?), Completeness (does the answer cover what the chunks support?), ChunkAttribution, ChunkUtilization. The RAG evaluation deep dive covers the per-stage pattern.

Agent-specific. LLMFunctionCalling (right tool, right arguments, right order), TaskCompletion (did the agent reach the user’s intended end state across all tool calls), PromptInjection (did a tool output rewrite the system instructions?). Agent metrics score the trace, not the response — they need the full span tree, which is why agent eval and tracing are the same system in 2026.

Conversation-specific. Conversation Completeness (did the dialogue reach the expected end state?), Knowledge Retention (did the agent carry facts across turns?), Outcome Label (resolved, filed, booked, refunded?). Per-turn evals on multi-turn agents produce false confidence; add conversation-level rubrics. See Multi-turn LLM Evaluation.

Format-specific. IsJson, ContainsValidLink, IsEmail, length bounds, schema match. Deterministic primitive. Microseconds. Never wrong.

The point is the catalog is short. Eight always. Three to five more by shape. Then stop.

Pick by application shape

Different shapes ship different defaults.

ShapeCore rubrics (CI gate every PR)Add when
Chat assistantRefusal, Tone, TaskCompletion + Toxicity, PII + Latency, CostMulti-turn → ConversationCompleteness
RAG / KB / supportGroundedness, ContextRelevance, Completeness + Toxicity, PII + Latency, CostHave ground truth → FactualAccuracy
Tool-using agentLLMFunctionCalling, TaskCompletion, AnswerRefusal + PromptInjection + Latency, CostTouches external state → permission-scope check
Code generationPass-at-1 on fixtures, static analysis, security patterns + Latency, CostSecurity-sensitive → secret-scanning rubric
Safety-critical (health, finance, legal)Domain harm rubric + full Scanner pack as inline guardrails + RefusalRegulated → domain-specific compliance template

Every shape adds the safety triad (Toxicity, PII, PromptInjection) and the latency-cost pair. The mistake is starting from the catalog and picking by name. Start from the failure modes that hurt your users and pick by job.

Six well-calibrated rubrics beat twenty noisy ones.

The CI gate composition

A metric you run once is a slide in a deck. A metric you run on every PR is a quality gate.

The gate is four parts:

  1. Versioned dataset. 50-100 examples per route, sampled from production, biased toward the hardest 10 percent. Refreshed weekly by promoting failing production traces.
  2. Pinned rubrics. A fixed set tied to the application shape, with (judge_model_id, rubric_version, prompt_template_hash) pinned. A vendor swap on the judge is a deliberate eval-suite migration, not a config change.
  3. Per-route thresholds. Fail the PR if any rubric drops more than two points from the trailing 7-day baseline, or falls below an agreed absolute floor. 0.75 on Groundedness, 0.85 on TaskCompletion are reasonable starting defaults to tune.
  4. Cascade-aware runner. The runner that executes the rubric uses the cascade — deterministic first, classifier next, judge last — so the CI bill is proportional to ambiguity, not to dataset size.

The ai-evaluation SDK ships a CLI (fi run) with an assertion engine that exits non-zero when scores drop below threshold. Wire it into GitHub Actions, GitLab CI, or your build system.

from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness, AnswerRefusal, FactualAccuracy,
    Toxicity, DataPrivacyCompliance, TestCase,
)

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

results = evaluator.evaluate(
    eval_templates=[
        Groundedness(augment=True),
        AnswerRefusal(augment=True),
        FactualAccuracy(augment=True),
        Toxicity(augment=True),
        DataPrivacyCompliance(),
    ],
    inputs=[TestCase(query=q, response=r, context=ctx)
            for q, r, ctx in dataset],
)

The CI gate is a regression suite, not a leaderboard. The score that matters is the diff against last week’s baseline, not the absolute number on a model card. The full buildout is in Build LLM evaluation framework from scratch.

The hybrid cascade that makes per-trace eval affordable

No single primitive wins. Deterministic misses everything semantic. Embedding misses everything that needs reasoning. Judge is too expensive to run on every production trace. The production answer is a cascade.

Future AGI’s augment=True flag on every EvalTemplate runs the three primitives in order:

  1. Deterministic layer. Regex, scanner, schema match. Roughly 70-80 percent of traces resolve here.
  2. Classifier-backed layer. One of 10 guardrail backends (LLAMAGUARD_3_8B, QWEN3GUARD_8B/4B/0.6B, GRANITE_GUARDIAN_8B/5B, WILDGUARD_7B, SHIELDGEMMA_2B, TURING_FLASH). Another 10-20 percent resolve here.
  3. LLM-as-judge layer. Rubric-based judge with a pinned model. The remaining 5-15 percent reach here.
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, Toxicity, TestCase

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

result = evaluator.evaluate(
    eval_templates=[
        Groundedness(augment=True),
        Toxicity(augment=True),
    ],
    inputs=[TestCase(query=q, response=r, context=ctx)],
)

Average per-eval cost lands one to two orders of magnitude below pure LLM-as-judge. That is what makes scoring every production span affordable, instead of sampling 1 percent and hoping the missed 99 percent stay clean.

The cascade is the operational answer to the “LLM eval is too expensive” complaint. The whole stack does not have to be cheap; the average has to be cheap, and the cascade gets the average there. See Deterministic vs LLM-judge evals for the deeper treatment.

Operating envelope: cost and latency by primitive

ExamplePrimitivep50 latencyCost / 1k calls
RegexScannerDeterministic<2 ms$0
JailbreakScannerDeterministic<10 ms$0
BLEU / ROUGEEmbedding (n-gram)<1 ms$0
BERTScoreEmbedding (contextual)~50 ms<$0.01
SHIELDGEMMA_2BClassifier~50 ms$0.02
LLAMAGUARD_3_8BClassifier~150 ms$0.05
TURING_FLASHClassifier~80 ms$0.03
Groundedness(augment=True)Cascade~200 ms avg$0.10 avg
CustomLLMJudge (Sonnet 4.5)Judge~4 s~$5
TaskCompletionJudge (agent trace)~3 s~$3

Numbers are illustrative; exact values depend on prompt length, region, and model selection. The shape matters more than the exact numbers: deterministic and classifier are roughly 100x cheaper than judge, which is why the cascade is the right default.

Common mistakes

  • Over-rely on BLEU and ROUGE. They reward surface overlap, not meaning. Use them as a cheap signal layer, not the primary quality metric.
  • Run only LLM-judge. The bill grows faster than the inference bill. Cascade through deterministic and classifier first.
  • Pick the highest-scoring judge. Judge-as-marketing. Pick the one that agrees with human labels.
  • Score the response, ignore the trace. Agent failures often live in the tool call or retrieval step. Use agent-specific rubrics and capture the trace.
  • Skip calibration. Every classifier and judge needs periodic agreement checks against human labels. Without calibration you have a number, not a metric.
  • Frozen dataset. A static dataset stops being a regression suite the moment production drifts. Promote failing traces weekly.
  • Treat safety as a separate stack. Safety rubrics belong in the same Evaluator call as quality rubrics. One pipeline, multiple rubrics, one cascade.
  • Latency and cost as graphs, not gates. A correct answer at 30 seconds is a product failure. Gate it in CI.

How Future AGI ships every family

Future AGI ships the eval stack as a package. Start with the SDK for code-defined custom evals. Graduate to the Platform when you want self-improving evaluators authored by an in-product agent.

  • ai-evaluation SDK (Apache 2.0). from fi.evals import Evaluator, Protect, Guardrails. 60+ EvalTemplate classes covering every rubric above: Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy, AnswerRefusal, TaskCompletion, LLMFunctionCalling, Toxicity, PromptInjection, DataPrivacyCompliance, IsHarmfulAdvice, NoHarmfulTherapeuticGuidance, format checks (IsJson, ContainsValidLink, IsEmail), and a long tail for tone, summarization, multi-modal, and translation. 13 guardrail backends (9 open-weight: LLAMAGUARD_3_8B/1B, QWEN3GUARD_8B/4B/0.6B, GRANITE_GUARDIAN_8B/5B, WILDGUARD_7B, SHIELDGEMMA_2B; 4 API: OPENAI_MODERATION, AZURE_CONTENT_SAFETY, TURING_FLASH, TURING_SAFETY). 8 sub-10 ms Scanners (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner). Four distributed runners (Celery, Ray, Temporal, Kubernetes). RailType.INPUT/OUTPUT/RETRIEVAL plus AggregationStrategy.ANY/ALL/MAJORITY/WEIGHTED. Multi-modal CustomLLMJudge via LiteLLM. augment=True cascade across all three primitives.
  • Future AGI Platform (hosted Agent Command Center). Self-improving evaluators tuned by thumbs up/down or relabel feedback from production. In-product authoring agent that turns natural-language descriptions into rubrics, grading prompts, and reference examples. Classifier-backed evals at lower per-eval cost than Galileo Luna-2 — the reason weekly full-dataset reruns are the default, not a budget conversation.
  • Error Feed (inside the eval stack). HDBSCAN soft-clustering over ClickHouse-stored embeddings groups failing traces into named issues. A Claude Sonnet 4.5 Judge agent on Bedrock writes the RCA, evidence quotes, and an immediate_fix. Fixes feed back into the platform’s self-improving evaluators so your eval suite ages with your product. Linear integration ships today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.
  • traceAI (Apache 2.0). 50+ AI surfaces across Python (46 packages), TypeScript (39), Java (24 modules including a Spring Boot starter, Spring AI, LangChain4j, and Semantic Kernel), and C#. Pluggable semantic conventions at register() time (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY). 14 span kinds (Phoenix has 8) including A2A_CLIENT and A2A_SERVER. 62 server-side EvalTag rubrics wire to span attributes for zero added inline latency.
  • agent-opt. Six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed and resumable, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) with teacher-inferred few-shot templates and shared EarlyStoppingConfig. The unified Evaluator interface lets you optimize against heuristics, judge scores, or any of the 60+ FAGI rubrics.

The hosted Agent Command Center is the gateway runtime: 17 MB Go binary, six native adapters (OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure) plus OpenAI-compatible presets for 20+ providers, six routing strategies plus circuit breaker plus shadow/mirror/race modes, six exact and four semantic cache backends, five-level hierarchical budgets (org/team/user/key/tag), and inline guardrails on every request via GuardrailProtectWrapper. Response headers expose per-request signal: x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-model-used, x-agentcc-fallback-used, x-agentcc-routing-strategy, x-agentcc-guardrail-triggered. SOC 2 Type II, HIPAA, GDPR, and CCPA certified.

Ready to score your own traffic? Install ai-evaluation, drop Groundedness(augment=True) plus AnswerRefusal(augment=True) against your last fifty production traces this afternoon, then wire the same rubrics as EvalTag on live spans via traceAI tomorrow. Three primitives, eight rubrics, one cascade — the rest is selection.

Frequently asked questions

How many LLM evaluation metrics actually matter in 2026?
Three primitive families and eight named rubrics. The primitives are deterministic (regex, schema, scanners — sub-10 ms, free), embedding-based (BLEU, ROUGE, BERTScore — microseconds, reference-anchored), and LLM-judge (G-Eval, rubric-based scoring — seconds, semantic). The rubrics that ship on most production systems are Groundedness, Refusal, Factual Accuracy, Toxicity, PII, Tone, Latency, and Cost. Anything else on a vendor's marketing page is a permutation of these primitives applied to a specialized failure mode.
When should you use BLEU or ROUGE versus an LLM-judge?
Use BLEU and ROUGE when the task has a clean reference and surface overlap is a meaningful proxy — summarization with gold summaries, translation with gold translations. They cost nothing, run in microseconds, and stay stable across reruns. They fail the moment the model rephrases a correct answer. LLM-judge handles paraphrase because it scores meaning, not n-grams. The right setup runs the cheap layer first, escalates to judge only on the borderline cases.
What rubrics should a RAG system actually ship with?
Three named rubrics carry most of the weight: Groundedness, Context Adherence, and Context Relevance. Add Completeness for long-form answers and Factual Accuracy when you have ground truth. Chunk Attribution and Chunk Utilization are diagnostic — they tell you whether retrieval is over-fetching or under-fetching, useful when groundedness drops and you need to debug. Future AGI ships all seven as named EvalTemplate classes; pick three, ship them, add the rest only when a real failure mode demands them.
What is the augment=True cascade and why does it matter?
It is the hybrid pattern that makes per-trace evaluation economically viable. The Future AGI Evaluator runs a deterministic check first. If the deterministic check is confident, it returns. If not, it cascades to a classifier-backed metric. If the classifier is uncertain, it cascades to an LLM-as-judge call. The result: roughly 70 to 90 percent of traces resolve at the cheap layer, frontier-model cost only fires on the genuinely ambiguous 10 to 30 percent, and average per-eval cost lands one to two orders of magnitude below pure LLM-judge.
What is the CI gate composition for an LLM application?
Four to six rubrics, pinned versions, per-route thresholds, baseline diff. The composition is application-shaped. RAG: Groundedness, Context Relevance, Completeness, plus the safety triad. Agent: Function Calling, Task Completion, Refusal, plus the safety triad. Chat: Refusal, Tone, Task Completion. Every shape adds Toxicity, PII, and a latency-cost pair. Pin (judge_model_id, rubric_version, prompt_template_hash). Fail the PR when any rubric drops more than two points from the trailing 7-day baseline or falls below an agreed floor.
Why score Latency and Cost as metrics, not as ops telemetry?
A correct answer at 30 seconds is a product failure. A correct answer that costs $0.50 is a unit-economics failure. Both belong in the eval rubric, not in a separate ops dashboard. Treat p50 and p95 latency per route as gates, not graphs. Treat cost per resolved task as a metric, not a finance report. Production owners ship faster when latency and cost block the CI gate the same way Groundedness does.
How does Future AGI ship every metric family?
An eval stack package, not a single tool. The ai-evaluation SDK (Apache 2.0) ships 60+ EvalTemplate classes across every family, 13 guardrail backends, 8 sub-10 ms Scanners, four distributed runners, and the augment=True cascade. The Future AGI Platform layers self-improving rubrics tuned by thumbs feedback, an in-product authoring agent for natural-language rubrics, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed clusters failing traces into named issues and feeds the fixes back into the self-improving evaluators. traceAI (Apache 2.0) carries the same rubric as a span-attached EvalTag across 50+ AI surfaces in Python, TypeScript, Java, and C#.
Related Articles
View all