LLM Evaluation Metrics: Everything You Need in 2026
There aren't 50 LLM eval metrics. There are three primitive families and eight rubrics that matter in production. The opinionated 2026 reference, with the CI gate and the cascade that make per-trace eval affordable.
Table of Contents
There are not 50 LLM evaluation metrics. There are three primitive families and a handful of named rubrics that decide whether a system ships. Most vendor catalogs are noise on top of that — a permutation of the same primitives applied to a slightly different failure mode, given a new name, charged for separately. The point of this reference is to flatten the catalog. Once you know the primitives and the rubrics that earn their slot, the rest is selection.
TL;DR: the shape of the catalog
| Layer | What it is | Count that matters |
|---|---|---|
| Primitive families | The mechanism the metric uses to score | 3 |
| Rubrics that ship in production | The named jobs the rubric does | 8 |
| Everything else | Permutations of the above | Skip until a real failure demands one |
Three primitives. Eight rubrics. One cascade that makes per-trace evaluation affordable. The rest of this post is the working definition of each.
The three primitive families
Every LLM eval metric is built from one of three primitives or a stack of them. Learn the three and the catalog becomes a lookup table.
Deterministic. A function with no model in the loop. Parse the response into JSON, validate against a schema. Run a regex for a refusal phrasing. Look up cited chunk IDs in the retrieval context. Match a tool call against an expected signature. Deterministic checks are sub-10 ms, free, and never drift. They are also the wrong tool for “is this helpful.” Use them for closed-form questions where the answer is provably right or wrong against a rule.
Embedding-based. Project candidate and reference into a vector space and measure distance. BLEU and ROUGE at the n-gram level. METEOR adding stem and synonym matches. BERTScore at the contextual-embedding token level. Output is a similarity score that tolerates more paraphrase as you climb the stack. Useful when you have a clean gold answer or as a feature for clustering failing traces. Confidently wrong answers that share vocabulary still score high — embedding metrics score “looks similar,” not “is correct.” See What is BLEU, ROUGE, and BERTScore? for the deep dive.
LLM-as-judge. A capable model reads the rubric, reads the candidate response, reads the context, returns a score. G-Eval formalized the pattern in 2023. Pairwise variants scaled it to ship decisions across millions of comparisons. The judge is the only general-purpose primitive for rubrics that require reasoning — faithfulness, refusal calibration, role adherence, helpfulness. It is also the most expensive primitive and the one most prone to bias. See LLM-as-judge best practices for the calibration pattern.
Classifier-backed metrics — LlamaGuard 3, Qwen3Guard, Granite Guardian, WildGuard, ShieldGemma — are a special case of the judge primitive. The model is small (600M-8B) and fine-tuned for one taxonomy, so it runs in 50-200 ms at fractions of a cent. Treat them as the cheap judge, not as a fourth primitive. Future AGI ships ten of them behind a single Guardrails interface, plus its in-house TURING_FLASH.
The skill is matching the question to the cheapest primitive that answers it honestly. Running a frontier judge on a binary toxicity decision a 4B Gemma adapter answers in 65 milliseconds is the most common pattern in the audits we run. Wrong tool, right answer for the wrong reason.
The eight rubrics that matter in production
Across hundreds of production deployments, the same rubrics earn their slot. Pick from these first. Add specialized rubrics only when a real failure cluster forces you.
1. Groundedness. Does the answer only assert claims supported by the retrieved context? The RAG and QA core. Failure mode: hallucination against perfectly good retrieval. Implementation: judge primitive against the context. Future AGI ships it as Groundedness.
2. Refusal. Did the system refuse what it should refuse, and not refuse what it should answer? Two-sided calibration. Over-refusal of benign requests is as much a failure as under-refusal of harmful ones. Implementation: judge primitive against the policy. Future AGI ships it as AnswerRefusal.
3. Factual Accuracy. The final ground-truth check when you have one. When you don’t, the judge scores against an authoritative rubric. Distinct from Groundedness — Groundedness checks “did the model stick to the retrieved context,” Factual Accuracy checks “is the claim actually true in the world.” Future AGI ships it as FactualAccuracy.
4. Toxicity. Multi-axis toxicity scoring across insult, threat, identity attack, sexual, and violent content. Classifier-backed primitive (LlamaGuard 3, ShieldGemma, Turing Flash). Both a CI rubric and a runtime guardrail. Future AGI ships it as Toxicity.
5. PII / Data Privacy Compliance. Detect personal identifiers in inputs and outputs, optionally redact, log the policy decision. Mix of deterministic (regex packs for SSN, credit card, phone) and classifier (named-entity recognition for free-form PII). Future AGI ships it as DataPrivacyCompliance.
6. Tone. Did the assistant stay in its declared persona, register, and brand voice? Judge primitive against a tone rubric. Most product owners under-spec this one and find out from a customer-support escalation. Pin the rubric to brand-voice exemplars rather than asking the judge for “professional tone” in the abstract.
7. Latency. A correct answer at 30 seconds is a product failure. Track p50 and p95 per route, gate p95 against an absolute floor in CI. Hard metric, not a graph in an ops dashboard.
8. Cost. A correct answer that costs $0.50 per request is a unit-economics failure. Track per-request cost and cost per resolved task. Gate cost per resolved task as a CI metric. Production owners ship faster when latency and cost block the CI gate the same way Groundedness does.
That is the list. Eight rubrics, three primitives. Everything else in a vendor’s marketing taxonomy — and you will see lists of 30, 50, 100 — is a permutation. ChunkAttribution is Groundedness scored at chunk granularity. IsHarmfulAdvice is Toxicity scored against a domain-specific policy. LLMFunctionCalling is Refusal-plus-Task-Completion scored against a tool signature. Useful permutations exist; the point is that you start with the eight, not the eighty.
The rubrics that show up in specific shapes
A few rubrics earn their slot only on specific application shapes. They are not in the core eight, but they are non-negotiable when the shape demands them.
RAG-specific. ContextRelevance (do retrieved chunks address the question?), ContextAdherence (does the answer only use what was retrieved?), Completeness (does the answer cover what the chunks support?), ChunkAttribution, ChunkUtilization. The RAG evaluation deep dive covers the per-stage pattern.
Agent-specific. LLMFunctionCalling (right tool, right arguments, right order), TaskCompletion (did the agent reach the user’s intended end state across all tool calls), PromptInjection (did a tool output rewrite the system instructions?). Agent metrics score the trace, not the response — they need the full span tree, which is why agent eval and tracing are the same system in 2026.
Conversation-specific. Conversation Completeness (did the dialogue reach the expected end state?), Knowledge Retention (did the agent carry facts across turns?), Outcome Label (resolved, filed, booked, refunded?). Per-turn evals on multi-turn agents produce false confidence; add conversation-level rubrics. See Multi-turn LLM Evaluation.
Format-specific. IsJson, ContainsValidLink, IsEmail, length bounds, schema match. Deterministic primitive. Microseconds. Never wrong.
The point is the catalog is short. Eight always. Three to five more by shape. Then stop.
Pick by application shape
Different shapes ship different defaults.
| Shape | Core rubrics (CI gate every PR) | Add when |
|---|---|---|
| Chat assistant | Refusal, Tone, TaskCompletion + Toxicity, PII + Latency, Cost | Multi-turn → ConversationCompleteness |
| RAG / KB / support | Groundedness, ContextRelevance, Completeness + Toxicity, PII + Latency, Cost | Have ground truth → FactualAccuracy |
| Tool-using agent | LLMFunctionCalling, TaskCompletion, AnswerRefusal + PromptInjection + Latency, Cost | Touches external state → permission-scope check |
| Code generation | Pass-at-1 on fixtures, static analysis, security patterns + Latency, Cost | Security-sensitive → secret-scanning rubric |
| Safety-critical (health, finance, legal) | Domain harm rubric + full Scanner pack as inline guardrails + Refusal | Regulated → domain-specific compliance template |
Every shape adds the safety triad (Toxicity, PII, PromptInjection) and the latency-cost pair. The mistake is starting from the catalog and picking by name. Start from the failure modes that hurt your users and pick by job.
Six well-calibrated rubrics beat twenty noisy ones.
The CI gate composition
A metric you run once is a slide in a deck. A metric you run on every PR is a quality gate.
The gate is four parts:
- Versioned dataset. 50-100 examples per route, sampled from production, biased toward the hardest 10 percent. Refreshed weekly by promoting failing production traces.
- Pinned rubrics. A fixed set tied to the application shape, with
(judge_model_id, rubric_version, prompt_template_hash)pinned. A vendor swap on the judge is a deliberate eval-suite migration, not a config change. - Per-route thresholds. Fail the PR if any rubric drops more than two points from the trailing 7-day baseline, or falls below an agreed absolute floor. 0.75 on Groundedness, 0.85 on TaskCompletion are reasonable starting defaults to tune.
- Cascade-aware runner. The runner that executes the rubric uses the cascade — deterministic first, classifier next, judge last — so the CI bill is proportional to ambiguity, not to dataset size.
The ai-evaluation SDK ships a CLI (fi run) with an assertion engine that exits non-zero when scores drop below threshold. Wire it into GitHub Actions, GitLab CI, or your build system.
from fi.evals import Evaluator
from fi.evals.templates import (
Groundedness, AnswerRefusal, FactualAccuracy,
Toxicity, DataPrivacyCompliance, TestCase,
)
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
results = evaluator.evaluate(
eval_templates=[
Groundedness(augment=True),
AnswerRefusal(augment=True),
FactualAccuracy(augment=True),
Toxicity(augment=True),
DataPrivacyCompliance(),
],
inputs=[TestCase(query=q, response=r, context=ctx)
for q, r, ctx in dataset],
)
The CI gate is a regression suite, not a leaderboard. The score that matters is the diff against last week’s baseline, not the absolute number on a model card. The full buildout is in Build LLM evaluation framework from scratch.
The hybrid cascade that makes per-trace eval affordable
No single primitive wins. Deterministic misses everything semantic. Embedding misses everything that needs reasoning. Judge is too expensive to run on every production trace. The production answer is a cascade.
Future AGI’s augment=True flag on every EvalTemplate runs the three primitives in order:
- Deterministic layer. Regex, scanner, schema match. Roughly 70-80 percent of traces resolve here.
- Classifier-backed layer. One of 10 guardrail backends (
LLAMAGUARD_3_8B,QWEN3GUARD_8B/4B/0.6B,GRANITE_GUARDIAN_8B/5B,WILDGUARD_7B,SHIELDGEMMA_2B,TURING_FLASH). Another 10-20 percent resolve here. - LLM-as-judge layer. Rubric-based judge with a pinned model. The remaining 5-15 percent reach here.
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, Toxicity, TestCase
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
result = evaluator.evaluate(
eval_templates=[
Groundedness(augment=True),
Toxicity(augment=True),
],
inputs=[TestCase(query=q, response=r, context=ctx)],
)
Average per-eval cost lands one to two orders of magnitude below pure LLM-as-judge. That is what makes scoring every production span affordable, instead of sampling 1 percent and hoping the missed 99 percent stay clean.
The cascade is the operational answer to the “LLM eval is too expensive” complaint. The whole stack does not have to be cheap; the average has to be cheap, and the cascade gets the average there. See Deterministic vs LLM-judge evals for the deeper treatment.
Operating envelope: cost and latency by primitive
| Example | Primitive | p50 latency | Cost / 1k calls |
|---|---|---|---|
RegexScanner | Deterministic | <2 ms | $0 |
JailbreakScanner | Deterministic | <10 ms | $0 |
| BLEU / ROUGE | Embedding (n-gram) | <1 ms | $0 |
| BERTScore | Embedding (contextual) | ~50 ms | <$0.01 |
SHIELDGEMMA_2B | Classifier | ~50 ms | $0.02 |
LLAMAGUARD_3_8B | Classifier | ~150 ms | $0.05 |
TURING_FLASH | Classifier | ~80 ms | $0.03 |
Groundedness(augment=True) | Cascade | ~200 ms avg | $0.10 avg |
CustomLLMJudge (Sonnet 4.5) | Judge | ~4 s | ~$5 |
TaskCompletion | Judge (agent trace) | ~3 s | ~$3 |
Numbers are illustrative; exact values depend on prompt length, region, and model selection. The shape matters more than the exact numbers: deterministic and classifier are roughly 100x cheaper than judge, which is why the cascade is the right default.
Common mistakes
- Over-rely on BLEU and ROUGE. They reward surface overlap, not meaning. Use them as a cheap signal layer, not the primary quality metric.
- Run only LLM-judge. The bill grows faster than the inference bill. Cascade through deterministic and classifier first.
- Pick the highest-scoring judge. Judge-as-marketing. Pick the one that agrees with human labels.
- Score the response, ignore the trace. Agent failures often live in the tool call or retrieval step. Use agent-specific rubrics and capture the trace.
- Skip calibration. Every classifier and judge needs periodic agreement checks against human labels. Without calibration you have a number, not a metric.
- Frozen dataset. A static dataset stops being a regression suite the moment production drifts. Promote failing traces weekly.
- Treat safety as a separate stack. Safety rubrics belong in the same Evaluator call as quality rubrics. One pipeline, multiple rubrics, one cascade.
- Latency and cost as graphs, not gates. A correct answer at 30 seconds is a product failure. Gate it in CI.
How Future AGI ships every family
Future AGI ships the eval stack as a package. Start with the SDK for code-defined custom evals. Graduate to the Platform when you want self-improving evaluators authored by an in-product agent.
- ai-evaluation SDK (Apache 2.0).
from fi.evals import Evaluator, Protect, Guardrails. 60+EvalTemplateclasses covering every rubric above:Groundedness,ContextAdherence,ContextRelevance,Completeness,ChunkAttribution,ChunkUtilization,FactualAccuracy,AnswerRefusal,TaskCompletion,LLMFunctionCalling,Toxicity,PromptInjection,DataPrivacyCompliance,IsHarmfulAdvice,NoHarmfulTherapeuticGuidance, format checks (IsJson,ContainsValidLink,IsEmail), and a long tail for tone, summarization, multi-modal, and translation. 13 guardrail backends (9 open-weight:LLAMAGUARD_3_8B/1B,QWEN3GUARD_8B/4B/0.6B,GRANITE_GUARDIAN_8B/5B,WILDGUARD_7B,SHIELDGEMMA_2B; 4 API:OPENAI_MODERATION,AZURE_CONTENT_SAFETY,TURING_FLASH,TURING_SAFETY). 8 sub-10 msScanners(JailbreakScanner,CodeInjectionScanner,SecretsScanner,MaliciousURLScanner,InvisibleCharScanner,LanguageScanner,TopicRestrictionScanner,RegexScanner). Four distributed runners (Celery, Ray, Temporal, Kubernetes).RailType.INPUT/OUTPUT/RETRIEVALplusAggregationStrategy.ANY/ALL/MAJORITY/WEIGHTED. Multi-modalCustomLLMJudgevia LiteLLM.augment=Truecascade across all three primitives. - Future AGI Platform (hosted Agent Command Center). Self-improving evaluators tuned by thumbs up/down or relabel feedback from production. In-product authoring agent that turns natural-language descriptions into rubrics, grading prompts, and reference examples. Classifier-backed evals at lower per-eval cost than Galileo Luna-2 — the reason weekly full-dataset reruns are the default, not a budget conversation.
- Error Feed (inside the eval stack). HDBSCAN soft-clustering over ClickHouse-stored embeddings groups failing traces into named issues. A Claude Sonnet 4.5 Judge agent on Bedrock writes the RCA, evidence quotes, and an
immediate_fix. Fixes feed back into the platform’s self-improving evaluators so your eval suite ages with your product. Linear integration ships today; Slack, GitHub, Jira, and PagerDuty are on the roadmap. - traceAI (Apache 2.0). 50+ AI surfaces across Python (46 packages), TypeScript (39), Java (24 modules including a Spring Boot starter, Spring AI, LangChain4j, and Semantic Kernel), and C#. Pluggable semantic conventions at
register()time (FI,OTEL_GENAI,OPENINFERENCE,OPENLLMETRY). 14 span kinds (Phoenix has 8) includingA2A_CLIENTandA2A_SERVER. 62 server-sideEvalTagrubrics wire to span attributes for zero added inline latency. - agent-opt. Six optimizers (
RandomSearchOptimizer,BayesianSearchOptimizerOptuna-backed and resumable,MetaPromptOptimizer,ProTeGi,GEPAOptimizer,PromptWizardOptimizer) with teacher-inferred few-shot templates and sharedEarlyStoppingConfig. The unifiedEvaluatorinterface lets you optimize against heuristics, judge scores, or any of the 60+ FAGI rubrics.
The hosted Agent Command Center is the gateway runtime: 17 MB Go binary, six native adapters (OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure) plus OpenAI-compatible presets for 20+ providers, six routing strategies plus circuit breaker plus shadow/mirror/race modes, six exact and four semantic cache backends, five-level hierarchical budgets (org/team/user/key/tag), and inline guardrails on every request via GuardrailProtectWrapper. Response headers expose per-request signal: x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-model-used, x-agentcc-fallback-used, x-agentcc-routing-strategy, x-agentcc-guardrail-triggered. SOC 2 Type II, HIPAA, GDPR, and CCPA certified.
Ready to score your own traffic? Install ai-evaluation, drop Groundedness(augment=True) plus AnswerRefusal(augment=True) against your last fifty production traces this afternoon, then wire the same rubrics as EvalTag on live spans via traceAI tomorrow. Three primitives, eight rubrics, one cascade — the rest is selection.
Related reading
- The 2026 LLM Evaluation Playbook
- Evaluating LLM Systems: Metrics and Benchmarks (2026)
- What is LLM Evaluation?
- Deterministic LLM Evaluation Metrics
- What is BLEU, ROUGE, and BERTScore?
- G-Eval: A Definitive Guide (2026)
- LLM-as-Judge Best Practices (2026)
- RAG Evaluation Metrics Deep Dive (2026)
- Build LLM Evaluation Framework From Scratch (2026)
- LLM Evaluation Architecture (2026)
Frequently asked questions
How many LLM evaluation metrics actually matter in 2026?
When should you use BLEU or ROUGE versus an LLM-judge?
What rubrics should a RAG system actually ship with?
What is the augment=True cascade and why does it matter?
What is the CI gate composition for an LLM application?
Why score Latency and Cost as metrics, not as ops telemetry?
How does Future AGI ship every metric family?
Deterministic vs LLM-judge isn't a pick. It's a cascade. Where each wins, where each breaks, and the layering that drops eval cost 95% in production.
Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.
Eval budget is four knobs: rubric coverage, dataset size, judge tier, refresh cadence. The priority order that maximizes signal per dollar, with a 90-day plan.