The LLM Evaluation Glossary (2026 Definitions)
A practitioner's dictionary for LLM evaluation in 2026: the 30 most-confused terms, what each means, when it appears, and the adjacent terms it gets mixed up with.
Table of Contents
LLM evaluation has too many words for too few ideas. The same concept ships under three names, the same name ships across three concepts, and most rubric debates are vocabulary debates wearing a quality costume. This is the practitioner’s dictionary: 30 of the most-confused terms, one canonical reading each, the adjacent terms they get mixed up with, and the Future AGI primitive that implements each.
TL;DR
Thirty terms. Two disambiguation tables up front, one cheat sheet at the end. Read Faithfulness, Evaluator vs Guardrail, and Span vs Trace vs Session first; those three resolve most cross-team confusion. Companion reads: the metrics reference, the 2026 playbook, and the build-from-scratch guide.
Same word, different camp
| Term | Reading A | Reading B |
|---|---|---|
| Groundedness | Claim has a source span (FAGI, Ragas) | Claim is true vs external truth |
| Faithfulness | Claims supported by retrieval context | Claims preserve source meaning (translation lineage) |
| Drift | Behaviour change without code change | A single regression on a held-out set |
| Relevance | Did the answer address the question | Did the chunks address the question |
| Calibration | Confidence matches accuracy | Judge agrees with humans |
When two vendors argue about whether a system is “grounded,” check which reading each is using before checking the numbers.
Different words, same thing
| Term cluster | Canonical reading | Common synonyms |
|---|---|---|
| Faithfulness | Claims supported by retrieval context | RAG-truth, context faithfulness, source attribution |
| Factuality | Claim is true vs external truth | Factual accuracy, correctness, veracity |
| LLM-as-judge | Capable LLM scores against rubric | Auto-eval, model-graded, AI-graded |
| Refusal | System declined a request | Abstention, soft-refusal, abdication |
| Atomic claim decomposition | Split response into claims, check each | FActScore, claim-level scoring, sub-claim eval |
A to D
Adequacy
Translation-era metric for meaning preservation. Modern stacks collapse it into faithfulness when the source is a retrieval context, and into task completion when the source is a user request. FAGI surface: no direct template; use Groundedness or Completeness.
Answer Relevance vs Context Relevance
Two rubrics that get collapsed into “relevance” and cause exactly the wrong RAG fix. Answer relevance asks whether the response addresses the user’s question. Context relevance asks whether the retrieved chunks address the question. A pipeline can score high on answer relevance and low on context relevance when the model bluffs without retrieval, and vice versa when retrieval is right but the model wanders. FAGI surface: AnswerRelevance and ContextRelevance in ai-evaluation.
Atomic Claim Decomposition
The technique behind any stable faithfulness rubric. The response is split into discrete claims, each is checked against the retrieval context, the score is the proportion supported. Decomposition is what holds the metric stable across phrasing. FActScore formalised it; production stacks bake it into the judge prompt. FAGI surface: FactualAccuracy and Groundedness run decomposition by default.
AUC-ROC / F1 / Precision / Recall
Classical classification metrics that survive because most guardrails and classifier-backed evals are still classification problems. Calibrate thresholds against precision and recall, not accuracy — class imbalance breaks accuracy. FAGI surface: ThresholdCalibrator in the Platform.
BERTScore
Neural metric comparing candidate and reference text by token-level contextual embedding similarity. Better than BLEU at paraphrase, worse than LLM-as-judge at reasoning. Cheap signal during dataset curation and a sanity check against judge drift. Confused with: BLEURT, MoverScore. FAGI surface: usable inside CustomLLMJudge when a reference exists.
BLEU / ROUGE / METEOR / chrF
The n-gram-overlap family from MT and summarisation. Useful for tasks with a fixed reference, useless for open-ended generation. They measure surface overlap, not semantic similarity. FAGI surface: reference metrics, not primary FAGI templates. See What is BLEU, ROUGE, BERTScore?
Calibration
The property that stated confidence matches observed accuracy. Measured with Brier score, Expected Calibration Error, and reliability diagrams. Confused with: judge calibration (matching judge to humans) and threshold calibration (tuning an operating point). Three different jobs. FAGI surface: ThresholdCalibrator; see the judge-bias post.
Chain-of-Thought
The model reasons step by step before answering. Two failure modes for eval: chain wrong but answer right, chain right but answer wrong. Score the chain and the answer separately. Confused with: ReAct (reasoning plus tool use), chain-of-self-correction. FAGI surface: LLMFunctionCalling and TaskCompletion score reasoning chains; traceAI tags them with the CHAIN span kind.
Citation Validity vs Citation Integrity
Two adjacent rubrics that look the same and are not. Validity asks whether the cited span exists in the retrieval context — a cheap exact-match check. Integrity asks whether the span actually supports the claim — an LLM-as-judge check. Validity catches lazy hallucinations; integrity catches motivated misattribution. FAGI surface: ChunkAttribution for validity, ChunkUtilization for coverage, integrity via CustomLLMJudge.
Drift vs Regression vs Decay
Three failure shapes that vendors paper over with one word. Regression — single delta against a baseline, after a code or prompt change. Drift — continuous behaviour change without code change, often from a silent provider update. Decay — gradual quality loss from dataset staleness. CI gates catch regressions, continuous re-scoring catches drift, weekly refresh catches decay. FAGI surface: CI gate via fi run; drift detection in the Platform.
E to L
Evaluator vs Guardrail
The most consequential vocabulary mix-up in 2026. An evaluator scores after the fact and feeds metrics into dashboards, CI gates, and the dataset. A guardrail runs inline, blocks or rewrites the output, and adds latency to every request. The same rubric can run in both modes. PromptInjection as evaluator tells you what percentage of traces tried an injection; as guardrail it blocks the request. Teams confuse them in two directions — building an inline judge that takes seconds, or post-hoc-checking something that needed to be blocked. FAGI surface: Evaluator(...).evaluate(...) for scoring; Protect and Guardrails for inline enforcement.
Eval Cascade (augment=True)
Cheap deterministic evals run first; the LLM-as-judge runs only on cases that need semantic scoring. Roughly 70 to 90 percent of traces resolve at the deterministic or classifier layer; the judge fires on the 10 to 30 percent that need reasoning. Average per-eval cost lands one to two orders of magnitude below pure judge. FAGI surface: Template(augment=True) flag on every EvalTemplate. See deterministic vs LLM-judge evals.
EvalTemplate / Rubric / Judge / Evaluator
Four words for four layers of the same thing. The rubric is the contract: what to measure, on what scale. The judge is the implementation: an LLM, a classifier, a regex, or a stack. The evaluator is the runtime that takes (input, response, context) and emits (score, rationale). EvalTemplate is the FAGI noun for a pre-built rubric. Teams that say “we have evals” without distinguishing the layers usually have a notebook, not a system. FAGI surface: EvalTemplate, CustomLLMJudge, Evaluator.
Faithfulness vs Groundedness vs Factuality vs RAG-Truth
The single biggest source of rubric confusion.
- Faithfulness — response asserts only claims supported by the retrieval context. Headline RAG metric.
- Groundedness — every claim links to a source span. Deterministically checkable via citation match. A faithful response can be ungrounded (stuck to context but did not cite); a grounded response can be unfaithful (cited the wrong span).
- Factuality / Factual Accuracy — the claim is true in the world, independent of retrieval. Needs external truth or a knowledge base.
- RAG-truth — vendor coinage; almost always means faithfulness.
A correct RAG pipeline scores all three because they fail in different directions. FAGI surface: Groundedness and ContextAdherence cover the first two; FactualAccuracy covers factuality; ChunkAttribution is the deterministic citation check.
Golden Set vs Hold-Out vs Validation
Three dataset roles that get pooled into “test set” and break the same way. Golden set — small, human-labelled, version-locked. Hold-out — never used for tuning; measures generalisation. Validation — used during development to compare configurations. Pooling them is how rubrics look good in dev and fail in production. FAGI surface: the Platform manages all three with version pins.
G-Eval / Pairwise / Arena-Style
Three judging modes. G-Eval scores one response on a rubric directly — the workhorse. Pairwise compares two responses and picks the better — more reliable for fuzzy rubrics like helpfulness. Arena-style aggregates many pairwise judgements into an Elo score. Pairwise wins where absolute scoring is unreliable; G-Eval wins where the rubric has a clear unit. FAGI surface: all three through CustomLLMJudge. See the G-Eval guide.
Hallucination
Not one failure mode; four. Factual — untrue about the world. Faithfulness — not supported by retrieval, even if true. Closed-domain — invented inside a constrained task (a citation, a tool argument). Open-domain — invented in free-form generation. Plus an adversarial fifth: confident wrong answer when the question is unanswerable. Treating them as one rubric is why hallucination dashboards do not move. FAGI surface: FactualAccuracy, Groundedness, ChunkAttribution, AnswerRefusal. See hallucination deep dive.
Judge Bias
Six patterns. Position (first answer in pairwise scores higher), length (longer scores higher regardless of quality), verbosity (wordier reasoning scores higher), self-bias (judge prefers its own model family), sycophancy (judge agrees with whichever side the prompt frames as correct), prior leakage (rubric examples bleed into scoring). Each can shift a score by 5 to 20 points. FAGI surface: position and length tracked automatically in pairwise mode; see the judge-bias post.
LLM-as-Judge
An LLM scores another LLM’s output against a rubric. Cheap to set up, expensive to run, subject to judge bias. Calibrate against a golden set, pin (judge_model_id, rubric_version, prompt_template_hash), run a Cohen’s kappa check against human labels above 0.6 before trusting it. FAGI surface: every EvalTemplate is implementable as LLM-as-judge; CustomLLMJudge is the wrapper. See LLM-as-judge best practices.
M to R
MRR / NDCG / Recall@K
Three retrieval metrics that travel into RAG evaluation. MRR — average inverse rank of the first relevant chunk. NDCG — graded relevance score that discounts lower-ranked results. Recall@K — proportion of relevant chunks in the top K. All three need labelled relevance; without labels, fall back to LLM-as-judge on ContextRelevance. FAGI surface: ContextRelevance, ContextRecall, ChunkUtilization. See the RAG metrics deep dive.
OpenInference vs OTel GenAI vs OPENLLMETRY
Three semantic conventions for AI tracing that get treated as competing standards when they actually coexist. OpenInference — Arize-led, rich attributes for LLM calls, retrievals, agents, evaluators. OTel GenAI — the OpenTelemetry working group’s, minimal, broadly compatible. OPENLLMETRY — Traceloop’s, focused on LLM provider calls. FAGI surface: traceAI ingests OpenInference natively and translates OTel GenAI plus OPENLLMETRY; pluggable at register() time.
PII Detection vs Data Privacy Compliance
Two rubrics that get collapsed and should not. PII detection is the technical job — find personally identifiable strings across 18 entity types. Data privacy compliance is the policy job — given a PII detection, decide whether to redact, log, deny, or escalate based on jurisdiction, consent, and route. Detection is necessary; compliance is what regulators ask for. FAGI surface: DataPrivacyCompliance template plus PII scanners. See PII detection deep dive.
Prompt Injection: Direct vs Indirect
Input data carries instructions that override the system prompt, split by entry vector. Direct — user message contains the override. Indirect — retrieved document, tool result, or external page contains it. Direct is caught on input rails; indirect needs retrieval rails. Confused with: jailbreak (related but distinct — aims at safety bypass rather than instruction override). FAGI surface: PromptInjection template plus JailbreakScanner. See the prompt-injection defence guide.
Refusal
The system declines a request. Two-sided rubric: over-refusal of benign requests is as much a failure as under-refusal of harmful ones. Optimising one direction alone produces theatre. Confused with: abstention, soft-refusal, abdication — all synonyms. FAGI surface: AnswerRefusal; pair with Toxicity and IsHarmfulAdvice for the safety triad.
Rubric
The unit of LLM evaluation. Specifies what to measure, on what scale, with what reasoning. Not a single metric: a rubric is the contract between eval authors and the runtime. Rubrics drift, need versioning, and should be reviewed in PR. The biggest gap in most eval programs is treating “we have evals” as the same thing as “we have rubrics.” Confused with: metric (the score), prompt (one part of the rubric). FAGI surface: every EvalTemplate is a rubric; the Platform’s authoring agent turns natural-language descriptions into rubrics.
RLHF / DPO / RLAIF
Three preference-tuning techniques. RLHF trains a reward model on human preferences and tunes the policy against it. RLAIF replaces the human with an LLM judge. DPO skips the reward model and optimises directly on preference pairs. All three consume preference data, which is what an eval stack with thumbs feedback collects. FAGI surface: preference data from Platform feedback feeds into self-improving evaluators.
S to Z
Span vs Trace vs Session
Three nesting levels that teams flatten into “trace” and then lose the ability to ask conversation-level questions. A span is one operation — an LLM call, a retrieval, a tool invocation. A trace is the tree of spans for one request. A session groups multiple traces that share a conversation. Eval scores attach to spans, rubrics roll up to traces, conversation rubrics roll up to sessions — what makes per-turn, per-request, and per-conversation metrics work in one system. FAGI surface: traceAI emits OTel-native span and trace IDs plus a session attribute.
Span Kinds
The OpenInference taxonomy of span types. Fourteen in active use: LLM, RETRIEVER, CHAIN, AGENT, TOOL, EMBEDDING, RERANKER, GUARDRAIL, EVALUATOR, A2A_CLIENT, A2A_SERVER, plus three modality kinds. Routes spans to the right evaluator and makes trace-tree views legible. Confused with: OTel’s broader span “kind” (server/client/internal), which is orthogonal. FAGI surface: EvalSpanKind auto-set by traceAI across 50+ AI surfaces.
Shadow / Mirror / Race / Canary
Four rollout modes that teams treat as one. Shadow — run the new variant alongside the old without serving it; compare offline. Mirror — same as shadow but the response is computed and discarded, so latency tail is measured. Race — send both, return whichever responds first. Canary — route a small percentage of real traffic to the new variant. Confused with: A/B testing (a measurement design, not a rollout mode). FAGI surface: all four supported by the Agent Command Center gateway. See agent rollout strategies.
Sycophancy
The model agrees with the user even when the user is wrong. Measured as the rate at which the model flips its answer when the user pushes back without new evidence. Distinct from helpfulness, which can produce the same surface behaviour for the right reasons. Sycophancy is also a judge-bias category. FAGI surface: measurable through CustomLLMJudge with paired prompts.
TTFT vs Inter-Token vs Total Latency
Three latency metrics that get collapsed into “latency.” TTFT — wall-clock from request to first streamed token. Inter-token — average gap between successive tokens. Total — request to last token. Mean total latency hides streaming behaviour and the long tail; surface all three. FAGI surface: x-agentcc-latency-ms gateway header captures total; span attributes capture TTFT and inter-token.
The cheat sheet
| Confused pair | The cheap distinction |
|---|---|
| Faithfulness vs Factuality | Vs the context; vs the world |
| Groundedness vs Faithfulness | Claim has a source; claim is supported by source |
| Evaluator vs Guardrail | Scores after; blocks during |
| Span vs Trace vs Session | Operation; request; conversation |
| Regression vs Drift vs Decay | Code-change delta; silent provider change; dataset staleness |
| Answer vs Context Relevance | Answer addresses question; chunks address question |
| Citation Validity vs Integrity | Span exists; span supports the claim |
| Direct vs Indirect Injection | User message; retrieved or tool content |
| PII Detection vs Compliance | Find the string; decide the policy |
| TTFT vs Total Latency | Perceived; measured |
If a vocabulary argument lasts more than five minutes, it is probably one of these.
How the glossary maps to the FAGI stack
Three surfaces. The ai-evaluation SDK ships 60+ EvalTemplate classes plus Evaluator, Protect, Guardrails, CustomLLMJudge, 13 guardrail backends, 8 sub-10 ms Scanner classes, and four distributed runners. The Future AGI Platform hosts self-improving evaluators, an in-product authoring agent that turns natural-language descriptions into rubrics, and an Error Feed that runs HDBSCAN soft clustering plus a Sonnet 4.5 Judge writing immediate_fix back into the evaluators. traceAI is the OTel-compatible tracing layer with 50+ AI surface instrumentations across Python, TypeScript, Java, and C#.
Honest framing: vendor terminology drifts. If a term reads differently in another stack, that is a vocabulary gap, not a contradiction. The metrics reference, the playbook, and the build-from-scratch guide are the next reads.
Related reading
Frequently asked questions
Why does LLM evaluation need a disambiguation glossary in 2026?
Faithfulness, groundedness, factuality, RAG-truth — are these the same thing?
What is the difference between an evaluator, a rubric, and a judge?
What is the difference between an evaluator and a guardrail?
What is the most common term confusion in production today?
How does Future AGI's stack map to the terms in this glossary?
The pillar playbook for LLM evaluation in 2026: dataset, metrics, judge, CI gate, production observation, and the closed loop from failing trace back to regression test.
Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.
Evaluating DSPy pipelines in 2026: why the compile metric isn't your production rubric, and how to eval the Signature instead of the program.