Guides

The LLM Evaluation Glossary (2026 Definitions)

A practitioner's dictionary for LLM evaluation in 2026: the 30 most-confused terms, what each means, and the adjacent terms it conflates with.

March 9, 2026

Updated May 20, 2026

13 min read

llm-evaluation glossary definitions rag agent-evaluation llm-observability 2026

Table of Contents

LLM evaluation has too many words for too few ideas. The same concept ships under three names, the same name ships across three concepts, and most rubric debates are vocabulary debates wearing a quality costume. This is the practitioner’s dictionary: 30 of the most-confused terms, one canonical reading each, the adjacent terms they get mixed up with, and the Future AGI primitive that implements each.

TL;DR

Thirty terms. Two disambiguation tables up front, one cheat sheet at the end. Read Faithfulness, Evaluator vs Guardrail, and Span vs Trace vs Session first; those three resolve most cross-team confusion. Companion reads: the metrics reference, the 2026 playbook, and the build-from-scratch guide.

Same word, different camp

Term	Reading A	Reading B
Groundedness	Claim has a source span (FAGI, Ragas)	Claim is true vs external truth
Faithfulness	Claims supported by retrieval context	Claims preserve source meaning (translation lineage)
Drift	Behaviour change without code change	A single regression on a held-out set
Relevance	Did the answer address the question	Did the chunks address the question
Calibration	Confidence matches accuracy	Judge agrees with humans

When two vendors argue about whether a system is “grounded,” check which reading each is using before checking the numbers.

Different words, same thing

Term cluster	Canonical reading	Common synonyms
Faithfulness	Claims supported by retrieval context	RAG-truth, context faithfulness, source attribution
Factuality	Claim is true vs external truth	Factual accuracy, correctness, veracity
LLM-as-judge	Capable LLM scores against rubric	Auto-eval, model-graded, AI-graded
Refusal	System declined a request	Abstention, soft-refusal, abdication
Atomic claim decomposition	Split response into claims, check each	FActScore, claim-level scoring, sub-claim eval

A to D

Adequacy

Translation-era metric for meaning preservation. Modern stacks collapse it into faithfulness when the source is a retrieval context, and into task completion when the source is a user request. FAGI surface: no direct template; use Groundedness or Completeness.

Answer Relevance vs Context Relevance

Two rubrics that get collapsed into “relevance” and cause exactly the wrong RAG fix. Answer relevance asks whether the response addresses the user’s question. Context relevance asks whether the retrieved chunks address the question. A pipeline can score high on answer relevance and low on context relevance when the model bluffs without retrieval, and vice versa when retrieval is right but the model wanders. FAGI surface: AnswerRelevance and ContextRelevance in ai-evaluation.

Atomic Claim Decomposition

The technique behind any stable faithfulness rubric. The response is split into discrete claims, each is checked against the retrieval context, the score is the proportion supported. Decomposition is what holds the metric stable across phrasing. FActScore formalised it; production stacks bake it into the judge prompt. FAGI surface: FactualAccuracy and Groundedness run decomposition by default.

AUC-ROC / F1 / Precision / Recall

Classical classification metrics that survive because most guardrails and classifier-backed evals are still classification problems. Calibrate thresholds against precision and recall, not accuracy — class imbalance breaks accuracy. FAGI surface: ThresholdCalibrator in the Platform.

BERTScore

Neural metric comparing candidate and reference text by token-level contextual embedding similarity. Better than BLEU at paraphrase, worse than LLM-as-judge at reasoning. Cheap signal during dataset curation and a sanity check against judge drift. Confused with: BLEURT, MoverScore. FAGI surface: usable inside CustomLLMJudge when a reference exists.

BLEU / ROUGE / METEOR / chrF

The n-gram-overlap family from MT and summarisation. Useful for tasks with a fixed reference, useless for open-ended generation. They measure surface overlap, not semantic similarity. FAGI surface: reference metrics, not primary FAGI templates. See What is BLEU, ROUGE, BERTScore?

Calibration

The property that stated confidence matches observed accuracy. Measured with Brier score, Expected Calibration Error, and reliability diagrams. Confused with: judge calibration (matching judge to humans) and threshold calibration (tuning an operating point). Three different jobs. FAGI surface: ThresholdCalibrator; see the judge-bias post.

Chain-of-Thought

The model reasons step by step before answering. Two failure modes for eval: chain wrong but answer right, chain right but answer wrong. Score the chain and the answer separately. Confused with: ReAct (reasoning plus tool use), chain-of-self-correction. FAGI surface: LLMFunctionCalling and TaskCompletion score reasoning chains; traceAI tags them with the CHAIN span kind.

Citation Validity vs Citation Integrity

Two adjacent rubrics that look the same and are not. Validity asks whether the cited span exists in the retrieval context — a cheap exact-match check. Integrity asks whether the span actually supports the claim — an LLM-as-judge check. Validity catches lazy hallucinations; integrity catches motivated misattribution. FAGI surface: ChunkAttribution for validity, ChunkUtilization for coverage, integrity via CustomLLMJudge.

Drift vs Regression vs Decay

Three failure shapes that vendors paper over with one word. Regression — single delta against a baseline, after a code or prompt change. Drift — continuous behaviour change without code change, often from a silent provider update. Decay — gradual quality loss from dataset staleness. CI gates catch regressions, continuous re-scoring catches drift, weekly refresh catches decay. FAGI surface: CI gate via fi run; drift detection in the Platform.

E to L

Evaluator vs Guardrail

The most consequential vocabulary mix-up in 2026. An evaluator scores after the fact and feeds metrics into dashboards, CI gates, and the dataset. A guardrail runs inline, blocks or rewrites the output, and adds latency to every request. The same rubric can run in both modes. PromptInjection as evaluator tells you what percentage of traces tried an injection; as guardrail it blocks the request. Teams confuse them in two directions — building an inline judge that takes seconds, or post-hoc-checking something that needed to be blocked. FAGI surface: Evaluator(...).evaluate(...) for scoring; Protect and Guardrails for inline enforcement.

Eval Cascade (`augment=True`)

Cheap deterministic evals run first; the LLM-as-judge runs only on cases that need semantic scoring. Roughly 70 to 90 percent of traces resolve at the deterministic or classifier layer; the judge fires on the 10 to 30 percent that need reasoning. Average per-eval cost lands one to two orders of magnitude below pure judge. FAGI surface: Template(augment=True) flag on every EvalTemplate. See deterministic vs LLM-judge evals.

EvalTemplate / Rubric / Judge / Evaluator

Four words for four layers of the same thing. The rubric is the contract: what to measure, on what scale. The judge is the implementation: an LLM, a classifier, a regex, or a stack. The evaluator is the runtime that takes (input, response, context) and emits (score, rationale). EvalTemplate is the FAGI noun for a pre-built rubric. Teams that say “we have evals” without distinguishing the layers usually have a notebook, not a system. FAGI surface: EvalTemplate, CustomLLMJudge, Evaluator.

Faithfulness vs Groundedness vs Factuality vs RAG-Truth

The single biggest source of rubric confusion.

Faithfulness — response asserts only claims supported by the retrieval context. Headline RAG metric.
Groundedness — every claim links to a source span. Deterministically checkable via citation match. A faithful response can be ungrounded (stuck to context but did not cite); a grounded response can be unfaithful (cited the wrong span).
Factuality / Factual Accuracy — the claim is true in the world, independent of retrieval. Needs external truth or a knowledge base.
RAG-truth — vendor coinage; almost always means faithfulness.

A correct RAG pipeline scores all three because they fail in different directions. FAGI surface: Groundedness and ContextAdherence cover the first two; FactualAccuracy covers factuality; ChunkAttribution is the deterministic citation check.

Golden Set vs Hold-Out vs Validation

Three dataset roles that get pooled into “test set” and break the same way. Golden set — small, human-labelled, version-locked. Hold-out — never used for tuning; measures generalisation. Validation — used during development to compare configurations. Pooling them is how rubrics look good in dev and fail in production. FAGI surface: the Platform manages all three with version pins.

G-Eval / Pairwise / Arena-Style

Three judging modes. G-Eval scores one response on a rubric directly — the workhorse. Pairwise compares two responses and picks the better — more reliable for fuzzy rubrics like helpfulness. Arena-style aggregates many pairwise judgements into an Elo score. Pairwise wins where absolute scoring is unreliable; G-Eval wins where the rubric has a clear unit. FAGI surface: all three through CustomLLMJudge. See the G-Eval guide.

Hallucination

Not one failure mode; four. Factual — untrue about the world. Faithfulness — not supported by retrieval, even if true. Closed-domain — invented inside a constrained task (a citation, a tool argument). Open-domain — invented in free-form generation. Plus an adversarial fifth: confident wrong answer when the question is unanswerable. Treating them as one rubric is why hallucination dashboards do not move. FAGI surface: FactualAccuracy, Groundedness, ChunkAttribution, AnswerRefusal. See hallucination deep dive.

Judge Bias

Six patterns. Position (first answer in pairwise scores higher), length (longer scores higher regardless of quality), verbosity (wordier reasoning scores higher), self-bias (judge prefers its own model family), sycophancy (judge agrees with whichever side the prompt frames as correct), prior leakage (rubric examples bleed into scoring). Each can shift a score by 5 to 20 points. FAGI surface: position and length tracked automatically in pairwise mode; see the judge-bias post.

LLM-as-Judge

An LLM scores another LLM’s output against a rubric. Cheap to set up, expensive to run, subject to judge bias. Calibrate against a golden set, pin (judge_model_id, rubric_version, prompt_template_hash), run a Cohen’s kappa check against human labels above 0.6 before trusting it. FAGI surface: every EvalTemplate is implementable as LLM-as-judge; CustomLLMJudge is the wrapper. See LLM-as-judge best practices.

M to R

MRR / NDCG / Recall@K

Three retrieval metrics that travel into RAG evaluation. MRR — average inverse rank of the first relevant chunk. NDCG — graded relevance score that discounts lower-ranked results. Recall@K — proportion of relevant chunks in the top K. All three need labelled relevance; without labels, fall back to LLM-as-judge on ContextRelevance. FAGI surface: ContextRelevance, ContextRecall, ChunkUtilization. See the RAG metrics deep dive.

OpenInference vs OTel GenAI vs OPENLLMETRY

Three semantic conventions for AI tracing that get treated as competing standards when they actually coexist. OpenInference — Arize-led, rich attributes for LLM calls, retrievals, agents, evaluators. OTel GenAI — the OpenTelemetry working group’s, minimal, broadly compatible. OPENLLMETRY — Traceloop’s, focused on LLM provider calls. FAGI surface: traceAI ingests OpenInference natively and translates OTel GenAI plus OPENLLMETRY; pluggable at register() time.

PII Detection vs Data Privacy Compliance

Two rubrics that get collapsed and should not. PII detection is the technical job — find personally identifiable strings across 18 entity types. Data privacy compliance is the policy job — given a PII detection, decide whether to redact, log, deny, or escalate based on jurisdiction, consent, and route. Detection is necessary; compliance is what regulators ask for. FAGI surface: DataPrivacyCompliance template plus PII scanners. See PII detection deep dive.

Prompt Injection: Direct vs Indirect

Input data carries instructions that override the system prompt, split by entry vector. Direct — user message contains the override. Indirect — retrieved document, tool result, or external page contains it. Direct is caught on input rails; indirect needs retrieval rails. Confused with: jailbreak (related but distinct — aims at safety bypass rather than instruction override). FAGI surface: PromptInjection template plus JailbreakScanner. See the prompt-injection defence guide.

Refusal

The system declines a request. Two-sided rubric: over-refusal of benign requests is as much a failure as under-refusal of harmful ones. Optimising one direction alone produces theatre. Confused with: abstention, soft-refusal, abdication — all synonyms. FAGI surface: AnswerRefusal; pair with Toxicity and IsHarmfulAdvice for the safety triad.

Rubric

The unit of LLM evaluation. Specifies what to measure, on what scale, with what reasoning. Not a single metric: a rubric is the contract between eval authors and the runtime. Rubrics drift, need versioning, and should be reviewed in PR. The biggest gap in most eval programs is treating “we have evals” as the same thing as “we have rubrics.” Confused with: metric (the score), prompt (one part of the rubric). FAGI surface: every EvalTemplate is a rubric; the Platform’s authoring agent turns natural-language descriptions into rubrics.

RLHF / DPO / RLAIF

Three preference-tuning techniques. RLHF trains a reward model on human preferences and tunes the policy against it. RLAIF replaces the human with an LLM judge. DPO skips the reward model and optimises directly on preference pairs. All three consume preference data, which is what an eval stack with thumbs feedback collects. FAGI surface: preference data from Platform feedback feeds into self-improving evaluators.

S to Z

Span vs Trace vs Session

Three nesting levels that teams flatten into “trace” and then lose the ability to ask conversation-level questions. A span is one operation — an LLM call, a retrieval, a tool invocation. A trace is the tree of spans for one request. A session groups multiple traces that share a conversation. Eval scores attach to spans, rubrics roll up to traces, conversation rubrics roll up to sessions — what makes per-turn, per-request, and per-conversation metrics work in one system. FAGI surface: traceAI emits OTel-native span and trace IDs plus a session attribute.

Span Kinds

The OpenInference taxonomy of span types. Fourteen in active use: LLM, RETRIEVER, CHAIN, AGENT, TOOL, EMBEDDING, RERANKER, GUARDRAIL, EVALUATOR, A2A_CLIENT, A2A_SERVER, plus three modality kinds. Routes spans to the right evaluator and makes trace-tree views legible. Confused with: OTel’s broader span “kind” (server/client/internal), which is orthogonal. FAGI surface: EvalSpanKind auto-set by traceAI across 50+ AI surfaces.

Shadow / Mirror / Race / Canary

Four rollout modes that teams treat as one. Shadow — run the new variant alongside the old without serving it; compare offline. Mirror — same as shadow but the response is computed and discarded, so latency tail is measured. Race — send both, return whichever responds first. Canary — route a small percentage of real traffic to the new variant. Confused with: A/B testing (a measurement design, not a rollout mode). FAGI surface: all four supported by the Agent Command Center gateway. See agent rollout strategies.

Sycophancy

The model agrees with the user even when the user is wrong. Measured as the rate at which the model flips its answer when the user pushes back without new evidence. Distinct from helpfulness, which can produce the same surface behaviour for the right reasons. Sycophancy is also a judge-bias category. FAGI surface: measurable through CustomLLMJudge with paired prompts.

TTFT vs Inter-Token vs Total Latency

Three latency metrics that get collapsed into “latency.” TTFT — wall-clock from request to first streamed token. Inter-token — average gap between successive tokens. Total — request to last token. Mean total latency hides streaming behaviour and the long tail; surface all three. FAGI surface: x-agentcc-latency-ms gateway header captures total; span attributes capture TTFT and inter-token.

The cheat sheet

Confused pair	The cheap distinction
Faithfulness vs Factuality	Vs the context; vs the world
Groundedness vs Faithfulness	Claim has a source; claim is supported by source
Evaluator vs Guardrail	Scores after; blocks during
Span vs Trace vs Session	Operation; request; conversation
Regression vs Drift vs Decay	Code-change delta; silent provider change; dataset staleness
Answer vs Context Relevance	Answer addresses question; chunks address question
Citation Validity vs Integrity	Span exists; span supports the claim
Direct vs Indirect Injection	User message; retrieved or tool content
PII Detection vs Compliance	Find the string; decide the policy
TTFT vs Total Latency	Perceived; measured

If a vocabulary argument lasts more than five minutes, it is probably one of these.

How the glossary maps to the FAGI stack

Three surfaces. The ai-evaluation SDK ships 60+ EvalTemplate classes plus Evaluator, Protect, Guardrails, CustomLLMJudge, 13 guardrail backends, 8 sub-10 ms Scanner classes, and four distributed runners. The Future AGI Platform hosts self-improving evaluators, an in-product authoring agent that turns natural-language descriptions into rubrics, and an Error Feed that runs HDBSCAN soft clustering plus a Sonnet 4.5 Judge writing immediate_fix back into the evaluators. traceAI is the OTel-compatible tracing layer with 50+ AI surface instrumentations across Python, TypeScript, Java, and C#.

Honest framing: vendor terminology drifts. If a term reads differently in another stack, that is a vocabulary gap, not a contradiction. The metrics reference, the playbook, and the build-from-scratch guide are the next reads.

Frequently asked questions

Why does LLM evaluation need a disambiguation glossary in 2026?

Because the same word means different things across vendors and the same concept ships under three names. Groundedness in one tool checks whether claims are supported by the retrieval context; in another it checks whether claims are supported by external truth. Faithfulness collapses into adequacy in some stacks and splits into three rubrics in others. Hallucination has at least four taxonomies in active use. Most rubric arguments are vocabulary arguments. The disambiguation glossary fixes one canonical reading per term, names the adjacent terms it gets confused with, and points to the Future AGI primitive that implements it.

Faithfulness, groundedness, factuality, RAG-truth — are these the same thing?

No. Faithfulness is whether the response asserts only claims supported by the retrieval context. Groundedness is whether every claim links to a retrievable source span. Factuality is whether the claim is true in the world, independent of retrieval. RAG-truth is a vendor coinage that usually means faithfulness. A response can be faithful and false (it cited a wrong document and stuck to it) or factual and ungrounded (it knew the answer without citing). Production stacks need all three because they fail in different directions.

What is the difference between an evaluator, a rubric, and a judge?

Three layers of the same thing. The rubric is the contract — what to measure, on what scale, with what reasoning. The judge is the implementation — an LLM, a classifier, a regex, or a stack. The evaluator is the runtime that takes inputs, applies the rubric through the judge, and emits score plus rationale. Future AGI exposes them as EvalTemplate (rubric), CustomLLMJudge (judge), and Evaluator (runtime).

What is the difference between an evaluator and a guardrail?

An evaluator scores after the fact and feeds metrics into dashboards, CI gates, and the dataset. A guardrail runs inline, blocks or rewrites the output, and adds latency to every request. The same rubric can run in both modes. PromptInjection as an evaluator tells you what percentage of traces tried an injection; as a guardrail it blocks the request before the model sees it. Future AGI ships both surfaces through ai-evaluation: Evaluator for scoring; Protect and Guardrails for inline enforcement with 13 guardrail backends and 8 sub-10 ms Scanners.

What is the most common term confusion in production today?

Five recurring mix-ups. Faithfulness vs groundedness vs factuality. Answer relevance vs context relevance. Evaluator vs guardrail. Span vs trace vs session. Drift vs regression vs decay. Most CI gate failures and most rubric disputes trace back to one of these five. Fix the vocabulary first, then the rubric.

How does Future AGI's stack map to the terms in this glossary?

Three surfaces. The ai-evaluation SDK ships 60+ EvalTemplate classes, 13 guardrail backends, 8 Scanners, and four distributed runners. The Future AGI Platform is the hosted Agent Command Center with self-improving evaluators, an in-product authoring agent for unlimited custom rubrics, and HDBSCAN soft clustering plus a Sonnet 4.5 Judge Error Feed that writes immediate_fix back into the evaluators. traceAI is the OpenTelemetry-compatible tracing layer with 50+ AI surface instrumentations across Python, TypeScript, Java (Spring Boot starter, LangChain4j), and C#. Every term in this glossary anchors to a concrete primitive in one of those three.

View all

Guides

The 2026 LLM Evaluation Playbook

The pillar playbook for LLM evaluation in 2026: dataset, metrics, judge, CI gate, production observation, closed loop from failing trace to regression.

Rishav Hada · Apr 12, 2026

10 min

Guides

Evaluating AWS Bedrock Agents in 2026

Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.

Rishav Hada · May 19, 2026

11 min

Guides

LLM Eval Golden Set Design: A 2026 Engineering Guide

Build a four-bucket golden set (production sample, adversarial, edge cases, failure replays) so a CI eval gate actually proves something about production.

NVJK Kartik · May 16, 2026

12 min

TL;DR

Same word, different camp

Different words, same thing

A to D

Adequacy

Answer Relevance vs Context Relevance

Atomic Claim Decomposition

AUC-ROC / F1 / Precision / Recall

BERTScore

BLEU / ROUGE / METEOR / chrF

Calibration

Chain-of-Thought

Citation Validity vs Citation Integrity

Drift vs Regression vs Decay

E to L

Evaluator vs Guardrail

Eval Cascade (augment=True)

EvalTemplate / Rubric / Judge / Evaluator

Faithfulness vs Groundedness vs Factuality vs RAG-Truth

Golden Set vs Hold-Out vs Validation

G-Eval / Pairwise / Arena-Style

Hallucination

Judge Bias

LLM-as-Judge

M to R

MRR / NDCG / Recall@K

OpenInference vs OTel GenAI vs OPENLLMETRY

PII Detection vs Data Privacy Compliance

Prompt Injection: Direct vs Indirect

Refusal

Rubric

RLHF / DPO / RLAIF

S to Z

Span vs Trace vs Session

Span Kinds

Shadow / Mirror / Race / Canary

Sycophancy

TTFT vs Inter-Token vs Total Latency

The cheat sheet

How the glossary maps to the FAGI stack

Related reading

Frequently asked questions

Eval Cascade (`augment=True`)