Guides

A Gentle Introduction to LLM Evaluation (2026)

Learn LLM evaluation from the inside out: the three primitives (deterministic, embedding, judge), offline vs online, the starter workflow for production.

March 25, 2026

12 min read

llm-evaluation beginner ai-evaluation rag 2026

Table of Contents

A unit test asks one question: did the function return four. An LLM eval asks a harder one: did the model say something that means four, in a form a user can use, without claiming the moon orbits Mars on the way there. That single shift, from exact match to rubric scoring, is why the first eval suite you build feels different from anything in your test directory.

The good news is the underlying mental model is small. There are three primitives. Two axes. One starter workflow. If you learn those three things first, the rest of the eval canon (metrics, judges, gates, dashboards) falls into place without you having to memorize a forty-row comparison table.

The opinion this post earns: LLM evaluation isn’t a metric, it’s a feedback loop. Pick the primitive that matches the question. Pick the cadence that matches the risk. Everything else is refinement.

TL;DR: the gentle map

You need to answer	Reach for	What it costs
Valid JSON, schema match, tool-call success	Deterministic check	Microseconds, zero API cost
Looks similar to a known good answer	Embedding metric (BERTScore, cosine)	Milliseconds, embedding cost only
Helpful, faithful, on-tone, refusing correctly	LLM-as-a-judge	Hundreds of ms, judge token cost
Toxicity, PII, prompt injection, bias	Fine-tuned classifier	Sub-10ms, no LLM call

Three primitives, three jobs. The mistake most beginners make is reaching for a judge on a question a parser already answers, or reaching for an embedding metric on a question that needs reasoning. Don’t.

Why “did it work” is the wrong question

When you ship a deterministic function, “did it work” is binary. The test asserts add(2, 2) == 4. Pass or fail. No middle.

LLMs broke that. A model asked for the capital of France can answer “Paris,” “Paris, the capital of France,” “It’s Paris,” or “The capital of France is Paris (located in Europe).” Four responses. All correct. No two share a token sequence. Exact match fails on three. ROUGE scores them inconsistently. None of these are bugs. They are the consequence of generating natural language instead of returning a value.

The mental shift: stop asking “did it work” and start asking “does this output satisfy the rubric.” The rubric is a definition of correctness that tolerates surface variation. The output gets scored against the definition, not compared to a fixed string. A rubric for capital-city answers might say: score 1 if the response names Paris as the capital, 0 if it names any other city, 0 if it refuses. That handles all four valid responses and rejects the wrong ones.

This is the move that beginners underestimate. The rubric is the contract. The score is the signal. The dataset is what keeps the signal honest over time.

The three primitives, in plain English

Every eval you’ll ever run is one of three primitives or a stack of them. Learn the three, and the metric catalog becomes a lookup table instead of a maze.

Deterministic. A function with no model in the loop. Parse the response into JSON, check it against a schema. Run a regex for refusal phrasings. Look up cited chunk IDs in the retrieval context. Match the tool call against an expected signature. Deterministic checks are microsecond-fast, free, and never drift. They are also the wrong tool for “is this helpful.” Use them for closed-form questions where the answer is provably right or wrong against a rule.

Embedding-based. Project candidate and reference into a vector space and measure distance. BERTScore (Zhang et al. 2020) does this at token level. Cosine similarity does it at sentence level. The output is a similarity score that tolerates paraphrase: “Paris is the capital” and “The capital is Paris” land close in vector space and score high. Embedding metrics need a clean reference and they score “looks similar,” not “is correct.” A confidently wrong answer that uses the right vocabulary scores high. Use them when you have a gold answer and want a fast similarity floor, or as a feature for clustering failing traces.

LLM-as-a-judge. A capable model reads the rubric, reads the candidate response, reads the context if there is one, and returns a score. G-Eval (Liu et al. 2023) formalized the pattern with chain-of-thought and a form-filling output schema. Pairwise variants (MT-Bench, Chatbot Arena) scaled it to ship decisions across millions of comparisons. The judge is the only general-purpose tool for rubrics that require reasoning: helpfulness, faithfulness, refusal calibration, role adherence. It is also the most expensive primitive and the one most prone to bias.

The skill of an eval engineer is matching the question to the cheapest primitive that answers it honestly. A pattern we see in nearly every audit: a $0.04-per-call frontier judge running on a binary toxicity decision a 4B Gemma adapter answers in 65 milliseconds. Wrong tool. Right answer for the wrong reason.

The two axes: offline vs online, pointwise vs pairwise

Once you have a primitive picked, two axes decide where and how it runs.

Offline vs online. Offline eval runs the rubric against a versioned dataset before deploy. The CI gate is the canonical surface. Online eval runs the same rubric against live traces in production, sampled uniformly or by failure signal. Offline catches regressions you knew to look for. Online catches drift and the rare paths the dataset doesn’t cover. Both use the same rubric. The diff between offline pass and online drop is itself a quality signal worth tracking.

Pointwise vs pairwise. Pointwise scoring rates a single response against an absolute rubric: “score this answer from 0 to 1.” Pairwise scoring asks the judge to compare two responses and pick the better one. Pointwise is the right primitive for absolute SLO gates and per-axis regression diagnosis. Pairwise is the right primitive for ship decisions where “which is better” matters more than “what’s the absolute number.” Most teams run both, in different places. Rubric scores power the CI gate. Arena-style pairwise powers the launch decision.

The reason these axes matter for a beginner: you can pick a single primitive and still get the deployment wrong by running it in the wrong place at the wrong cadence. A judge on the inline user-facing path will blow your p95 latency budget. A deterministic schema check running quarterly instead of per-PR catches nothing.

The starter workflow: dataset, rubric, judge, gate

Here is the smallest working setup that earns its keep. Four steps, in order.

1. Build the dataset. Sample fifty to one hundred traces per route from the last seven days of production. Bias toward the hardest cases. Add an expected_behavior field per trace, even if it’s just a sentence. Version the dataset like you version a prompt: tag releases, freeze the set for active CI gates, review additions in PR. A test set invented at launch reflects the test author’s assumptions, not user behavior; that’s why this step starts with production data.

2. Pick four rubrics. One per failure axis. Groundedness for RAG correctness (every claim supported by retrieval context). Refusal calibration for safety (covers both over-refusal and under-refusal). Factual accuracy for the is-this-true axis. One custom rubric tied to your specific product outcome. The custom rubric is where your real signal lives; everything else is borrowed.

3. Wire the judge. Deterministic where possible (schema, citation existence, refusal regex). Classifier where a sharp target exists (toxicity, PII). LLM-as-a-judge where the rubric requires reasoning. Pin the judge model and rubric version as a single contract. Cache verdicts keyed on the contract. The eval is the tuple (judge_model_id, rubric_version, prompt_template_hash), and a vendor swap is a deliberate eval-suite migration, not a config change.

4. Set the gate. Run the rubrics against the dataset on every PR. Fail the gate if any rubric drops more than two points from the trailing 7-day baseline, or falls below an agreed absolute floor (0.75 for faithfulness, 0.85 for task completion are reasonable defaults you’ll tune). The gate produces an artifact: rubric scores per dataset entry with diffs against baseline. Engineers reviewing the PR drill into failing examples and decide if the regression is real or a noisy judge.

That’s the whole flow. Five lines of code in the ai-evaluation SDK gets you started.

from fi.evals import Evaluator
from fi.evals.templates import Groundedness

ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
    eval_templates=[Groundedness()],
    inputs=[{"input": question, "output": answer, "context": retrieved_chunks}],
)
print(result.eval_results[0].output, result.eval_results[0].reason)

Swap Groundedness for any of 60+ EvalTemplate classes including ContextAdherence, FactualAccuracy, AnswerRefusal, Toxicity, PromptInjection, TaskCompletion, EvaluateFunctionCalling. Pass an array of inputs to score a batch in one call. When you outgrow code-defined rubrics, swap to a CustomLLMJudge with your own grading_criteria text.

What to measure first (and what to skip)

The DeepEval-style metric tables have forty rows. You don’t need forty rubrics. You need four well-calibrated ones tied to the four axes that decide whether your product works.

Groundedness. For RAG, the rubric that pays back fastest. Score 1 if every claim in the response is supported by the retrieval context, 0 otherwise. Cheap variants check claim-by-claim entailment with a small NLI model (DeBERTa). Richer variants run an LLM judge. The RAG evaluation primer covers the full retrieval-plus-generation rubric set.

Refusal calibration. Both directions count. Over-refusal (the model refuses a benign request) is as much a failure as under-refusal (the model answers a harmful one). A small classifier handles sharp safety targets; a judge handles the gray zone where context decides.

Factual accuracy. Did the model state something true, against either a retrieved source or a known fact. For RAG, this overlaps with groundedness. For free-form generation, it stands alone. The cheap version uses an embedding metric against a gold answer. The richer version uses a judge with retrieval-grounded reasoning.

One custom rubric. Yours. Did the contract review flag the indemnity clause. Did the support bot escalate at turn three when the user mentioned legal action. Did the code-gen agent produce code that compiles and passes a unit test. This rubric is the one only your team can write.

What to skip on day one: per-token perplexity dashboards, response-style metrics that don’t correlate with user complaints, BLEU on anything that isn’t machine translation, and any rubric you can’t explain to a non-ML stakeholder in one sentence. Most starter suites are too generic and stop being maintained within a quarter; four well-calibrated rubrics beat fifteen noisy ones.

The hand-off to production: traces, observability, Error Feed

Offline eval gates regressions. Production eval catches drift. Both need to live in the same place as the trace, or no one will look at any of them.

The pattern that works: attach eval scores to the OpenTelemetry span so the eval result lives next to the trace that produced it. traceAI (Apache 2.0) ships 50+ AI surfaces across Python (46 packages), TypeScript (39), Java (24 modules including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C#. Pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) at register() time mean your traces aren’t locked to one vendor’s format. Built-in evals wire to span attributes via EvalTag so the score, the input, and the trace timeline are in the same view.

Alarm on rolling-mean drift: per-route, per-rubric, per-prompt-version. A two-to-five point sustained drop over fifteen to sixty minutes is the right detection threshold for most products. Triage failing traces into an annotation queue. A human or a clusterer decides whether the failure is a bug, a rubric problem, or expected.

Closing the loop is the move that compounds. The simplest version is a weekly job: pull 100 random production traces, score with the current rubrics, promote anything below threshold into the offline dataset with a quick human sanity check. The next PR has to clear the new entries. Error Feed automates this inside the Future AGI eval stack: HDBSCAN soft-clustering over ClickHouse-stored embeddings groups failures into named issues; a Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools) writes the immediate fix. Those fixes feed back into the Platform’s self-improving evaluators so the rubric ages with the product rather than against it.

Common beginner mistakes

Stopping at offline. Offline pass is necessary, not sufficient. Real users find what the test author didn’t think of.
Inventing the test set. A whiteboard test set reflects your assumptions, not user behavior. Sample from production from day one.
Treating the judge as ground truth. The judge is a model with its own biases (position, verbosity, self-preference, calibration drift). Sample 50 traces a quarter and human-label them. Track judge-human Cohen’s kappa as its own metric.
No baseline. A score without a trailing baseline is a number with no context. Compare to last week, not to a frozen reference.
Too many rubrics. Ten well-calibrated rubrics beat thirty noisy ones. Cut anything that doesn’t correlate with user complaints.
Eval lives in a different tool than the trace. When the score, the trace, and the failure live in three places, no one reads any of them. Attach scores to the span.

When you’ve outgrown “gentle”

Three signs you’re ready for something bigger.

You have more than four routes and one rubric set no longer fits everywhere. Per-route configs become worth the operational cost. You’re running LLM-as-a-judge on more than 10K examples a week and the bill grows faster than the inference bill; classifier-backed evals start paying for themselves. You have a backlog of failing traces and no time to triage them by hand; auto-clustering becomes the difference between learning from production and drowning in it.

Until those signs appear, the four-step workflow above plus a CI gate plus a weekly loop is the whole job.

How Future AGI fits a beginner’s stack

The eval surface ships as a package you light up in order. Start with the SDK. Graduate to the Platform when you want rubrics that improve themselves rather than stay frozen.

ai-evaluation SDK (Apache 2.0). Code-first. 60+ EvalTemplate classes. 13 guardrail backends including 9 open-weight (Llama Guard 3 8B/1B, Qwen3-Guard 8B/4B/0.6B, Granite Guardian 8B/5B, WildGuard 7B, ShieldGemma 2B) and 4 API (OpenAI Moderation, Azure Content Safety, Turing Flash, Turing Safety). 8 sub-10ms local Scanners. Four distributed runners (Celery, Ray, Temporal, Kubernetes). Multi-modal CustomLLMJudge via LiteLLM. Start here.
Future AGI Platform. Self-improving evaluators tuned by thumbs feedback. An in-product authoring agent that turns natural-language descriptions into rubrics. Classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Graduate here when you want the rubric to improve itself rather than decay.
Error Feed (inside the eval stack). HDBSCAN soft-clustering plus a Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur summariser, 90% prompt-cache) writes the immediate fix per named issue. Fixes feed back into the self-improving evaluators. Linear integration today; Slack, GitHub, Jira, and PagerDuty on the roadmap.

traceAI (Apache 2.0) is the tracing layer that joins the eval and the trace: 50+ AI surfaces across Python, TypeScript, Java, and C#. The hosted Agent Command Center is SOC 2 Type II, HIPAA, GDPR, and CCPA certified (ISO/IEC 27001 in active audit) and routes evals across 20+ providers with shadow, mirror, and race modes.

Ready to run your first eval against your own workload? Install ai-evaluation, drop a Groundedness rubric against your last fifty production traces this afternoon, and wire the same rubric as an EvalTag on live spans via traceAI tomorrow. The same rubric in both places is what turns an LLM eval from a notebook experiment into a feedback loop that holds for two years.

Frequently asked questions

What is LLM evaluation, in one sentence?

LLM evaluation is the practice of scoring a model's outputs against a rubric so you can tell whether a change made the system better, worse, or the same before users find out. Unlike unit tests, evals don't compare to a single correct answer. They compare to a definition of correctness that handles paraphrase, partial credit, and open-ended generation. The rubric is the contract. The score is the signal. The dataset is what keeps the signal honest over time. If you treat eval as a one-time setup task, the rubric ages out of usefulness within a quarter. If you treat it as a feedback loop that ingests production failures weekly, the suite gets sharper every month.

What are the three primitives every beginner should learn first?

Deterministic checks, embedding-based metrics, and LLM-as-a-judge. Deterministic is your CI floor: JSON schema, regex, exact match, tool-call success. Cheap, fast, never drifts. Embedding metrics (BERTScore, cosine similarity) score 'looks similar to a reference' and are useful when you have a clean gold answer or want to cluster failing traces. LLM-as-a-judge is the only general-purpose tool for subjective rubrics like helpfulness, faithfulness, refusal calibration. Each primitive answers a different question. The skill of the practitioner is reaching for the cheapest tool that gives the right answer.

Do I need LLM-as-a-judge to start?

No. Start with deterministic checks: schema validation, citation existence, tool-call arguments, refusal regex. They catch close to half of real failures and cost nothing to run on every trace. Layer in a judge once you need to score faithfulness, task completion, or anything that requires reasoning over the candidate. The mistake most teams make is running a frontier judge on a binary toxicity decision a 4B classifier would answer in 65 milliseconds. Save the judge call for the rubric that genuinely needs reasoning.

Offline eval versus online eval — what's the difference?

Offline eval runs the rubric against a versioned dataset to catch regressions before deploy. Online eval (production observation) runs the same rubric against live traces to catch drift, new failure modes, and rare paths the dataset doesn't cover. Both use the same rubric. Offline gates the PR. Online watches the trace. The two together close the loop: regressions caught offline never ship, drift caught online promotes back into the offline set, and the dataset ratchets stronger over time. Skipping online is how teams ship clean evals and still get paged when real users find the failure mode no one wrote a test for.

What's a starter rubric set for a brand-new project?

Four rubrics, one per axis. Groundedness for RAG correctness. Refusal calibration for safety (both over-refusal and under-refusal). Factual accuracy for the answer-is-true axis. One custom rubric tied to the specific outcome your product cares about (did the contract review flag the indemnity clause, did the support bot escalate at the right turn, did the code-gen agent produce code that compiles). Four good rubrics beat fifteen noisy ones. The custom rubric is where your product signal lives. Everything else is borrowed.

How big should my first eval dataset be?

Fifty to one hundred examples per route. Bias the set toward the hardest cases you've seen so far, not the happy path. Pull from production traces wherever possible rather than inventing inputs at a whiteboard; a test set written at launch reflects the author's assumptions, not user behavior. Grow the dataset weekly by promoting failing production traces into the offline set with rubric labels. Beyond a few hundred examples per route, sampling becomes a bigger lever than dataset size for judge cost reasons. Quality, coverage of failure modes, and refresh cadence matter more than raw count.

What does Future AGI ship for someone just starting out?

An eval stack you light up in order. The ai-evaluation SDK (Apache 2.0) is the five-line code surface: 60+ EvalTemplate classes covering RAG, safety, agents, and structured output. 13 guardrail backends including 9 open-weight (Llama Guard, Qwen3-Guard, Granite Guardian, WildGuard, ShieldGemma). 8 sub-10ms local Scanners. Four distributed runners (Celery, Ray, Temporal, Kubernetes). traceAI joins evals to traces across 50+ AI surfaces in Python, TypeScript, Java, and C#. When you outgrow code-defined rubrics, the Future AGI Platform layers self-improving evaluators tuned by thumbs feedback, an in-product authoring agent for custom rubrics, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed closes the loop by clustering failing traces into named issues.

View all

Guides

LLM Summarization Evaluation: A 2026 Architectural Deep Dive

Summarization eval is four judge prompts: groundedness, completeness, factuality, conciseness. Each a hardened prompt with a calibration set. 2026 guide.

Nikhil Pareek · Apr 27, 2026

12 min

Guides

Evaluating LLM Summarization: A Step-by-Step Guide (2026)

Summarization eval is four rubrics, not one number: groundedness, completeness, factuality, conciseness, calibrated against humans in CI.

Rishav Hada · Mar 26, 2026

13 min

Guides

Evaluating AWS Bedrock Agents in 2026

Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.

Rishav Hada · May 19, 2026

11 min

TL;DR: the gentle map

Why “did it work” is the wrong question

The three primitives, in plain English

The two axes: offline vs online, pointwise vs pairwise

The starter workflow: dataset, rubric, judge, gate

What to measure first (and what to skip)

The hand-off to production: traces, observability, Error Feed

Common beginner mistakes

When you’ve outgrown “gentle”

How Future AGI fits a beginner’s stack

Related reading

Frequently asked questions