A Gentle Introduction to LLM Evaluation (2026)
Learn LLM evaluation from the inside out: the three primitives (deterministic, embedding, judge), offline vs online, and the starter workflow that holds up in production.
Table of Contents
A unit test asks one question: did the function return four. An LLM eval asks a harder one: did the model say something that means four, in a form a user can use, without claiming the moon orbits Mars on the way there. That single shift, from exact match to rubric scoring, is why the first eval suite you build feels different from anything in your test directory.
The good news is the underlying mental model is small. There are three primitives. Two axes. One starter workflow. If you learn those three things first, the rest of the eval canon (metrics, judges, gates, dashboards) falls into place without you having to memorize a forty-row comparison table.
The opinion this post earns: LLM evaluation isn’t a metric, it’s a feedback loop. Pick the primitive that matches the question. Pick the cadence that matches the risk. Everything else is refinement.
TL;DR: the gentle map
| You need to answer | Reach for | What it costs |
|---|---|---|
| Valid JSON, schema match, tool-call success | Deterministic check | Microseconds, zero API cost |
| Looks similar to a known good answer | Embedding metric (BERTScore, cosine) | Milliseconds, embedding cost only |
| Helpful, faithful, on-tone, refusing correctly | LLM-as-a-judge | Hundreds of ms, judge token cost |
| Toxicity, PII, prompt injection, bias | Fine-tuned classifier | Sub-10ms, no LLM call |
Three primitives, three jobs. The mistake most beginners make is reaching for a judge on a question a parser already answers, or reaching for an embedding metric on a question that needs reasoning. Don’t.
Why “did it work” is the wrong question
When you ship a deterministic function, “did it work” is binary. The test asserts add(2, 2) == 4. Pass or fail. No middle.
LLMs broke that. A model asked for the capital of France can answer “Paris,” “Paris, the capital of France,” “It’s Paris,” or “The capital of France is Paris (located in Europe).” Four responses. All correct. No two share a token sequence. Exact match fails on three. ROUGE scores them inconsistently. None of these are bugs. They are the consequence of generating natural language instead of returning a value.
The mental shift: stop asking “did it work” and start asking “does this output satisfy the rubric.” The rubric is a definition of correctness that tolerates surface variation. The output gets scored against the definition, not compared to a fixed string. A rubric for capital-city answers might say: score 1 if the response names Paris as the capital, 0 if it names any other city, 0 if it refuses. That handles all four valid responses and rejects the wrong ones.
This is the move that beginners underestimate. The rubric is the contract. The score is the signal. The dataset is what keeps the signal honest over time.
The three primitives, in plain English
Every eval you’ll ever run is one of three primitives or a stack of them. Learn the three, and the metric catalog becomes a lookup table instead of a maze.
Deterministic. A function with no model in the loop. Parse the response into JSON, check it against a schema. Run a regex for refusal phrasings. Look up cited chunk IDs in the retrieval context. Match the tool call against an expected signature. Deterministic checks are microsecond-fast, free, and never drift. They are also the wrong tool for “is this helpful.” Use them for closed-form questions where the answer is provably right or wrong against a rule.
Embedding-based. Project candidate and reference into a vector space and measure distance. BERTScore (Zhang et al. 2020) does this at token level. Cosine similarity does it at sentence level. The output is a similarity score that tolerates paraphrase: “Paris is the capital” and “The capital is Paris” land close in vector space and score high. Embedding metrics need a clean reference and they score “looks similar,” not “is correct.” A confidently wrong answer that uses the right vocabulary scores high. Use them when you have a gold answer and want a fast similarity floor, or as a feature for clustering failing traces.
LLM-as-a-judge. A capable model reads the rubric, reads the candidate response, reads the context if there is one, and returns a score. G-Eval (Liu et al. 2023) formalized the pattern with chain-of-thought and a form-filling output schema. Pairwise variants (MT-Bench, Chatbot Arena) scaled it to ship decisions across millions of comparisons. The judge is the only general-purpose tool for rubrics that require reasoning: helpfulness, faithfulness, refusal calibration, role adherence. It is also the most expensive primitive and the one most prone to bias.
The skill of an eval engineer is matching the question to the cheapest primitive that answers it honestly. A pattern we see in nearly every audit: a $0.04-per-call frontier judge running on a binary toxicity decision a 4B Gemma adapter answers in 65 milliseconds. Wrong tool. Right answer for the wrong reason.
The two axes: offline vs online, pointwise vs pairwise
Once you have a primitive picked, two axes decide where and how it runs.
Offline vs online. Offline eval runs the rubric against a versioned dataset before deploy. The CI gate is the canonical surface. Online eval runs the same rubric against live traces in production, sampled uniformly or by failure signal. Offline catches regressions you knew to look for. Online catches drift and the rare paths the dataset doesn’t cover. Both use the same rubric. The diff between offline pass and online drop is itself a quality signal worth tracking.
Pointwise vs pairwise. Pointwise scoring rates a single response against an absolute rubric: “score this answer from 0 to 1.” Pairwise scoring asks the judge to compare two responses and pick the better one. Pointwise is the right primitive for absolute SLO gates and per-axis regression diagnosis. Pairwise is the right primitive for ship decisions where “which is better” matters more than “what’s the absolute number.” Most teams run both, in different places. Rubric scores power the CI gate. Arena-style pairwise powers the launch decision.
The reason these axes matter for a beginner: you can pick a single primitive and still get the deployment wrong by running it in the wrong place at the wrong cadence. A judge on the inline user-facing path will blow your p95 latency budget. A deterministic schema check running quarterly instead of per-PR catches nothing.
The starter workflow: dataset, rubric, judge, gate
Here is the smallest working setup that earns its keep. Four steps, in order.
1. Build the dataset. Sample fifty to one hundred traces per route from the last seven days of production. Bias toward the hardest cases. Add an expected_behavior field per trace, even if it’s just a sentence. Version the dataset like you version a prompt: tag releases, freeze the set for active CI gates, review additions in PR. A test set invented at launch reflects the test author’s assumptions, not user behavior; that’s why this step starts with production data.
2. Pick four rubrics. One per failure axis. Groundedness for RAG correctness (every claim supported by retrieval context). Refusal calibration for safety (covers both over-refusal and under-refusal). Factual accuracy for the is-this-true axis. One custom rubric tied to your specific product outcome. The custom rubric is where your real signal lives; everything else is borrowed.
3. Wire the judge. Deterministic where possible (schema, citation existence, refusal regex). Classifier where a sharp target exists (toxicity, PII). LLM-as-a-judge where the rubric requires reasoning. Pin the judge model and rubric version as a single contract. Cache verdicts keyed on the contract. The eval is the tuple (judge_model_id, rubric_version, prompt_template_hash), and a vendor swap is a deliberate eval-suite migration, not a config change.
4. Set the gate. Run the rubrics against the dataset on every PR. Fail the gate if any rubric drops more than two points from the trailing 7-day baseline, or falls below an agreed absolute floor (0.75 for faithfulness, 0.85 for task completion are reasonable defaults you’ll tune). The gate produces an artifact: rubric scores per dataset entry with diffs against baseline. Engineers reviewing the PR drill into failing examples and decide if the regression is real or a noisy judge.
That’s the whole flow. Five lines of code in the ai-evaluation SDK gets you started.
from fi.evals import Evaluator
from fi.evals.templates import Groundedness
ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
eval_templates=[Groundedness()],
inputs=[{"input": question, "output": answer, "context": retrieved_chunks}],
)
print(result.eval_results[0].output, result.eval_results[0].reason)
Swap Groundedness for any of 60+ EvalTemplate classes including ContextAdherence, FactualAccuracy, AnswerRefusal, Toxicity, PromptInjection, TaskCompletion, EvaluateFunctionCalling. Pass an array of inputs to score a batch in one call. When you outgrow code-defined rubrics, swap to a CustomLLMJudge with your own grading_criteria text.
What to measure first (and what to skip)
The DeepEval-style metric tables have forty rows. You don’t need forty rubrics. You need four well-calibrated ones tied to the four axes that decide whether your product works.
Groundedness. For RAG, the rubric that pays back fastest. Score 1 if every claim in the response is supported by the retrieval context, 0 otherwise. Cheap variants check claim-by-claim entailment with a small NLI model (DeBERTa). Richer variants run an LLM judge.
Refusal calibration. Both directions count. Over-refusal (the model refuses a benign request) is as much a failure as under-refusal (the model answers a harmful one). A small classifier handles sharp safety targets; a judge handles the gray zone where context decides.
Factual accuracy. Did the model state something true, against either a retrieved source or a known fact. For RAG, this overlaps with groundedness. For free-form generation, it stands alone. The cheap version uses an embedding metric against a gold answer. The richer version uses a judge with retrieval-grounded reasoning.
One custom rubric. Yours. Did the contract review flag the indemnity clause. Did the support bot escalate at turn three when the user mentioned legal action. Did the code-gen agent produce code that compiles and passes a unit test. This rubric is the one only your team can write.
What to skip on day one: per-token perplexity dashboards, response-style metrics that don’t correlate with user complaints, BLEU on anything that isn’t machine translation, and any rubric you can’t explain to a non-ML stakeholder in one sentence. Most starter suites are too generic and stop being maintained within a quarter; four well-calibrated rubrics beat fifteen noisy ones.
The hand-off to production: traces, observability, Error Feed
Offline eval gates regressions. Production eval catches drift. Both need to live in the same place as the trace, or no one will look at any of them.
The pattern that works: attach eval scores to the OpenTelemetry span so the eval result lives next to the trace that produced it. traceAI (Apache 2.0) ships 50+ AI surfaces across Python (46 packages), TypeScript (39), Java (24 modules including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C#. Pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) at register() time mean your traces aren’t locked to one vendor’s format. Built-in evals wire to span attributes via EvalTag so the score, the input, and the trace timeline are in the same view.
Alarm on rolling-mean drift: per-route, per-rubric, per-prompt-version. A two-to-five point sustained drop over fifteen to sixty minutes is the right detection threshold for most products. Triage failing traces into an annotation queue. A human or a clusterer decides whether the failure is a bug, a rubric problem, or expected.
Closing the loop is the move that compounds. The simplest version is a weekly job: pull 100 random production traces, score with the current rubrics, promote anything below threshold into the offline dataset with a quick human sanity check. The next PR has to clear the new entries. Error Feed automates this inside the Future AGI eval stack: HDBSCAN soft-clustering over ClickHouse-stored embeddings groups failures into named issues; a Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools) writes the immediate fix. Those fixes feed back into the Platform’s self-improving evaluators so the rubric ages with the product rather than against it.
Common beginner mistakes
- Stopping at offline. Offline pass is necessary, not sufficient. Real users find what the test author didn’t think of.
- Inventing the test set. A whiteboard test set reflects your assumptions, not user behavior. Sample from production from day one.
- Treating the judge as ground truth. The judge is a model with its own biases (position, verbosity, self-preference, calibration drift). Sample 50 traces a quarter and human-label them. Track judge-human Cohen’s kappa as its own metric.
- No baseline. A score without a trailing baseline is a number with no context. Compare to last week, not to a frozen reference.
- Too many rubrics. Ten well-calibrated rubrics beat thirty noisy ones. Cut anything that doesn’t correlate with user complaints.
- Eval lives in a different tool than the trace. When the score, the trace, and the failure live in three places, no one reads any of them. Attach scores to the span.
When you’ve outgrown “gentle”
Three signs you’re ready for something bigger.
You have more than four routes and one rubric set no longer fits everywhere. Per-route configs become worth the operational cost. You’re running LLM-as-a-judge on more than 10K examples a week and the bill grows faster than the inference bill; classifier-backed evals start paying for themselves. You have a backlog of failing traces and no time to triage them by hand; auto-clustering becomes the difference between learning from production and drowning in it.
Until those signs appear, the four-step workflow above plus a CI gate plus a weekly loop is the whole job.
How Future AGI fits a beginner’s stack
The eval surface ships as a package you light up in order. Start with the SDK. Graduate to the Platform when you want rubrics that improve themselves rather than stay frozen.
- ai-evaluation SDK (Apache 2.0). Code-first. 60+ EvalTemplate classes. 13 guardrail backends including 9 open-weight (Llama Guard 3 8B/1B, Qwen3-Guard 8B/4B/0.6B, Granite Guardian 8B/5B, WildGuard 7B, ShieldGemma 2B) and 4 API (OpenAI Moderation, Azure Content Safety, Turing Flash, Turing Safety). 8 sub-10ms local Scanners. Four distributed runners (Celery, Ray, Temporal, Kubernetes). Multi-modal
CustomLLMJudgevia LiteLLM. Start here. - Future AGI Platform. Self-improving evaluators tuned by thumbs feedback. An in-product authoring agent that turns natural-language descriptions into rubrics. Classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Graduate here when you want the rubric to improve itself rather than decay.
- Error Feed (inside the eval stack). HDBSCAN soft-clustering plus a Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur summariser, 90% prompt-cache) writes the immediate fix per named issue. Fixes feed back into the self-improving evaluators. Linear integration today; Slack, GitHub, Jira, and PagerDuty on the roadmap.
traceAI (Apache 2.0) is the tracing layer that joins the eval and the trace: 50+ AI surfaces across Python, TypeScript, Java, and C#. The hosted Agent Command Center is SOC 2 Type II, HIPAA, GDPR, and CCPA certified (ISO/IEC 27001 in active audit) and routes evals across 20+ providers with shadow, mirror, and race modes.
Ready to run your first eval against your own workload? Install ai-evaluation, drop a Groundedness rubric against your last fifty production traces this afternoon, and wire the same rubric as an EvalTag on live spans via traceAI tomorrow. The same rubric in both places is what turns an LLM eval from a notebook experiment into a feedback loop that holds for two years.
Related reading
Frequently asked questions
What is LLM evaluation, in one sentence?
What are the three primitives every beginner should learn first?
Do I need LLM-as-a-judge to start?
Offline eval versus online eval — what's the difference?
What's a starter rubric set for a brand-new project?
How big should my first eval dataset be?
What does Future AGI ship for someone just starting out?
Summarization eval is four rubrics, not one number: groundedness, completeness, factuality, conciseness. Scored independently, calibrated against humans, run in CI. The 2026 guide.
Summarization eval is four judge prompts, not four concepts. Groundedness, completeness, factuality, conciseness — each as a hardened prompt with a calibration set. The 2026 deep dive.
Contract review RAG in 2026: clause-level retrieval, citation enforcement, the eval suite in-house counsel will sign off, plus the LangGraph wiring to live OTel traces.