Guides

The 2026 LLM Evaluation Playbook

The pillar playbook for LLM evaluation in 2026: dataset, metrics, judge, CI gate, production observation, and the closed loop from failing trace back to regression test.

·
10 min read
llm-evaluation ai-gateway llm-observability ci-cd agent-evaluation rag 2026
Editorial cover image for The 2026 LLM Evaluation Playbook
Table of Contents

The LLM evaluation question used to be “did the output look right.” In 2026 it’s a six-layer engineering problem: dataset, metrics, judge, CI gate, production observation, and the loop that closes between them. The teams shipping reliable agents in 2026 treat eval as core infrastructure, not a post-launch checklist. This playbook is the working pattern from the deployments we’ve watched ship and stay shipped.

TL;DR: the six layers

LayerJobFailure if missing
DatasetInputs with expected behaviorEval scores nothing useful
MetricsRubric definitionsPass means whatever you want
JudgeScoring engineInconsistent rubric application
CI gateBlock bad PRsRegressions ship
Production observationScore live tracesDrift invisible until users complain
Closed loopFailures back into datasetSame bugs ship twice

If you only build three: dataset + judge + CI gate. The rest are how you keep the playbook honest over time.

Layer 1: the dataset

A versioned set of inputs paired with expected behavior. The shape depends on the task:

  • Single-turn QA. (input, expected_output, retrieval_context?, metadata).
  • Tool-using agent. (input, expected_tool_calls, expected_final_state, retrieval_context?, metadata).
  • Conversational agent. (system_prompt, turns, expected_outcome, persona, scenario, metadata).

Three rules that decide whether the dataset earns its keep:

  1. Sampled from production, not invented. A test set written by the test author at launch reflects the test author’s assumptions, not user behavior. Pull representative traces from prod and annotate them.
  2. Weighted toward failures. Most of the eval signal comes from the hardest 10% of inputs. Skew the dataset toward edge cases and rare intents; keep some happy-path coverage for sanity.
  3. Versioned and timestamped. Treat the dataset like prompt versions: tag releases, freeze datasets for active CI gates, and review additions in PR. Drift in the dataset can move scores measurably; tracking is the only way to know.

Start at 50-100 examples per route. Grow weekly by promoting failing production traces. Beyond a few hundred, judge cost becomes the dominant constraint and sampling beats more examples. The synthetic-test-data approach covers ways to scale dataset construction without losing signal.

Layer 2: the metrics

Pick a small set of metrics that map to failure modes you actually have. Three families:

Deterministic metrics — cheap, fast, exact:

  • Exact match / structured-output schema. Did the response parse into the expected JSON shape?
  • Tool-call success. Did the agent call the tools you expected with the arguments you expected?
  • Citation validity. Does every cited span actually exist in the retrieval context?
  • Length and format bounds. Within the allowed token budget? Conforming to format constraints?

Semantic metrics (LLM-as-judge) — expensive, slow, harder to fool:

  • Faithfulness. Does the response only assert what the retrieval context supports?
  • Groundedness. Are claims linked to source spans?
  • Context precision and recall. Did retrieval surface the right chunks?
  • Task completion. Did the agent fulfill the user’s request?
  • Role adherence. Did the assistant stay in its declared persona?

Conversation and outcome metrics — for multi-turn agents:

  • Conversation completeness. Did the dialogue reach the expected end state?
  • Knowledge retention. Did the agent carry facts across turns?
  • Outcome label. Did the user accomplish their goal (resolved, filed, booked, refunded)?

Most teams write five evals once and never touch them, then ship breaking changes for months because no one updates them. Future AGI’s eval stack is built so the rubric improves itself from production feedback instead of decaying. The ai-evaluation SDK (Apache 2.0) is the code-first surface for these rubric families — real Evaluator(fi_api_key=..., fi_secret_key=...).evaluate(eval_templates=[Template()], inputs=[TestCase(...)]) API across EvalTemplate classes like Groundedness, ContextAdherence, FactualAccuracy, Toxicity, PromptInjection, TaskCompletion, EvaluateFunctionCalling, with augment=True cascading from cheap heuristics into LLM-as-judge. The Future AGI Platform is where the in-product authoring agent lives (describe a rubric in natural language and the agent generates rubric + grading prompt + reference examples) and where self-improving evaluators retune from production thumbs up/down feedback.

Layer 3: the judge

The judge is the LLM or classifier that turns a rubric prompt plus a candidate response into a numeric score. Three rules:

  1. Your eval bill grows faster than your inference bill once you start LLM-as-judge at scale. Run small classifiers on every trace and reserve frontier models for adjudication. Future AGI’s Platform runs classifier-backed evals at lower per-eval cost than Galileo Luna-2, which is what makes weekly full-dataset reruns the default rather than a budget conversation.
  2. Pin the judge model and rubric version. A floating judge model produces drifting scores. The judge version is part of the eval contract.
  3. Calibrate against human labels. Sample 50-100 traces per quarter, label by hand, compare to the judge. Track inter-rater reliability between judge and human as its own quality metric.

Bad rubrics are the most common eval bug we see. Symptoms: high variance across reruns, judge-human disagreement above 20%, or rubric scores that don’t correlate with user satisfaction. The fix is usually rubric clarification, not a fancier judge.

Layer 4: the CI gate

The gate runs the rubrics against the versioned dataset on every PR. Three knobs:

  • Per-rubric thresholds. Fail the PR if any rubric drops more than 2 points from the baseline, or if absolute score falls below an agreed floor (0.75 for faithfulness, 0.85 for task completion are reasonable defaults to tune).
  • Per-route scoping. A PR touching the support-bot prompt doesn’t need to rerun the full eval suite for the sales agent. Scope CI runs to the affected routes.
  • Baseline tracking. Compare against the trailing 7-day rolling baseline, not a frozen number. Models drift; the baseline drifts with them; the gate catches regressions relative to the moving truth.

The gate produces an artifact: rubric scores per dataset entry, with diffs against the baseline. Engineers reviewing the PR can drill into failing examples and decide if the regression is real or a noisy judge.

Layer 5: production observation

Offline eval gates regressions; production eval catches drift. The pattern that works:

  • Sample production traffic (uniformly or by failure signal) and score with the same rubric used in CI.
  • Attach scores to the OTel span so the eval result lives next to the trace. traceAI (Apache 2.0) ships 50+ AI surfaces across Python / TypeScript / Java / C# — 46 Python packages, 39 TypeScript, 24 Java modules (Spring AI, Spring Boot starter, LangChain4j, Semantic Kernel), 1 C#. Pluggable semantic conventions at register() time (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY) ingest cleanly into Phoenix or Traceloop without re-instrumenting. 62 built-in evals wire to span attributes via EvalTag for zero added latency; 14 span kinds (Phoenix has 8) include A2A_CLIENT / A2A_SERVER; LangGraph topology surfaces node_count, conditional edges, and state diffs.
  • Alarm on rolling-mean drift. Per-route, per-rubric, per-prompt-version. A 2-5 point sustained drop over 15-60 minutes is the right detection threshold for most products.
  • Triage failing traces to an annotation queue. A human (or an automated clusterer) decides whether the failure is a bug, a rubric problem, or expected.

Production observation is where the rubric meets the user. Drift between offline pass and online drop is a quality signal of its own; track it.

Layer 6: the closed loop

The loop is what makes the playbook compound. Without it, each incident produces a one-off fix and the team writes the same regression twice.

Two automation patterns:

  • Auto-cluster failures inside the eval stack. Error Feed is part of Future AGI’s eval stack — it clusters every trace failure into a named issue using HDBSCAN soft-clustering over ClickHouse-stored embeddings (production v2; noise points with prob >= 0.4 get reassigned to the highest-probability cluster). A Claude Sonnet 4.5 Judge agent on Bedrock (30-turn budget, 8 span-tools: read_span, get_children, get_spans_by_type across 11 observation types, search_spans, submit_finding, submit_scores, submit_summary; Haiku Chauffeur sub-agent summarises spans over 3000 chars; 90% prompt-cache hit ratio) writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5). Those fixes feed back into the platform’s self-improving evaluators — that’s the mechanism that closes the loop without you re-authoring the evaluators yourself.
  • Promote to dataset. From each named issue, the on-call engineer (or a scheduled job) promotes representative traces into the offline eval set with rubric labels. The next PR that touches the offending path has to clear the new entries.

The dataset ratchets stronger over time. The CI gate catches more regressions every quarter. The closed loop is what separates teams whose eval scores trend down from teams whose eval scores trend up.

Common mistakes

  • Mistake: stop at offline. Offline pass is a necessary, not sufficient, condition. Real users find what the test author didn’t think of.
  • Mistake: too many rubrics. Ten well-calibrated rubrics beat thirty noisy ones. Cut rubrics that don’t correlate with user complaints.
  • Mistake: judge-as-marketing. Don’t pick the judge model that produces the highest scores; pick the one that agrees with human labels.
  • Mistake: hand-written test set never updates. A static dataset stops being a regression suite the moment production drifts past it.
  • Mistake: per-turn scoring on multi-turn products. Per-turn evals on a conversational agent produce false confidence. Add conversation-level metrics.
  • Mistake: ignore tool calls. Response-only scoring misses agent failures whose root cause is a bad tool call or a bad retrieval. Score the trace itself, including the tool calls and the retrieval, alongside the response.

Three deliberate tradeoffs

  • The playbook adds operational surface. Six layers is more moving parts than a single pytest directory. The payoff is reliability that compounds; the dataset gets sharper, the judge stays calibrated, the CI gate catches more, the production rate of regressions trends down quarter over quarter. New deployments can ship with traceAI plus ai-evaluation alone and turn on the gateway, optimizer, and error feed when traffic justifies them.
  • Self-improving rubrics need calibration of their own. Rubrics that learn from production traces can drift in unexpected directions. Pin a human-labeled hold-out set; alarm when the judge disagrees with the hold-out by more than the inter-rater baseline.
  • LLM-as-judge cost grows with traffic. Continuous online scoring on every trace is expensive at scale. Sample by failure signal, cache deterministic substrings, and reserve frontier models for adjudication. Future AGI’s Platform runs classifier-backed evals at lower per-eval cost than Galileo Luna-2 for the scale-economics tradeoff.

How Future AGI wires the full playbook

Future AGI ships the eval stack as a package. Start with the SDK for custom code-defined evals. Graduate to the Platform when you want self-improving evals authored by an in-product agent.

  • SDK (ai-evaluation, Apache 2.0): real from fi.evals import Evaluator, Protect + Evaluator(fi_api_key=..., fi_secret_key=...).evaluate(eval_templates=[Template()], inputs=[TestCase(...)]). 60+ EvalTemplate classes including Groundedness, ContextAdherence, FactualAccuracy, Toxicity, PromptInjection, DataPrivacyCompliance, AnswerRefusal, IsHarmfulAdvice, TaskCompletion, EvaluateFunctionCalling. 13 guardrail backends (9 open-weight including LLAMAGUARD_3_8B/1B, QWEN3GUARD_8B/4B/0.6B, GRANITE_GUARDIAN_8B/5B, WILDGUARD_7B, SHIELDGEMMA_2B; 4 API including OPENAI_MODERATION, AZURE_CONTENT_SAFETY, TURING_FLASH, TURING_SAFETY). 8 sub-10ms Scanners (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner). Four distributed runners (Celery, Ray, Temporal, Kubernetes). RailType.INPUT/OUTPUT/RETRIEVAL + AggregationStrategy.ANY/ALL/MAJORITY/WEIGHTED. Multi-modal CustomLLMJudge via LiteLLM. AutoEval pipelines from natural-language description.
  • Platform (cloud / hosted Agent Command Center): self-improving evaluators (thumbs up/down or relabel retunes the rubric — richer than the SDK’s few-shot retrieval). In-product agent authors unlimited custom evaluators from natural-language descriptions. Classifier-backed evals at lower per-eval cost than Galileo Luna-2.
  • Error Feed (inside the eval stack): HDBSCAN clustering over ClickHouse embeddings + Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools) writes the fix; those fixes feed back into the platform’s self-improving evaluators so your eval suite ages with your product. Linear integration ships today; Slack, GitHub, Jira, and PagerDuty land on the roadmap.
  • CI gate: ai-evaluation plugs into pytest, GitHub Actions, GitLab CI, or any test runner.
  • Production observation: same rubrics run as span-attached scorers against live traces via traceAI (Apache 2.0) — 50+ AI surfaces across Python (46 packages) / TypeScript (39) / Java (24 modules including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel) / C#; pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY); 14 span kinds; 62 built-in evals via EvalTag.
  • agent-opt: six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer); teacher-inferred few-shot templates; resumable Optuna studies; unified Evaluator over heuristics / LLM-judge / 60+ FAGI rubrics; EarlyStoppingConfig shared across all six. Eval-driven today; trace-stream ingestion (traceAI → dataset connector) is the active roadmap item.

The Agent Command Center is the hosted runtime: a 17 MB Go binary, six native adapters (OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure) plus OpenAI-compatible presets and self-hosted backends (20+ providers total), 6 routing strategies + circuit breaker + shadow/mirror/race modes, 6 exact and 4 semantic cache backends (defaults: similarity 0.85, 256 dims, 50k LRU), 5-level hierarchical budgets (org/team/user/key/tag), MCP + A2A + realtime WebSocket + Assistants + threads + runs + vector stores + batch + files + video + OCR + rerank + responses + async + scheduled, native Anthropic /v1/messages and Gemini /v1beta, response headers x-agentcc-cost / x-agentcc-latency-ms / x-agentcc-model-used / x-agentcc-fallback-used / x-agentcc-routing-strategy / x-agentcc-cache / x-agentcc-guardrail-triggered. RBAC, SOC 2 Type II, HIPAA, GDPR, and CCPA certified, AWS Marketplace, multi-region hosted.

Frequently asked questions

What does an LLM evaluation playbook cover in 2026?
Six layers. (1) Dataset: a versioned set of inputs with expected behavior, refreshed weekly from production. (2) Metrics: rubric definitions covering faithfulness, groundedness, task completion, tool-use correctness, conversation completeness. (3) Judge: the LLM or classifier that scores responses against the rubric, calibrated against human labels. (4) CI gate: rubric-on-dataset run on every PR with thresholds. (5) Production observation: same rubrics applied to live traces as span-attached scores. (6) Closed loop: failing traces auto-cluster into named issues and promote back into the dataset.
What's different between 2024 and 2026 LLM evaluation?
Three shifts. Per-turn metrics gave way to conversation-level and outcome metrics, because per-turn scoring on multi-turn agents produces false confidence. Static benchmarks gave way to weekly-refreshed datasets sampled from production, because real users find what test authors didn't think of. Eval moved from a separate dashboard to span-attached scores on the OTel trace tree, so the eval, the trace, and the failure live in the same place. The other big change: rubrics learn from production feedback rather than getting authored once and forgotten, and the failing trace promotes back into the dataset automatically rather than waiting for a quarterly review pass.
How big should the eval dataset be?
Start at 50-100 examples per route, weighted toward representative user behavior plus the hardest 10% of failures observed so far. Grow weekly by promoting failing production traces. Beyond ~500 examples per route, sampling becomes a bigger lever than dataset size for judge cost reasons. Quality, coverage of failure modes, and refresh cadence matter more than raw count.
LLM-as-judge versus deterministic metrics — which?
Both, layered. Deterministic metrics (exact match, structured-output schema, tool-call success, citation validity, length and format bounds) catch about half of real-world failures and are cheap to run on every trace. LLM-as-judge catches the rest (faithfulness, role adherence, helpfulness, task completion) and is much more expensive. The right setup runs deterministic checks first, falls back to LLM-as-judge on cases that need semantic scoring, and reserves frontier models for adjudication when a small classifier and a large judge disagree. The ai-evaluation SDK's augment=True flag encodes this cascade directly.
What's the right CI gate threshold?
Set thresholds per rubric, per route, calibrated against the current production baseline. A reasonable starting point: fail the PR if any rubric drops more than 2 points from baseline on the regression set, or if absolute score falls below an agreed floor (often 0.75 for faithfulness, 0.85 for task completion, varies by domain). Tune as the dataset matures.
How does production observation fit with offline eval?
Same rubric, two places. The CI gate runs the rubric against a versioned dataset to catch regressions before deploy. Production observation runs the same rubric against live traces (sampled uniformly or by failure signal) to catch drift, new failure modes, and rare paths the dataset doesn't cover. Span-attached scores keep both in the same trace tree, so a regression on the CI dataset and a drift in production are visible against the same OTel attributes. Alarm on a 2-5 point sustained drop over 15-60 minutes, per-route and per-rubric, and triage failing traces into an annotation queue.
What does Future AGI ship for the full playbook?
An eval stack package + a tracing layer + a hosted runtime. The ai-evaluation SDK (Apache 2.0) ships 60+ EvalTemplate classes, 13 guardrail backends including 9 open-weight, 8 sub-10ms Scanners, and four distributed runners. The Future AGI Platform is the deeper surface — self-improving evaluators tuned by thumbs up/down feedback, an in-product authoring agent that turns natural-language descriptions into rubrics, and Luna-2-better cost economics on classifier-backed evals. Error Feed is part of this eval stack: HDBSCAN clustering plus a Sonnet 4.5 Judge writes the fix that feeds back into the self-improving evaluators. traceAI (Apache 2.0) ships 50+ AI surfaces across Python / TypeScript / Java / C# with a Spring Boot starter. agent-opt ships six optimizers (RandomSearch, BayesianSearch, MetaPrompt, ProTeGi, GEPA, PromptWizard). Agent Command Center is the SOC 2 Type II / HIPAA / GDPR / CCPA certified hosted runtime.
Related Articles
View all