The 2026 LLM Evaluation Playbook
The pillar playbook for LLM evaluation in 2026: dataset, metrics, judge, CI gate, production observation, and the closed loop from failing trace back to regression test.
Table of Contents
The LLM evaluation question used to be “did the output look right.” In 2026 it’s a six-layer engineering problem: dataset, metrics, judge, CI gate, production observation, and the loop that closes between them. The teams shipping reliable agents in 2026 treat eval as core infrastructure, not a post-launch checklist. This playbook is the working pattern from the deployments we’ve watched ship and stay shipped.
TL;DR: the six layers
| Layer | Job | Failure if missing |
|---|---|---|
| Dataset | Inputs with expected behavior | Eval scores nothing useful |
| Metrics | Rubric definitions | Pass means whatever you want |
| Judge | Scoring engine | Inconsistent rubric application |
| CI gate | Block bad PRs | Regressions ship |
| Production observation | Score live traces | Drift invisible until users complain |
| Closed loop | Failures back into dataset | Same bugs ship twice |
If you only build three: dataset + judge + CI gate. The rest are how you keep the playbook honest over time.
Layer 1: the dataset
A versioned set of inputs paired with expected behavior. The shape depends on the task:
- Single-turn QA.
(input, expected_output, retrieval_context?, metadata). - Tool-using agent.
(input, expected_tool_calls, expected_final_state, retrieval_context?, metadata). - Conversational agent.
(system_prompt, turns, expected_outcome, persona, scenario, metadata).
Three rules that decide whether the dataset earns its keep:
- Sampled from production, not invented. A test set written by the test author at launch reflects the test author’s assumptions, not user behavior. Pull representative traces from prod and annotate them.
- Weighted toward failures. Most of the eval signal comes from the hardest 10% of inputs. Skew the dataset toward edge cases and rare intents; keep some happy-path coverage for sanity.
- Versioned and timestamped. Treat the dataset like prompt versions: tag releases, freeze datasets for active CI gates, and review additions in PR. Drift in the dataset can move scores measurably; tracking is the only way to know.
Start at 50-100 examples per route. Grow weekly by promoting failing production traces. Beyond a few hundred, judge cost becomes the dominant constraint and sampling beats more examples. The synthetic-test-data approach covers ways to scale dataset construction without losing signal.
Layer 2: the metrics
Pick a small set of metrics that map to failure modes you actually have. Three families:
Deterministic metrics — cheap, fast, exact:
- Exact match / structured-output schema. Did the response parse into the expected JSON shape?
- Tool-call success. Did the agent call the tools you expected with the arguments you expected?
- Citation validity. Does every cited span actually exist in the retrieval context?
- Length and format bounds. Within the allowed token budget? Conforming to format constraints?
Semantic metrics (LLM-as-judge) — expensive, slow, harder to fool:
- Faithfulness. Does the response only assert what the retrieval context supports?
- Groundedness. Are claims linked to source spans?
- Context precision and recall. Did retrieval surface the right chunks?
- Task completion. Did the agent fulfill the user’s request?
- Role adherence. Did the assistant stay in its declared persona?
Conversation and outcome metrics — for multi-turn agents:
- Conversation completeness. Did the dialogue reach the expected end state?
- Knowledge retention. Did the agent carry facts across turns?
- Outcome label. Did the user accomplish their goal (resolved, filed, booked, refunded)?
Most teams write five evals once and never touch them, then ship breaking changes for months because no one updates them. Future AGI’s eval stack is built so the rubric improves itself from production feedback instead of decaying. The ai-evaluation SDK (Apache 2.0) is the code-first surface for these rubric families — real Evaluator(fi_api_key=..., fi_secret_key=...).evaluate(eval_templates=[Template()], inputs=[TestCase(...)]) API across EvalTemplate classes like Groundedness, ContextAdherence, FactualAccuracy, Toxicity, PromptInjection, TaskCompletion, EvaluateFunctionCalling, with augment=True cascading from cheap heuristics into LLM-as-judge. The Future AGI Platform is where the in-product authoring agent lives (describe a rubric in natural language and the agent generates rubric + grading prompt + reference examples) and where self-improving evaluators retune from production thumbs up/down feedback.
Layer 3: the judge
The judge is the LLM or classifier that turns a rubric prompt plus a candidate response into a numeric score. Three rules:
- Your eval bill grows faster than your inference bill once you start LLM-as-judge at scale. Run small classifiers on every trace and reserve frontier models for adjudication. Future AGI’s Platform runs classifier-backed evals at lower per-eval cost than Galileo Luna-2, which is what makes weekly full-dataset reruns the default rather than a budget conversation.
- Pin the judge model and rubric version. A floating judge model produces drifting scores. The judge version is part of the eval contract.
- Calibrate against human labels. Sample 50-100 traces per quarter, label by hand, compare to the judge. Track inter-rater reliability between judge and human as its own quality metric.
Bad rubrics are the most common eval bug we see. Symptoms: high variance across reruns, judge-human disagreement above 20%, or rubric scores that don’t correlate with user satisfaction. The fix is usually rubric clarification, not a fancier judge.
Layer 4: the CI gate
The gate runs the rubrics against the versioned dataset on every PR. Three knobs:
- Per-rubric thresholds. Fail the PR if any rubric drops more than 2 points from the baseline, or if absolute score falls below an agreed floor (0.75 for faithfulness, 0.85 for task completion are reasonable defaults to tune).
- Per-route scoping. A PR touching the support-bot prompt doesn’t need to rerun the full eval suite for the sales agent. Scope CI runs to the affected routes.
- Baseline tracking. Compare against the trailing 7-day rolling baseline, not a frozen number. Models drift; the baseline drifts with them; the gate catches regressions relative to the moving truth.
The gate produces an artifact: rubric scores per dataset entry, with diffs against the baseline. Engineers reviewing the PR can drill into failing examples and decide if the regression is real or a noisy judge.
Layer 5: production observation
Offline eval gates regressions; production eval catches drift. The pattern that works:
- Sample production traffic (uniformly or by failure signal) and score with the same rubric used in CI.
- Attach scores to the OTel span so the eval result lives next to the trace. traceAI (Apache 2.0) ships 50+ AI surfaces across Python / TypeScript / Java / C# — 46 Python packages, 39 TypeScript, 24 Java modules (Spring AI, Spring Boot starter, LangChain4j, Semantic Kernel), 1 C#. Pluggable semantic conventions at
register()time (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY) ingest cleanly into Phoenix or Traceloop without re-instrumenting. 62 built-in evals wire to span attributes viaEvalTagfor zero added latency; 14 span kinds (Phoenix has 8) includeA2A_CLIENT/A2A_SERVER; LangGraph topology surfaces node_count, conditional edges, and state diffs. - Alarm on rolling-mean drift. Per-route, per-rubric, per-prompt-version. A 2-5 point sustained drop over 15-60 minutes is the right detection threshold for most products.
- Triage failing traces to an annotation queue. A human (or an automated clusterer) decides whether the failure is a bug, a rubric problem, or expected.
Production observation is where the rubric meets the user. Drift between offline pass and online drop is a quality signal of its own; track it.
Layer 6: the closed loop
The loop is what makes the playbook compound. Without it, each incident produces a one-off fix and the team writes the same regression twice.
Two automation patterns:
- Auto-cluster failures inside the eval stack. Error Feed is part of Future AGI’s eval stack — it clusters every trace failure into a named issue using HDBSCAN soft-clustering over ClickHouse-stored embeddings (production v2; noise points with
prob >= 0.4get reassigned to the highest-probability cluster). A Claude Sonnet 4.5 Judge agent on Bedrock (30-turn budget, 8 span-tools:read_span,get_children,get_spans_by_typeacross 11 observation types,search_spans,submit_finding,submit_scores,submit_summary; Haiku Chauffeur sub-agent summarises spans over 3000 chars; 90% prompt-cache hit ratio) writes the RCA, evidence quotes, animmediate_fix, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5). Those fixes feed back into the platform’s self-improving evaluators — that’s the mechanism that closes the loop without you re-authoring the evaluators yourself. - Promote to dataset. From each named issue, the on-call engineer (or a scheduled job) promotes representative traces into the offline eval set with rubric labels. The next PR that touches the offending path has to clear the new entries.
The dataset ratchets stronger over time. The CI gate catches more regressions every quarter. The closed loop is what separates teams whose eval scores trend down from teams whose eval scores trend up.
Common mistakes
- Mistake: stop at offline. Offline pass is a necessary, not sufficient, condition. Real users find what the test author didn’t think of.
- Mistake: too many rubrics. Ten well-calibrated rubrics beat thirty noisy ones. Cut rubrics that don’t correlate with user complaints.
- Mistake: judge-as-marketing. Don’t pick the judge model that produces the highest scores; pick the one that agrees with human labels.
- Mistake: hand-written test set never updates. A static dataset stops being a regression suite the moment production drifts past it.
- Mistake: per-turn scoring on multi-turn products. Per-turn evals on a conversational agent produce false confidence. Add conversation-level metrics.
- Mistake: ignore tool calls. Response-only scoring misses agent failures whose root cause is a bad tool call or a bad retrieval. Score the trace itself, including the tool calls and the retrieval, alongside the response.
Three deliberate tradeoffs
- The playbook adds operational surface. Six layers is more moving parts than a single
pytestdirectory. The payoff is reliability that compounds; the dataset gets sharper, the judge stays calibrated, the CI gate catches more, the production rate of regressions trends down quarter over quarter. New deployments can ship with traceAI plus ai-evaluation alone and turn on the gateway, optimizer, and error feed when traffic justifies them. - Self-improving rubrics need calibration of their own. Rubrics that learn from production traces can drift in unexpected directions. Pin a human-labeled hold-out set; alarm when the judge disagrees with the hold-out by more than the inter-rater baseline.
- LLM-as-judge cost grows with traffic. Continuous online scoring on every trace is expensive at scale. Sample by failure signal, cache deterministic substrings, and reserve frontier models for adjudication. Future AGI’s Platform runs classifier-backed evals at lower per-eval cost than Galileo Luna-2 for the scale-economics tradeoff.
How Future AGI wires the full playbook
Future AGI ships the eval stack as a package. Start with the SDK for custom code-defined evals. Graduate to the Platform when you want self-improving evals authored by an in-product agent.
- SDK (ai-evaluation, Apache 2.0): real
from fi.evals import Evaluator, Protect+Evaluator(fi_api_key=..., fi_secret_key=...).evaluate(eval_templates=[Template()], inputs=[TestCase(...)]). 60+EvalTemplateclasses includingGroundedness,ContextAdherence,FactualAccuracy,Toxicity,PromptInjection,DataPrivacyCompliance,AnswerRefusal,IsHarmfulAdvice,TaskCompletion,EvaluateFunctionCalling. 13 guardrail backends (9 open-weight including LLAMAGUARD_3_8B/1B, QWEN3GUARD_8B/4B/0.6B, GRANITE_GUARDIAN_8B/5B, WILDGUARD_7B, SHIELDGEMMA_2B; 4 API including OPENAI_MODERATION, AZURE_CONTENT_SAFETY, TURING_FLASH, TURING_SAFETY). 8 sub-10ms Scanners (JailbreakScanner,CodeInjectionScanner,SecretsScanner,MaliciousURLScanner,InvisibleCharScanner,LanguageScanner,TopicRestrictionScanner,RegexScanner). Four distributed runners (Celery, Ray, Temporal, Kubernetes).RailType.INPUT/OUTPUT/RETRIEVAL+AggregationStrategy.ANY/ALL/MAJORITY/WEIGHTED. Multi-modalCustomLLMJudgevia LiteLLM.AutoEvalpipelines from natural-language description. - Platform (cloud / hosted Agent Command Center): self-improving evaluators (thumbs up/down or relabel retunes the rubric — richer than the SDK’s few-shot retrieval). In-product agent authors unlimited custom evaluators from natural-language descriptions. Classifier-backed evals at lower per-eval cost than Galileo Luna-2.
- Error Feed (inside the eval stack): HDBSCAN clustering over ClickHouse embeddings + Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools) writes the fix; those fixes feed back into the platform’s self-improving evaluators so your eval suite ages with your product. Linear integration ships today; Slack, GitHub, Jira, and PagerDuty land on the roadmap.
- CI gate:
ai-evaluationplugs into pytest, GitHub Actions, GitLab CI, or any test runner. - Production observation: same rubrics run as span-attached scorers against live traces via traceAI (Apache 2.0) — 50+ AI surfaces across Python (46 packages) / TypeScript (39) / Java (24 modules including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel) / C#; pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY); 14 span kinds; 62 built-in evals via
EvalTag. - agent-opt: six optimizers (
RandomSearchOptimizer,BayesianSearchOptimizerOptuna-backed,MetaPromptOptimizer,ProTeGi,GEPAOptimizer,PromptWizardOptimizer); teacher-inferred few-shot templates; resumable Optuna studies; unifiedEvaluatorover heuristics / LLM-judge / 60+ FAGI rubrics;EarlyStoppingConfigshared across all six. Eval-driven today; trace-stream ingestion (traceAI → datasetconnector) is the active roadmap item.
The Agent Command Center is the hosted runtime: a 17 MB Go binary, six native adapters (OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure) plus OpenAI-compatible presets and self-hosted backends (20+ providers total), 6 routing strategies + circuit breaker + shadow/mirror/race modes, 6 exact and 4 semantic cache backends (defaults: similarity 0.85, 256 dims, 50k LRU), 5-level hierarchical budgets (org/team/user/key/tag), MCP + A2A + realtime WebSocket + Assistants + threads + runs + vector stores + batch + files + video + OCR + rerank + responses + async + scheduled, native Anthropic /v1/messages and Gemini /v1beta, response headers x-agentcc-cost / x-agentcc-latency-ms / x-agentcc-model-used / x-agentcc-fallback-used / x-agentcc-routing-strategy / x-agentcc-cache / x-agentcc-guardrail-triggered. RBAC, SOC 2 Type II, HIPAA, GDPR, and CCPA certified, AWS Marketplace, multi-region hosted.
Related reading
Frequently asked questions
What does an LLM evaluation playbook cover in 2026?
What's different between 2024 and 2026 LLM evaluation?
How big should the eval dataset be?
LLM-as-judge versus deterministic metrics — which?
What's the right CI gate threshold?
How does production observation fit with offline eval?
What does Future AGI ship for the full playbook?
Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.
Long-context support is marketing. Long-context fidelity is what you eval: NIAH at every position, lost-in-middle on your docs, attention-budget cost.
Celery, Ray, Temporal, and Kubernetes optimise for different things. Pick by your bottleneck, not by what's fashionable. The 2026 engineering decision guide.