Guides

15 Common LLM Evaluation Mistakes Teams Make in 2026

The 15 LLM evaluation mistakes the Future AGI team sees in customer engagements, each with a vignette and the concrete primitive that prevents it.

·
17 min read
llm-evaluation best-practices guardrails ai-gateway agent-opt ci-cd 2026
Editorial cover image for 15 Common LLM Evaluation Mistakes Teams Make in 2026
Table of Contents

The team has good intentions. They ship the agent, watch latency dashboards, and tell themselves they will add eval next sprint. Then a customer escalation lands at 11pm on a Tuesday, the trace looks fine in isolation, and there’s no rubric, no regression set, and no way to tell whether the model regressed yesterday or six weeks ago. This post is the field-notes version of that pattern. Fifteen mistakes the Future AGI team sees in customer engagements, each with a story-shaped vignette and the concrete primitive that prevents it.

The companion posts say what to do. The 2026 LLM evaluation playbook covers the six layers, the 21-item checklist covers the design and ship discipline, and the agent-passes-evals-fails-production post covers the offline-to-online gap. This one says what trips teams up. Both matter, and the failure-mode framing tends to stick with engineers in a way the abstract advice does not.

TL;DR: the 15 mistakes

#MistakeOne-step fix
1”We don’t need eval yet”Wire one Groundedness rubric and one PR gate today
2Single-metric evalStack five metric families minimum
3No annotator agreementMeasure Cohen’s kappa before promoting any rubric
4LLM-judge on every traceCascade through Scanners then classifier then judge
5No augment=True cascadeOne flag flip, 70 to 90 percent judge cost cut
6Custom rubrics from scratchStart from the 60-plus pre-built templates
7Skipping the Scanner pre-filterEight sub-10 ms checks free at the door
8No production-trace miningWire Error Feed clusters into the weekly review
9No PR-gate evalBlock merge on rubric delta over five percent
10Subjective eyeball reviewReplace with rubric-on-dataset runs
11No per-tenant chargebackSet the five-level hierarchical budget
12Trust self-reported confidenceCalibrate against held-out labels
13Position bias in pairwise evalRandomize order, average across orderings
14Cache without freshness evalPair semantic cache with a freshness rubric
15No incident-response playbookWire Error Feed to Linear, write the runbook

The meta-mistake at the bottom is treating eval as a checkbox instead of load-bearing infrastructure. The teams that win build eval discipline like SRE or DevOps discipline, not like a one-time audit. The next 15 sections are the specific failures, with a fix for each.

Mistake 1: “We don’t need eval yet”

Every team learns this lesson the same way. The agent ships, traffic ramps, and a customer support escalation lands in Slack with a screenshot of the model confidently citing a fact that does not exist in the source documents. Someone asks “when did this start happening” and the room goes quiet. No regression set, no rubric, no historical scores. The team spends two days reading raw traces, finds nothing systematic, and ships a one-line prompt patch that may or may not fix the underlying issue. Three weeks later, a similar escalation lands and the same conversation happens.

The fix is one rubric on one route on day one. Wire Groundedness from the ai-evaluation SDK, run it on a 50-example regression set built from synthetic data plus a handful of real production traces, and block the next PR if the score drops more than five percent. The total work is half a day. The Future AGI ai-evaluation SDK ships free under Apache 2.0 and the templates import directly from fi.evals. See the build LLM evaluation framework from scratch walkthrough for the first-day setup.

from fi.evals import Evaluator, TestCase
from fi.evals.templates import Groundedness

evaluator = Evaluator(fi_api_key=API_KEY, fi_secret_key=SECRET)
result = evaluator.evaluate(
    eval_templates=[Groundedness()],
    inputs=[TestCase(query=q, response=r, context=ctx)],
)

Mistake 2: single-metric eval

A team picks Faithfulness because it sounds important, runs it weekly, watches the score sit at 0.82, and ships every release because nothing trips the gate. Three months later they discover the agent is generating perfectly faithful answers to the wrong questions, or refusing 11 percent of valid requests, or calling tools in the wrong order. The single metric was scoring one slice of the failure surface and the other slices were invisible.

The fix is metric families, not one metric. Pick at least one from each of the five families: statistical, classifier-backed, LLM-judge, deterministic, and safety. For a RAG system that looks like Groundedness plus ContextRelevance plus Completeness plus AnswerRefusal plus PromptInjection. For a tool-using agent it looks like TaskCompletion plus LLMFunctionCalling plus a deterministic schema check plus Toxicity plus a jailbreak Scanner. The Future AGI ai-evaluation SDK exposes all five families from one package so the stacking is one import and one config block away. See deterministic vs LLM judge evals for the trade-offs across families.

Mistake 3: no annotator agreement

The team labels 200 examples for a new Helpfulness rubric, ships the rubric, watches scores look reasonable, and three weeks later cannot reproduce any of their offline numbers in production. The investigation traces back to the labels themselves. Two annotators marking the same example with different labels means the rubric is under-specified, the threshold is wrong, or the task is genuinely ambiguous, and you cannot tell which until you measure inter-rater agreement.

The fix is Cohen’s kappa on a 50 to 100 example calibration set before the rubric ever enters a CI gate. Two annotators independently label, compute kappa, target above 0.7, and tighten the rubric prompt until the threshold is met. Below 0.5 means the rubric is genuinely ambiguous and needs to be split. The Future AGI Platform exposes annotation queues that surface label disagreement directly to the rubric author. The pattern pairs with the LLM-as-judge best practices guide on calibration cycles.

Mistake 4: LLM-judge on every trace

The team wires a judge-backed rubric on every production trace, watches the latency dashboard, and is confused that the agent feels slow. Then the bill arrives and the conversation gets tense. A judge-on-everything setup pays frontier model cost and frontier latency on every request, which prices the eval program out of production within a quarter. We have walked into engagements where the eval bill was higher than the agent’s inference bill, which is a sign that the design is upside down.

The fix is a cascade that puts the cheapest check first. Run the eight Scanners up front for the deterministic failures, escalate to a classifier backend for the safety and refusal calibration, and only send genuinely ambiguous cases to a judge. The cascade typically cuts judge calls 70 to 90 percent on production traffic while keeping recall on the hard cases. See LLM eval cost optimization for the cascade math.

Mistake 5: no augment=True cascade

A close cousin of Mistake 4. The team understands the cascade pattern but does not realize the ai-evaluation SDK ships it as a one-flag flip. They hand-roll a Scanner-then-classifier-then-judge pipeline in 400 lines of orchestration code, get the routing wrong on some edge cases, and pay judge cost on traces that should have terminated at the classifier layer. The cascade is one of those primitives where the abstraction matters as much as the design pattern.

The fix is the augment=True flag on Evaluator.evaluate. The SDK routes through the Scanners, then through a classifier picked from the nine open-weight backends or four API backends, and only escalates to a judge when the classifier output is ambiguous. The open-weight choices are LLAMAGUARD_3_8B, LLAMAGUARD_3_1B, QWEN3GUARD_8B, QWEN3GUARD_4B, QWEN3GUARD_0_6B, GRANITE_GUARDIAN_8B, GRANITE_GUARDIAN_5B, WILDGUARD_7B, and SHIELDGEMMA_2B. API backends cover OPENAI_MODERATION, AZURE_CONTENT_SAFETY, TURING_FLASH, and TURING_SAFETY. Pick the backend that matches the latency and safety profile of the route.

result = evaluator.evaluate(
    eval_templates=[Toxicity(), PromptInjection(), AnswerRefusal()],
    inputs=[TestCase(query=q, response=r)],
    augment=True,
)

Mistake 6: custom rubrics from scratch when 60-plus templates exist

A team spends two sprints writing a custom rubric for groundedness, gets the prompt 80 percent right, and ships a noisier version of Groundedness that already exists in the SDK. The custom-judge surface is there for the long tail of genuinely novel failure modes. For everything else, the pre-built template is calibrated, threshold-tuned, and ships with a Scanner fallback. Reinventing it loses three sprints of engineering and produces a weaker result.

The fix is to start from the 60-plus pre-built templates and only write custom when no template matches. The shapes that ship today cover most production failure modes: Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy, Toxicity, PromptInjection, DataPrivacyCompliance, AnswerRefusal, IsHarmfulAdvice, NoHarmfulTherapeuticGuidance, TaskCompletion, and LLMFunctionCalling. See custom LLM eval metrics best practices for the decision tree on when custom is the right move.

Mistake 7: skipping the Scanner pre-filter

The team treats every eval as a judge call. Then the bill question lands. Sub-10 ms deterministic checks are the cheapest, fastest, most reliable layer of an eval stack, and skipping them sends jailbreak attempts, code-injection payloads, leaked secrets, and invisible-character attacks through the judge layer for no reason. The judge sees a trace that any deterministic pattern matcher would catch in microseconds and bills a frontier model call for the privilege.

The fix is the eight Scanners as the first eval layer. JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, and RegexScanner cover the deterministic failure surface. They run sub-10 ms, ship free under Apache 2.0, and require no judge calls. Wire them before the classifier and judge layers in the cascade so the deterministic failures terminate before they touch the expensive paths. See the open-source ai-evaluation library guide for the Scanner reference.

Mistake 8: no production-trace mining

The team builds a golden set in week one, runs it on every PR for six months, and watches scores slowly stop predicting production behavior. The golden set is calibrated to the world it was built in. The product moves, users move, the model moves, and the golden set silently goes out of distribution. Without a feedback loop, the regression set scores 0.91 and production scores 0.62, and nobody understands the gap.

The fix is mining the production trace stream weekly. The Future AGI Error Feed runs HDBSCAN soft-clustering over traces in ClickHouse, surfaces clusters of similar failures, and a Sonnet 4.5 Judge writes a candidate immediate_fix per cluster. Promote the failing traces into the regression set, retune the rubric thresholds against the new data, and the offline-to-online gap closes within two refresh cycles. See LLM observability best practices for the trace-mining setup.

Mistake 9: no PR-gate eval

The team runs eval weekly, manually, on a notebook nobody owns. Regressions ship Monday morning, get caught by the Friday review, and a hotfix lands the following Tuesday. Three days of bad behavior in production for every regression. The signal exists but the gate that uses it does not, so the eval program is technically running and operationally useless.

The fix is a PR gate that runs the regression set on every pull request and blocks the merge if any rubric drops more than five percent against the trunk baseline. Surface the per-rubric delta in the PR check so the reviewer can see which rubric moved. Pair with shadow routing in production for changes that pass CI but need real-traffic validation. See CI/CD LLM eval GitHub Actions for the working recipe and CI/CD for AI agents best practices for the broader pattern.

Mistake 10: subjective eyeball review

The team reviews traces in a Notion doc on Friday afternoons. Five engineers stare at 30 responses each, vote on quality, and update the prompt based on the consensus. The pattern works for the first two weeks and collapses as the product scales. By trace 100, the engineers are tired, by trace 500 the consensus is noise, and by trace 1,000 nobody opens the doc.

The fix is rubric-on-dataset runs that replace eyeball review with measurable, reproducible scores. Eyeball review still has a place for the long-tail edge cases where the rubric does not fire cleanly, but it is the last 5 percent of the eval surface, not the first 95. The Future AGI Platform exposes annotation queues that pull only the ambiguous traces for human review, which is the right shape: humans on the hardest cases, rubrics on the rest. See the agentic AI evaluation guide for the human-in-the-loop pattern.

Mistake 11: no per-tenant chargeback

The team runs a multi-tenant LLM app. The total bill arrives, finance asks which customer is driving the spike, and the answer is a 40-minute trip through traces with grep. Without per-tenant attribution, finance cannot tell which team or customer is driving the bill, budget conversations stall, and the cost number sits unowned. Cost without attribution is the worst kind of cost.

The fix is the Agent Command Center five-level hierarchical budget structure covering org, team, user, key, and tag. Every gateway call rolls up correctly through the hierarchy, the x-prism-cost header on the response makes the spend visible at the call site, and the chargeback dashboard surfaces per-tenant burn rates and projections. See AI agent cost optimization observability for the budget enforcement patterns and LLM cost tracking best practices for the broader cost discipline.

Mistake 12: trust self-reported confidence

The team writes a rubric that asks the model “how confident are you in this answer,” uses the response as a routing signal, and discovers six weeks later that the model returns 0.95 confidence on the answers that are most wrong. LLMs are systematically over-confident on the failures that matter most, especially on confident-sounding hallucinations. The self-reported number is correlated with the wrong thing.

The fix is calibration against held-out labels. Treat any confidence signal as a feature to be calibrated, not as a ground-truth score. Run the model’s confidence against held-out human labels, fit a calibration curve, and only use the calibrated score for routing decisions. Or skip the self-reported confidence entirely and route on the classifier-backed score from a Guard model, which is calibrated against pre-training labels and does not depend on the underlying model’s self-knowledge. See evaluating LLM judge bias mitigation for the calibration math.

Mistake 13: position bias in pairwise eval

The team runs A/B prompt evaluation as a pairwise judge: “given prompt A response and prompt B response, which is better.” Six weeks of results consistently favor whichever option appears first in the prompt, even when the underlying preference distribution is roughly 50/50. The judge has a position bias and the eval is reporting a winner that does not exist in the data.

The fix is to randomize order on every comparison and average across both orderings, or skip pairwise framing entirely and score each response independently against the rubric. The independent-scoring approach is usually the right move because it scales better than pairwise and removes the position-bias surface entirely. The Future AGI Platform exposes both surfaces; the choice depends on whether you are asking “is response X above the threshold” (independent scoring) or “is prompt A better than prompt B in the abstract” (pairwise). See A/B testing LLM prompts best practices for the full bias-mitigation guide.

Mistake 14: cache without freshness eval

The team wires semantic caching on the gateway to cut cost, watches the cache hit rate climb to 40 percent, and ships. Three months later a customer complaint surfaces a stale answer that was correct at cache-write time and wrong at cache-read time. The cache is serving truth past its useful life and nobody is checking. Caching without a freshness rubric trades cost savings for silent staleness.

The fix is pairing the semantic cache with a freshness rubric that scores the cached response against the current ground truth on a sample of cache hits. The Agent Command Center exposes six exact-match cache backends and four semantic cache backends with cross-tenant tag namespacing, and the freshness rubric runs on a sampled subset of hits to catch the cases where the underlying answer has moved. The x-prism-cache-hit header on the response makes the cache decision visible at the call site so the freshness sampler can pull the right traces. See AI gateway best practices for the cache-tier breakdown.

Mistake 15: no incident-response playbook

A Sev 1 lands at 2am. The on-call engineer has no runbook, no clustering of similar incidents, no triage matrix, and no idea whether this is a model regression, a prompt change, a data drift, or a tool-call failure. The next four hours are improvised from first principles. The fix lands at 6am, the post-mortem is written at noon, and a similar incident lands two weeks later with the same improvisation.

The fix is the four-step playbook: detect, triage, mitigate, back-test. Detection lands through Error Feed clusters on the live trace stream. Triage attaches per-cluster severity from the four-dimensional trace scoring. Mitigation routes through the gateway via guardrail tuning, model fallback, or prompt rollback. Back-test promotes the failing traces into the regression set so the same bug does not ship twice. Error Feed wires to Linear today via OAuth so the incident lands as a triaged issue in the team queue automatically. See LLM incident response playbook for the full runbook.

The meta-mistake

All 15 mistakes share a shape. The team treats eval as a checkbox: do it once, ship the rubric, move on. The teams that ship reliable LLM systems in 2026 build eval discipline like SRE or DevOps discipline, not like a one-time audit. That means versioned regression sets, PR-gated rubrics, production trace mining, weekly refresh cadences, on-call rotations for eval failures, and an incident playbook that lands when something breaks.

The cost of building this discipline up front is one week of focused work. The cost of not building it is six months of incidents, escalations, and 11pm Slack messages. The math is not subtle. The teams that build the discipline early ship faster after month two because the regression gate catches what would have been a Sev 1 and the rubric tells them which slice of the failure surface moved. The teams that skip it ship slower after month two because every release is a hand-rolled risk assessment.

The antidote pattern: one week to prevent six months

A working sequence for a team starting from zero.

Day one: install the ai-evaluation SDK, pick the three or four pre-built templates that match the route (Mistake 6), wire augment=True so the cascade routes through Scanners then classifier then judge (Mistake 4 and 5), and run a 50-example regression set on the current trunk.

Day two: add the eight Scanners as the first eval layer (Mistake 7), measure Cohen’s kappa on a 50-example calibration set with two annotators (Mistake 3), and tune the rubric thresholds.

Day three: wire a PR-gate GitHub Action that runs the regression set on every PR and blocks the merge on a five percent drop (Mistake 9). Replace any eyeball review processes with rubric-on-dataset runs (Mistake 10).

Day four: set the five-level hierarchical budget structure in the Agent Command Center so finance can see per-tenant burn (Mistake 11). Add the x-prism-cost header to the trace metadata so cost lands on every trace.

Day five: wire Error Feed into Linear via OAuth (Mistake 15). Pair the semantic cache with a freshness rubric on sampled hits (Mistake 14). Add a calibrated confidence score that uses a held-out label set, not self-reported confidence (Mistake 12).

The result is a working eval program that catches regressions before they ship, attributes cost correctly, and lands incidents as triaged issues automatically. The remaining work is the weekly refresh cadence and the long tail of custom rubrics, both of which scale linearly with the product.

Honest framing: what ships today versus the roadmap

Eval-driven prompt optimization on a regression set ships today via the agent-opt library. Six optimizers are in the box: RandomSearchOptimizer for baselines, BayesianSearchOptimizer (Optuna-backed, with teacher-inferred few-shot and resumable runs), MetaPromptOptimizer, ProTeGi, GEPAOptimizer, and PromptWizardOptimizer. Pair any optimizer with EarlyStoppingConfig so the run terminates when the rubric plateaus instead of burning judge budget on marginal gains. The pattern: a failing rubric on a regression set becomes an optimization target, the optimizer searches the prompt space, and the winning prompt ships.

The direct connector that streams trace failures from traceAI into an agent-opt dataset automatically is on the roadmap, not in the box. The reason to call this out: pretending the closed loop is one click hides the operational work of building and curating the dataset that feeds the optimizer, which is where the actual eval program lives. Build the dataset first; the optimizer is the part that bolts on once the dataset earns its keep.

Error Feed wires to Linear today via OAuth. Slack, GitHub, Jira, and PagerDuty connectors are on the roadmap, not in the box. The Linear integration covers the on-call queue use case for engineering teams that have standardized on Linear. Other integrations are in active development. The honest framing matters because vendor pitch decks tend to claim every integration; the reality is that Linear ships today and the rest are coming.

See the agent-opt webinar recap and optimizing LLM experimentation best practices for the optimization pattern in depth.

Putting it together

The 15 mistakes are not exotic. Most teams hit most of them in the first six months, usually in roughly this order: shipping without eval (Mistake 1), single metric (Mistake 2), judge-on-everything (Mistake 4), no PR gate (Mistake 9), eyeball review (Mistake 10), no chargeback (Mistake 11), no incident playbook (Mistake 15). The fixes are straightforward and the SDK ships the primitives that prevent each one.

The Future AGI ai-evaluation SDK covers Mistakes 2 through 7 directly. The Agent Command Center covers Mistake 11 (chargeback) and Mistake 14 (cache with freshness). Error Feed covers Mistake 8 (trace mining) and Mistake 15 (incident response). The agent-opt library covers the optimization loop once the dataset is curated. The practice is portable across vendors; the surfaces save the months. For the wider pattern this post plugs into, see the 2026 LLM evaluation playbook, the 21-item best practices checklist, and agent observability vs evaluation vs benchmarking.

Frequently asked questions

What is the single most expensive LLM evaluation mistake teams make?
Sending every trace through an LLM judge without a cascade. A naive judge-on-everything setup pays frontier model cost and frontier latency on every request, which prices the program out of production within a quarter. The fix is the augment=True cascade in the Future AGI ai-evaluation SDK, which routes each trace through eight sub-10 ms Scanners first, then a classifier backend like LLAMAGUARD_3_8B or QWEN3GUARD_4B, and only escalates genuinely ambiguous cases to a judge. Teams that wire the cascade on day one cut judge calls 70 to 90 percent and keep recall on the hard cases.
Why is annotator agreement load-bearing for LLM evaluation?
Without Cohen's kappa or a similar inter-rater agreement check, your golden labels are noise. Two annotators marking the same example with different labels means the rubric is under-specified, the threshold is wrong, or the task is genuinely ambiguous, and you cannot tell which until you measure agreement. Target kappa above 0.7 on a 50 to 100 example calibration set before the rubric goes into a CI gate. The Future AGI Platform exposes annotation queues that surface disagreement directly, so the rubric author can tighten the spec before promoting the rubric to the regression set.
What's wrong with custom rubrics from scratch?
Most teams who build rubrics from scratch end up with weaker, noisier versions of rubrics that already ship in the ai-evaluation SDK. The 60-plus EvalTemplate classes including Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy, Toxicity, PromptInjection, DataPrivacyCompliance, AnswerRefusal, IsHarmfulAdvice, NoHarmfulTherapeuticGuidance, TaskCompletion, and LLMFunctionCalling cover the common shapes. Start with the pre-built template, tune the threshold against your data, and only write a fully custom rubric when none of the templates match the failure mode. The custom-judge surface is there for the long tail, not the head.
How do I prevent regressions from shipping silently?
Wire a PR gate that runs the regression set on every pull request and blocks the merge if any rubric drops more than five percent against the trunk baseline. Surface the per-rubric delta in the PR check. Without a gate, regressions ship between weekly review cadences and show up as user complaints two weeks later. The pattern pairs with shadow routing in production for changes that pass CI but need real-traffic validation before full rollout. See the GitHub Actions recipe in the CI/CD eval guide.
Why is per-tenant chargeback important for LLM spend?
Without per-tenant attribution, finance cannot tell which team or customer is driving the LLM bill, and budget conversations stall. The Agent Command Center exposes a five-level hierarchical budget structure covering org, team, user, key, and tag, so every gateway call rolls up correctly. The x-prism-cost header on every response makes the spend visible at the call site. Cost without attribution is a number that nobody owns, which is the worst kind of cost.
How does position bias affect pairwise LLM judges?
An LLM judge presented with a pairwise comparison tends to prefer whichever option appears first, often by a measurable margin. Without randomization, this position bias systematically skews the judge in favor of one side and the eval reports a winner that does not exist in the underlying preference distribution. The fix is to randomize order on every comparison and average across both orderings, or to score each response independently against the rubric and avoid pairwise framing entirely. The Future AGI Platform exposes both surfaces; pick the framing that matches the question you are asking.
What does an incident-response playbook for LLM systems look like?
A working playbook covers detection, triage, mitigation, root cause, and back-test. Detection runs through Error Feed clustering on the trace stream. Triage attaches per-cluster severity from the four-dimensional trace scoring. Mitigation routes through the gateway via guardrail tuning or model fallback. Root cause lands as an Error Feed cluster with a Sonnet 4.5 Judge written immediate_fix. Back-test promotes the failing traces into the regression set so the same bug does not ship twice. Error Feed wires to Linear today via OAuth; the rest of the integrations are on the roadmap.
Related Articles
View all
LLM Evaluation Best Practices Checklist for 2026
Guides

The 7-item LLM evaluation best practices checklist that actually ships: dataset, judge calibration, deterministic floor, CI gate, statistical math, production observability, closed loop. Paste it into a sprint.

NVJK Kartik
NVJK Kartik ·
13 min