Research

LLM Evaluation Architecture in 2026: The Three-Tier Stack That Scales

LLM evaluation architecture in 2026: heuristics on every span, distilled judges on a sample, humans on the gold-set. The three-tier stack that scales without breaking the bill.

·
10 min read
llm-evaluation evaluation-architecture evaluation-pyramid llm-as-judge online-eval offline-eval llmops 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline LLM EVAL ARCHITECTURE fills the left half. The right half shows a wireframe three-tier pyramid with each tier labeled. Top tier (smallest) HUMAN, middle JUDGES, bottom (widest) HEURISTICS. The bottom tier has a soft white halo glow.

A team running an LLM judge on every production span receives a surprise from finance: the judge bill is $87K this month, double the production model bill. The team runs the math. 30M production spans times $0.003 per judge call is $90K. The fix is not “stop scoring.” The fix is the architecture: heuristics on every span, classifiers on every span where the rubric applies, distilled judges on 10% sample plus 100% of errors, frontier judges on canary cohorts, humans on a gold-set. Similar practical coverage for many workloads at much lower judge spend.

This is what evaluation architecture is for. Single-tier evaluation works in a notebook. At production scale, the cost-coverage trade-off forces a layered design. This piece walks through the three-tier pyramid (with two refinements that make it five tiers in practice), the offline-online split, and the calibration discipline that holds the whole stack together.

TL;DR: Three tiers, two surfaces, one calibration loop

The standard architecture is a pyramid. The base is heuristics and schema validators that run on every span, microseconds per call, deterministic. The middle is distilled judges that run on a sample plus small classifiers that run where the rubric applies, both calibrated against a human gold-set. The top is human review on a small gold-set, the calibration anchor for the middle tier. The same evaluator stack runs against two surfaces: offline against a versioned test set on every CI run; online against a sample of production traces. Drift detection on rolling-mean rubric scores closes the loop.

If you only read one paragraph: cost-coverage is the binding constraint. Single-tier evaluation is unaffordable at scale or has gaps in coverage. The pyramid is what fits both.

Why evaluation architecture matters in 2026

Three forces converged.

First, judge cost is real. A frontier judge at $0.005 per call across 30M spans is $150K per month. Sampling reduces but trades coverage. Distilled judges (Galileo Luna, Future AGI Turing-flash, OpenAI o-mini-class judges) cut per-call cost 10-50x but still need to be combined with cheaper tiers to scale.

Second, regressions hide in different layers. A schema regression (the new prompt occasionally emits malformed JSON) is invisible to a judge that parses past structural noise. A tone regression is invisible to a schema validator. A class of failure visible only at human-judgment level (legal correctness on regulated workflows) is invisible to any algorithmic evaluator. Coverage requires multiple tiers.

Third, online and offline are different surfaces with the same evaluators. Offline CI gates catch regressions before deployment. Online drift detection catches drift after deployment. Both consume the same evaluator stack. Building the stack once and running it in both contexts cuts the maintenance cost roughly in half compared to maintaining two parallel systems.

The OTel GenAI semantic conventions are emerging as the cross-provider span attribute standard (still in Development status); tools like traceAI handle the instrumentation. The architecture sits on top of this layer, consuming spans and emitting scores.

The three-tier pyramid (refined to five in practice)

The pyramid has three conceptual tiers. In production it expands to five for operational reasons.

Editorial figure on a black background showing a three-tier pyramid in thin white outlines. Top tier (smallest) HUMAN with annotation 0.5% sampled, ground truth. Middle JUDGES with annotation 20% sampled, calibrated. Bottom (widest) HEURISTICS with annotation 100% scored, deterministic. Tiny side labels noting latency budget decreasing top to bottom and cost per call increasing top to bottom. The bottom tier has a soft white halo glow.

Tier 1: Heuristics and schema validators (every span)

The base of the pyramid. Heuristics and schema validators run on every output, inline on the request path, cost is microseconds per call.

Heuristics: regex denylists for banned phrases, length limits, exact-match against known correct outputs, presence-of-required-tokens, format conformance.

Schema validators: JSON parseability, Pydantic model validation, required field checks, type checks, enum bounds.

These catch a class of failure (structural, format, banned content) that downstream tiers miss because they parse past the structural noise. Failures here also short-circuit downstream tiers; no point sending a malformed JSON output to a judge.

Tier 2: Classifiers (every span where rubric applies)

Small fine-tuned models that handle fixed-label rubrics. Intent classification, refusal calibration, toxicity, PII detection, sentiment.

Classifiers run sub-millisecond, cost is fractions of a cent per call. They handle rubrics that fit a fixed label space and for which labeled training data exists.

The trade-off: classifiers require a labeled training set upfront. The cost is paid once at training time, then cheap per inference. They generalize within their label space and not beyond.

Tier 3: Distilled judges (sampled subset)

Small distilled judge models tuned for open-ended rubrics. Galileo Luna and Future AGI’s turing_flash are production-scale examples of distilled judge models. Latency and price depend on workload, region, and credit pricing; consult vendor docs for current numbers.

Distilled judges handle groundedness, helpfulness, refusal calibration, tool-call appropriateness, and rubrics where there is no fixed label space. Sample rate is 5-20% of production traffic, plus 100% of traces flagged by upstream tiers (Tier 1 or Tier 2 failures, errors, top-1% cost or latency).

This is the workhorse tier. It catches the largest set of regressions at sustainable cost.

Tier 4: Frontier judges (sparingly)

Frontier model judges (current-generation OpenAI, Anthropic Claude, and Google Gemini families) used in three contexts:

  • Calibration runs against the gold-set. Compare distilled-judge scores against frontier-judge scores against human labels.
  • Canary cohorts. Score the canary cohort with a frontier judge for higher fidelity before promoting to general traffic.
  • Audit and incident response. When a regression is suspected, replay the affected traces through a frontier judge for a higher-confidence verdict.

Frontier judges are expensive (10-50x distilled cost) and slow (1-2s per call). They are not a production-scale tier; they are a sanity-check tier.

Tier 5: Human gold-set (calibration anchor)

200-500 hand-labeled traces per workload, refreshed quarterly. Each trace is labeled by 2-3 humans against the rubric; inter-annotator agreement (Cohen’s kappa or Krippendorff’s alpha) is tracked.

The gold-set anchors the judge. The calibration check: run the distilled judge over the gold-set, compute agreement against humans, trigger an alert if agreement drops below threshold (kappa under 0.6 is a common bar). Recalibrate by adjusting the rubric prompt, swapping the judge model, or retraining the classifier.

The gold-set also covers the rubrics no algorithm captures: legal correctness, medical safety, brand voice nuance.

Offline and online: the two surfaces

The evaluator stack runs against two surfaces.

Offline: CI gates against a versioned test set

The CI integration runs the eval suite on every PR that touches prompts, model config, retrieval, or tool definitions. The shape:

  1. Test set: versioned, hashed, dated. 500-1,500 prompts per workload, stratified across difficulty, intent, and risk tier.
  2. Run heuristics, schema, classifiers, judges in sequence.
  3. Compute per-rubric pass-rate vector.
  4. Compare against incumbent. Block merge on regression beyond threshold.

This is the pre-deployment quality gate. See Eval-Driven Development for the full CI workflow.

Online: drift detection on production spans

The same evaluator stack runs against sampled production traces. Each span carries the score vector as attributes. The drift detector watches rolling-mean rubric scores per route, per prompt version, per user cohort. Rolling-mean drops of 2-5% per rubric typically warrant investigation; 5%+ warrants a page.

Online evaluators run async; they do not block the request path. They do increase the cost of every scored span; sample rate is the lever for cost control.

See LLM Tracing Best Practices for how scores attach to spans and Production LLM Monitoring Checklist for how drift alerts integrate with the broader monitoring stack.

How the calibration loop holds the stack together

Calibration is what keeps the architecture honest over time. Without it, judges drift, classifiers stale, heuristics accumulate.

Monthly judge calibration

Run the distilled judge over the gold-set. Compute Cohen’s kappa against human labels. If kappa drops below 0.6 (or whichever threshold the team set), trigger recalibration:

  • Inspect the disagreement cases.
  • Tighten the rubric prompt.
  • Swap the judge model if the rubric is fine but the model has drifted.
  • Retrain a classifier with fresh labels.

Quarterly gold-set refresh

Pull 100-200 fresh production traces, label them, replace the staler gold-set entries. The gold-set should reflect the current workload distribution, not the launch distribution.

Annual taxonomy review

Are the rubrics still the right rubrics? Have new failure modes appeared that the existing rubric set does not cover? Add rubrics, retire stale ones, update the test set composition.

Production-failure mining (continuous)

Failed traces from production (errors, low scores, escalations) become eval candidates. Cluster them, label the cluster, promote representative traces into the test set. See What is Error Analysis for LLMs?.

Tools that fit each tier

Tier 1 (heuristics, schema): Pydantic, Zod, DeepEval built-ins, custom Python or TypeScript.

Tier 2 (classifiers): Hugging Face Transformers, OpenAI fine-tuning, custom encoders. Galileo Luna and Future AGI Turing-flash also serve some classifier-like rubrics.

Tier 3 (distilled judges): Galileo Luna, Future AGI Turing-flash, smaller OpenAI judge models, Anthropic Haiku as a judge. The distillation gap to frontier is small for most rubrics.

Tier 4 (frontier judges): Latest frontier OpenAI, Anthropic, and Google models. Used sparingly; expensive.

Tier 5 (human gold-set): Argilla, Label Studio, Future AGI annotation UI, SuperAnnotate, in-house labeling tools.

CI integration: DeepEval, Promptfoo, Future AGI SDK and CI integration, LangSmith Evals, Braintrust.

Online evaluator orchestration: Future AGI, Phoenix, LangSmith, Galileo, Braintrust.

The choice depends on which platform your traces already live on, where the prompt registry sits, and which CI you use. The architecture is platform-agnostic; the tools are interchangeable.

Common mistakes when designing evaluation architecture

  • Single-tier evaluation. A judge on every span is unaffordable; a heuristic on every span has gaps.
  • No calibration loop. Judges drift; without monthly calibration, the scores you trust today are wrong by next quarter.
  • No human gold-set. No anchor to detect drift against.
  • Same model judging itself. GPT-4 judging GPT-4 over-rewards GPT-4. Use a different family.
  • Inline judges on the request path. Adds judge latency to user-perceived latency. Run judges async on a sample.
  • Ignoring upstream-flag plumbing. Tier 1 failures should short-circuit downstream tiers. Otherwise judge runs on traces it cannot meaningfully score.
  • No per-rubric tracking. A single overall score hides which rubric is regressing.
  • Test set frozen at launch. The workload moves on; the test set that does not is a museum.
  • Mining failures without curation. Production failures are noisy; not every failure is an eval candidate. Cluster, label, then promote.
  • Treating cost as a free variable. Judge cost can rival production cost. Budget per-tier sample rates against the cost target.

What changed in evaluation architecture in 2026

DateEventWhy it matters
2026Distilled judges hit production scaleTier 3 became affordable to run at higher sample rates
2026Span-attached eval scores increasingly implemented by observability platforms alongside OTel GenAI traces (conventions still Development)Online drift detection moved from custom glue to off-the-shelf
2026Production-failure mining tools maturedThe continuous loop from failure to eval candidate became operational
2026Per-tenant rubric overrides became table stakes for B2BEval frameworks shipped tenant-scoped rubric configuration

How to actually build LLM evaluation architecture in 2026

  1. Build Tier 1 first. Heuristics and schema. These are cheap, fast, and catch a class of failure no other tier catches.
  2. Add Tier 5 (human gold-set) early. 200 traces, hand-labeled. This is the calibration anchor everything else depends on.
  3. Add Tier 3 (distilled judge). Pick a distilled judge model, calibrate it against the gold-set, verify kappa above threshold.
  4. Wire the offline CI gate. Run Tiers 1, 2, 3 against the test set on every PR. Block merges on regression.
  5. Wire the online evaluator. Run Tiers 1, 2, 3 on a sample of production traces; tag scores as span attributes; build per-rubric per-version dashboards.
  6. Add drift detection. Rolling means per route, per version, per cohort; page on degradation.
  7. Schedule calibration cadence. Monthly judge recalibration; quarterly gold-set refresh.
  8. Add Tier 4 (frontier judges) for canary cohorts and incident response. Reserve for high-fidelity audits.
  9. Add Tier 2 (classifiers) for fixed-label rubrics that do not need a judge. Cheaper than judges where the label space is fixed.
  10. Continuous failure mining. Production failures cluster into eval candidates; promote, label, re-test.

Sources

Related: What is an LLM Evaluator?, Eval-Driven Development, LLM-as-Judge Best Practices, Production LLM Monitoring Checklist

Frequently asked questions

What is LLM evaluation architecture?
LLM evaluation architecture is the layered design that combines multiple evaluator types across offline and online surfaces to deliver coverage without unbounded cost. The standard pattern is a three-tier pyramid: heuristics and schema validators on every span (cheap, fast, deterministic), distilled judges on a sampled subset (medium cost, calibrated, broader coverage), human review on a small gold-set (expensive but high-fidelity calibration anchor). The architecture is the same for offline (CI gates) and online (production drift detection).
Why a pyramid and not a flat single-evaluator stack?
Because evaluators have different cost-coverage profiles. A single LLM judge running on every span produces good coverage but unbounded cost. A single heuristic running on every span is cheap but misses open-ended quality regressions. The pyramid lets each tier do what it is good at: heuristics filter the easy failures, judges score the hard cases, humans calibrate everything. The aggregate is better coverage at sustainable cost than any single tier alone.
What goes in the bottom tier (heuristics and schema)?
Deterministic checks that run on every output. Length limits, banned-phrase regex, JSON schema validation, Pydantic model checks, exact-match against expected outputs, presence-of-required-fields, format conformance. These run inline on the request path, cost is microseconds per call, and they catch a class of failure (malformed structured output, length violations, banned content) that judges often miss because the judge parses past structural noise.
What goes in the middle tier (judges and classifiers)?
Calibrated scorers that handle open-ended rubrics. Distilled small judges (Galileo Luna, Future AGI Turing-flash) for groundedness, refusal calibration, helpfulness, tool-call appropriateness. Small fine-tuned classifiers for fixed-label rubrics (intent, toxicity, sentiment). These run async or on a sampled subset of traffic. Judges run at 5-20% sample rate plus 100% on errors and outliers. Classifiers run on every span when the rubric is in scope.
What goes in the top tier (human)?
A small calibration set, refreshed quarterly. 200-500 hand-labeled traces per workload covering the full rubric space. The gold-set anchors the judge: when the judge agrees with humans on the gold-set above threshold (Cohen's kappa over 0.6 is a common bar), the judge is calibrated. When agreement drops, the rubric or the judge needs revision. Humans also handle subjective rubrics no algorithm captures (legal correctness, brand voice nuance).
How do offline and online evaluation differ in this architecture?
Offline (CI gates): the evaluator stack runs against a versioned test set on every PR, blocks merges on rubric regression. Online (production drift): the evaluator stack runs against a sample of production traces, scores attach as span attributes, drift detection alerts on rolling-mean degradation. Same evaluator types, different surfaces. The offline suite tests new prompts before they ship; the online evaluators detect drift after they ship.
How do I size the sample rate for the middle tier?
Two constraints. Cost: judge cost per span times traffic times sample rate must fit the budget. Coverage: the sample must be large enough to detect drift before it grows large. A reasonable starting point is 5-20% uniform sample plus 100% of traces with errors, schema violations, or upstream-tier flags. Reduce sample rate when costs grow; increase when drift detection is too slow. Target judge cost below 10-15% of production LLM cost.
What does an LLM evaluation architecture cost in operational complexity?
At minimum: a heuristic and schema library, an online evaluator runner, a versioned test set, a CI integration, a backend that stores per-span scores, drift detection on rolling means, and a human-annotation surface for the gold-set. The harder cost is calibration discipline: refreshing the gold-set quarterly, recalibrating judges monthly, retiring stale heuristics, and feeding labeled production failures back into the test set. Tools that automate the calibration loop save the most engineering time.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.