External Evaluation Pipelines for LLM Apps in 2026
External eval pipelines need four properties: async, idempotent, observable, recoverable. The working blueprint with FAGI distributed runners and real code.
Table of Contents
The eval that catches a regression at 3 am is almost never the one that ran in CI. CI ran twelve hours earlier against a frozen dataset. The regression sits inside production traffic the dataset doesn’t cover, waiting for a worker pool that scores live traces, attaches results to the trace tree, and pages on-call when the rolling mean drops. That worker pool is what this post is about.
Inline eval runs in the request path on a cheap rubric with a sub-100 ms budget. External eval is everywhere else: scoring sampled production traffic, replaying adversarial corpora overnight, watching for drift, feeding the failure-fix loop. It runs out-of-process, on its own worker pool, against its own queue, with its own observability. The two surfaces share rubric definitions and almost nothing else.
The opinion this post earns: external eval pipelines are async, idempotent, observable, and recoverable. The four properties. A pipeline missing any of them loses data, double-scores, hides failures, or can’t restart cleanly. Most setups in production are async and partially observable; they skip idempotency until the judge bill doubles after a Kafka rebalance, skip recoverability until a worker-pool restart drops 18 minutes of traces. The four are what makes external eval ship.
This guide is the working blueprint: how external eval differs from inline, the four properties per failure mode, the queue + worker + result store + observability architecture, the FAGI distributed-runner adapters (Celery / Ray / Temporal / Kubernetes), and where the Platform closes the loop. Code shaped against the ai-evaluation SDK and traceAI.
TL;DR: the four properties
| Property | Failure if missing | Where it lives |
|---|---|---|
| Async | Eval blocks the user; latency spikes on every score | Queue between app and worker; non-blocking trace emission |
| Idempotent | Same trace scored twice; judge bill doubles; dashboards split | Dedup key on the queue; pre-call store check; upsert writes |
| Observable | Missing scores invisible; debugging by guessing | Per-job metadata (trace_id, rubric_version, judge_model, status, timestamps) plus stage-level metrics |
| Recoverable | Worker restart drops minutes of traces; manual reconciliation | Durable queue, checkpointed offsets, dead-letter handling, replay tooling |
Most pipelines nail one or two. Shipping all four is the difference between “we have async eval” and “external eval is core infrastructure.”
Why external eval is a different job from inline
Inline eval (the LLM guardrails surface) is one rubric, one trace, one verdict, in the request path. Hard latency, cheap judge, fail-closed on safety-critical rubrics. Hard problems: picking the rubric stack and not blowing the latency budget. Codebase: the app itself.
External eval is unbounded trace streams, dozens of rubrics, frontier-judge ensembles, drift detection across thousands of traces per route. No inline latency, expensive judges allowed, every trace eventually scored, no double-billing, scores attached back to the trace tree. Hard problems: queue mechanics, worker reliability, idempotency under retry, observability of the pipeline itself. Codebase: a separate service.
Three failure modes show up only in external eval:
- The double-score. A worker pulls a trace, calls the judge, crashes before writing the result. The queue redelivers, a second worker scores the same trace, the judge bill doubles, the result store has two rows. Dashboards split. Engineers argue which number is real. Fix: idempotency keys end to end, not retry logic.
- The silent gap. A worker pool restart drops in-flight messages. Traces from 02:14 to 02:32 never get scored. The rolling mean smooths over the hole; three weeks later an audit finds it. Fix: durable queue offsets and explicit replay tooling, not “we hope Kafka does the right thing.”
- The runaway judge. A new rubric times out on 30 percent of traces. The worker retries each one three times. The judge bill triples overnight. There’s no dashboard for “judge cost per rubric per route” because pipeline observability was an afterthought. Fix: per-stage metrics on the eval pipeline itself, treated like any other production service.
CI eval doesn’t see any of these. They’re distributed-systems failures, and external eval is a distributed system. Treat it like one.
The four properties in detail
Async
Async means two things. First, the application never waits for the evaluator; it emits a trace and the request returns. Second, the trace emission itself is non-blocking; if the queue is down, the app degrades to log-and-forget, not stall.
In code this looks like a thin emit on a queue rather than a synchronous score call:
# inside the app, NOT inside the worker
from fi_instrumentation import register
from fi_instrumentation.fi_types import EvalTag, EvalTagType, EvalSpanKind, EvalName, ProjectType
# Register at startup. traceAI writes to OTel; eval tags carry the rubric spec.
register(
project_type=ProjectType.OBSERVE,
project_name="checkout-agent",
eval_tags=[
EvalTag(
eval_name=EvalName.GROUNDEDNESS,
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
config={},
mapping={"context": "retrieval.documents", "output": "output.value"},
),
],
)
# The app keeps doing what apps do. traceAI buffers spans and ships them async.
# An external worker pool reads from the trace stream and runs the eval.
The shape that matters: the app calls register() once at startup and the rest of the hot path is unchanged. Spans land on the trace exporter, get batched, get shipped. The evaluator lives elsewhere. The observability vs evaluation vs benchmarking post covers where the two surfaces fit alongside each other.
Async also means picking the back-pressure mode: drop on full queue for cheap-classifier lanes that score 100 percent of traffic, persist-locally-ship-later for the LLM-judge lane that can tolerate minutes of delay but not data loss. Block-with-timeout is the third option and the wrong choice on any user-facing path. The decision belongs in the trace exporter config, not the worker pool.
Idempotent
Idempotency is the property nobody puts in the v1 design and everybody adds in v2 after the judge bill doubles. The cleanest place to enforce it is at the score key:
score_key = (trace_id, rubric_template_version, judge_model_version)
Three layers cooperate:
- Queue dedup. Messages carry the score key as a deduplication header. Kafka exactly-once semantics, NATS JetStream’s dedup window, or a Redis SET gate all work; pick what the stack already runs.
- Pre-call check. Before the worker calls the judge, it checks the result store for the score key. A row with status
succeededtriggers a skip-and-exit; a row with statuspendingorfailedbacks off or claims the work depending on retry policy. - Upsert writes. The result-store write is
INSERT ... ON CONFLICT (trace_id, rubric, version) DO UPDATE, notINSERT. Two workers that race past the pre-call check land one row, not two.
# inside the worker
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextRelevance, FactualAccuracy
from fi.testcases import TestCase
RUBRICS = {
"groundedness": (Groundedness(), "v3"),
"context_relevance": (ContextRelevance(), "v2"),
"factual_accuracy": (FactualAccuracy(), "v3"),
}
JUDGE_MODEL = "claude-sonnet-4-5-20250930"
def score_trace(trace, results_db):
cases_to_run = []
for rubric_name, (template, rubric_version) in RUBRICS.items():
key = (trace.id, rubric_name, rubric_version, JUDGE_MODEL)
if results_db.get(key, status="succeeded"):
continue # idempotent skip
cases_to_run.append((rubric_name, template, key))
if not cases_to_run:
return # nothing to do; previous run scored this trace
evaluator = Evaluator() # FI_API_KEY / FI_SECRET_KEY from env
test_case = TestCase(input=trace.input, output=trace.output, context=trace.context)
results = evaluator.evaluate(
eval_templates=[t for _, t, _ in cases_to_run],
inputs=[test_case],
)
for (rubric_name, _, key), result in zip(cases_to_run, results.eval_results):
results_db.upsert(key=key, score=result.metrics[0].value, status="succeeded")
The pattern reads boring on purpose. Boring is what idempotency looks like in production. The cost is one read against the result store before each judge call; a Redis cache makes it microseconds. The savings are the judge bill not doubling on the next Kafka rebalance.
Observable
Observable means the pipeline answers three questions without anyone SSH-ing into a worker:
- Where is trace X? Pending, in-flight, scored, failed, dead-lettered. With timestamps.
- Why is the rolling mean for rubric Y dropping? Per-rubric error rate, judge-model error rate, sample-mix shift across routes, percentile shifts on the score itself.
- What does the pipeline cost? Judge calls and cost per rubric per route, total spend per day, deviation from forecast.
Three layers carry the metadata:
# every score-store row carries the run metadata
row = {
"trace_id": "tr_01HQB8...",
"rubric": "groundedness",
"rubric_version": "v3",
"judge_model": "claude-sonnet-4-5-20250930",
"score": 0.86,
"status": "succeeded",
"started_at": "2026-05-19T02:14:31.119Z",
"finished_at": "2026-05-19T02:14:33.812Z",
"queue_lag_ms": 412,
"judge_latency_ms": 2693,
"judge_cost_usd": 0.041,
"worker_id": "eval-worker-7",
"queue_partition": 12,
}
Roll those rows into a columnar store (ClickHouse, BigQuery, Snowflake) and a dashboard answers the three questions above with SQL. Per-stage Prometheus metrics on the worker pool (queue depth, in-flight jobs, judge-call duration, error rate by model) catch operational failures; the columnar store catches quality and cost failures.
Span-attached scoring is the other observability axis. EvalTag writes the score back to the OTel trace asynchronously, so on-call doesn’t join two systems to read a failing trace:
# config-level; the worker writes to the same OTel exporter the app uses
from fi_instrumentation.fi_types import EvalTag, EvalTagType, EvalSpanKind, EvalName
eval_tags = [
EvalTag(
eval_name=EvalName.GROUNDEDNESS,
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
config={},
mapping={"context": "retrieval.documents", "output": "output.value"},
),
EvalTag(
eval_name=EvalName.CONTEXT_RELEVANCE,
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.RETRIEVER,
config={},
mapping={"input": "input.value", "context": "retrieval.documents"},
),
]
A failing trace surfaces with its rubric scores next to its latency, chunk IDs, and tool calls. traceAI ships 50+ AI surfaces across Python, TypeScript, Java, and C# with pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY).
Recoverable
Recoverable means a worker pool restart picks up where it left off and a deploy rolls forward without an operator opening a runbook. Three habits cover most of the surface:
- Durable offsets. Kafka consumer offsets, NATS JetStream state, Temporal workflow state. Commit the offset after the result-store write succeeds, not before. A crash between read and write replays the message; idempotency catches the double.
- Dead-letter handling. Messages that fail more than N retries land in a DLQ with the failure reason, rubric, judge model, and original payload. Engineers drain on a cadence; a depth spike fires a page.
- Replay tooling. A CLI that takes a time range or trace-id list and re-enqueues the matching traces. Without it, every gap turns into a custom script.
# replay CLI shape (Click or Typer; trim to taste)
import click
from datetime import datetime
from your_pipeline import enqueue_trace, trace_store
@click.command()
@click.option("--from", "from_ts", required=True, type=click.DateTime())
@click.option("--to", "to_ts", required=True, type=click.DateTime())
@click.option("--route", default=None)
@click.option("--rubric", default=None, multiple=True)
def replay(from_ts: datetime, to_ts: datetime, route: str | None, rubric: tuple[str, ...]):
"""Re-enqueue traces in [from, to) for re-evaluation. Idempotent at the queue."""
for trace in trace_store.iter_range(from_ts, to_ts, route=route):
enqueue_trace(trace.id, rubrics=list(rubric) or None)
click.echo(f"Re-enqueued traces from {from_ts} to {to_ts}")
if __name__ == "__main__":
replay()
The replay tool exists because eventually it has to. Build it on day one with one route and one rubric; extend as the pipeline matures. Otherwise the first incident that needs a replay becomes an outage post-mortem about the missing runbook.
Temporal makes recoverability cheap because it ships durable workflow state out of the box; a workflow that fails mid-sweep resumes from the last activity. For long-running red-team replays or multi-step composite evals, Temporal earns its operational cost.
The architecture
The pieces and the wires between them, end to end:
| Stage | Component | What it owns |
|---|---|---|
| 1. Trace emission | App + traceAI | Non-blocking span export; sampling decision; semantic conventions |
| 2. Trace stream | Kafka / NATS / Redis Streams / hosted queue | Durable buffer; dedup window; partitioning by trace_id |
| 3. Worker pool | Celery / Ray / Temporal / K8s Jobs + ai-evaluation | Idempotent score; judge calls; retries; backoff |
| 4. Result store | OTel trace (hot) + columnar store (warm) | Span-attached scores; aggregate rows; upsert semantics |
| 5. Observability | Prometheus + dashboard + the columnar store | Per-stage metrics; missing-score detection; cost dashboards |
| 6. Loop | Error Feed + self-improving evaluators | Cluster failures; write immediate_fix; retune rubric thresholds |
Three workload patterns layer on the same architecture:
- Sampled live scoring. 5-10 percent of production traffic routed to the eval queue. LLM-judge ensemble per trace. Drift detection on rolling means per route per rubric.
- Batch sweep. Cron, Temporal schedule, or k8s CronJob replays the previous 24 hours at higher coverage than live sampling. Catches what live sampling smoothed over.
- Replay sweep. Adversarial corpora, golden datasets, and red-team prompt suites injected into the same queue with a tagged source so dashboards filter them out of production drift signal. Same workers, same idempotency, same observability.
The gateway shapes the live-scoring lane. The Agent Command Center supports three routing modes for external-eval traffic:
- Shadow. Primary version serves the user; a parallel call to a candidate version is mirrored to the eval queue. Compares candidate quality before promotion.
- Mirror. The eval lane gets a copy of production traffic at a configured sample rate; the user never sees the eval call. The pattern for sampled live scoring.
- Race. First non-error response wins to the user; the slower one feeds the eval queue. Useful when judges are slow and the latency budget is tight.
Gateway headers (x-prism-cost, x-prism-latency-ms, x-prism-model-used, x-prism-fallback-used, x-prism-routing-strategy, x-prism-guardrail-triggered) ride on the trace, so every score-store row joins quality to cost without a separate stitch. That join closes the cost-versus-quality loop.
Pick a distributed runner
The ai-evaluation SDK ships four runner adapters. The choice is operational, not capability:
- Celery. Shortest on-ramp for Python shops on Redis or RabbitMQ. Mature ecosystem, well-understood failure modes. Right default for teams that already operate Celery.
- Ray. Wins on compute-heavy multi-judge ensembles where intra-job parallelism matters. Actor model fits “evaluator that holds a model in memory and serves scores fast.” Cost: a Ray cluster.
- Temporal. Right backbone for long-running workflows that need durable retries and replay: red-team sweeps with hour-long activities, multi-step composite evals, scheduled batch sweeps that survive a worker restart mid-run. Cost: the Temporal service.
- Kubernetes Jobs and CronJobs. Native pod-based fan-out, no extra runtime. Right for teams already running k8s. Cost: YAML.
A Celery-backed worker for sampled live scoring:
# eval_worker.py
from celery import Celery
from fi.evals import Evaluator
from fi.evals.templates import (
Groundedness, ContextRelevance, FactualAccuracy, TaskCompletion,
)
from fi.testcases import TestCase
app = Celery("eval", broker="redis://redis:6379/0")
evaluator = Evaluator(max_workers=8)
@app.task(bind=True, max_retries=3, default_retry_delay=30, acks_late=True)
def score_trace(self, trace_payload: dict, score_keys: list[tuple]) -> dict:
# Idempotency check: drop any score_keys already in the result store.
pending = [k for k in score_keys if not results_db.exists(k, status="succeeded")]
if not pending:
return {"status": "skipped", "reason": "already_scored"}
try:
test_case = TestCase(**trace_payload)
result = evaluator.evaluate(
eval_templates=[rubric_for(k) for k in pending],
inputs=[test_case],
)
for key, metric in zip(pending, result.eval_results[0].metrics):
results_db.upsert(key=key, score=metric.value, status="succeeded")
return {"status": "scored", "n_rubrics": len(pending)}
except Exception as exc:
for key in pending:
results_db.upsert(key=key, status="failed", error=str(exc))
raise self.retry(exc=exc)
The shape that matters: acks_late=True so the worker only acknowledges after the write commits; max_retries=3 with backoff; an idempotency check before the judge call; an upsert on the result store; explicit failure rows so observability sees failures instead of silence.
Swap the score_trace body into a Ray actor, a Temporal activity, or a k8s Job entrypoint. The eval logic doesn’t change; the runner does.
A worked example: the live-scoring lane
Three files. App emission, eval-service worker, dashboard query. Enough to ship something that holds all four properties.
# 1) In the app: emit traces. traceAI handles the queue exporter.
from fi_instrumentation import register
from fi_instrumentation.fi_types import EvalTag, EvalTagType, EvalSpanKind, EvalName, ProjectType
register(
project_type=ProjectType.OBSERVE,
project_name="support-rag",
eval_tags=[
EvalTag(eval_name=EvalName.GROUNDEDNESS, type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM, config={},
mapping={"context": "retrieval.documents", "output": "output.value"}),
EvalTag(eval_name=EvalName.CONTEXT_RELEVANCE, type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.RETRIEVER, config={},
mapping={"input": "input.value", "context": "retrieval.documents"}),
],
)
# App keeps doing app things. No inline eval, no blocking judge calls.
# 2) In the eval service: Celery worker, idempotent, observable, recoverable.
import os, time
from celery import Celery
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextRelevance
from fi.testcases import TestCase
app = Celery("eval", broker=os.environ["EVAL_QUEUE_URL"])
evaluator = Evaluator(max_workers=8)
RUBRICS = {"groundedness": Groundedness(), "context_relevance": ContextRelevance()}
RUBRIC_VERSION = "v3"
JUDGE_MODEL = os.environ["JUDGE_MODEL"] # pinned
@app.task(bind=True, max_retries=3, default_retry_delay=15, acks_late=True)
def score(self, trace: dict):
keys = [(trace["id"], r, RUBRIC_VERSION, JUDGE_MODEL) for r in RUBRICS]
pending = [(name, RUBRICS[name], k) for name, k in zip(RUBRICS, keys)
if not results_db.exists(k, status="succeeded")]
if not pending:
metrics.skipped.inc()
return
started = time.time()
try:
result = evaluator.evaluate(
eval_templates=[t for _, t, _ in pending],
inputs=[TestCase(input=trace["input"], output=trace["output"],
context=trace["context"])],
)
for (name, _, key), metric in zip(pending, result.eval_results[0].metrics):
results_db.upsert(key=key, score=metric.value, status="succeeded",
judge_latency_ms=int((time.time() - started) * 1000))
metrics.scored.inc(len(pending))
except Exception as exc:
for _, _, key in pending:
results_db.upsert(key=key, status="failed", error=str(exc))
metrics.failed.inc(len(pending))
raise self.retry(exc=exc)
-- 3) In the columnar store: drift detection on rolling means per route per rubric.
WITH per_minute AS (
SELECT route, rubric,
toStartOfMinute(finished_at) AS bucket,
avg(score) AS mean_score,
count() AS n
FROM eval_scores
WHERE finished_at >= now() - INTERVAL 6 HOUR
AND status = 'succeeded'
GROUP BY route, rubric, bucket
)
SELECT route, rubric, bucket, mean_score,
avg(mean_score) OVER w AS rolling_mean_60m,
mean_score - avg(mean_score) OVER w AS delta
FROM per_minute
WINDOW w AS (PARTITION BY route, rubric ORDER BY bucket
ROWS BETWEEN 60 PRECEDING AND 1 PRECEDING)
ORDER BY route, rubric, bucket;
That triple is the working spine. App emits. Worker scores idempotently. Dashboard reads the rolling mean and pages on sustained drift. Recoverability: Celery acks_late plus durable Redis. Observability: explicit failure rows plus Prometheus counters. Idempotency: score-key dedup plus upsert.
The Future AGI external-eval surface
The ai-evaluation SDK and the Future AGI Platform together cover the external-eval stack without gluing five vendors together.
- ai-evaluation SDK (Apache 2.0) —
Evaluator(...).evaluate(eval_templates=[...], inputs=[TestCase(...)])as the core entry point. 60+EvalTemplateclasses coveringGroundedness,ContextAdherence,ContextRelevance,Completeness,ChunkAttribution,ChunkUtilization,FactualAccuracy,Toxicity,PromptInjection,DataPrivacyCompliance,AnswerRefusal,IsHarmfulAdvice,TaskCompletion,LLMFunctionCalling, and more. Four distributed runner adapters (Celery, Ray, Temporal, Kubernetes). Eight sub-10 msScannerclasses for inline work that complements external pipelines. - Span-attached scoring via
EvalTag. External jobs write scores back to the OTel trace tree async, so eval and trace live in the same place when on-call needs to walk a regression. TheEvalTag/EvalSpanKind/EvalNamesurface is configured atregister()time. - Prebuilt domain pipelines via
AutoEvalPipeline. Templates forcustomer_support,rag_system,code_assistant,content_moderation,agent_workflow,healthcare, andfinancial. Each is a curated rubric set, dataset shape, and recommended judge config; a team starting from zero ships a working pipeline in an afternoon, not a week. - Composite orchestration via
EvalTemplateManager.create_composite,submit_composite,execute_composite,upload_ground_truth, andrun_playgroundcover multi-step eval workflows. Chain retrieval into generation, gate the next step on the first step’s score, run playground variants against ground truth. Temporal is the right runner here because the workflow is the unit. - traceAI (Apache 2.0) — 50+ AI surfaces across Python, TypeScript, Java, and C#. Pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY). 14 span kinds including a first-class
RETRIEVER; 62 built-in evals wire server-side viaEvalTagwith zero inline latency. - Agent Command Center gateway. Shadow, mirror, and race modes feed a configured fraction of production traffic to the eval queue without affecting user-facing latency. Per-response headers (
x-prism-cost,x-prism-latency-ms,x-prism-model-used,x-prism-fallback-used,x-prism-routing-strategy,x-prism-guardrail-triggered) enrich every score with operational context. 17 MB Go binary, self-hosts in your VPC, 20+ providers via six native adapters, SOC 2 Type II, HIPAA, GDPR, and CCPA certified. - Error Feed inside the eval stack. HDBSCAN soft-clustering groups failures into named issues. A Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur for spans over 3000 chars, 90 percent prompt-cache hit ratio) reads each cluster and writes the RCA, evidence quotes, an
immediate_fix, and a four-dimensional score. Fixes feed the Platform’s self-improving evaluators. Linear integration ships today; Slack, GitHub, Jira, and PagerDuty land on the roadmap. - agent-opt (Apache 2.0) — six optimizers (
RandomSearchOptimizer,BayesianSearchOptimizer,MetaPromptOptimizer,ProTeGi,GEPAOptimizer,PromptWizardOptimizer) consume a curated dataset and a scoring function and iterate prompts against the regression set. The trace-stream-to-agent-opt connector that auto-builds an optimizer dataset from failing clusters is on the roadmap; eval-driven optimization on curated regression sets ships today.
The pieces are independent. Drop ai-evaluation plus a Celery worker into your stack this afternoon; bring traceAI, Error Feed, the gateway, and the Platform online as the pipeline matures.
Ready to ship an external eval pipeline that doesn’t lose data? Run pip install ai-evaluation, register traceAI in the app, point a Celery worker at the trace queue, set FI_API_KEY and FI_SECRET_KEY. Your next production regression has a place to land.
Common anti-patterns
- Async without idempotency. The pipeline is non-blocking and the judge bill doubles after the first Kafka rebalance. Fix: three-layer idempotency (queue dedup, pre-call check, upsert write), not retry tuning.
- Idempotency without observability. The pipeline doesn’t double-score; it also doesn’t tell you when 18 percent of traces silently fail and never retry. Fix: explicit failure rows plus per-stage metrics, treated as production telemetry.
- Observability without recoverability. A dashboard shows the gap from a worker restart; nobody can fill it. Fix: durable offsets plus a replay CLI built on day one.
- Synchronous span-attached scoring. Someone calls the judge inline “for convenience” and the request budget blows up. Fix: the queue, every time. Inline rubrics belong to the guardrails surface.
- One queue for everything. Cheap classifier scoring and frontier-judge ensembles share a queue; the slow lane head-of-lines the fast lane. Fix: two queues with different SLAs, or priority partitioning.
- Pinned judge that never gets pinned. “We use the latest Sonnet” is not a judge version. Fix: put the judge model and version in the score key and refuse to score with an unpinned judge.
- Result store as a write-only log. Every score appends a row; reads are slow; idempotency leans on application logic. Fix: upsert semantics with a unique constraint on
(trace_id, rubric, version).
Where this gets you a quarter in
A team running the four-property external eval stack for a quarter has these properties:
- Every production trace has a score on every rubric within minutes of landing, with no inline latency to the user.
- A worker pool restart drops zero traces; replay covers the rare gap.
- The judge bill is tracked per rubric per route per day, with deviation alerts when a new rubric eats more than its budget.
- Drift fires a page within an observation window calibrated to the rubric’s noise floor, not hours after complaints.
- Failure clusters from the Error Feed turn into Linear issues with a candidate
immediate_fixalready attached. - The same rubric definition runs in inline guardrails, in nightly batch sweeps, in red-team replays, and in the live-scoring lane. Engineers debug regressions against one trace tree, not five surfaces.
External eval is a distributed system. The four properties are how to keep it from behaving like an undergraduate one. Teams shipping reliable LLM apps in 2026 treat the pipeline like core infrastructure, the way they treat their primary message bus. That is the bar. The blueprint above is how to clear it.
Related reading
- How to Evaluate RAG Applications in CI/CD Pipelines (2026)
- The 2026 LLM Evaluation Playbook
- CI/CD for AI Agents Best Practices (2026)
- Agent Observability vs Evaluation vs Benchmarking (2026)
- Best AI Gateways for Canary Model Rollouts (2026)
- Best AI Drift Detection Tools (2026)
- Red-Teaming LLMs Step by Step (2026)
- The Self-Improving AI Agent Pipeline
Frequently asked questions
What is an external evaluation pipeline for an LLM app?
How is external eval different from inline CI eval?
What are the four properties an external eval pipeline needs?
What does the architecture look like end to end?
Which distributed runner should I pick?
Where does idempotency actually live in the pipeline?
How does Future AGI ship external eval?
Celery, Ray, Temporal, and Kubernetes optimise for different things. Pick by your bottleneck, not by what's fashionable. The 2026 engineering decision guide.
Temporal turns agent workflows into replayable state machines. The eval that matches: per-activity correctness, workflow outcome, retry budget enforcement, signal-handler correctness. Replay is the superpower.
Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.