Guides

External Evaluation Pipelines for LLM Apps in 2026

External eval pipelines need four properties: async, idempotent, observable, recoverable. The working blueprint with FAGI distributed runners and real code.

·
Updated
·
14 min read
llm-evaluation external-eval ai-gateway agent-evaluation distributed-systems celery temporal 2026
Editorial cover image for External Evaluation Pipelines for LLM Apps in 2026
Table of Contents

The eval that catches a regression at 3 am is almost never the one that ran in CI. CI ran twelve hours earlier against a frozen dataset. The regression sits inside production traffic the dataset doesn’t cover, waiting for a worker pool that scores live traces, attaches results to the trace tree, and pages on-call when the rolling mean drops. That worker pool is what this post is about.

Inline eval runs in the request path on a cheap rubric with a sub-100 ms budget. External eval is everywhere else: scoring sampled production traffic, replaying adversarial corpora overnight, watching for drift, feeding the failure-fix loop. It runs out-of-process, on its own worker pool, against its own queue, with its own observability. The two surfaces share rubric definitions and almost nothing else.

The opinion this post earns: external eval pipelines are async, idempotent, observable, and recoverable. The four properties. A pipeline missing any of them loses data, double-scores, hides failures, or can’t restart cleanly. Most setups in production are async and partially observable; they skip idempotency until the judge bill doubles after a Kafka rebalance, skip recoverability until a worker-pool restart drops 18 minutes of traces. The four are what makes external eval ship.

This guide is the working blueprint: how external eval differs from inline, the four properties per failure mode, the queue + worker + result store + observability architecture, the FAGI distributed-runner adapters (Celery / Ray / Temporal / Kubernetes), and where the Platform closes the loop. Code shaped against the ai-evaluation SDK and traceAI.

TL;DR: the four properties

PropertyFailure if missingWhere it lives
AsyncEval blocks the user; latency spikes on every scoreQueue between app and worker; non-blocking trace emission
IdempotentSame trace scored twice; judge bill doubles; dashboards splitDedup key on the queue; pre-call store check; upsert writes
ObservableMissing scores invisible; debugging by guessingPer-job metadata (trace_id, rubric_version, judge_model, status, timestamps) plus stage-level metrics
RecoverableWorker restart drops minutes of traces; manual reconciliationDurable queue, checkpointed offsets, dead-letter handling, replay tooling

Most pipelines nail one or two. Shipping all four is the difference between “we have async eval” and “external eval is core infrastructure.”

Why external eval is a different job from inline

Inline eval (the LLM guardrails surface) is one rubric, one trace, one verdict, in the request path. Hard latency, cheap judge, fail-closed on safety-critical rubrics. Hard problems: picking the rubric stack and not blowing the latency budget. Codebase: the app itself.

External eval is unbounded trace streams, dozens of rubrics, frontier-judge ensembles, drift detection across thousands of traces per route. No inline latency, expensive judges allowed, every trace eventually scored, no double-billing, scores attached back to the trace tree. Hard problems: queue mechanics, worker reliability, idempotency under retry, observability of the pipeline itself. Codebase: a separate service.

Three failure modes show up only in external eval:

  • The double-score. A worker pulls a trace, calls the judge, crashes before writing the result. The queue redelivers, a second worker scores the same trace, the judge bill doubles, the result store has two rows. Dashboards split. Engineers argue which number is real. Fix: idempotency keys end to end, not retry logic.
  • The silent gap. A worker pool restart drops in-flight messages. Traces from 02:14 to 02:32 never get scored. The rolling mean smooths over the hole; three weeks later an audit finds it. Fix: durable queue offsets and explicit replay tooling, not “we hope Kafka does the right thing.”
  • The runaway judge. A new rubric times out on 30 percent of traces. The worker retries each one three times. The judge bill triples overnight. There’s no dashboard for “judge cost per rubric per route” because pipeline observability was an afterthought. Fix: per-stage metrics on the eval pipeline itself, treated like any other production service.

CI eval doesn’t see any of these. They’re distributed-systems failures, and external eval is a distributed system. Treat it like one.

The four properties in detail

Async

Async means two things. First, the application never waits for the evaluator; it emits a trace and the request returns. Second, the trace emission itself is non-blocking; if the queue is down, the app degrades to log-and-forget, not stall.

In code this looks like a thin emit on a queue rather than a synchronous score call:

# inside the app, NOT inside the worker
from fi_instrumentation import register
from fi_instrumentation.fi_types import EvalTag, EvalTagType, EvalSpanKind, EvalName, ProjectType

# Register at startup. traceAI writes to OTel; eval tags carry the rubric spec.
register(
    project_type=ProjectType.OBSERVE,
    project_name="checkout-agent",
    eval_tags=[
        EvalTag(
            eval_name=EvalName.GROUNDEDNESS,
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            config={},
            mapping={"context": "retrieval.documents", "output": "output.value"},
        ),
    ],
)
# The app keeps doing what apps do. traceAI buffers spans and ships them async.
# An external worker pool reads from the trace stream and runs the eval.

The shape that matters: the app calls register() once at startup and the rest of the hot path is unchanged. Spans land on the trace exporter, get batched, get shipped. The evaluator lives elsewhere. The observability vs evaluation vs benchmarking post covers where the two surfaces fit alongside each other.

Async also means picking the back-pressure mode: drop on full queue for cheap-classifier lanes that score 100 percent of traffic, persist-locally-ship-later for the LLM-judge lane that can tolerate minutes of delay but not data loss. Block-with-timeout is the third option and the wrong choice on any user-facing path. The decision belongs in the trace exporter config, not the worker pool.

Idempotent

Idempotency is the property nobody puts in the v1 design and everybody adds in v2 after the judge bill doubles. The cleanest place to enforce it is at the score key:

score_key = (trace_id, rubric_template_version, judge_model_version)

Three layers cooperate:

  1. Queue dedup. Messages carry the score key as a deduplication header. Kafka exactly-once semantics, NATS JetStream’s dedup window, or a Redis SET gate all work; pick what the stack already runs.
  2. Pre-call check. Before the worker calls the judge, it checks the result store for the score key. A row with status succeeded triggers a skip-and-exit; a row with status pending or failed backs off or claims the work depending on retry policy.
  3. Upsert writes. The result-store write is INSERT ... ON CONFLICT (trace_id, rubric, version) DO UPDATE, not INSERT. Two workers that race past the pre-call check land one row, not two.
# inside the worker
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextRelevance, FactualAccuracy
from fi.testcases import TestCase

RUBRICS = {
    "groundedness": (Groundedness(), "v3"),
    "context_relevance": (ContextRelevance(), "v2"),
    "factual_accuracy": (FactualAccuracy(), "v3"),
}
JUDGE_MODEL = "claude-sonnet-4-5-20250930"

def score_trace(trace, results_db):
    cases_to_run = []
    for rubric_name, (template, rubric_version) in RUBRICS.items():
        key = (trace.id, rubric_name, rubric_version, JUDGE_MODEL)
        if results_db.get(key, status="succeeded"):
            continue  # idempotent skip
        cases_to_run.append((rubric_name, template, key))

    if not cases_to_run:
        return  # nothing to do; previous run scored this trace

    evaluator = Evaluator()  # FI_API_KEY / FI_SECRET_KEY from env
    test_case = TestCase(input=trace.input, output=trace.output, context=trace.context)
    results = evaluator.evaluate(
        eval_templates=[t for _, t, _ in cases_to_run],
        inputs=[test_case],
    )

    for (rubric_name, _, key), result in zip(cases_to_run, results.eval_results):
        results_db.upsert(key=key, score=result.metrics[0].value, status="succeeded")

The pattern reads boring on purpose. Boring is what idempotency looks like in production. The cost is one read against the result store before each judge call; a Redis cache makes it microseconds. The savings are the judge bill not doubling on the next Kafka rebalance.

Observable

Observable means the pipeline answers three questions without anyone SSH-ing into a worker:

  • Where is trace X? Pending, in-flight, scored, failed, dead-lettered. With timestamps.
  • Why is the rolling mean for rubric Y dropping? Per-rubric error rate, judge-model error rate, sample-mix shift across routes, percentile shifts on the score itself.
  • What does the pipeline cost? Judge calls and cost per rubric per route, total spend per day, deviation from forecast.

Three layers carry the metadata:

# every score-store row carries the run metadata
row = {
    "trace_id": "tr_01HQB8...",
    "rubric": "groundedness",
    "rubric_version": "v3",
    "judge_model": "claude-sonnet-4-5-20250930",
    "score": 0.86,
    "status": "succeeded",
    "started_at": "2026-05-19T02:14:31.119Z",
    "finished_at": "2026-05-19T02:14:33.812Z",
    "queue_lag_ms": 412,
    "judge_latency_ms": 2693,
    "judge_cost_usd": 0.041,
    "worker_id": "eval-worker-7",
    "queue_partition": 12,
}

Roll those rows into a columnar store (ClickHouse, BigQuery, Snowflake) and a dashboard answers the three questions above with SQL. Per-stage Prometheus metrics on the worker pool (queue depth, in-flight jobs, judge-call duration, error rate by model) catch operational failures; the columnar store catches quality and cost failures.

Span-attached scoring is the other observability axis. EvalTag writes the score back to the OTel trace asynchronously, so on-call doesn’t join two systems to read a failing trace:

# config-level; the worker writes to the same OTel exporter the app uses
from fi_instrumentation.fi_types import EvalTag, EvalTagType, EvalSpanKind, EvalName

eval_tags = [
    EvalTag(
        eval_name=EvalName.GROUNDEDNESS,
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        config={},
        mapping={"context": "retrieval.documents", "output": "output.value"},
    ),
    EvalTag(
        eval_name=EvalName.CONTEXT_RELEVANCE,
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.RETRIEVER,
        config={},
        mapping={"input": "input.value", "context": "retrieval.documents"},
    ),
]

A failing trace surfaces with its rubric scores next to its latency, chunk IDs, and tool calls. traceAI ships 50+ AI surfaces across Python, TypeScript, Java, and C# with pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY).

Recoverable

Recoverable means a worker pool restart picks up where it left off and a deploy rolls forward without an operator opening a runbook. Three habits cover most of the surface:

  • Durable offsets. Kafka consumer offsets, NATS JetStream state, Temporal workflow state. Commit the offset after the result-store write succeeds, not before. A crash between read and write replays the message; idempotency catches the double.
  • Dead-letter handling. Messages that fail more than N retries land in a DLQ with the failure reason, rubric, judge model, and original payload. Engineers drain on a cadence; a depth spike fires a page.
  • Replay tooling. A CLI that takes a time range or trace-id list and re-enqueues the matching traces. Without it, every gap turns into a custom script.
# replay CLI shape (Click or Typer; trim to taste)
import click
from datetime import datetime
from your_pipeline import enqueue_trace, trace_store

@click.command()
@click.option("--from", "from_ts", required=True, type=click.DateTime())
@click.option("--to", "to_ts", required=True, type=click.DateTime())
@click.option("--route", default=None)
@click.option("--rubric", default=None, multiple=True)
def replay(from_ts: datetime, to_ts: datetime, route: str | None, rubric: tuple[str, ...]):
    """Re-enqueue traces in [from, to) for re-evaluation. Idempotent at the queue."""
    for trace in trace_store.iter_range(from_ts, to_ts, route=route):
        enqueue_trace(trace.id, rubrics=list(rubric) or None)
    click.echo(f"Re-enqueued traces from {from_ts} to {to_ts}")

if __name__ == "__main__":
    replay()

The replay tool exists because eventually it has to. Build it on day one with one route and one rubric; extend as the pipeline matures. Otherwise the first incident that needs a replay becomes an outage post-mortem about the missing runbook.

Temporal makes recoverability cheap because it ships durable workflow state out of the box; a workflow that fails mid-sweep resumes from the last activity. For long-running red-team replays or multi-step composite evals, Temporal earns its operational cost.

The architecture

The pieces and the wires between them, end to end:

StageComponentWhat it owns
1. Trace emissionApp + traceAINon-blocking span export; sampling decision; semantic conventions
2. Trace streamKafka / NATS / Redis Streams / hosted queueDurable buffer; dedup window; partitioning by trace_id
3. Worker poolCelery / Ray / Temporal / K8s Jobs + ai-evaluationIdempotent score; judge calls; retries; backoff
4. Result storeOTel trace (hot) + columnar store (warm)Span-attached scores; aggregate rows; upsert semantics
5. ObservabilityPrometheus + dashboard + the columnar storePer-stage metrics; missing-score detection; cost dashboards
6. LoopError Feed + self-improving evaluatorsCluster failures; write immediate_fix; retune rubric thresholds

Three workload patterns layer on the same architecture:

  • Sampled live scoring. 5-10 percent of production traffic routed to the eval queue. LLM-judge ensemble per trace. Drift detection on rolling means per route per rubric.
  • Batch sweep. Cron, Temporal schedule, or k8s CronJob replays the previous 24 hours at higher coverage than live sampling. Catches what live sampling smoothed over.
  • Replay sweep. Adversarial corpora, golden datasets, and red-team prompt suites injected into the same queue with a tagged source so dashboards filter them out of production drift signal. Same workers, same idempotency, same observability.

The gateway shapes the live-scoring lane. The Agent Command Center supports three routing modes for external-eval traffic:

  • Shadow. Primary version serves the user; a parallel call to a candidate version is mirrored to the eval queue. Compares candidate quality before promotion.
  • Mirror. The eval lane gets a copy of production traffic at a configured sample rate; the user never sees the eval call. The pattern for sampled live scoring.
  • Race. First non-error response wins to the user; the slower one feeds the eval queue. Useful when judges are slow and the latency budget is tight.

Gateway headers (x-prism-cost, x-prism-latency-ms, x-prism-model-used, x-prism-fallback-used, x-prism-routing-strategy, x-prism-guardrail-triggered) ride on the trace, so every score-store row joins quality to cost without a separate stitch. That join closes the cost-versus-quality loop.

Pick a distributed runner

The ai-evaluation SDK ships four runner adapters. The choice is operational, not capability:

  • Celery. Shortest on-ramp for Python shops on Redis or RabbitMQ. Mature ecosystem, well-understood failure modes. Right default for teams that already operate Celery.
  • Ray. Wins on compute-heavy multi-judge ensembles where intra-job parallelism matters. Actor model fits “evaluator that holds a model in memory and serves scores fast.” Cost: a Ray cluster.
  • Temporal. Right backbone for long-running workflows that need durable retries and replay: red-team sweeps with hour-long activities, multi-step composite evals, scheduled batch sweeps that survive a worker restart mid-run. Cost: the Temporal service.
  • Kubernetes Jobs and CronJobs. Native pod-based fan-out, no extra runtime. Right for teams already running k8s. Cost: YAML.

A Celery-backed worker for sampled live scoring:

# eval_worker.py
from celery import Celery
from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness, ContextRelevance, FactualAccuracy, TaskCompletion,
)
from fi.testcases import TestCase

app = Celery("eval", broker="redis://redis:6379/0")
evaluator = Evaluator(max_workers=8)

@app.task(bind=True, max_retries=3, default_retry_delay=30, acks_late=True)
def score_trace(self, trace_payload: dict, score_keys: list[tuple]) -> dict:
    # Idempotency check: drop any score_keys already in the result store.
    pending = [k for k in score_keys if not results_db.exists(k, status="succeeded")]
    if not pending:
        return {"status": "skipped", "reason": "already_scored"}

    try:
        test_case = TestCase(**trace_payload)
        result = evaluator.evaluate(
            eval_templates=[rubric_for(k) for k in pending],
            inputs=[test_case],
        )
        for key, metric in zip(pending, result.eval_results[0].metrics):
            results_db.upsert(key=key, score=metric.value, status="succeeded")
        return {"status": "scored", "n_rubrics": len(pending)}
    except Exception as exc:
        for key in pending:
            results_db.upsert(key=key, status="failed", error=str(exc))
        raise self.retry(exc=exc)

The shape that matters: acks_late=True so the worker only acknowledges after the write commits; max_retries=3 with backoff; an idempotency check before the judge call; an upsert on the result store; explicit failure rows so observability sees failures instead of silence.

Swap the score_trace body into a Ray actor, a Temporal activity, or a k8s Job entrypoint. The eval logic doesn’t change; the runner does.

A worked example: the live-scoring lane

Three files. App emission, eval-service worker, dashboard query. Enough to ship something that holds all four properties.

# 1) In the app: emit traces. traceAI handles the queue exporter.
from fi_instrumentation import register
from fi_instrumentation.fi_types import EvalTag, EvalTagType, EvalSpanKind, EvalName, ProjectType

register(
    project_type=ProjectType.OBSERVE,
    project_name="support-rag",
    eval_tags=[
        EvalTag(eval_name=EvalName.GROUNDEDNESS, type=EvalTagType.OBSERVATION_SPAN,
                value=EvalSpanKind.LLM, config={},
                mapping={"context": "retrieval.documents", "output": "output.value"}),
        EvalTag(eval_name=EvalName.CONTEXT_RELEVANCE, type=EvalTagType.OBSERVATION_SPAN,
                value=EvalSpanKind.RETRIEVER, config={},
                mapping={"input": "input.value", "context": "retrieval.documents"}),
    ],
)
# App keeps doing app things. No inline eval, no blocking judge calls.
# 2) In the eval service: Celery worker, idempotent, observable, recoverable.
import os, time
from celery import Celery
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextRelevance
from fi.testcases import TestCase

app = Celery("eval", broker=os.environ["EVAL_QUEUE_URL"])
evaluator = Evaluator(max_workers=8)
RUBRICS = {"groundedness": Groundedness(), "context_relevance": ContextRelevance()}
RUBRIC_VERSION = "v3"
JUDGE_MODEL = os.environ["JUDGE_MODEL"]  # pinned

@app.task(bind=True, max_retries=3, default_retry_delay=15, acks_late=True)
def score(self, trace: dict):
    keys = [(trace["id"], r, RUBRIC_VERSION, JUDGE_MODEL) for r in RUBRICS]
    pending = [(name, RUBRICS[name], k) for name, k in zip(RUBRICS, keys)
               if not results_db.exists(k, status="succeeded")]
    if not pending:
        metrics.skipped.inc()
        return

    started = time.time()
    try:
        result = evaluator.evaluate(
            eval_templates=[t for _, t, _ in pending],
            inputs=[TestCase(input=trace["input"], output=trace["output"],
                             context=trace["context"])],
        )
        for (name, _, key), metric in zip(pending, result.eval_results[0].metrics):
            results_db.upsert(key=key, score=metric.value, status="succeeded",
                              judge_latency_ms=int((time.time() - started) * 1000))
        metrics.scored.inc(len(pending))
    except Exception as exc:
        for _, _, key in pending:
            results_db.upsert(key=key, status="failed", error=str(exc))
        metrics.failed.inc(len(pending))
        raise self.retry(exc=exc)
-- 3) In the columnar store: drift detection on rolling means per route per rubric.
WITH per_minute AS (
  SELECT route, rubric,
         toStartOfMinute(finished_at) AS bucket,
         avg(score) AS mean_score,
         count() AS n
  FROM eval_scores
  WHERE finished_at >= now() - INTERVAL 6 HOUR
    AND status = 'succeeded'
  GROUP BY route, rubric, bucket
)
SELECT route, rubric, bucket, mean_score,
       avg(mean_score) OVER w AS rolling_mean_60m,
       mean_score - avg(mean_score) OVER w AS delta
FROM per_minute
WINDOW w AS (PARTITION BY route, rubric ORDER BY bucket
             ROWS BETWEEN 60 PRECEDING AND 1 PRECEDING)
ORDER BY route, rubric, bucket;

That triple is the working spine. App emits. Worker scores idempotently. Dashboard reads the rolling mean and pages on sustained drift. Recoverability: Celery acks_late plus durable Redis. Observability: explicit failure rows plus Prometheus counters. Idempotency: score-key dedup plus upsert.

The Future AGI external-eval surface

The ai-evaluation SDK and the Future AGI Platform together cover the external-eval stack without gluing five vendors together.

  • ai-evaluation SDK (Apache 2.0)Evaluator(...).evaluate(eval_templates=[...], inputs=[TestCase(...)]) as the core entry point. 60+ EvalTemplate classes covering Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy, Toxicity, PromptInjection, DataPrivacyCompliance, AnswerRefusal, IsHarmfulAdvice, TaskCompletion, LLMFunctionCalling, and more. Four distributed runner adapters (Celery, Ray, Temporal, Kubernetes). Eight sub-10 ms Scanner classes for inline work that complements external pipelines.
  • Span-attached scoring via EvalTag. External jobs write scores back to the OTel trace tree async, so eval and trace live in the same place when on-call needs to walk a regression. The EvalTag / EvalSpanKind / EvalName surface is configured at register() time.
  • Prebuilt domain pipelines via AutoEvalPipeline. Templates for customer_support, rag_system, code_assistant, content_moderation, agent_workflow, healthcare, and financial. Each is a curated rubric set, dataset shape, and recommended judge config; a team starting from zero ships a working pipeline in an afternoon, not a week.
  • Composite orchestration via EvalTemplateManager. create_composite, submit_composite, execute_composite, upload_ground_truth, and run_playground cover multi-step eval workflows. Chain retrieval into generation, gate the next step on the first step’s score, run playground variants against ground truth. Temporal is the right runner here because the workflow is the unit.
  • traceAI (Apache 2.0) — 50+ AI surfaces across Python, TypeScript, Java, and C#. Pluggable semantic conventions (FI / OTEL_GENAI / OPENINFERENCE / OPENLLMETRY). 14 span kinds including a first-class RETRIEVER; 62 built-in evals wire server-side via EvalTag with zero inline latency.
  • Agent Command Center gateway. Shadow, mirror, and race modes feed a configured fraction of production traffic to the eval queue without affecting user-facing latency. Per-response headers (x-prism-cost, x-prism-latency-ms, x-prism-model-used, x-prism-fallback-used, x-prism-routing-strategy, x-prism-guardrail-triggered) enrich every score with operational context. 17 MB Go binary, self-hosts in your VPC, 20+ providers via six native adapters, SOC 2 Type II, HIPAA, GDPR, and CCPA certified.
  • Error Feed inside the eval stack. HDBSCAN soft-clustering groups failures into named issues. A Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur for spans over 3000 chars, 90 percent prompt-cache hit ratio) reads each cluster and writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score. Fixes feed the Platform’s self-improving evaluators. Linear integration ships today; Slack, GitHub, Jira, and PagerDuty land on the roadmap.
  • agent-opt (Apache 2.0) — six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) consume a curated dataset and a scoring function and iterate prompts against the regression set. The trace-stream-to-agent-opt connector that auto-builds an optimizer dataset from failing clusters is on the roadmap; eval-driven optimization on curated regression sets ships today.

The pieces are independent. Drop ai-evaluation plus a Celery worker into your stack this afternoon; bring traceAI, Error Feed, the gateway, and the Platform online as the pipeline matures.

Ready to ship an external eval pipeline that doesn’t lose data? Run pip install ai-evaluation, register traceAI in the app, point a Celery worker at the trace queue, set FI_API_KEY and FI_SECRET_KEY. Your next production regression has a place to land.

Common anti-patterns

  • Async without idempotency. The pipeline is non-blocking and the judge bill doubles after the first Kafka rebalance. Fix: three-layer idempotency (queue dedup, pre-call check, upsert write), not retry tuning.
  • Idempotency without observability. The pipeline doesn’t double-score; it also doesn’t tell you when 18 percent of traces silently fail and never retry. Fix: explicit failure rows plus per-stage metrics, treated as production telemetry.
  • Observability without recoverability. A dashboard shows the gap from a worker restart; nobody can fill it. Fix: durable offsets plus a replay CLI built on day one.
  • Synchronous span-attached scoring. Someone calls the judge inline “for convenience” and the request budget blows up. Fix: the queue, every time. Inline rubrics belong to the guardrails surface.
  • One queue for everything. Cheap classifier scoring and frontier-judge ensembles share a queue; the slow lane head-of-lines the fast lane. Fix: two queues with different SLAs, or priority partitioning.
  • Pinned judge that never gets pinned. “We use the latest Sonnet” is not a judge version. Fix: put the judge model and version in the score key and refuse to score with an unpinned judge.
  • Result store as a write-only log. Every score appends a row; reads are slow; idempotency leans on application logic. Fix: upsert semantics with a unique constraint on (trace_id, rubric, version).

Where this gets you a quarter in

A team running the four-property external eval stack for a quarter has these properties:

  • Every production trace has a score on every rubric within minutes of landing, with no inline latency to the user.
  • A worker pool restart drops zero traces; replay covers the rare gap.
  • The judge bill is tracked per rubric per route per day, with deviation alerts when a new rubric eats more than its budget.
  • Drift fires a page within an observation window calibrated to the rubric’s noise floor, not hours after complaints.
  • Failure clusters from the Error Feed turn into Linear issues with a candidate immediate_fix already attached.
  • The same rubric definition runs in inline guardrails, in nightly batch sweeps, in red-team replays, and in the live-scoring lane. Engineers debug regressions against one trace tree, not five surfaces.

External eval is a distributed system. The four properties are how to keep it from behaving like an undergraduate one. Teams shipping reliable LLM apps in 2026 treat the pipeline like core infrastructure, the way they treat their primary message bus. That is the bar. The blueprint above is how to clear it.

Frequently asked questions

What is an external evaluation pipeline for an LLM app?
An external evaluation pipeline runs eval work outside the LLM application process. The app emits a trace; a separate worker pool consumes the trace, runs judges or classifiers against it, writes scores to a result store, and attaches those scores back to the trace. The user never waits for the evaluator. This is the opposite of inline eval, where the judge runs in the request path. External pipelines are how teams afford frontier-judge ensembles, full-trace conversation rubrics, drift detection across thousands of traces, and red-team replays without paying for them on every user request.
How is external eval different from inline CI eval?
Inline CI eval runs at build time on a versioned dataset; the eval is part of the test suite, the verdict gates the merge, the dataset doesn't move during the run. External eval runs against live or near-live production traffic on an ongoing basis; the trace stream is unbounded, traces arrive out of order, workers crash, results need to land idempotently. CI eval's hard problem is statistical significance on a fixed dataset. External eval's hard problem is shipping a worker pool that survives the kinds of failures distributed systems actually have. Different jobs, different failure modes, different toolkits.
What are the four properties an external eval pipeline needs?
Async, idempotent, observable, recoverable. Async means the eval worker never blocks the application's request path; the app emits a trace and moves on. Idempotent means the same trace can be scored twice without producing two rows or double-charging the judge bill; a retry after a worker crash lands cleanly. Observable means every job carries enough metadata (trace_id, rubric_version, judge_model, started_at, finished_at, status) to debug why a score is missing or stale. Recoverable means a worker pool restart picks up the queue where it left off, with no data loss and no manual reconciliation. Pipelines that miss any of the four either lose data, double-score, hide failures, or can't restart cleanly.
What does the architecture look like end to end?
Four pieces. The application instruments traces with traceAI and writes to an event queue (Kafka, NATS, Redis Streams, or a hosted equivalent). A worker pool consumes the queue, runs an Evaluator with the right rubric stack against each trace, and writes scores to a result store. The result store is two layers: hot scores live on the OTel trace via EvalTag for debugging; aggregated scores live in a columnar store (ClickHouse, BigQuery, Snowflake) for dashboards and drift detection. An observability layer (Prometheus or the platform's built-in telemetry) tracks queue depth, worker health, score-write latency, and judge-error rate. The four properties live across the queue (idempotency keys, dead-letter handling), the worker pool (graceful restart, checkpoints), the result store (upsert semantics), and the observability layer (per-stage metrics).
Which distributed runner should I pick?
Celery for teams already on Redis or RabbitMQ with a Python worker culture; the on-ramp is shortest. Ray for compute-heavy multi-judge ensembles that need the actor model and intra-job parallelism; the cost is a Ray cluster to operate. Temporal for long-running workflows that need durable state, retries, and replay (red-team sweeps, multi-step eval workflows that take hours); the cost is the Temporal service. Kubernetes Jobs and CronJobs for teams already running k8s who want pod-level fan-out with no extra runtime; the cost is k8s YAML. The ai-evaluation SDK ships adapters for all four so the runner is a config choice, not a rewrite.
Where does idempotency actually live in the pipeline?
Three places. The queue carries a deduplication key (trace_id + rubric_version + judge_model) so a redelivered message doesn't trigger a second judge call. The worker checks the result store before calling the judge; if a score for that key exists, the worker skips the call and exits clean. The result-store write is an upsert on (trace_id, rubric, version), not an insert, so a retried write replaces rather than duplicates. Pipelines that skip any of the three eventually double-charge the judge bill and write conflicting rows; the second one is worse because dashboards start showing two different numbers for the same trace.
How does Future AGI ship external eval?
The ai-evaluation SDK exposes Evaluator(...).evaluate(eval_templates=[...], inputs=[TestCase(...)]) as the core entry point with 60+ EvalTemplate classes and four distributed runner adapters (Celery, Ray, Temporal, Kubernetes). EvalTag plus traceAI write scores back to the OTel trace asynchronously, so a result lands on the span without inline latency. AutoEvalPipeline ships prebuilt domain pipelines (customer_support, rag_system, code_assistant, content_moderation, agent_workflow, healthcare, financial) for teams who don't want to assemble a rubric stack from scratch. EvalTemplateManager exposes create_composite, submit_composite, execute_composite, upload_ground_truth, and run_playground for multi-step workflows. Error Feed clusters failures with HDBSCAN; a Sonnet 4.5 Judge writes the immediate_fix; the Platform's self-improving evaluators retune from accumulated production feedback. The Agent Command Center gateway adds shadow, mirror, and race modes so a sampled fraction of production traffic can feed the pipeline without affecting user-facing latency.
Related Articles
View all