Guides

Distributed eval runners for LLM regression at scale: Celery, Ray, Temporal, Kubernetes

Celery, Ray, Temporal, and Kubernetes optimise for different things. Pick by your bottleneck, not by what's fashionable. The 2026 engineering decision guide.

·
Updated
·
13 min read
llm-evaluation distributed-systems ci-cd ai-gateway agent-evaluation 2026
Editorial cover image for distributed LLM eval runners in 2026
Table of Contents

Sequential eval is the silent reason most teams ship LLM features without regression coverage. A 10,000-case suite at 500 ms per evaluation runs in 1.4 hours on a single worker, and that’s before the LLM-judge round-trip. Fifty workers brings the same suite under two minutes. Add a four-language coverage matrix and the sequential number turns into a working day; the parallel number stays in minutes. The choice between “we ship every PR” and “we ship every Tuesday” almost always comes down to whether eval runs distributed or not.

The opinion this post earns: the four common runners (Celery, Ray, Temporal, Kubernetes) each optimise for a different thing. Celery for throughput. Ray for ML burst. Temporal for durability. Kubernetes for elasticity. Pick by your bottleneck, not by what’s fashionable. Most teams pick Celery because that’s what they already run, and that is usually the right call. Introducing a new orchestrator only for eval is the cleanest way to double your operational tax forever.

This is the decision document. Four runners, when each one wins, the minimal wiring, and how to keep cost accounting honest when judge calls fan out across hundreds of workers.

TL;DR: pick the runner that matches your bottleneck

RunnerOptimises forBest whenWiring costWatch out for
CeleryThroughputPython services on Redis or RabbitMQ alreadyHours to daysVisibility timeout on long judge calls
RayML burstCompute-heavy, multi-modal, in-worker modelsDaysHead-node failure mid-suite
TemporalDurabilityAudit-grade replay, hours-to-days suites, regulated workloadsDays to a weekPer-template activity timeout tuning
KubernetesElasticityBursty workloads, polyglot workers, cloud-nativeDaysPod startup latency on small jobs

The Future AGI ai-evaluation SDK keeps the eval definitions runner-agnostic so the runner becomes a deployment detail, not a rewrite. The decision is reversible.

Why one runner doesn’t fit

Eval volume in 2026 is not what it was in 2024. A serious regression suite for a single agent route now sits at 500 to 5,000 cases. Multiply by routes, language coverage, rubric count, and the per-tenant variants enterprise customers ask for, and the matrix you actually want to gate CI on is well into five figures.

Three forces drive this:

  1. Dataset growth from production. The 2026 LLM evaluation playbook is dataset-refresh-weekly, so the suite grows on a schedule. New failures land in the dataset every Monday.
  2. Multi-rubric per case. A single case now scores against groundedness, faithfulness, toxicity, prompt injection, completeness, and a domain rubric. Six judge calls per input where 2024 had one.
  3. Multi-modal coverage. Image, audio, and video inputs land in the same suite as text. A single image-grounded eval can be 10x the compute of a text-only judge.

The combined effect: a suite that ran in five minutes last quarter runs in 45 this quarter. By the next planning cycle it runs overnight, and overnight means it stops running per-PR. The fix is to fan out, not to trim the suite.

But fan-out is not a single problem. A bursty overnight batch has different failure modes from a continuous live-scoring lane. A multi-modal eval that holds a vision model in worker memory has different needs from a sub-100 ms classifier sweep. A regulated workflow that must replay deterministically has different needs from a throwaway PR check. There is no single optimal runner because there is no single workload. That’s the thesis.

For the broader economics see LLM eval cost optimisation and LLM cost-tracking best practices.

Celery: throughput-first

Celery is the right call when your bottleneck is raw cases per second and your stack already runs Celery for other workloads. The broker is Redis or RabbitMQ. The cognitive overhead is the lowest of the four because most Python engineers have seen it. Throughput scales linearly with worker count until the broker becomes the bottleneck, which for typical eval payloads sits around a few thousand tasks per second per Redis broker.

Use it for:

  • Suites in the hundreds-to-low-thousands of cases
  • Eval that runs in the same Python process family as your application code
  • CI gates where the runner only spins up during the pipeline

Minimal wiring with the SDK:

from celery import Celery
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, FactualAccuracy, Toxicity
from fi.testcases import TestCase

app = Celery("evals", broker="redis://broker:6379/0")
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

@app.task(acks_late=True)
def score_case(payload: dict) -> dict:
    case = TestCase(**payload)
    result = evaluator.evaluate(
        eval_templates=[Groundedness(), FactualAccuracy(), Toxicity()],
        inputs=[case],
    )
    return result.to_dict()

Fan-out is the standard Celery pattern: a group or chord over the case list, results merged by the callback. Watch-out: long-running judge calls past Celery’s visibility timeout will get redelivered. Tune visibility_timeout and task_acks_late=True together, and prefer short tasks (one case per task) over batched tasks.

Celery is also the right first runner for teams without one. The on-ramp is an afternoon; the cost ceiling shows up only at multi-modal or audit-replay workloads, both of which warrant a different runner anyway.

Ray: ML burst

Ray wins when the per-case work is heavy: image, audio, or video inputs, multi-step judge chains, or scoring that needs vectorised numpy in the worker. The native dataframe story (Ray Data) is also a real reason to choose it when eval inputs already live in a Parquet or Arrow pipeline. Ray’s resource model lets you ask for num_gpus=1 on the scoring task itself, which matters when the judge is a local guardrail backend you run in-process.

Use it for:

  • Multi-modal eval where the judge ingests image, audio, or video frames
  • ML-engineering teams that already operate Ray Train or Ray Serve
  • Suites that benefit from in-memory shuffles between scoring stages

Minimal wiring:

import ray
from fi.evals import Evaluator
from fi.evals.templates import ContextAdherence, ChunkAttribution
from fi.testcases import TestCase

ray.init()
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

@ray.remote(num_cpus=1)
def score_case(payload):
    case = TestCase(**payload)
    return evaluator.evaluate(
        eval_templates=[ContextAdherence(), ChunkAttribution()],
        inputs=[case],
    ).to_dict()

futures = [score_case.remote(c) for c in load_cases()]
results = ray.get(futures)

When the judge is LLAMAGUARD_3_8B, QWEN3GUARD_8B, or another local backend that loads into worker memory, an actor with num_gpus=1 keeps the model resident across cases instead of cold-loading per task. That’s Ray earning its operational cost. For backend selection patterns see open-source LLM eval libraries.

Watch-out: head-node restarts mid-suite lose in-flight state. For long suites prefer Ray Job submission with a persistent dashboard, and checkpoint results to durable storage between batches.

Temporal: durable

Temporal is the right call when eval matters enough that mid-run failures need to recover, not restart. The per-activity retry semantics, the replay-correct execution history, and workflow-level versioning make it the strongest choice for regulated workloads where the eval audit trail is part of the compliance story.

Use it for:

  • Suites that run for hours to days (overnight per-tenant regression batches)
  • Finance, healthcare, and government workloads where audit replay matters
  • Eval pipelines with heterogeneous activities (fetch dataset, score, attach span, write report) that each need their own retry policy

Minimal wiring with the Python SDK:

from datetime import timedelta
from temporalio import activity, workflow
from fi.evals import Evaluator
from fi.evals.templates import TaskCompletion, LLMFunctionCalling
from fi.testcases import TestCase

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

@activity.defn
async def score_case(payload: dict) -> dict:
    case = TestCase(**payload)
    return evaluator.evaluate(
        eval_templates=[TaskCompletion(), LLMFunctionCalling()],
        inputs=[case],
    ).to_dict()

@workflow.defn
class EvalSuite:
    @workflow.run
    async def run(self, cases: list[dict]) -> list[dict]:
        return await workflow.gather(*[
            workflow.execute_activity(
                score_case, c,
                start_to_close_timeout=timedelta(minutes=5),
                retry_policy=workflow.RetryPolicy(maximum_attempts=3),
            )
            for c in cases
        ])

The win is what happens when a worker dies on case 4,217. The activity gets retried on another worker, the workflow’s deterministic history replays, the final report is identical to what it would have been if nothing had failed. That is genuinely hard to build by hand on Celery or Ray.

Watch-out: activity timeouts have to be tuned per template family. A CustomLLMJudge with a stronger model can take 30 seconds; a stateless RegexScanner returns in milliseconds. One blanket timeout starves the cheap activities and kills the expensive ones.

Kubernetes Jobs: elastic

Kubernetes is the right call when the workload varies dramatically across the day, the worker pool needs to scale to zero between batches, and the worker images are polyglot. Job and CronJob primitives give you native scheduling, HPA handles bursts, and the container can be Python, TypeScript, or Java, which matters when the eval surface spans multiple SDK languages.

Use it for:

  • Bursty workloads (light during business hours, heavy overnight)
  • Polyglot worker pools (Python eval workers + TypeScript trace processors)
  • Platform-led organisations where eval is one tenant of a shared cluster

Minimal Job spec:

apiVersion: batch/v1
kind: Job
metadata:
  name: eval-suite-regression
spec:
  parallelism: 50
  completions: 10000
  template:
    spec:
      containers:
        - name: worker
          image: registry/fagi-eval-worker:2026.05
          env:
            - name: FI_API_KEY
              valueFrom: { secretKeyRef: { name: fagi, key: api-key } }
            - name: FI_SECRET_KEY
              valueFrom: { secretKeyRef: { name: fagi, key: secret-key } }
          command: ["python", "-m", "eval_worker", "--shard", "$(JOB_COMPLETION_INDEX)"]
      restartPolicy: OnFailure

The container code looks like the Celery worker without the broker: pull a shard from durable storage, run the Evaluator over it, write results back. The cluster handles the rest. For longer recurring suites use CronJob with concurrencyPolicy: Forbid to avoid pile-ups.

Watch-out: pod startup latency. For small ad-hoc evals (a single PR check on 50 cases) per-pod cold start can dominate. Keep a small Celery or local pool for the small jobs and reserve Kubernetes for the large batch and overnight work.

The comparison matrix

DimensionCeleryRayTemporalKubernetes
Primary optimisationThroughputML burstDurabilityElasticity
On-ramp timeHoursDaysDays to a weekDays
GPU supportExternalNative (num_gpus)Via activityVia node selector
Mid-suite failure recoveryManualCheckpointAutomatic (replay)Manual / re-run Job
Audit replayNoNoYesNo
Cold startNoneCluster initNonePer pod
Polyglot workersPython-firstPython-firstMulti-language SDKAny image
Best at scale10K-100K casesCompute-heavy 1K-100KHours-to-days suitesBursty 100-1M cases
Operational taxLowest if Redis existsCluster lifecycleTemporal serviceCluster, but shared
Worst atLong judge callsHead-node failureShort tasksSmall ad-hoc jobs

The pattern: each runner is the right answer for its column and the wrong answer for the others. Forcing one across all workloads is how teams end up with idle Ray clusters during business hours or Celery brokers gasping under multi-modal payloads.

A healthier shape for organisations with multiple workloads is one runner per workload class. A small Celery pool for PR checks, Kubernetes Jobs for overnight batches, Temporal for the regulated tenant. The ai-evaluation SDK keeps the eval definitions identical across all of them, so the operational complexity stays in deployment and never leaks into the rubric layer.

For the broader architecture see LLM evaluation architecture and the external evaluation pipelines deep-dive on async/idempotent/observable/recoverable pipeline properties.

FAGI ai-evaluation supports all four

The Evaluator API does not care which runner is wiring tasks underneath. The shape stays the same:

from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness, ContextAdherence, FactualAccuracy,
    Toxicity, PromptInjection, DataPrivacyCompliance,
    AnswerRefusal, IsHarmfulAdvice, TaskCompletion, LLMFunctionCalling,
)
from fi.testcases import TestCase

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
templates = [
    Groundedness(), ContextAdherence(), FactualAccuracy(),
    Toxicity(), PromptInjection(), DataPrivacyCompliance(),
    AnswerRefusal(), IsHarmfulAdvice(), TaskCompletion(), LLMFunctionCalling(),
]
inputs = [TestCase(...) for _ in dataset]
result = evaluator.evaluate(eval_templates=templates, inputs=inputs)

That same call works whether the wiring underneath is Celery, Ray, Temporal, or Kubernetes. The templates parallelise per input. The 13 guardrail backends (LLAMAGUARD_3_8B, LLAMAGUARD_3_1B, QWEN3GUARD_8B, QWEN3GUARD_4B, QWEN3GUARD_0.6B, GRANITE_GUARDIAN_8B, GRANITE_GUARDIAN_5B, WILDGUARD_7B, SHIELDGEMMA_2B, OPENAI_MODERATION, AZURE_CONTENT_SAFETY, TURING_FLASH, TURING_SAFETY) are stateless and trivially fan-outable. The 8 Scanners (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner) return in single-digit milliseconds and shard linearly.

AutoEvalPipeline adds backend-aware sharding, so a suite that mixes a heavy CustomLLMJudge with a cheap RegexScanner doesn’t pin both to the same worker pool. EarlyStopPolicy works across all four runners, so once a confidence threshold is hit the suite terminates instead of grinding through the remaining 7,000 cases.

The decision rule the SDK enables: start sequentially in a notebook, fan out on Celery when the suite crosses a few hundred cases, migrate to Temporal when audit replay matters, and the eval definitions never change. The runner becomes a deployment lever.

Cost telemetry: route judge calls through the gateway

A distributed runner without per-eval cost telemetry is a budget grenade. The fix is to route every LLM judge call through the Agent Command Center gateway at https://gateway.futureagi.com/v1. Every response carries headers your worker can capture:

  • x-prism-cost: exact spend for this judge call
  • x-prism-latency-ms: wall time including any retry
  • x-prism-model-used: the actual model the gateway routed to
  • x-prism-fallback-used: whether a fallback fired
  • x-prism-routing-strategy: which strategy made the choice
  • x-prism-guardrail-triggered: whether a guardrail blocked or modified the call

The worker tags each result row with these values. Aggregate by suite, rubric, or runner and you have the economics that show where a TURING_FLASH classifier (sub-100 ms, pennies per call) earns its place versus where a stronger judge is worth the spend. Without this telemetry the runner choice is uninformed; with it you can A/B the same suite across two backends and pick the one with the better dollar-per-rubric.

The Agent Command Center gateway is a 17 MB Go binary, self-hosts in your VPC, supports 100+ providers via six native adapters, and is SOC 2 Type II, HIPAA, GDPR, and CCPA certified. For the gateway architecture in depth see AI gateway governance, cost, and provider flexibility.

Anti-patterns to retire

Sequential eval that should be parallel. This is where most teams start and most stay. The signal is “we’ll add eval to CI when it’s fast enough.” It is fast enough; you just need a runner. Even Celery on a single broker fixes 80% of the pain in an afternoon.

One runner for every workload. Ray for a 50-case PR check is over-engineering and the cluster cost runs during idle hours. Celery for frame-by-frame video grounding is under-powered and the broker becomes the bottleneck. The healthier pattern is a small pool (Celery or local) for PR checks and a heavier pool (Ray, Temporal, or Kubernetes) for nightly batches.

No per-runner cost telemetry. If you can’t tell whether a regression suite cost $4 or $400, you can’t optimise the runner choice and you can’t justify a migration. Capture x-prism-cost on every judge call from day one. The pattern composes with broader AI agent cost optimisation and observability work.

Custom orchestrator instead of one of the four. You will reinvent half of Temporal and learn why the other half exists. Adopt; don’t rebuild.

Closing the loop: Error Feed clusters distributed-eval failures

Distributed eval fails in patterns. A worker pool drops on a specific classifier backend. A Temporal activity timeout fires on long-context inputs. A Ray task OOMs when image inputs exceed a size threshold. Spotting these by hand is the kind of work nobody does until the third or fourth incident.

Error Feed runs HDBSCAN soft-clustering over failing eval runs and surfaces clusters as named issues. A Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur for long spans, ~90% prompt-cache hit) writes the concrete immediate_fix per cluster (bump the activity timeout from 5 minutes to 10, shard WILDGUARD_7B workers across two pools, raise the Ray worker memory ceiling to 16 GiB) and pushes that into the Future AGI Platform’s self-improving evaluators. Linear is the only integration today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.

The honest framing on the broader loop: eval-driven optimisation on judge prompts ships today through agent-opt’s six optimisers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer), all with EarlyStoppingConfig. The trace-stream-to-agent-opt connector that would let you optimise straight from production traces is on the roadmap, not shipped. We name the gap because over-claiming the loop is one of the easier ways to lose trust with engineering buyers.

For more on the closed loop in practice see agent observability vs evaluation vs benchmarking and automated prompt improvement.

Where Future AGI fits

The four-runner choice is operational. The layer above and below it is where the platform earns its place.

  • ai-evaluation SDK (Apache 2.0). One Evaluator API across all four runners. 60+ EvalTemplate classes, 13 guardrail backends, 8 sub-10 ms Scanner classes. AutoEvalPipeline and EarlyStopPolicy so the suite shape stays the same as the runner changes. Drop it into the runner you already operate this afternoon.
  • Agent Command Center gateway. 17 MB Go binary, self-hosts in your VPC, 100+ providers. Tags every judge call with x-prism-cost, x-prism-latency-ms, x-prism-model-used, and routing metadata so the runner-by-runner cost analysis takes a SQL query, not a spreadsheet.
  • traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java, and C#. Span-attached scoring via EvalTag writes results back to the OTel trace so eval and trace live in the same place when on-call walks a regression.
  • Error Feed inside the eval stack. HDBSCAN clustering plus a Sonnet 4.5 Judge agent that writes the immediate_fix for each cluster. Failures from any of the four runners flow through the same pipeline; the fix candidates feed self-improving evaluators on the Future AGI Platform.
  • agent-opt (Apache 2.0). Six optimisers for the judge-prompt and rubric-tuning loop, with EarlyStoppingConfig. Eval-driven optimisation ships today; the trace-stream connector is roadmap.

The pieces are independent. Pick the runner that fits your bottleneck. Wire the SDK into one worker. Route the judge calls through the gateway. The runner is the lever; the rest compounds on top of it.

What to do this week

If your eval is still sequential, this is the action list:

  1. Pick the runner that matches your bottleneck and your existing infra. Don’t add a new one for eval alone.
  2. Wire the Evaluator into one worker. Confirm a single case scores end to end.
  3. Fan out the smallest suite you have (a single rubric on 100 cases). Measure wall time before and after.
  4. Route judge calls through the gateway. Capture x-prism-cost and x-prism-latency-ms on every result row.
  5. Pick one anti-pattern from above to retire by end of quarter.

The goal is not a perfect distributed eval system. It’s an eval suite that runs in minutes instead of hours, costs what you expect, and tells you what failed when something fails. The runner is the lever that gets you there.

Frequently asked questions

Which distributed runner should I pick for LLM eval at scale?
Pick by your bottleneck. Celery wins when throughput is the goal and Redis or RabbitMQ is already in production; the wiring takes hours. Ray wins when per-case work is compute-heavy, multi-modal, or holds models in worker memory. Temporal wins when eval matters enough that a mid-run failure has to recover, not restart, and the audit trail is part of the compliance story. Kubernetes wins when the workload is bursty, the worker images are polyglot, and the platform team already runs the cluster. Most teams pick Celery because that's what they already operate, and the right move is rarely to introduce a new orchestrator just for eval.
Why does LLM evaluation need a distributed runner at all?
Because eval volume scales with the product, not the team. A 10,000-case regression suite at 500 ms per case is 1.4 hours on a single worker, and that's the cheap math before the LLM-judge round-trip. Fifty workers brings it back to under two minutes, which is the difference between gating every PR and gating once a week. Add a multi-language matrix or per-tenant rubrics and the workload multiplies again. Sequential eval is the silent reason most teams ship LLM features without regression coverage.
How does Future AGI's ai-evaluation SDK abstract the runner?
The Evaluator API is identical across backends. You construct one client with Evaluator(fi_api_key=..., fi_secret_key=...) and call .evaluate(eval_templates=[...], inputs=[...]) the same way whether fan-out is on Celery, Ray, Temporal, or Kubernetes. The runner is a deployment concern, not an API concern. You can start sequentially during prototyping, switch to Celery when the suite grows past a few hundred cases, and migrate to Temporal when audit replay matters, without rewriting your eval definitions.
Does the runner choice affect which evals I can run?
No. Every template in the ai-evaluation SDK parallelises per input. Groundedness, ContextAdherence, FactualAccuracy, Toxicity, PromptInjection, DataPrivacyCompliance, AnswerRefusal, IsHarmfulAdvice, TaskCompletion, LLMFunctionCalling, CustomLLMJudge, and the rest of the 60+ catalog fan out identically. The 13 guardrail backends and 8 Scanners are stateless. AutoEvalPipeline adds backend-aware sharding, and EarlyStopPolicy works across all four runners so a 10,000-case run can terminate once a confidence threshold is met.
How do I do cost accounting when eval workers call LLM judges?
Route every judge call through the Agent Command Center gateway at https://gateway.futureagi.com/v1. Each response carries x-prism-cost, x-prism-latency-ms, x-prism-model-used, x-prism-fallback-used, and x-prism-routing-strategy headers, so the worker tags its result row with the exact judge cost and latency. Aggregate by suite, by rubric, or by runner and you have the per-eval economics that show where a cheap classifier earns its keep and where a stronger judge is worth the spend. Without this telemetry the runner choice is uninformed and the bill is a surprise.
What are the most common distributed-eval anti-patterns?
Three keep appearing. Sequential eval that should be parallel, which is where most teams start and most stay until a release breaks. One runner for every workload, like Ray for a 50-case PR check (over-engineered, idle cluster cost) or Celery for frame-by-frame video grounding (under-powered, broker bottleneck). No per-runner cost telemetry, so without x-prism-cost on every judge call you can't compare backends or justify a migration. A quieter fourth: building a custom orchestrator instead of adopting one that already has retries, scheduling, and observability.
How does Error Feed help when distributed eval starts failing in clusters?
Error Feed runs HDBSCAN soft-clustering over failing eval runs and surfaces patterns like 'worker pool drops on classifier backend Y' or 'Temporal activity timeout on long-context inputs in cluster 3'. A Sonnet 4.5 Judge agent writes a concrete immediate_fix per cluster (bump the activity timeout, shard the classifier across two pools, raise the worker memory ceiling) and pushes that into the Future AGI Platform's self-improving evaluators. Linear is the only integration today; broader integrations are on the roadmap. The honest framing is eval-driven optimisation today, full trace-to-optimisation connector on roadmap.
Related Articles
View all
The 2026 LLM Evaluation Playbook
Guides

The pillar playbook for LLM evaluation in 2026: dataset, metrics, judge, CI gate, production observation, and the closed loop from failing trace back to regression test.

NVJK Kartik
NVJK Kartik ·
10 min