Guides

Evaluating LLM Batch Inference in 2026: Deferred-Execution, Drift, and Completion-Rate Eval

Batch APIs cut LLM cost ~50%, but break the eval loop. The working pattern for deferred execution, batch-vs-sync drift, and failed-row recovery.

·
Updated
·
11 min read
batch-inference llm-evaluation openai-batch-api anthropic-batches bedrock-batch llm-observability cost-optimization 2026
Editorial cover image for Evaluating Batch LLM Inference Workloads in 2026
Table of Contents

Batch LLM inference saves about 50 percent on cost and breaks the eval loop. OpenAI’s Batch API, Anthropic Batches, and Bedrock batch inference all trade real-time latency for a discount and a 24-hour SLA. The economics work. The eval pattern most teams brought over from synchronous traffic does not. A batch returns hours after the prompt was submitted, with some rows completed, some silently retried, some content-filtered, and a quality distribution that can drift inside the same job ID. The CI gate that fires in 30 seconds on a synchronous deploy has nothing to gate against here.

This is the post for ML engineers running OpenAI, Anthropic, or Bedrock batch jobs and wondering why the eval suite that worked on sync traffic is missing every batch-shaped failure. The opinion this post earns: batch inference saves 50 percent on cost but introduces three eval gotchas: deferred execution kills the feedback loop, batch-vs-sync quality drift is real, and failed-row recovery is your job, not the provider’s. The eval that matches batch is async-friendly, completion-rate-aware, and drift-aware against the same model called synchronously.

The methodology is code-defined against the ai-evaluation SDK, instrumented with traceAI, and includes the gateway pattern for cost attribution when batch sits next to sync traffic.

Why batch needs a different eval pattern

A synchronous LLM call is one trace, one response, latency in seconds. A batch job inverts every property of that contract.

The SLA window is 24 hours, not seconds. OpenAI Batch and Anthropic Batches promise completion inside 24 hours; Bedrock batch inference is similar. A batch submitted at midnight returns at 3 a.m. or 11 p.m. the next day. Your eval can’t fire on the prompt because the prompt hasn’t been answered yet.

Failure modes are partial, not all-or-nothing. A synchronous call returns or errors. A batch ships a .jsonl file where some rows succeeded, some failed with provider errors, some hit content filters, some timed out, and some were silently retried with subtly different formatting. The aggregate failure rate hides which rows failed and what they had in common.

Per-row attribution disappears. A 100k-row batch is one job ID. When the result file lands and the downstream pipeline reports a 12 percent quality regression, the on-call has no way to reconstruct which 12 percent without span-level per-row attribution.

Quality drifts inside the window. A 100k-row batch runs for hours. The provider can rotate snapshots, capacity pools, or backend routing inside that window. JSON-mode strictness varies between snapshots. The first 30k rows can land at 99 percent schema conformance and the last 30k at 86 percent, and the aggregate still looks fine.

Real-time eval covers none of these. Batch eval has to. For the broader pattern batch eval inherits from, see the 2026 LLM evaluation playbook and the external evaluation pipelines walkthrough.

Gotcha 1: the deferred-execution feedback loop

The standard inner loop reads: submit a prompt, score the response, block the release if the score drops. That loop assumes the response comes back in seconds. Batch breaks it. The submission goes out, the eval has nothing to score for hours, the next batch is already queued by the time scoring finishes, and the “gate” lands on yesterday’s prompt against today’s model.

The fix is two loops, not one.

The inner loop runs synchronously on a small canary. Sample 1-2 percent of the batch input, call the same model through the sync API, score the response with the standard rubric set, and gate the bulk batch submission on that score. This is the release gate.

The outer loop runs against the batch result file when it lands. Score every completed row with the same rubric, plus the batch-specific judges below. This is the next-batch gate, not this-batch. The artifact is a saved evaluation dataset, a versioned rubric, and a trend line of completion rate, per-row quality, and unit cost across batches.

import json
import openai
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextAdherence, TaskCompletion
from fi.testcases import TestCase

ev = Evaluator()  # FI_API_KEY / FI_SECRET_KEY from env
client = openai.OpenAI()

def sync_canary(rows, n=50):
    cases = []
    for row in rows[:n]:
        prompt = row["body"]["messages"][-1]["content"]
        resp = client.chat.completions.create(
            model=row["body"]["model"],
            messages=row["body"]["messages"],
        )
        cases.append(TestCase(
            input=prompt,
            output=resp.choices[0].message.content,
            context=row["metadata"].get("retrieval_context", ""),
        ))
    return ev.evaluate(
        eval_templates=[Groundedness(), ContextAdherence(), TaskCompletion()],
        inputs=cases,
    )

# Gate the bulk batch on canary scores
canary = sync_canary(input_rows, n=50)
if canary.aggregate_score < 0.85:
    raise RuntimeError("Canary failed; do not submit bulk batch.")

The decoupled loop is the only pattern that survives the SLA. Don’t try to make the bulk batch eval gate the bulk batch release. It can’t.

Gotcha 2: batch-vs-sync quality drift

The provider documentation says the Batch API uses the same model. Production logs disagree. The same model snapshot called through the batch path can lose 2-5 points on the same rubric against the sync path, and the gap moves with provider releases. Plausible reasons: the batch path routes through a relaxed capacity pool, JSON-mode strictness is dialed differently for high-throughput batched calls, or the snapshot under the same alias rotates inside the 24-hour window. The exact mechanism is opaque to customers; the drift is not.

The audit is a paired comparison. Take 200-500 rows from production, send each through both the sync API and the Batch API to the same model alias, and score both responses with the same rubric set. Anything above a 2-point gap on any rubric is drift worth alarming on.

from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider

batch_sync_drift = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "batch_sync_drift",
        "model": "claude-sonnet-4-5-20250929",
        "grading_criteria": (
            "Compare two responses to the same prompt: one from the synchronous "
            "API and one from the Batch API. Score on grounding, schema "
            "conformance, tool-call validity, and refusal calibration. "
            "Return 0.0 if the batch response is materially worse, 0.5 if "
            "equivalent, 1.0 if the batch response is materially better."
        ),
    },
)

def drift_audit(rows, sync_fn, batch_results):
    scores = []
    for row, batch_row in zip(rows, batch_results):
        sync_out = sync_fn(row)
        batch_out = batch_row["response"]["body"]["choices"][0]["message"]["content"]
        out = batch_sync_drift.compute_one(CustomInput(
            question=row["body"]["messages"][-1]["content"],
            answer_a=sync_out, answer_b=batch_out,
        ))["output"]
        scores.append(out)
    return sum(scores) / len(scores)

A drift score below 0.45 across 200+ pairs is a clear regression; the batch path is materially worse than sync on your workload. A score between 0.45 and 0.50 is a soft drift to monitor. Run the audit weekly and treat the trend line as a leading indicator on provider snapshot rotation.

The arena-judge pattern under the hood is the same one detailed in the LLM arena-as-a-judge guide; the difference is the two answers come from the same model through two paths, not two models through one path. For batch-shaped failures on structured output specifically, the evaluating LLM structured output modes guide goes deeper.

Gotcha 3: completion-rate-aware eval and failed-row recovery

A batch that completes 88 percent of rows is not a quality regression. It is a different shape of failure, and conflating the two is how teams ship a 95-percent-quality job at 88-percent completion and tell finance everything is fine.

Score completion and quality on separate tracks.

The completion track classifies every row by status. Read the result file row-by-row, attach batch.row_status to the trace span, and bucket every row into one of: completed, provider_error, content_filter, timeout, retry_exhausted, silently_retried. The OpenAI Batch API returns either a response block or an error block per row; Anthropic returns a per-request status; Bedrock writes per-row markers in the result file.

def classify_status(row):
    if row.get("error"):
        code = row["error"].get("code", "")
        if "content" in code: return "content_filter"
        if "timeout" in code: return "timeout"
        return "provider_error"
    if not row.get("response"): return "retry_exhausted"
    if row.get("retried"): return "silently_retried"
    return "completed"

statuses = [classify_status(r) for r in rows]
completion_rate = statuses.count("completed") / len(rows)

The quality track scores only completed and silently_retried rows. Treat the retried bucket separately: providers can serve retried rows from a different snapshot with subtly different formatting, and the output distribution shifts. Compare per-rubric scores on completed vs silently_retried to surface that case.

Failed-row recovery is the engineering job downstream. Three strategies, each correct for some workloads:

  • Drop and proceed for tolerant pipelines where 2-3 percent loss is acceptable.
  • Retry on the sync API and pay the markup when the downstream deadline is tight.
  • Halt and require manual intervention for regulated pipelines where dropped rows are a violation.

The eval doesn’t pick the strategy; it surfaces whether the strategy is working. Cluster the failed rows with HDBSCAN on prompt embeddings to find the shared shape (every Japanese-text row, every prompt above 50k tokens, every content-filter-adjacent topic). The shape is what to fix. The agent passes evals but fails in production guide covers the broader cluster-then-fix pattern.

The unit cost number finance actually cares about:

def cost_per_completed_row(batch_cost, sync_retry_cost, completed_validated):
    return (batch_cost + sync_retry_cost) / completed_validated

Failed rows still consume retries. The honest cost-per-useful-output is higher than the batch invoice. The unit economics chart that excludes retry cost lies.

Mid-batch drift: the segment-level view

Long batches drift mid-run. A 100k-row job running for 14 hours hits different provider snapshots, capacity pools, and prompt-distribution slices over its lifetime. The aggregate hides the drift; the segment view exposes it.

Segment the result file into chunks of 1k or 5k rows in submission order. Compute per-segment conformance to the canonical schema. Alarm when a segment falls more than 5 points below the batch-wide rate.

schema_drift = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "schema_drift_across_batch",
        "model": "claude-sonnet-4-5-20250929",
        "grading_criteria": (
            "Given the canonical output schema, a row's parsed response, and "
            "the field-level expectations, return 1.0 if conformant, 0.0 if "
            "not. Identify which schema field broke."
        ),
    },
)

def per_segment_drift(rows, segment_size=5000):
    segments = [rows[i:i+segment_size] for i in range(0, len(rows), segment_size)]
    overall = sum(1 for r in rows if schema_passes(r)) / len(rows)
    for i, seg in enumerate(segments):
        seg_rate = sum(1 for r in seg if schema_passes(r)) / len(seg)
        if overall - seg_rate > 0.05:
            print(f"DRIFT segment {i}: {seg_rate:.2%} vs batch {overall:.2%}")

Three drift patterns recur. JSON-mode strictness loosens as the provider rotates capacity, so a late segment loses closing braces on a third of rows. Prompt-distribution shift hits a category the prompt wasn’t tuned for (long inputs, non-Latin scripts, edge cases), so a contiguous 5k segment underperforms. Retry-induced drift: silently retried rows reroute through a different code path with subtly different formatting and contaminate the segment they land in. The shape of the divergence tells you which pattern is in play.

Instrumenting batch with traceAI

Production telemetry on a batch job is one parent span and one child span per row.

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType, SpanKind
from opentelemetry import trace

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="batch-rag-ingest",
)
tracer = trace.get_tracer("batch_eval")

with tracer.start_as_current_span(
    "batch_job",
    attributes={
        "fi.span.kind": SpanKind.CHAIN.value,
        "batch.job_id": job_id,
        "batch.provider": "openai",
        "batch.row_count": len(result_rows),
    },
):
    for i, row in enumerate(result_rows):
        with tracer.start_as_current_span(
            "batch_row",
            attributes={
                "fi.span.kind": SpanKind.LLM.value,
                "batch.row_id": row["custom_id"],
                "batch.row_status": classify_status(row),
                "batch.segment": i // 5000,
                "llm.model_name": row.get("model", "unknown"),
            },
        ):
            pass  # rubric scores attach as span events

The trace tree lets you audit a single row in a 100k-row job by filtering on batch.row_id or batch.row_status. A partial-batch failure now has the row, the prompt, the response, and the rubric score on one screen. Auto-instrumentors handle the sync canary path without code changes; the batch retrieval loop is the only manual span you write. PII redaction runs on attributes before export. The traceAI OpenTelemetry walkthrough covers the full surface across 50+ AI integrations.

The gateway pattern for batch cost attribution

When batch sits next to sync traffic, cost attribution gets messy. The provider invoice mixes the two. The wrapper-reported cost drifts against the invoice on cached prompt tokens and inter-release price changes. The honest unit-economics number requires a per-call cost reading from the same code path that handled the call.

Agent Command Center routes batch submissions to OpenAI, Anthropic, or Bedrock through the same OpenAI-compatible base URL as sync traffic, and returns telemetry headers on every response.

from openai import OpenAI

client = OpenAI(
    api_key="sk-agentcc-...",
    base_url="https://gateway.futureagi.com/v1",
)

# Same SDK call shape; cost lands on the response headers
resp = client.batches.create(
    input_file_id=file_id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)
print(resp.headers.get("x-agentcc-cost"))           # discounted batch unit cost
print(resp.headers.get("x-agentcc-latency-ms"))     # batch wall time
print(resp.headers.get("x-agentcc-model-used"))     # resolved provider model
print(resp.headers.get("x-agentcc-fallback-used"))  # true if gateway re-fired failed rows over sync

The gateway aggregates per-row headers into a per-batch summary. Finance gets the canonical discounted-cost number; engineering gets the per-row breakdown for drift attribution. Single Go binary, Apache 2.0, 100+ providers, 18+ built-in guardrail scanners plus 15 third-party adapters, exact and semantic caching, OTel/Prometheus observability, MCP + A2A protocol support. README benchmark: ~29k req/s, P99 21 ms with guardrails on, t3.xlarge. See the enterprise LLM gateway cost-tracking walkthrough and best LLM cost tracking tools (2026).

Sync vs Batch: the honest tradeoff

AxisSynchronous APIBatch API
LatencySecondsUp to 24 hours
Price (vs sync baseline)1.0x~0.5x on OpenAI; similar on Anthropic, Bedrock
Failure shapeAll-or-nothing per callPartial across rows
Eval loop fitBlock-on-fail at submit timeTwo loops: sync canary + post-hoc batch
Per-row attributionNative (one trace per call)Manual span emission required
Cost attributionPer-call invoice linePer-batch aggregate (split via gateway headers)
Quality drift riskSnapshot rotation between releasesSnapshot rotation inside 24-hour window
Right call whenLatency < hours; per-request gating50k+ rows/day; offline pipeline; cost-bound

The Batch API wins on cost when volume is high and latency tolerance is hours. The synchronous API wins when the workload needs real-time gating or rows per day are small enough that operational overhead (polling, recovery, drift audit) outweighs the discount. Many production pipelines run both: sync for the canary and the recovery path, batch for the bulk.

How Future AGI ships batch evaluation

The eval stack as a package. Start with the SDK and the async submission primitive for code-defined batch eval. Layer in traceAI for per-row attribution. Move to the gateway when cost and governance need one place to live.

  • ai-evaluation SDK (Apache 2.0). 50+ EvalTemplate classes (Groundedness, ContextAdherence, TaskCompletion, EvaluateFunctionCalling, AnswerRefusal, DataPrivacyCompliance). CustomLLMJudge covers the batch-specific judges (SchemaDriftAcrossBatch, CostPerCompletedRow, BatchSyncDrift, PartialBatchResilience). .submit() returns an Execution handle immediately and .get_execution(id) polls non-blocking, so scoring 100k rows doesn’t block the eval driver. Local heuristic metrics (regex, JSON schema, BLEU, ROUGE, semantic similarity) run offline at sub-second latency for high-volume schema checks.
  • traceAI. OpenInference spans across 50+ AI surfaces in Python, TypeScript, and Java. Auto-instrumentation for OpenAI, Anthropic, LangChain, Gemini, Groq, Portkey, with PII redaction at the attribute layer. The batch pattern is one parent CHAIN span and one child LLM span per row, with batch.job_id, batch.row_id, batch.row_status, and batch.segment as attributes for filtering and aggregation.
  • Agent Command Center. Single Go binary, Apache 2.0, 100+ providers with native shape preservation, 18+ built-in guardrail scanners plus 15 third-party adapters, exact and semantic caching, MCP and A2A protocol support. Per-call telemetry headers (x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-model-used, x-agentcc-fallback-used) collapse cost-per-completed-row from a reconciliation script to a header read. Sync canary and bulk batch route through the same base URL.
  • Future AGI Platform. Self-improving evaluators retune from production feedback at lower per-eval cost than Galileo Luna-2, which makes weekly full-batch rescores budgetable. Error Feed clusters failing rows with HDBSCAN over ClickHouse-stored embeddings; a Sonnet 4.5 Judge agent writes the immediate_fix per cluster (“JSON-mode strictness loosened in segment 18 onwards”, “every Japanese-text row fell into one failure cluster”, “silently retried rows came back through a different snapshot”). The fix text feeds back into the next batch’s submission shape. Linear is the only Error Feed external integration today; Slack, GitHub, Jira, PagerDuty are roadmap.

Drop ai-evaluation with async .submit() into the sync canary and the post-hoc batch eval this afternoon. Add the traceAI parent-and-child span pattern when jobs cross 10k rows. Move to the gateway when cost attribution, fallback recovery, and per-call governance need one place to live across batch and sync.

Ready to evaluate your first batch workload? Run pip install ai-evaluation traceai-openai, scaffold a sync canary on 50 rows, wrap the retrieval loop in a CHAIN span, and score completed rows with Groundedness, TaskCompletion, and BatchSyncDrift. A workload that survives all three gotchas is worth the discount; the rest is a regression hiding in an invoice.

Frequently asked questions

Why does batch LLM inference need a different evaluation pattern?
A batch job inverts every property a real-time eval setup is built on. Latency moves from seconds to a 24-hour SLA, so you can't gate a release on the result. Failure is partial rather than all-or-nothing, so a job completes 88 percent and the on-call has to find which 12 percent broke. Quality drifts mid-batch because providers rotate snapshots over a multi-hour window. And per-row attribution disappears into one job ID. Real-time eval treats every call as its own trace and triggers immediately. Batch eval has to score the row, score the segment of the batch the row belongs to, and reconcile the result against the same model called synchronously, all hours after the prompt was submitted. The eval pattern that matches batch is async-friendly, completion-rate-aware, and drift-aware against the sync baseline.
What is the deferred-execution feedback loop problem?
The OpenAI Batch API, Anthropic Batches, and Bedrock batch inference all defer completion up to 24 hours. The eval can't run until the batch returns, so the standard CI gate (submit, score, block-on-fail) doesn't fit. You submit at midnight, results land at 4 a.m., scoring finishes at 6 a.m., and by then the next batch is already in flight on yesterday's prompt. The fix is to decouple the eval loop from release gating. Use a paired sync canary on a 1-2 percent sample to score the prompt before submitting the bulk job, treat the full-batch eval as a next-day verification artifact, and alarm on drift between the sync canary score and the post-hoc batch score. The eval no longer gates the submission; it gates the next submission.
How do I detect batch-vs-sync quality drift?
Call the same model synchronously and through the Batch API on a 200-500 row paired sample drawn from production. Both paths nominally use the same model snapshot. In practice the batch path can route through a different capacity pool, a relaxed JSON-mode profile, or a snapshot that rotated inside the 24-hour window. Score both responses with the same rubric set (Groundedness, TaskCompletion, schema-conformance) and compare. A 2+ point gap on any rubric is drift. The most common patterns: batch loses 3-5 points on structured-output strictness, batch loses 1-2 points on long-context grounding, and batch trails sync on tool-call argument validity. Run the paired audit weekly because the drift signal moves with provider releases.
How does completion-rate-aware eval work?
A batch returns a JSONL file with three row classes: completed, failed (provider error, content filter, retry exhausted, timeout), and silently retried. Quality eval has to be scoped to completed rows so the rubric doesn't compare nulls. Completion eval has to be its own track that classifies the failure mode per row, segments the failure distribution by prompt shape (long context, non-Latin script, schema edge case), and surfaces the cost-per-completed-row, which counts retry spend. A job at 88 percent completion and 95 percent quality on the 88 percent is a different beast than a job at 99 percent completion and 80 percent quality. Treat completion and quality as independent axes; aggregate them only at the end for the unit economics chart.
When does the Batch API actually save money?
Above roughly 50-100k rows per day per workload. OpenAI ships Batch at 50 percent off both input and output tokens against the synchronous price; Anthropic and Bedrock have similar provider-specific discounts. The headline math is clean. The honest math subtracts retried-row cost (failed rows still bill on retry), the cost of running a sync paired canary for drift detection, and the engineering hours spent on the polling-and-recovery code path. For a 1M-row classification pipeline the batch path saves real money; for a 5k-row weekly job the operational overhead often outweighs the discount. Cost-per-completed-row is the metric finance signs off on, not the provider invoice.
How do I instrument a batch job with traceAI?
Emit one parent span with fi.span.kind=CHAIN for the batch job and one child span with fi.span.kind=LLM per row. Attach batch.job_id, batch.row_id, batch.row_status, batch.segment, and batch.provider as span attributes. The parent span carries aggregate cost and wall-time; the child spans carry per-row prompt, response, rubric scores, and the failure mode if the row didn't complete. The OpenAI and Anthropic instrumentors emit standard LLM spans on the synchronous canary path, and you wrap the batch retrieval loop in a manual span so retrieved rows ingest as child spans. The result is a trace tree where any row in a 100k-row job is filterable, and a partial-batch failure has the row, the prompt, the response, and the rubric score on one screen.
What does Future AGI ship for batch evaluation today?
The eval stack as a package. The ai-evaluation SDK (Apache 2.0) ships 50+ EvalTemplate classes (Groundedness, ContextAdherence, TaskCompletion, EvaluateFunctionCalling, DataPrivacyCompliance) plus CustomLLMJudge for the batch-specific judges (SchemaDriftAcrossBatch, CostPerCompletedRow, BatchSyncDrift) and async submission via .submit() and .get_execution() so scoring 100k rows doesn't block. The traceAI instrumentor emits OpenInference spans across 50+ AI surfaces in Python, TypeScript, and Java with PII redaction. Agent Command Center routes batch submissions to OpenAI, Anthropic, or Bedrock through a single Go binary (Apache 2.0, 100+ providers, 18+ built-in guardrail scanners) and returns per-call telemetry headers including x-agentcc-cost and x-agentcc-fallback-used, so cost-per-completed-row is a header read rather than a reconciliation script. The Future AGI Platform's self-improving evaluators retune from production failure feedback at lower per-eval cost than Galileo Luna-2.
Related Articles
View all