Infrastructure

What Is Batch Processing (ML)?

Bounded, high-throughput handling of inputs together; for LLM serving, grouping concurrent prompts through a serving engine, often via continuous batching.

What Is Batch Processing (ML)?

Batch processing is the pattern of handling a bounded set of inputs together to maximize throughput per unit of compute. In ML it covers nightly feature jobs, offline model scoring, bulk evaluation, and warehouse-style training-data preparation. For LLM serving specifically, batch processing means grouping concurrent prompts so a single GPU forward pass handles many requests at once — most often using continuous batching, which mixes new requests into in-flight batches to keep the device busy. Unlike streaming, batch processing is bounded and periodic, and it is the right tool for scheduled evaluation runs and offline scoring of FutureAGI Datasets.

Why It Matters in Production LLM/Agent Systems

Batch processing is where compute becomes affordable. A naive serving setup that processes one prompt at a time leaves the GPU 70% idle; a properly batched setup runs at 80–95% utilization and cuts cost per token by 3–5x. The two common failure modes are over-aggressive batching (a long prompt in the batch stalls everyone behind it, blowing TTFT) and under-batching (the engine never fills, GPUs sit idle, cost stays high).

The pain spreads across roles. Finance sees inference cost dominate the AI line item when batching is misconfigured. Platform engineers fight tradeoffs between GPU utilization and per-request latency. Data engineers run nightly batch jobs that overlap training jobs and create resource contention. Compliance teams need predictable nightly batch runs to score the previous day’s traces against Groundedness and PromptInjection thresholds — a missed batch is an audit gap.

In 2026 LLM stacks the boundary between streaming and batching is fuzzy: continuous batching keeps streams active for individual users while batching across users for GPU efficiency. The right framing is not “stream or batch” but “what cadence and what budget” — interactive chat is streaming-shaped; eval suites, regression tests, dataset rescoring, and synthetic-data generation are batch-shaped.

How FutureAGI handles batch processing

The specified FutureAGI anchor for this term is none: batch processing is a serving and scheduling pattern, not a single FutureAGI evaluator. FutureAGI’s approach is to make batch jobs first-class on the evaluation side: any Dataset can be scored in bulk through Dataset.add_evaluation, and any production cohort can be replayed through a scheduled batch.

A real workflow looks like this. A team’s nightly Airflow DAG samples 5,000 traces from the previous day’s traceAI spans, materializes them into a Dataset, and calls Dataset.add_evaluation with Groundedness, ContextRelevance, TaskCompletion, and PromptInjection. The job runs on a single batched LLM-as-judge endpoint with routing policy: cost-optimized — Haiku for short responses, Sonnet for long ones — and writes scores back per row. Cohort regressions are flagged the next morning.

A separate weekly batch runs the regression cohort against any candidate prompt or model from open pull requests, returning eval-gate verdicts to CI. For serving-side batching, traceAI on traceAI-vllm records llm.token_count.prompt, queue time, and effective batch size so engineers can see whether continuous batching is filling the device. Unlike a Spark batch job that mainly tracks task state, FutureAGI keeps dataset version, evaluator score, prompt version, and trace in one record so the batch run is a real reliability artifact, not just a row of green ticks.

How to Measure or Detect It

Measure batch processing as both a throughput and a quality boundary:

  • Throughput (requests per second): the headline batch metric; rising throughput at constant quality is the goal.
  • GPU utilization: percentage of cycles doing model work; healthy continuous batching sits in the 80–95% range.
  • Effective batch size distribution: histogram of batch fill levels; a long left tail means under-batching.
  • Per-request p50 / p99 latency under batch: confirms throughput gains did not destroy per-user experience.
  • Job success and rerun rate: percentage of scheduled batches that complete without manual rerun.
  • Evaluator coverage in the batch: number of evaluator classes attached per row; below 2 is usually too thin.

Bulk evaluator pattern over a Dataset:

from fi.evals import Groundedness, TaskCompletion

for row in dataset.rows:
    g = Groundedness().evaluate(response=row.resp, context=row.ctx)
    t = TaskCompletion().evaluate(input=row.task, response=row.resp)
    dataset.add_evaluation(row.id, {"groundedness": g.score, "task": t.score})

Common Mistakes

  • Treating batch as the opposite of streaming: most modern LLM engines do continuous batching with streaming output simultaneously.
  • Sizing batches without latency guardrails: pushing batch size up always raises throughput and eventually destroys per-user TTFT.
  • Running unscored batches: a nightly job that processes data without attaching evaluator scores is a silent operation, not a reliability artifact.
  • Mixing offline and online batches on the same hardware: nightly eval runs starve the serving fleet during the morning peak.
  • No re-run idempotency: a batch that double-writes a Dataset row corrupts the next eval cohort.

Frequently Asked Questions

What is batch processing in ML?

Batch processing groups a bounded set of inputs and processes them together for high throughput. In ML it covers nightly feature jobs, offline scoring, and bulk evaluation. For LLM serving, it means grouping concurrent prompts through an engine, usually via continuous batching.

How is batch processing different from streaming processing?

Batch is bounded, periodic, high-throughput; streaming is unbounded, low-latency, per-event. For LLMs, batch groups requests for GPU efficiency while streaming sends tokens to the client. The same engine usually does both.

How do you measure batch processing?

Track throughput (requests per second), GPU utilization, batch size distribution, p50 and p99 per-request latency, and FutureAGI evaluator scores attached when batch jobs score Datasets via Dataset.add_evaluation.