What are concurrent calls?

Concurrent calls are the number of simultaneous, in-flight requests an LLM or voice-AI fleet is handling at any moment. The metric drives capacity planning, autoscaling, and routing decisions across model and voice-agent serving layers.

How do concurrent calls differ from requests-per-second?

Requests-per-second is throughput — how many requests start in a second. Concurrent calls is the in-flight count. A 10 RPS workload with 4-second average latency runs 40 concurrent calls; the two metrics are related but distinct.

How does FutureAGI measure concurrent-call effects?

FutureAGI captures concurrency as a span attribute via traceAI integrations and surfaces saturation effects through latency p99, AudioQualityEvaluator, and ASRAccuracy in the observability dashboard.

Concurrent Calls: LLM and Voice AI Definition

What Is Concurrent Calls?

Concurrent calls are the number of simultaneous, in-flight requests an LLM or voice-AI fleet is handling at any moment. The metric matters at two layers. At the model-serving layer — vLLM, Ollama, NVIDIA NIM, hosted APIs — concurrency caps protect the engine from overload, and concurrency drives batching, KV-cache pressure, and tail latency. At the voice-agent layer — LiveKit, Pipecat — concurrency is the fleet-saturation signal: how many calls a replica handles before audio quality degrades. FutureAGI tracks concurrent calls as a span attribute and surfaces fleet-saturation effects in the observability dashboard alongside time-to-first-token, time-to-first-audio, audio quality, and ASR accuracy.

Why Concurrent Calls Matter in Production LLM and Voice Systems

Concurrency is the lever that links cost to user-visible quality. Push it too low and you over-provision; push it too high and tail latency rises, audio glitches appear, and ASR accuracy drops. The cost-quality curve is non-linear: a modest increase in concurrency past a saturation point can swing p99 latency by 3-5x.

Pain shows up across roles. An SRE responds to a paged “voice-agent quality drop” alarm and finds that the underlying cause is a spike in concurrent sessions per replica — not a model regression. A finance lead sees inference cost rise 30% to maintain the same user-visible latency after a launch boosted traffic. A product reviewer hears that voice-agent calls during peak hours sound choppy or have repeated greetings — a concurrency-induced symptom, not a model bug. A platform engineer cannot answer “how many concurrent sessions can a single replica safely handle?” because no benchmark was run against the production audio path.

In 2026 voice-AI fleets the question intensified. Voice agents do not tolerate cold-start the way text agents do — every cold-start is a noticeable pause to the user. Concurrency must be tuned together with autoscaling so that quality holds during traffic spikes. Multi-agent text systems face a softer version of the same problem: high concurrency raises p99, which raises retry rate, which raises overall latency in a feedback loop.

How FutureAGI Handles Concurrent Calls

FutureAGI does not run the inference engine; vLLM, Ollama, NVIDIA NIM, or a hosted API owns model execution. FutureAGI’s approach is to treat concurrent calls as a causal load variable inside traces, not as a standalone infrastructure counter. Every request flows through a traceAI-* integration such as traceAI-livekit and emits OpenTelemetry spans tagged with session ID, agent name, replica ID, and agent.trajectory.step. Unlike vLLM throughput counters or a Grafana CPU chart alone, the useful production view connects concurrency-per-replica to latency p99, time-to-first-audio, ASRAccuracy, AudioQualityEvaluator, and ConversationResolution so a saturation event is visible across signal types.

A real workflow: a voice-agent fleet running on traceAI-livekit operates at 100 concurrent sessions per replica during the day. During an evening campaign, concurrency rises to 380 per replica. Time-to-first-audio p99 jumps from 380ms to 1.2s. AudioQualityEvaluator flags a 14% rise in glitch-rate and ASRAccuracy falls from 0.94 to 0.86. The team uses the simulate SDK’s LiveKitEngine to replay the load profile against a higher-replica autoscaling policy and confirm that quality holds at 200 concurrent sessions per replica with the new replica count. The new policy ships through the Agent Command Center routing layer.

For text LLM workloads, the same pattern applies: concurrent-calls-per-route is graphed against llm.token_count.prompt, p99 latency, and evaluator scores. When concurrency rises and quality drops, the routing-policy fallback ladder (cost-optimized → least-latency → strong-model) absorbs the shift.

How to Measure or Detect Concurrent Calls

Useful signals when monitoring concurrent calls:

Concurrent-sessions-per-replica (dashboard signal): the canonical fleet-saturation metric for voice systems.
Concurrent-requests-per-route (dashboard signal): the text-LLM equivalent; pair with model and tenant.
ASRAccuracy: drops first under saturation in voice fleets; the leading-edge quality signal.
AudioQualityEvaluator: flags glitch-rate, dropouts, and codec artifacts that spike with concurrency.
Time-to-first-audio p99: paired with concurrency for capacity planning; saturation knee is visible on the graph.
ConversationResolution: the AI-side end-state metric that captures “did saturation actually break the call.”

Measure the knee, not just peak load. Run a ramp test with a fixed traffic mix, mark the first point where p99 or evaluator score breaks threshold, and store that as a route-level SLO. Re-run the test after model, TTS, ASR, or replica-size changes.

Minimal Python:

from fi.evals import ASRAccuracy, AudioQualityEvaluator

asr = ASRAccuracy().evaluate(
    audio_path="/sessions/abc.wav",
    reference_text=ground_truth,
)
audio = AudioQualityEvaluator().evaluate(
    audio_path="/sessions/abc.wav",
)
print(asr.score, audio.score)

Common mistakes

Targeting 100% concurrency. Sustained 100% utilization removes the headroom that absorbs provider jitter, retries, and replica restarts; capacity plans need a p95 operating band.
Treating concurrency as a single number. Per-replica, per-route, per-tenant, and per-region views isolate different failures; one global average hides hot shards during regional launches.
Ignoring cold-start cost. Aggressive scale-down creates the next pause after a traffic dip; tune scale-down with time-to-first-audio, p99 latency, and active call duration.
No quality cross-check. A higher concurrency target is not safe unless ASRAccuracy, AudioQualityEvaluator, and latency stay inside the launch threshold for the same traffic cohort.
Hand-tuning thresholds in production. Replay representative call profiles with LiveKitEngine, then ship the limit through routing policy or autoscaling config before changing production limits.