Does FutureAGI calculate agent utilization?

FutureAGI does not compute human-rep utilization — that is a workforce-management tool's job. The AI equivalent — voice-agent fleet utilization — is derived from traceAI session lifecycle spans on LiveKit and Pipecat integrations.

Agent Utilization: Definition & FutureAGI Guide (2026)

Q: What is agent utilization?

Agent utilization is a contact-center metric: the share of an agent's total scheduled time spent actively handling contacts. Calculated as handle time divided by total scheduled time including breaks and training.

Q: How is agent utilization different from agent occupancy?

Utilization uses total scheduled time as the denominator (including break and training time). Occupancy uses only available time as the denominator (excluding breaks). Utilization is always lower than occupancy for the same handle time.

What Is Agent Utilization?

Agent utilization is a contact-center workforce-management metric: the percentage of an agent’s scheduled time spent actively handling contacts. The standard formula is handle time divided by total scheduled time, multiplied by 100, where scheduled time includes breaks, training, meetings, and unavailable time. It differs from occupancy because occupancy uses only available time as the denominator. In FutureAGI-style AI-agent operations, the equivalent is fleet utilization: active session-time divided by replica-hours, then checked against latency and quality.

Why Agent Utilization Matters in Production LLM and Agent Systems

Utilization is the metric that ties staffing cost to productive work. For human reps, sustained low utilization (<55%) signals over-staffing or schedule inefficiency; sustained high utilization (>75%) signals under-staffing and rising attrition risk. WFM platforms use it to validate the schedule against the demand forecast.

For AI-agent fleets, the equivalent question is whether the fleet you are paying for is doing work. A voice-agent fleet with 80% utilization is busy; one at 25% is over-provisioned. But low utilization is not always waste — some idle headroom is the buffer that holds latency at SLA when traffic shifts. Unlike Erlang C staffing models, which estimate staffing from arrival rate, handle time, and target service level, AI-fleet utilization is measured from actual session spans and replica-hours. The right target depends on session-time variance and acceptable cold-start risk, both of which are observable properties on the AI fleet side that have no human equivalent.

Different roles see different views. A platform engineer plans autoscaling targets against utilization curves. An SRE alerts on utilization plus latency together — high utilization with latency creep means the fleet is saturating. A finance partner uses utilization to compute the cost-per-handled-call ratio and to forecast spend. A product reviewer rarely cares about utilization directly but feels its consequences when over-provisioning eats budget that could fund features.

In 2026, voice-AI fleets routinely autoscale on utilization plus a quality floor — for example, scale up when utilization exceeds 70% OR when AudioQualityEvaluator p95 dips below threshold. That joint-policy approach is the AI equivalent of WFM’s “shrinkage-adjusted” staffing math.

How FutureAGI Handles Agent Utilization

FutureAGI does not measure human-rep utilization — that is a WFM platform’s job. The AI equivalent — voice-fleet utilization — is derived from traceAI session lifecycle spans. The traceAI-livekit and traceAI-pipecat integrations emit OpenTelemetry events for replica registration, session accept, session active, session end, and replica drain. Sum the active-session-time per replica per hour and divide by replica-hours; that is fleet utilization.

FutureAGI’s approach is to treat utilization as a capacity signal only when quality stays inside threshold. A replica running at 95% utilization that produces 18% glitch-rate audio is overloaded, not productive. AudioQualityEvaluator, ASRAccuracy, and ConversationResolution run continuously against sampled sessions, and the dashboard pairs utilization with quality scores so the autoscaler can target a quality-adjusted utilization band rather than raw busy-time.

Concrete example: a voice-AI fleet on LiveKit reports 78% raw utilization. FutureAGI’s AudioQualityEvaluator shows p95 audio score is 0.71 — below the 0.85 floor. Slicing by replica reveals 22% of replicas account for 80% of low-quality sessions. Those replicas are running on a degraded inference-engine version. The autoscaler target stays at 70% utilization, but the deploy gate gets a new check: replica health includes audio-quality output, not just process health. After fix, raw utilization stays similar but quality-adjusted utilization rises from 55% to 68%.

For multi-agent text fleets, the same pattern applies via traceAI spans on traceAI-openai-agents, traceAI-langgraph, and traceAI-crewai: sum active-trajectory time per replica per hour to derive fleet utilization, pair with TaskCompletion for the quality-adjusted view.

How to Measure Agent Utilization

Utilization is best computed from session lifecycle events:

fleet utilization (dashboard signal): sum(active-session-time) / sum(replica-hours); the headline AI-fleet KPI.
quality-adjusted utilization (FutureAGI dashboard): utilization weighted by AudioQualityEvaluator or TaskCompletion p95 — the metric the autoscaler should actually target.
ConversationResolution: scores whether handled time was productive — pairs with utilization to compute cost-per-resolved-session.
AudioQualityEvaluator: floor-check that paired with utilization avoids overloading the fleet.
ASRAccuracy: scores transcript quality under load — degrades when fleet utilization saturates audio infra.
agent.trajectory.step (OTel attribute): tagged with replica ID and active session ID for slicing utilization per region or per model.

Use rolling 5- or 15-minute windows rather than daily averages. Utilization spikes are short; smoothing hides overload that appears minutes later as p99 latency, transcript loss, or manual escalation alerts.

from fi.evals import ConversationResolution, AudioQualityEvaluator

res = ConversationResolution().evaluate(
    conversation=transcript,
)
audio = AudioQualityEvaluator().evaluate(
    audio_path="/sessions/abc.wav",
)
print(res.score, audio.score)

Common mistakes

Confusing utilization with occupancy. Different denominators, different thresholds. For AI fleets, this hides warmup and drain time, producing targets that miss SLA.
Targeting raw utilization without a quality floor. A fleet at 95% utilization producing degraded audio is overloaded; require AudioQualityEvaluator or ASRAccuracy gates.
Ignoring cold-start in the utilization curve. New replicas inflate the denominator before they serve traffic; chart warmup lag separately during demand spikes.
Using one target across heterogeneous traffic. Long-call and short-call routes have different optimal bands; segment by route, model, and region instead of global averages.
Reading WFM utilization onto AI fleets one-to-one. Humans fail through burnout; AI fleets fail through cold starts, queueing, degraded media, and retries under heavy load.