What Is P90 Latency?
The 90th-percentile latency value: 90% of requests are at or below it, and the slowest 10% take longer.
What Is P90 Latency?
P90 latency is the 90th-percentile response time for a request population: 90% of requests finish at or below that value, and the slowest 10% take longer. It is an observability metric for production LLM and agent systems because it exposes tail latency without letting rare p99 outliers dominate. In FutureAGI traces, teams compute p90 over LLM, tool, retriever, gateway, or full-request spans from fields such as gen_ai.client.operation.duration plus model and route attributes.
Why P90 Latency Matters in Production LLM and Agent Systems
P90 latency catches the slow path that enough users actually feel. If a support copilot has a 900 ms p50 and a 7 second p90, half the traffic looks fine while one in ten sessions feels stalled. That pattern leads to repeated submits, manual escalations, tool retries, and timeout chains. It can also hide quality fallout: users abandon before the answer arrives, so product dashboards show low completion rather than an obvious model error.
The pain lands in different places. SREs see p90 and p99 latency spikes, retry storms, queue growth, and provider timeout errors. Developers see long spans for retrieval, tool calls, guardrails, or prompt-heavy model steps. Product teams see lower task completion for cohorts that hit slow routes. Compliance teams worry when users bypass an approved workflow because the AI path is too slow for regulated decisions.
P90 is especially useful for 2026-era agent systems because a single user turn can fan out into planner calls, retrievers, tools, policy checks, and a final response. A p50 view says the happy path is healthy. P90 shows whether common multi-step paths are degrading. If p90 rises only for traces with fi.span.kind="TOOL" or for one model route, the fix is targeted instead of a broad model swap.
How FutureAGI Handles P90 Latency
FutureAGI handles p90 latency as a trace aggregation problem, not a single SDK timer. In a LangChain support agent instrumented with the traceAI-langchain integration, every request becomes an OpenTelemetry trace with spans for the chain, retriever, LLM call, tool execution, guardrail, and gateway hop. Each span records duration, fi.span.kind, gen_ai.request.model, token counts, route metadata, and streaming fields such as gen_ai.server.time_to_first_token.
A real incident might start with this alert: “p90 full-request latency for checkout-agent route A rose from 2.1s to 5.8s over 20 minutes.” The engineer opens slow FutureAGI traces, groups duration by fi.span.kind, and sees the LLM span stayed flat while a tax-calculation tool moved from 300 ms to 3.4s. The next action is not to change the model. It is to cap tool timeout, add a cached fallback for tax lookup, and run a regression eval to confirm TaskCompletion did not drop.
FutureAGI’s approach is to keep the percentile explainable. Unlike Datadog HTTP APM, which can show an endpoint is slow but not why one agent path is slow, FutureAGI slices p90 by traceAI-openai model spans, traceAI-langchain chain spans, tenant tier, prompt version, tool name, and Agent Command Center route. Teams can alert on p90 by route, then use routing policy: least-latency or model fallback only when the trace evidence points to the provider path.
How to Measure or Detect P90 Latency
Measure p90 on the population that matches the user promise. A global p90 across batch jobs, voice turns, and checkout agents is usually misleading. Use these signals:
- End-to-end span duration: aggregate
gen_ai.client.operation.durationon the root request or agent-turn span. - Streaming responsiveness: compute p90 for
gen_ai.server.time_to_first_tokenwhen the user sees streamed tokens. - Span decomposition: group p90 by
fi.span.kindto separate LLM, retriever, tool, guardrail, and gateway time. - Cohort dimensions: slice by
gen_ai.request.model, route, region, tenant tier, prompt version, and tool name. - Token pressure: compare p90 with
gen_ai.usage.input_tokens; long prompts often increase prefill time before any token appears. - Dashboard signals: p90 and p99 latency, timeout rate, retry count, token-cost-per-trace, and eval-fail-rate-by-cohort.
- User proxies: repeated-submit events, abandoned sessions, thumbs-down rate, and escalation rate after slow turns.
from fi_instrumentation import register
from traceai_langchain import LangChainInstrumentor
provider = register(project_name="support-agent")
LangChainInstrumentor().instrument(tracer_provider=provider)
# Query p90 over gen_ai.client.operation.duration by route/model.
For alerts, compare rolling p90 with a baseline for the same route and hour. A 35% jump on one route is more actionable than a flat global threshold.
Common Mistakes
- Treating p90 as the only latency SLO. Pair it with p50 for typical behavior and p99 for extreme tail risk.
- Mixing unlike traffic. Voice turns, batch evals, admin jobs, and customer-facing chat need separate p90 budgets.
- Averaging percentiles across shards. Recompute p90 from raw events or histogram buckets; averaging p90 values gives false precision.
- Ignoring failed and timed-out requests. Excluding timeouts makes p90 look better while users experience the worst path.
- Stopping at the root span. A high request-level p90 tells you there is pain; span-level p90 tells you which component caused it.
Frequently Asked Questions
What is P90 latency?
P90 latency is the 90th-percentile response time: 90% of requests finish at or below that value, while 10% are slower. It is a tail-latency metric for production LLM and agent traces.
How is P90 latency different from P99 latency?
P90 shows the slow edge of normal user experience, while P99 isolates the rarest and most expensive tail. Teams usually alert on p90 for broad regression and inspect p99 for extreme outliers.
How do you measure P90 latency?
Instrument with FutureAGI traceAI and aggregate span fields such as gen_ai.client.operation.duration or gen_ai.server.time_to_first_token by route, model, tenant, and fi.span.kind.