What Is a Quantile? Definition for LLM Latency & Cost (2026)

What Is a Quantile?

A quantile is a cut point that divides an ordered distribution into equal-probability groups. The 0.5 quantile (the median) is the value below which half the observations sit. The 0.99 quantile (p99) is the value below which 99% of observations sit. In LLM-application monitoring, quantiles are how you describe the tail of latency, cost, token usage, and eval-fail rate — anywhere the average is misleading because the distribution is skewed. A trace pipeline that looks fine at the mean and broken at p99 is a trace pipeline that ships a broken experience to the worst-served fraction of users.

Why It Matters in Production LLM and Agent Systems

LLM workloads have heavy tails by default. A streaming response that averages 200ms can have a p99 above 4 seconds because of cold-start tokenization, retry logic, retriever timeouts, or a tool call that occasionally hits a slow upstream. An agent that averages $0.04 per request has a p99 above $1.20 because once in 100 requests the agent loops 30 times before stopping. Both shapes are invisible if you only chart the mean.

The pain falls on whichever role owns the tail. SREs page on p99 latency thresholds. Finance owns the p95 cost per trace because that is the bill they cannot predict. Product managers field complaints from the 1% of users who got the 4-second response and are louder than the 99% who got the fast one. Compliance leads who own a regulated SLA — “median response within 2 seconds” or “no request exceeds 10 seconds” — care about quantiles because that is how the SLA is written.

In 2026-era agentic stacks, quantiles compound across steps. A 5-step agent where each step has a clean p50 but a dirty p99 produces a request-level latency that is far worse than 5×p50 — the tails align and the worst-of-N becomes the user-facing number. Tracking quantiles step-by-step is the only way to see this.

How FutureAGI Handles Quantiles

FutureAGI doesn’t reinvent statistics — we surface quantiles where they matter for LLM systems. Every span ingested through traceAI carries latency, llm.token_count.prompt, llm.token_count.completion, and cost attributes; FutureAGI’s dashboards compute p50, p90, and p99 over those time series, sliced by route, model, prompt version, or eval-fail-rate cohort. The quantile view is what reveals that a model swap from gpt-4o to gpt-4o-mini cut mean cost in half but tripled p99 cost because the smaller model loops more often on a specific intent.

Concretely: an engineering team running an agent on traceAI-openai-agents charts p99 end-to-end latency by route. They see the refund_intent route sitting at a p50 of 1.2s and a p99 of 9.8s. The trace view points to one tool — a CRM lookup — that hits the slow path 3% of the time. They add a model fallback route in the Agent Command Center and the p99 drops to 3.1s. None of this would have surfaced from the mean. FutureAGI’s role is making the tail visible at every layer of the stack — span, trace, evaluator, cost — so the tail-owning team gets the signal first.

How to Measure or Detect It

Quantile-based signals you actually use:

p50, p90, p99 latency: per-span and end-to-end; alert on p99, not mean.
p95 cost per trace (llm.token_count.prompt + llm.token_count.completion): the bill predictability metric.
p99 token-count per trace: surfaces runaway-context bugs before they become a runaway-cost incident.
p90 eval-fail-rate-by-cohort: alerts on the 10% of cohorts where evals are worst rather than the global average.
p99 by route, by model, by prompt version: slicing dimension matters more than the quantile itself.

# OTel attribute names you'll filter on
PROMPT_TOKENS = "llm.token_count.prompt"
COMPLETION_TOKENS = "llm.token_count.completion"
LATENCY = "latency_ms"

# In FutureAGI dashboards: percentile(LATENCY, 0.99) by route, model

Common Mistakes

Charting only the mean. It hides every tail problem; LLM workloads have heavy tails by default.
Using p99 as a single global number. P99 by route and by model is what tells you where the tail lives; aggregate p99 hides the bug.
Setting alerts on p50. P50 moves slowly and misses regressions; alert on p95 or p99.
Ignoring quantile compounding across agent steps. A 5-step trace with clean step-level p99s can have a terrible request-level p99 — measure end-to-end too.
Using small windows to compute p99. A p99 over 50 samples is noise; aggregate over enough traffic that the quantile is stable.

Frequently Asked Questions

What is a quantile?

A quantile is a cut point that divides an ordered distribution into equal-probability groups. The median is the 0.5 quantile, the p99 latency is the 0.99 quantile of latency.

Why use p99 latency instead of mean latency?

Mean latency is dominated by the bulk of fast requests and hides the slow tail. P99 surfaces the experience of the slowest 1% of users, which is usually what your worst customers feel.

How do you track quantiles in FutureAGI?

FutureAGI computes p50, p90, and p99 over OTel-traced spans for latency, token usage, and cost. Slice by route, model, or cohort to see which traffic owns the tail.