LLM Inference in 2026: How Prompts Become Responses, and How to Cut Latency and Cost
How LLM inference works in 2026: tokenization, KV cache, decoding, latency targets (TTFT under 500ms), cost math, and 7 optimizations that move the needle.
Table of Contents
LLM Inference, Explained in One Paragraph
Large language model inference is what happens after training is done. A pretrained model takes your input prompt, tokenizes it, runs a single parallel pass called prefill to produce the first response token, then loops one token at a time through the decode phase until it hits a stop condition. Every production concern about LLMs (latency, cost, GPU utilization, queue depth, streaming UX) is an inference concern. This guide walks through the mechanics, the metrics that matter in 2026, and the seven optimizations that actually move throughput and cost numbers in production.
TL;DR: LLM Inference at a Glance in 2026
| Question | 2026 answer |
|---|---|
| What are the two phases? | Prefill (parallel over input prompt, sets TTFT) and decode (one token at a time, sets tokens-per-second). |
| What metric matters most? | TTFT under 500 ms for chat, sustained tokens-per-second per user, and GPU memory headroom. |
| Top throughput lever? | KV cache reuse with PagedAttention plus continuous batching (vLLM default since 2023, refined through 2026). |
| Top cost lever? | Routing easy queries to smaller open-weight models behind a BYOK gateway. 60 to 80 percent typical savings. |
| Quantization in 2026? | FP8 on Hopper and Blackwell is the new default. INT8 with AWQ or SmoothQuant for Ampere. |
| Speculative decoding payoff? | 1.5x to 3x throughput when the draft model accepts roughly 70 percent of proposed tokens. |
| How to track inference quality? | Pair vLLM metrics with application traces via traceAI and faithfulness or factuality evaluators from Future AGI’s evaluate API. |

Image 1: Prefill processes the entire prompt in one parallel pass; the decode phase then generates one token per step, reusing the KV cache.
How LLM Inference Works: Prefill, Decode, and the KV Cache
Tokenization Converts Text Into Integer IDs
Tokenization splits the input string into subword pieces using a learned vocabulary (BPE for GPT-family, SentencePiece for Llama and Gemma, Tiktoken for OpenAI). Each piece maps to an integer ID. A 1,000-word English prompt is roughly 1,300 tokens for GPT-5 and roughly 1,250 for Claude. Tokenizers are model-specific, which is why prompt token counts vary across providers even for identical text.
Prefill Runs the Whole Prompt Through the Network in One Pass
The prefill phase computes every token’s hidden state in parallel through every transformer layer. The output is the next-token distribution at the final position, sampled to produce the first response token. Prefill is the most compute-heavy step per request: its FLOPs scale with input length squared because of full self-attention. This is why TTFT grows with prompt length and why prefix caching is so valuable for repeated long prefixes.
Decode Loops One Token at a Time, Reusing the KV Cache
Decode computes attention only for the newest token, using the cached key and value tensors from all prior tokens. Each decode step is one transformer forward pass over a single position. Throughput in tokens-per-second is bounded by GPU memory bandwidth, not FLOPs, because decode is memory-bound. This is why batching multiple users’ decode steps together is so effective: you amortize one weight read across many tokens.
Decoding Strategies: Greedy, Sampling, and Beam Search
- Greedy decoding picks the argmax token every step. Fastest, deterministic, prone to repetition on long generations.
- Top-p (nucleus) sampling picks from the smallest set of tokens whose cumulative probability exceeds p (commonly 0.9). Combined with a temperature between 0.3 and 0.7, it is the production default for chat and content.
- Beam search maintains k partial sequences and prunes to the top-k after each step. Used for translation and summarization, rarely for chat.
- Constrained decoding (Outlines, XGrammar, Guidance) restricts the next-token distribution to tokens consistent with a JSON schema or regex. Required for reliable tool-calling.
LLM Inference Metrics That Matter in 2026
| Metric | What it measures | Healthy 2026 target |
|---|---|---|
| TTFT (time-to-first-token) | Prefill latency plus queue time | Under 500 ms for prompts up to 4K tokens |
| TPOT (time-per-output-token) | Per-token decode latency | 30 to 60 ms for 7B to 70B models on H100 |
| Tokens-per-second per user | End-user perceived stream rate | 30 to 80 t/s feels conversational |
| Aggregate throughput | Total tokens-per-second across a server | 5K to 20K t/s on a single 8x H100 node with vLLM |
| GPU memory utilization | Fraction of HBM used by weights plus KV cache | 80 to 90 percent for max batch density |
| Goodput | Throughput at requests meeting an SLO | The right north star for capacity planning |
For deeper benchmarks, see the Anyscale LLM serving benchmarks and the vLLM performance dashboard.
Seven Optimizations That Move the Needle
1. KV Cache Reuse With PagedAttention
PagedAttention, introduced in vLLM and described in the original paper, borrows virtual memory paging from operating systems and applies it to the KV cache. It eliminates the fragmentation that wasted up to 60 percent of cache memory in early serving frameworks and is the default for most modern stacks (vLLM, TensorRT-LLM, SGLang).
2. Continuous Batching
Static batching waits for every sequence in a batch to finish before accepting new requests. Continuous batching swaps a finished sequence out and a new one in immediately. The original Orca paper from OSDI 2022 demonstrated the technique; vLLM and TGI ship it as the default. Expect 2x to 4x effective throughput compared to static batching on mixed chat workloads.
3. FP8 and INT8 Quantization
FP8 weight and activation quantization on Hopper and Blackwell GPUs roughly doubles throughput at small quality cost. INT8 with SmoothQuant or AWQ is the equivalent for Ampere (A100). For weights-only INT4, GPTQ and AWQ are the established methods. Always measure quality regression with your evaluation suite before shipping a quantized model.
4. Speculative Decoding
A small draft model generates k candidate tokens cheaply; the large target model verifies all k in a single forward pass. Accepted tokens are committed; rejected tokens trigger a fallback. Real-world acceptance rates for EAGLE-3 and Medusa-2 sit around 70 percent on common chat data, translating to 1.5x to 3x decode throughput. See the EAGLE-3 paper and the vLLM spec decode guide.
5. Prefix Caching for Shared Context
Most agent workloads reuse a long system prompt or RAG context across turns. Prefix caching stores the KV cache for that shared prefix and reuses it on subsequent requests, dropping prefill cost on cache hits to roughly zero. SGLang’s RadixAttention and vLLM’s automatic prefix cache both implement this.
6. Tensor and Pipeline Parallelism for Large Models
When a single model copy no longer fits in one GPU’s HBM, the model itself must be sharded. Tensor parallelism splits weight matrices across GPUs within a server (NVLink-bound). Pipeline parallelism splits transformer layers across servers (network-bound). Megatron-LM, DeepSpeed-Inference, and vLLM’s distributed mode all implement these patterns. For 405B Llama 4 inference on commodity hardware, expect to combine both.
7. Model Routing and Cascading
The cheapest token is the one you never run through GPT-5. A router classifies incoming prompts (length, complexity, tool requirements) and sends easy ones to Llama 4 8B or Mistral Small 3 and hard ones to GPT-5 or Claude Opus 4.7. Implementations include RouteLLM and similar production routers. Routing is best built behind a BYOK gateway so you can switch models without touching application code; see our guide to the best LLM gateways in 2026.
What Inference Looks Like in Production Code
Below is a thin Python sketch using LiteLLM as the model client and Future AGI’s traceAI plus evaluate APIs for observability and quality. Real production code adds retries, timeouts, and SLO checks.
import os
from fi_instrumentation import register, FITracer
from fi.evals import evaluate
from litellm import completion
# 1. Register the application with traceAI for end-to-end traces.
tracer_provider = register(
project_name="chat-inference",
project_type="experiment",
)
tracer = FITracer(tracer_provider.get_tracer(__name__))
os.environ.setdefault("FI_API_KEY", os.environ["FI_API_KEY"])
os.environ.setdefault("FI_SECRET_KEY", os.environ["FI_SECRET_KEY"])
@tracer.chain
def answer(question: str, context: str) -> str:
response = completion(
model="gpt-5-2025-08-07",
messages=[
{"role": "system", "content": "Answer using the provided context only."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
],
stream=False,
)
output = response["choices"][0]["message"]["content"]
# 2. Score the answer for faithfulness against the provided context.
result = evaluate(
eval_templates="faithfulness",
inputs={"output": output, "context": context},
model_name="turing_flash",
)
if result.eval_results[0].metrics[0].value < 0.7:
# Low faithfulness: log, alert, or route to a stronger model.
# Production teams swap in a fallback model call here.
return output
return output
A few notes on the snippet:
registerandFITracercome from traceAI, Future AGI’s open-source OpenTelemetry-compatible instrumentation under Apache 2.0.evaluateis the cloud evaluator entry point in ai-evaluation (Apache 2.0). Theturing_flashmodel returns in roughly 1 to 2 seconds;turing_smallin 2 to 3 seconds;turing_largein 3 to 5 seconds. See the cloud evals reference.- Traces and evaluator scores flow into the Agent Command Center at
/platform/monitor/command-center, which is the dashboard for production inference observability.
Pairing Serving Metrics With Quality Evaluation
Inference observability has two layers and most teams build both.
Layer 1: serving metrics. vLLM, TGI, and TensorRT-LLM all export Prometheus metrics: TTFT, TPOT, queue depth, GPU memory, and so on. Wire these into Grafana for capacity planning and SLO dashboards. The vLLM metrics doc lists the supported series.
Layer 2: application-level traces with quality scores. Serving metrics tell you the model returned a token in 50 ms; they do not tell you whether the token was right. Future AGI’s evaluate API runs faithfulness, factuality, and task-specific judges on every traced completion, and the results sit next to the latency in the Agent Command Center. The same trace surfaces in load-testing dashboards (see our LLM load testing guide) so a regression in p95 latency caused by switching to a quantized model is paired with the corresponding quality delta.
For a deeper dive, see our LLM evaluation tools roundup for 2026 and the LLM cost optimization playbook. If you are scoring prompts, our prompt best-practices guide covers the upstream concerns that affect token count and TTFT.
Common Pitfalls When Optimizing Inference
- Quantizing without a quality test set. FP8 and INT8 usually look free in benchmarks and hurt in production on hard tasks. Always rerun your evaluator suite after a precision change.
- Reading aggregate throughput without goodput. A server can advertise 20K tokens-per-second total while p95 TTFT for individual users blows past 3 seconds. Track goodput, not raw throughput.
- Caching the wrong prefix. Prefix cache hits only help if the prefix is byte-identical. Insertion of a per-user timestamp or a session ID at the top of a system prompt destroys the cache hit rate.
- Speculative decoding with a poorly aligned draft. If the draft model and target model disagree on more than roughly 50 percent of tokens, speculative decoding can be slower than vanilla decode. Run an acceptance-rate audit before rolling it out.
- Skipping load tests. Continuous batching only earns its keep at realistic concurrency. Run a representative load profile (see our load testing tool guide) before changing batch settings in production.
Where Future AGI Fits in the Inference Stack
Future AGI sits at the application layer, on top of whichever serving stack you use (vLLM, TensorRT-LLM, SGLang, or a managed API). It provides three things inference teams rely on in 2026:
- traceAI for application-level OpenTelemetry traces, instrumenting LiteLLM, the OpenAI SDK, LangChain, LlamaIndex, OpenAI Agents, and MCP servers under Apache 2.0.
- The evaluate API for online quality scoring of streamed completions with deterministic and LLM-judge evaluators. Scores attach to traces and feed dashboards.
- The Agent Command Center at
/platform/monitor/command-centerfor production dashboards, BYOK gateway routing, and the Protect guardrail layer for output filtering.
If you are switching models, quantizing, or rolling out speculative decoding, traceAI and the evaluator suite give you the before-and-after data to ship safely. Read the getting-started docs or jump into the GitHub repo for the Apache 2.0 source.
Frequently asked questions
What is LLM inference?
What is the difference between training and inference?
What is TTFT and why does it matter?
How does KV caching speed up LLM inference?
What decoding strategy should I use for production?
How do I measure inference quality, not just speed?
What is the cheapest way to run inference at scale?
How does inference observability fit alongside model serving?
OpenAI AgentKit (Oct 2025) + Future AGI in 2026: visual builder, traceAI auto-instrumentation, fi.evals scoring, BYOK gateway. Real code, real APIs, no hype.
Cut LLM costs 30% in 90 days. 2026 playbook on model routing, caching, BYOK gateways, cost tracking. Includes best LLM cost-tracking tools.
Top prompt management platforms in 2026: Future AGI, PromptLayer, Promptfoo, Langfuse, Helicone, Braintrust, and the OpenAI Prompts API. Versioning + eval + deploy.