Guides

LLM Inference in 2026: How Prompts Become Responses, and How to Cut Latency and Cost

How LLM inference works in 2026: tokenization, KV cache, decoding, latency targets (TTFT under 500ms), cost math, and 7 optimizations that move the needle.

April 11, 2025

Updated May 14, 2026

9 min read

llm-inference latency llms evaluations

Table of Contents

LLM Inference, Explained in One Paragraph

Large language model inference is what happens after training is done. A pretrained model takes your input prompt, tokenizes it, runs a single parallel pass called prefill to produce the first response token, then loops one token at a time through the decode phase until it hits a stop condition. Every production concern about LLMs (latency, cost, GPU utilization, queue depth, streaming UX) is an inference concern. This guide walks through the mechanics, the metrics that matter in 2026, and the seven optimizations that actually move throughput and cost numbers in production.

TL;DR: LLM Inference at a Glance in 2026

Question	2026 answer
What are the two phases?	Prefill (parallel over input prompt, sets TTFT) and decode (one token at a time, sets tokens-per-second).
What metric matters most?	TTFT under 500 ms for chat, sustained tokens-per-second per user, and GPU memory headroom.
Top throughput lever?	KV cache reuse with PagedAttention plus continuous batching (vLLM default since 2023, refined through 2026).
Top cost lever?	Routing easy queries to smaller open-weight models behind a BYOK gateway. 60 to 80 percent typical savings.
Quantization in 2026?	FP8 on Hopper and Blackwell is the new default. INT8 with AWQ or SmoothQuant for Ampere.
Speculative decoding payoff?	1.5x to 3x throughput when the draft model accepts roughly 70 percent of proposed tokens.
How to track inference quality?	Pair vLLM metrics with application traces via traceAI and faithfulness or factuality evaluators from Future AGI’s evaluate API.

LLM inference flow diagram showing prefill phase, KV cache decoding, and token streaming for production deployments

Image 1: Prefill processes the entire prompt in one parallel pass; the decode phase then generates one token per step, reusing the KV cache.

How LLM Inference Works: Prefill, Decode, and the KV Cache

Tokenization Converts Text Into Integer IDs

Tokenization splits the input string into subword pieces using a learned vocabulary (BPE for GPT-family, SentencePiece for Llama and Gemma, Tiktoken for OpenAI). Each piece maps to an integer ID. A 1,000-word English prompt is roughly 1,300 tokens for GPT-5 and roughly 1,250 for Claude. Tokenizers are model-specific, which is why prompt token counts vary across providers even for identical text.

Prefill Runs the Whole Prompt Through the Network in One Pass

The prefill phase computes every token’s hidden state in parallel through every transformer layer. The output is the next-token distribution at the final position, sampled to produce the first response token. Prefill is the most compute-heavy step per request: its FLOPs scale with input length squared because of full self-attention. This is why TTFT grows with prompt length and why prefix caching is so valuable for repeated long prefixes.

Decode Loops One Token at a Time, Reusing the KV Cache

Decode computes attention only for the newest token, using the cached key and value tensors from all prior tokens. Each decode step is one transformer forward pass over a single position. Throughput in tokens-per-second is bounded by GPU memory bandwidth, not FLOPs, because decode is memory-bound. This is why batching multiple users’ decode steps together is so effective: you amortize one weight read across many tokens.

Decoding Strategies: Greedy, Sampling, and Beam Search

Greedy decoding picks the argmax token every step. Fastest, deterministic, prone to repetition on long generations.
Top-p (nucleus) sampling picks from the smallest set of tokens whose cumulative probability exceeds p (commonly 0.9). Combined with a temperature between 0.3 and 0.7, it is the production default for chat and content.
Beam search maintains k partial sequences and prunes to the top-k after each step. Used for translation and summarization, rarely for chat.
Constrained decoding (Outlines, XGrammar, Guidance) restricts the next-token distribution to tokens consistent with a JSON schema or regex. Required for reliable tool-calling.

LLM Inference Metrics That Matter in 2026

Metric	What it measures	Healthy 2026 target
TTFT (time-to-first-token)	Prefill latency plus queue time	Under 500 ms for prompts up to 4K tokens
TPOT (time-per-output-token)	Per-token decode latency	30 to 60 ms for 7B to 70B models on H100
Tokens-per-second per user	End-user perceived stream rate	30 to 80 t/s feels conversational
Aggregate throughput	Total tokens-per-second across a server	5K to 20K t/s on a single 8x H100 node with vLLM
GPU memory utilization	Fraction of HBM used by weights plus KV cache	80 to 90 percent for max batch density
Goodput	Throughput at requests meeting an SLO	The right north star for capacity planning

For deeper benchmarks, see the Anyscale LLM serving benchmarks and the vLLM performance dashboard.

Seven Optimizations That Move the Needle

1. KV Cache Reuse With PagedAttention

PagedAttention, introduced in vLLM and described in the original paper, borrows virtual memory paging from operating systems and applies it to the KV cache. It eliminates the fragmentation that wasted up to 60 percent of cache memory in early serving frameworks and is the default for most modern stacks (vLLM, TensorRT-LLM, SGLang).

2. Continuous Batching

Static batching waits for every sequence in a batch to finish before accepting new requests. Continuous batching swaps a finished sequence out and a new one in immediately. The original Orca paper from OSDI 2022 demonstrated the technique; vLLM and TGI ship it as the default. Expect 2x to 4x effective throughput compared to static batching on mixed chat workloads.

3. FP8 and INT8 Quantization

FP8 weight and activation quantization on Hopper and Blackwell GPUs roughly doubles throughput at small quality cost. INT8 with SmoothQuant or AWQ is the equivalent for Ampere (A100). For weights-only INT4, GPTQ and AWQ are the established methods. Always measure quality regression with your evaluation suite before shipping a quantized model.

4. Speculative Decoding

A small draft model generates k candidate tokens cheaply; the large target model verifies all k in a single forward pass. Accepted tokens are committed; rejected tokens trigger a fallback. Real-world acceptance rates for EAGLE-3 and Medusa-2 sit around 70 percent on common chat data, translating to 1.5x to 3x decode throughput. See the EAGLE-3 paper and the vLLM spec decode guide.

5. Prefix Caching for Shared Context

Most agent workloads reuse a long system prompt or RAG context across turns. Prefix caching stores the KV cache for that shared prefix and reuses it on subsequent requests, dropping prefill cost on cache hits to roughly zero. SGLang’s RadixAttention and vLLM’s automatic prefix cache both implement this.

6. Tensor and Pipeline Parallelism for Large Models

When a single model copy no longer fits in one GPU’s HBM, the model itself must be sharded. Tensor parallelism splits weight matrices across GPUs within a server (NVLink-bound). Pipeline parallelism splits transformer layers across servers (network-bound). Megatron-LM, DeepSpeed-Inference, and vLLM’s distributed mode all implement these patterns. For 405B Llama 4 inference on commodity hardware, expect to combine both.

7. Model Routing and Cascading

The cheapest token is the one you never run through GPT-5. A router classifies incoming prompts (length, complexity, tool requirements) and sends easy ones to Llama 4 8B or Mistral Small 3 and hard ones to GPT-5 or Claude Opus 4.7. Implementations include RouteLLM and similar production routers. Routing is best built behind a BYOK gateway so you can switch models without touching application code; see our guide to the best LLM gateways in 2026.

What Inference Looks Like in Production Code

Below is a thin Python sketch using LiteLLM as the model client and Future AGI’s traceAI plus evaluate APIs for observability and quality. Real production code adds retries, timeouts, and SLO checks.

import os
from fi_instrumentation import register, FITracer
from fi.evals import evaluate
from litellm import completion

# 1. Register the application with traceAI for end-to-end traces.
tracer_provider = register(
    project_name="chat-inference",
    project_type="experiment",
)
tracer = FITracer(tracer_provider.get_tracer(__name__))

os.environ.setdefault("FI_API_KEY", os.environ["FI_API_KEY"])
os.environ.setdefault("FI_SECRET_KEY", os.environ["FI_SECRET_KEY"])


@tracer.chain
def answer(question: str, context: str) -> str:
    response = completion(
        model="gpt-5-2025-08-07",
        messages=[
            {"role": "system", "content": "Answer using the provided context only."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
        stream=False,
    )
    output = response["choices"][0]["message"]["content"]

    # 2. Score the answer for faithfulness against the provided context.
    result = evaluate(
        eval_templates="faithfulness",
        inputs={"output": output, "context": context},
        model_name="turing_flash",
    )
    if result.eval_results[0].metrics[0].value < 0.7:
        # Low faithfulness: log, alert, or route to a stronger model.
        # Production teams swap in a fallback model call here.
        return output
    return output

A few notes on the snippet:

register and FITracer come from traceAI, Future AGI’s open-source OpenTelemetry-compatible instrumentation under Apache 2.0.
evaluate is the cloud evaluator entry point in ai-evaluation (Apache 2.0). The turing_flash model returns in roughly 1 to 2 seconds; turing_small in 2 to 3 seconds; turing_large in 3 to 5 seconds. See the cloud evals reference.
Traces and evaluator scores flow into the Agent Command Center at /platform/monitor/command-center, which is the dashboard for production inference observability.

Pairing Serving Metrics With Quality Evaluation

Inference observability has two layers and most teams build both.

Layer 1: serving metrics. vLLM, TGI, and TensorRT-LLM all export Prometheus metrics: TTFT, TPOT, queue depth, GPU memory, and so on. Wire these into Grafana for capacity planning and SLO dashboards. The vLLM metrics doc lists the supported series.

Layer 2: application-level traces with quality scores. Serving metrics tell you the model returned a token in 50 ms; they do not tell you whether the token was right. Future AGI’s evaluate API runs faithfulness, factuality, and task-specific judges on every traced completion, and the results sit next to the latency in the Agent Command Center. The same trace surfaces in load-testing dashboards (see our LLM load testing guide) so a regression in p95 latency caused by switching to a quantized model is paired with the corresponding quality delta.

For a deeper dive, see our LLM evaluation tools roundup for 2026 and the LLM cost optimization playbook. If you are scoring prompts, our prompt best-practices guide covers the upstream concerns that affect token count and TTFT.

Common Pitfalls When Optimizing Inference

Quantizing without a quality test set. FP8 and INT8 usually look free in benchmarks and hurt in production on hard tasks. Always rerun your evaluator suite after a precision change.
Reading aggregate throughput without goodput. A server can advertise 20K tokens-per-second total while p95 TTFT for individual users blows past 3 seconds. Track goodput, not raw throughput.
Caching the wrong prefix. Prefix cache hits only help if the prefix is byte-identical. Insertion of a per-user timestamp or a session ID at the top of a system prompt destroys the cache hit rate.
Speculative decoding with a poorly aligned draft. If the draft model and target model disagree on more than roughly 50 percent of tokens, speculative decoding can be slower than vanilla decode. Run an acceptance-rate audit before rolling it out.
Skipping load tests. Continuous batching only earns its keep at realistic concurrency. Run a representative load profile (see our load testing tool guide) before changing batch settings in production.

Where Future AGI Fits in the Inference Stack

Future AGI sits at the application layer, on top of whichever serving stack you use (vLLM, TensorRT-LLM, SGLang, or a managed API). It provides three things inference teams rely on in 2026:

traceAI for application-level OpenTelemetry traces, instrumenting LiteLLM, the OpenAI SDK, LangChain, LlamaIndex, OpenAI Agents, and MCP servers under Apache 2.0.
The evaluate API for online quality scoring of streamed completions with deterministic and LLM-judge evaluators. Scores attach to traces and feed dashboards.
The Agent Command Center at /platform/monitor/command-center for production dashboards, BYOK gateway routing, and the Protect guardrail layer for output filtering.

If you are switching models, quantizing, or rolling out speculative decoding, traceAI and the evaluator suite give you the before-and-after data to ship safely. Read the getting-started docs or jump into the GitHub repo for the Apache 2.0 source.

Frequently asked questions

What is LLM inference?

LLM inference is the process where a pretrained language model takes an input prompt and produces a token-by-token response. It runs in two phases: a prefill phase that processes the entire input prompt in parallel and produces the first token, then a decode phase that generates one token at a time using cached key and value tensors. Inference is the cost center for any LLM product because every user interaction triggers a fresh GPU compute cycle.

What is the difference between training and inference?

Training updates model weights using gradient descent over massive datasets, runs once, and is amortized over the model lifetime. Inference uses frozen weights to produce outputs from new prompts and runs on every user request. Training is throughput-bound and tolerant of seconds-long batch steps. Inference is latency-bound: production targets are usually a time-to-first-token under 500 milliseconds and a tokens-per-second rate that keeps the conversation feeling instant.

What is TTFT and why does it matter?

TTFT stands for time-to-first-token. It measures the wall-clock delay between the moment the user submits a prompt and the moment the first token streams back. TTFT is dominated by the prefill phase, which scales with input length squared because of self-attention. In 2026, a healthy chat UI targets TTFT under 500 milliseconds for prompts up to 4K tokens, with degradation modeled and budgeted for longer contexts.

How does KV caching speed up LLM inference?

Key-value caching stores the key and value tensors computed during attention for every prior token in the sequence. On each decode step, the model only computes attention for the newest token against the cached prior context, turning a quadratic operation into a linear one. KV cache memory grows with batch size and context length and is the primary GPU memory bottleneck in production serving. PagedAttention in vLLM solves the fragmentation problem.

What decoding strategy should I use for production?

Greedy decoding is fastest and deterministic but can loop on repetitive output. Top-p (nucleus) sampling with a temperature between 0.3 and 0.7 is the default for chat and content generation. Beam search is reserved for translation and summarization tasks where you need the globally most probable sequence and can afford the latency hit. For tool-calling and structured output, constrained decoding via libraries like Outlines or XGrammar pins the output to a JSON schema.

How do I measure inference quality, not just speed?

Latency and throughput alone do not capture whether the model is right. Production teams pair operational metrics with output-quality evaluators: faithfulness, factuality, instruction following, and task-specific judges. Future AGI's evaluate API runs these as evaluator checks (deterministic where possible, LLM-judge where needed) on streamed completions, so a regression in latency caused by switching to a smaller model is caught alongside any drop in answer quality. See the evaluation companion section for the wiring.

What is the cheapest way to run inference at scale?

The two largest cost levers are model size and serving efficiency. Distilling or routing easy queries to a smaller open-weight model often saves 60 to 80 percent at the cost of a small quality drop you can measure. Serving efficiency comes from KV cache reuse, continuous batching, FP8 or INT8 quantization, and speculative decoding. Each unlocks roughly 1.5x to 4x throughput on the same hardware in 2026 benchmarks from vLLM and TensorRT-LLM, depending on the technique and workload.

How does inference observability fit alongside model serving?

Serving frameworks like vLLM, TGI, and TensorRT-LLM expose Prometheus metrics for TTFT, tokens-per-second, and GPU utilization. Application observability sits on top, traces each request end-to-end through retrieval, prompt assembly, the model call, and post-processing, and attaches quality evaluators. The OpenTelemetry-compatible traceAI SDK from Future AGI provides this application layer and feeds the Agent Command Center for production dashboards.

View all

Guides

OpenAI AgentKit + Future AGI in 2026: Reliable Production Agents

OpenAI AgentKit (Oct 2025) + Future AGI in 2026: visual builder, traceAI auto-instrumentation, fi.evals scoring, BYOK gateway. Real code, real APIs, no hype.

NVJK Kartik · Nov 24, 2025

6 min

Guides

LLM Cost Optimization (2026): Cut Spend 30% in 90 Days

Cut LLM costs 30% in 90 days. 2026 playbook on model routing, caching, BYOK gateways, cost tracking. Includes best LLM cost-tracking tools.

Vrinda Damani · Nov 11, 2025

11 min

Guides

Top Prompt Management Platforms in 2026: 7 Compared

Top prompt management platforms in 2026: Future AGI, PromptLayer, Promptfoo, Langfuse, Helicone, Braintrust, and the OpenAI Prompts API. Versioning + eval + deploy.

NVJK Kartik · Nov 9, 2025

9 min