Evaluating vLLM Self-Hosted LLMs in 2026: Catching the Serving Failures the Model Eval Misses
How to evaluate a vLLM self-hosted LLM in 2026: catch continuous-batching jitter, KV-cache eviction, and AWQ/GPTQ/FP8 drift before prod.
Table of Contents
The launch is a 70B Llama 3.3 served on vLLM at AWQ-INT4 for three H100s of footprint. The offline eval ran on the FP16 weights in a notebook and scored 0.87 on Groundedness, 0.91 on TaskCompletion, 0.93 on EvaluateFunctionCalling. The gateway lights up Monday morning. By Wednesday the on-call thread reads: p99 latency triples on the 9am burst, JSON-mode adherence drops to 0.79 on the support agent’s structured outputs, and the second tool call in three-step chains starts dropping arguments about a third of the time. The model that scored clean in the notebook ships a serving stack that scores nothing the notebook measured.
This is the failure shape every team running vLLM in production hits. The model eval scored the weights. The serving stack (PagedAttention, AWQ kernels, continuous batching, KV-cache eviction) was never on the eval path.
The opinion this post earns: vLLM eval is two problems, model quality and serving quality, and most teams eval the model and ship the serving. The serving has its own failure modes. Continuous-batching jitter that hides inside aggregate p99. KV-cache eviction that breaks coherence past the prefix-cache budget. AWQ and GPTQ quantization drift that survives free-form generation and dies on JSON and tool calls. Each one is invisible to a single-tenant notebook benchmark and visible in production within a week. The eval that catches them runs against the artifact you ship, on the hardware path you ship it on, with the request mix it will actually see.
This guide is the working playbook for evaluating a vLLM self-hosted LLM end to end in 2026. It is shaped against the ai-evaluation SDK, the traceAI OpenAI-compatible instrumentor, and the Agent Command Center gateway with vLLM as a backend. The model-eval side reuses the same templates you would run against Claude or GPT-5; the serving-eval side is what this post is about.
TL;DR: the model-vs-serving eval split
| Side | What it scores | Hardware path | Misses if you skip it |
|---|---|---|---|
| Model quality | Groundedness, TaskCompletion, EvaluateFunctionCalling on golden set | FP16 in any environment | Nothing if your serving never quantizes or batches |
| Quantization regression | Per-axis delta of AWQ / GPTQ / FP8 vs FP16 | The exact quantized variant you ship | Silent JSON and tool-call cliffs |
| Tail quality | Scores binned by output-token position and prompt length | Production hardware path | Long-output coherence drops and long-context grounding cliffs |
| Continuous-batching jitter | Per-token decode latency variance under burst | Production scheduler with concurrent tenants | p99 spikes that aggregate p99 hides |
| KV-cache eviction | Quality cliff past the prefix-cache budget | Production memory pressure | Coherence and tool-arg drops mid-generation |
| Per-tenant fairness | Small-tenant p99 under shared continuous batching | Multi-tenant burst test | Customer-specific SLO breaches |
Ship only when the model side passes and the five serving-side checks pass on the exact artifact you serve. Model-side green plus serving-side untested is the quality cliff dressed as a launch.
Why model eval and serving eval are two problems
When you call OpenAI or Anthropic, you evaluate one thing: the prompt and the application logic. The vendor owns the weights, the inference kernels, the scheduler, the quantization, and the SLA. Your eval covers content quality and that is enough.
A vLLM stack inverts every assumption.
The model card is yours to mutate. Swap a Llama 3.3 70B base for a Qwen 3 32B in an afternoon. Fine-tune either of them. Quantize a 70B down to AWQ-INT4 to fit three GPUs instead of eight. Each one of these is a different model that needs a fresh baseline.
The inference stack is yours to operate. PagedAttention pages KV blocks. Continuous batching interleaves prefill and decode tokens across tenants in the same forward pass. AWQ, GPTQ, FP8, and BNB kernels run different math at different precision. Each behavior is a place where quality regresses or latency drifts in ways the model eval cannot see.
The two surfaces fail differently. Model quality regressions are usually distribution shifts: the new fine-tune cliffs on a slice you didn’t sample. Serving quality regressions are mostly tail behavior. p50 stays flat, p99 doubles, JSON adherence drops on structured outputs but not on chat, the small tenant’s burst gets starved. Aggregate eval scores miss them. Per-bucket scores find them.
The rest of this post walks the five serving-side checks that catch what the model eval misses. Each is a single failure mode with a code-level test. For broader background on the engine itself see What is vLLM in 2026 and the vLLM self-hosted inference alternatives comparison.
Check 1: quantization regression on the variant you actually ship
vLLM supports FP16, FP8, AWQ-INT4, GPTQ-INT4, BNB, and HQQ across the 2026 model catalog. The pattern across them is consistent. Free-form generation stays inside one point of FP16. Structured outputs cliff.
A typical finding on 70B Llama 3.3 served on H100s, scored on a 200-case mixed golden set: FP16 baseline Groundedness 0.87, AWQ-INT4 0.83, GPTQ-INT4 0.82, FP8 0.86. TaskCompletion holds across the four. The interesting numbers are EvaluateFunctionCalling on multi-tool chains (FP16 0.91, AWQ-INT4 0.78, GPTQ-INT4 0.79, FP8 0.88) and JSON-mode strict-schema adherence (FP16 0.94, AWQ-INT4 0.83, GPTQ-INT4 0.81, FP8 0.92). Numbers will vary by your prompts and your model; the shape will not. INT4 breaks the structured cases, FP8 mostly holds, FP16 is the ceiling.
The eval pattern that catches this is a per-axis delta against FP16 on the same hardware path, scored with EvaluateFunctionCalling, a JSON-schema validator as a heuristic gate, and a CustomLLMJudge rubric that scores semantic preservation case by case.
from fi.evals import Evaluator
from fi.evals.templates import (
Groundedness, TaskCompletion, EvaluateFunctionCalling,
)
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider
from fi.testcases import TestCase
evaluator = Evaluator() # FI_API_KEY / FI_SECRET_KEY from env
quant_delta_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "quantization_quality_delta",
"model": "claude-sonnet-4-5-20250929",
"grading_criteria": (
"Compare FP16_ANSWER and QUANTIZED_ANSWER for the same input. "
"Return 1.0 if semantically identical and structurally equivalent. "
"Return 0.0 if the quantized answer breaks a JSON field, "
"drops a tool argument, or changes a critical claim. "
"Penalize structured-output drift more than free-form drift."
),
},
)
def quant_regression(samples, fp16_fn, quant_fn):
results = evaluator.evaluate(
eval_templates=[
Groundedness(),
TaskCompletion(),
EvaluateFunctionCalling(),
],
inputs=[
TestCase(
input=ex.input,
output=quant_fn(ex.input),
context=getattr(ex, "context", ""),
expected_output=getattr(ex, "gold", None),
)
for ex in samples
],
)
deltas = []
for ex in samples:
fp16, quant = fp16_fn(ex.input), quant_fn(ex.input)
delta = quant_delta_judge.compute_one(CustomInput(
question=ex.input, answer_a=fp16, answer_b=quant,
))["output"]
deltas.append(delta)
return {"per_axis": results.eval_results, "semantic_delta_mean": sum(deltas) / len(deltas)}
Decide the floor before the run. A 2-point drop on EvaluateFunctionCalling on chat workloads is acceptable. The same drop on a tool-using agent is a release blocker. Quantize for chat; pay for FP16 or FP8 on agents. The eval is what tells you which case you are in.
Check 2: tail quality, not aggregate quality
Aggregate scores lie about long outputs and long prompts. A model that scores 0.87 on Groundedness across a mixed bucket can score 0.93 on the first 500 output tokens and 0.71 past 2000 (same model, same prompts, same eval) because coherence and grounding degrade with output position when KV-cache pressure forces recomputation or when the decoder’s attention sink drifts.
The fix is to bin every score by two axes: prompt-length bucket (0-8k, 8k-32k, 32k-64k, 64k+) and output-position bucket (first 500 tokens, 500-2000, 2000+). Report per-bucket scores. The interesting bucket is the one that drops a cliff against the rest.
def bucketed_score(samples, model_fn, rubric):
by_bucket = {}
for ex in samples:
prompt_tokens = ex.prompt_tokens
prompt_bucket = (
"0-8k" if prompt_tokens < 8000
else "8k-32k" if prompt_tokens < 32000
else "32k-64k" if prompt_tokens < 64000
else "64k+"
)
output = model_fn(ex.input)
for output_bucket, tok_slice in (
("first_500", output[:500]),
("500-2000", output[500:2000]),
("2000+", output[2000:]),
):
if not tok_slice:
continue
key = (prompt_bucket, output_bucket)
by_bucket.setdefault(key, []).append(
evaluator.evaluate(
eval_templates=[rubric],
inputs=[TestCase(input=ex.input, output=tok_slice, context=ex.context)],
).eval_results[0]
)
return by_bucket
A typical curve on a 70B Llama 3.3 served at AWQ-INT4: 0-8k prompt bucket scores 0.88 / 0.86 / 0.84 across the three output buckets; the 64k+ prompt bucket scores 0.82 / 0.74 / 0.61. The 64k+ × 2000+ cell is the operating envelope. Past it, you route to a different model or you summarize the prompt before passing it in.
Check 3: continuous-batching jitter under burst
vLLM’s continuous batching is throughput-optimal and per-request volatile. The scheduler packs prefill tokens from a new request into the same forward pass as decode tokens from in-flight requests. The decode tokens slow by the prefill tax. Single-tenant load tests do not surface this. The pattern only appears when prefill bursts from one tenant collide with decode tail from another.
Detect it from spans, not from synthetic load tests. Compute per-token decode latency for every completed call (completion_duration_ms / completion_token_count), slide a rolling p99 window over 5-minute buckets, and alert on a 2x spike against the 24-hour trailing baseline.
from fi_instrumentation import using_attributes
with using_attributes({
"llm.system": "vllm",
"llm.vllm.engine_version": "0.6.4",
"llm.vllm.quantization": "awq",
"llm.vllm.batch_size": current_batch_size,
"llm.vllm.prefill_in_batch": prefill_token_count,
}):
response = client.chat.completions.create(...)
# In your analytics layer:
# per_token_latency = span.completion_duration_ms / span.token_count.completion
# rolling_p99 = percentile(per_token_latency, 0.99, window="5m")
# alert if rolling_p99 > 2.0 * trailing_baseline_p99_24h
The reason tags cluster. Prefill-burst spikes co-occur with prefill_in_batch > 4000. KV-eviction spikes co-occur with cache-hit-ratio drops. Scheduler-starve spikes co-occur with low-priority-tenant requests behind a high-priority queue. Tag the span attribute, route the cluster, and the fix is configuration: cap concurrent prefills, raise the prefill budget, or assign priority lanes by tenant. The LLM observability self-hosting guide walks the rolling-window detector pattern in more depth.
Check 4: KV-cache eviction patterns
PagedAttention is the reason 200k context is even feasible to serve. It is also the reason coherence drops in the middle of a long output you did not expect to fail. When GPU memory pressure rises, vLLM evicts cached KV blocks for older requests, then recomputes them on the next decode step. The compute cost is latency. The quality cost is the part teams miss.
When a long-context request loses prefix-cache hits mid-generation, the model behaves as if it never saw the early tokens. Tool-call sequences drop arguments seen only in the system prompt. RAG answers cite details from the wrong document. Reasoning chains lose the constraint stated up top. The aggregate Groundedness number stays flat; the per-output-position curve cliffs around the eviction point.
The detection is the bucketed score from check 2, refined with one more axis: the llm.vllm.cache_hit_ratio attribute on the span. Join scores to cache-hit-ratio buckets and the cliff appears cleanly.
def eviction_curve(spans, rubric):
by_cache_bucket = {}
for span in spans:
cache_bucket = (
"high" if span.attributes["llm.vllm.cache_hit_ratio"] > 0.85
else "medium" if span.attributes["llm.vllm.cache_hit_ratio"] > 0.55
else "low"
)
score = evaluator.evaluate(
eval_templates=[rubric],
inputs=[TestCase(input=span.input, output=span.output, context=span.context)],
).eval_results[0]
by_cache_bucket.setdefault(cache_bucket, []).append(score)
return by_cache_bucket
A typical finding: cache-hit-ratio > 0.85 scores Groundedness 0.88; the 0.55-0.85 bucket scores 0.82; the < 0.55 bucket scores 0.69. The fix is not eval. The fix is configuration: raise gpu_memory_utilization, drop max concurrent long-context requests, or shard the model across more GPUs. Eval is what tells you the configuration is wrong before the customer does.
Check 5: per-tenant fairness, not aggregate fairness
Continuous batching is globally throughput-optimal and per-tenant unfair. A tenant sending five requests per minute can see p99 triple when a tenant sending five hundred requests per minute starts a burst of long prefills. The aggregate p99 stays inside SLO. The small tenant’s p99 does not.
The check runs on purpose. Generate synthetic traffic from two tenants at different rates. Measure per-tenant p99 separately. Confirm the small tenant’s worst case stays inside the SLO. vLLM exposes priority and prefill-budget knobs in the scheduler; tune them against this test, not against the aggregate.
A CustomLLMJudge rubric over the trace tree catches drift after the fact, but the load test catches the configuration mistake before ship. Tag every span with tenant_id, compute per-tenant p99 in the rolling window, alert when a tenant’s p99 exceeds 1.5x the aggregate. That alert is the fairness SLO.
Wiring vLLM to traceAI in five lines
vLLM exposes an OpenAI-compatible endpoint, so the traceAI OpenAI instrumentor works as-is once base_url points at your vLLM server.
from fi_instrumentation import register, ProjectType
from traceai_openai import OpenAIInstrumentor
from openai import OpenAI
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="vllm-prod",
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
client = OpenAI(
base_url="https://your-vllm-or-gateway.example.com/v1",
api_key=os.environ["GATEWAY_KEY"],
)
Every chat completion now emits fi.span.kind=LLM spans with llm.model_name, llm.token_count.prompt, llm.token_count.completion, and llm.token_count.total. Add llm.system=vllm plus llm.vllm.engine_version, llm.vllm.quantization, llm.vllm.batch_size, and llm.vllm.cache_hit_ratio as using_attributes context to separate vLLM traffic from API-backed traffic and to feed the four serving-quality checks above. Those eight attributes are all the bucketed and rolling-window detectors need.
Pipe spans through the Agent Command Center when you want quantization-variant canary, shadow, and race modes without app-code changes. The gateway returns x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-model-used, and x-agentcc-routing-strategy on every call, so the cost-per-million-tokens and the rolling p99 ride on the same trace as the eval scores. Deploy it BYOC in the same Kubernetes cluster as vLLM and the network hop adds nothing.
How Future AGI ships vLLM eval
Future AGI ships the eval stack as a package. Start with the SDK and traceAI for code-defined gates. Graduate to the Platform when self-improving rubrics and per-cluster routing become the bottleneck.
ai-evaluationSDK (Apache 2.0). 60+EvalTemplateclasses coveringGroundedness,ContextAdherence,TaskCompletion,EvaluateFunctionCalling,AnswerRefusal,PromptInjection, andDataPrivacyCompliance.CustomLLMJudgeis the primitive for quantization-delta scoring and tail-quality bucketing. Local heuristic metrics (regex, JSON schema, BLEU, ROUGE) run offline at sub-second latency, which matters when you score every quantization variant against the same 500-case set on each model push.traceAI. 50+ AI surfaces across Python, TypeScript, Java, and C#. The OpenAI-compatible instrumentor covers vLLM out of the box. Every span carriesllm.model_nameand token counts; add the vLLM-specific attributes asusing_attributescontext and the bucketing and rolling-window detectors run against the same span tree.- Agent Command Center. Single Go binary, Apache 2.0, 100+ providers including vLLM. Shadow, mirror, and race modes for quantization-variant rollouts; eval-gated canary rollback as the default; benchmarked at ~29k req/s, P99 ≤ 21 ms with guardrails on, on t3.xlarge. Returns the cost, latency, model, and routing-strategy headers on every call.
agent-opt. Six optimizers (PROTEGI, GEPA, MetaPrompt, BayesianSearch, RandomSearch, PromptWizard) for closing the residual quality gap on a quantized model with prompt tuning. If AWQ-INT4 loses 3 points onEvaluateFunctionCalling, PROTEGI’s gradient pass often recovers 2 of them on the same model before you trade up to FP8.- Future AGI Platform. Self-improving evaluators that retune from production traces; in-product authoring agent that writes rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2. Error Feed clusters failing traces with HDBSCAN and a Sonnet 4.5 Judge writes the
immediate_fixper cluster. Typical clusters on a vLLM stack: “AWQ-INT4 drops 4 points on JSON-mode strict outputs, fix: route JSON-mode traffic to FP8 variant,” “Tenant B’s p99 doubles when Tenant A’s 9am burst hits, fix: raise prefill budget on Tenant B’s priority lane,” “long-context loses grounding past 64k, fix: route past 64k to summarization preprocessor.” Eachimmediate_fixflows back into the routing policy at the gateway and into the next eval run.
Drop ai-evaluation and traceAI into the gate this afternoon. Add the gateway when canary rollout becomes the workflow. Turn the Platform on when per-cluster routing is the bottleneck.
Ready to evaluate your first vLLM stack? Run pip install ai-evaluation traceai-openai, scaffold the quantization-delta and bucketed-tail-quality gates against your golden set, point your OpenAI client at https://gateway.futureagi.com/v1 with vLLM upstream for shadow traffic, and alarm the rolling p99 and cache-hit-ratio detectors on the canary cohort. The quantized variant that survives all five serving checks is the one worth shipping; everything else is a regression the notebook didn’t show you.
Related reading
- What is vLLM in 2026
- vLLM Self-Hosted Inference Alternatives (2026)
- LLM Observability Self-Hosting Guide (2026)
- Evaluating Cheap Frontier Models (2026)
- LLM Eval Shadow Traffic and Canary Patterns (2026)
- Best Self-Hosted AI Gateways (2026)
- AI Agent Cost Optimization and Observability (2026)
- Self-Hosted LLM Observability (2026)
Frequently asked questions
Why is evaluating a vLLM self-hosted LLM different from evaluating an API-backed model?
What is continuous-batching jitter and how do I catch it?
How do KV-cache eviction patterns hurt quality on vLLM?
Does AWQ, GPTQ, or FP8 quantization actually degrade quality enough to matter?
How do I instrument vLLM with traceAI?
What does Future AGI ship for vLLM evaluation specifically?
What is the worst anti-pattern for vLLM eval in 2026?
How to evaluate Fireworks AI, Together AI, Modal, and Replicate apps in 2026: bit-fidelity, per-provider arena-judge, latency parity, and quantization disclosure.
Cheap-fast-statistically-significant LLM eval gates in GitHub Actions: classifier cascade, fi CLI exit codes, Welch's t-test, path-scoped triggers, auto-rollback.
Contract review RAG in 2026: clause-level retrieval, citation enforcement, the eval suite in-house counsel will sign off, plus the LangGraph wiring to live OTel traces.