Guides

Evaluating vLLM Self-Hosted LLMs in 2026: Catching the Serving Failures the Model Eval Misses

How to evaluate a vLLM self-hosted LLM in 2026: catch continuous-batching jitter, KV-cache eviction, and AWQ/GPTQ/FP8 drift before prod.

·
Updated
·
11 min read
vllm self-hosted-llm llm-evaluation quantization kv-cache continuous-batching 2026
Editorial cover image for Evaluating vLLM-Backed Self-Hosted LLM Apps in 2026
Table of Contents

The launch is a 70B Llama 3.3 served on vLLM at AWQ-INT4 for three H100s of footprint. The offline eval ran on the FP16 weights in a notebook and scored 0.87 on Groundedness, 0.91 on TaskCompletion, 0.93 on EvaluateFunctionCalling. The gateway lights up Monday morning. By Wednesday the on-call thread reads: p99 latency triples on the 9am burst, JSON-mode adherence drops to 0.79 on the support agent’s structured outputs, and the second tool call in three-step chains starts dropping arguments about a third of the time. The model that scored clean in the notebook ships a serving stack that scores nothing the notebook measured.

This is the failure shape every team running vLLM in production hits. The model eval scored the weights. The serving stack (PagedAttention, AWQ kernels, continuous batching, KV-cache eviction) was never on the eval path.

The opinion this post earns: vLLM eval is two problems, model quality and serving quality, and most teams eval the model and ship the serving. The serving has its own failure modes. Continuous-batching jitter that hides inside aggregate p99. KV-cache eviction that breaks coherence past the prefix-cache budget. AWQ and GPTQ quantization drift that survives free-form generation and dies on JSON and tool calls. Each one is invisible to a single-tenant notebook benchmark and visible in production within a week. The eval that catches them runs against the artifact you ship, on the hardware path you ship it on, with the request mix it will actually see.

This guide is the working playbook for evaluating a vLLM self-hosted LLM end to end in 2026. It is shaped against the ai-evaluation SDK, the traceAI OpenAI-compatible instrumentor, and the Agent Command Center gateway with vLLM as a backend. The model-eval side reuses the same templates you would run against Claude or GPT-5; the serving-eval side is what this post is about.

TL;DR: the model-vs-serving eval split

SideWhat it scoresHardware pathMisses if you skip it
Model qualityGroundedness, TaskCompletion, EvaluateFunctionCalling on golden setFP16 in any environmentNothing if your serving never quantizes or batches
Quantization regressionPer-axis delta of AWQ / GPTQ / FP8 vs FP16The exact quantized variant you shipSilent JSON and tool-call cliffs
Tail qualityScores binned by output-token position and prompt lengthProduction hardware pathLong-output coherence drops and long-context grounding cliffs
Continuous-batching jitterPer-token decode latency variance under burstProduction scheduler with concurrent tenantsp99 spikes that aggregate p99 hides
KV-cache evictionQuality cliff past the prefix-cache budgetProduction memory pressureCoherence and tool-arg drops mid-generation
Per-tenant fairnessSmall-tenant p99 under shared continuous batchingMulti-tenant burst testCustomer-specific SLO breaches

Ship only when the model side passes and the five serving-side checks pass on the exact artifact you serve. Model-side green plus serving-side untested is the quality cliff dressed as a launch.

Why model eval and serving eval are two problems

When you call OpenAI or Anthropic, you evaluate one thing: the prompt and the application logic. The vendor owns the weights, the inference kernels, the scheduler, the quantization, and the SLA. Your eval covers content quality and that is enough.

A vLLM stack inverts every assumption.

The model card is yours to mutate. Swap a Llama 3.3 70B base for a Qwen 3 32B in an afternoon. Fine-tune either of them. Quantize a 70B down to AWQ-INT4 to fit three GPUs instead of eight. Each one of these is a different model that needs a fresh baseline.

The inference stack is yours to operate. PagedAttention pages KV blocks. Continuous batching interleaves prefill and decode tokens across tenants in the same forward pass. AWQ, GPTQ, FP8, and BNB kernels run different math at different precision. Each behavior is a place where quality regresses or latency drifts in ways the model eval cannot see.

The two surfaces fail differently. Model quality regressions are usually distribution shifts: the new fine-tune cliffs on a slice you didn’t sample. Serving quality regressions are mostly tail behavior. p50 stays flat, p99 doubles, JSON adherence drops on structured outputs but not on chat, the small tenant’s burst gets starved. Aggregate eval scores miss them. Per-bucket scores find them.

The rest of this post walks the five serving-side checks that catch what the model eval misses. Each is a single failure mode with a code-level test. For broader background on the engine itself see What is vLLM in 2026 and the vLLM self-hosted inference alternatives comparison.

Check 1: quantization regression on the variant you actually ship

vLLM supports FP16, FP8, AWQ-INT4, GPTQ-INT4, BNB, and HQQ across the 2026 model catalog. The pattern across them is consistent. Free-form generation stays inside one point of FP16. Structured outputs cliff.

A typical finding on 70B Llama 3.3 served on H100s, scored on a 200-case mixed golden set: FP16 baseline Groundedness 0.87, AWQ-INT4 0.83, GPTQ-INT4 0.82, FP8 0.86. TaskCompletion holds across the four. The interesting numbers are EvaluateFunctionCalling on multi-tool chains (FP16 0.91, AWQ-INT4 0.78, GPTQ-INT4 0.79, FP8 0.88) and JSON-mode strict-schema adherence (FP16 0.94, AWQ-INT4 0.83, GPTQ-INT4 0.81, FP8 0.92). Numbers will vary by your prompts and your model; the shape will not. INT4 breaks the structured cases, FP8 mostly holds, FP16 is the ceiling.

The eval pattern that catches this is a per-axis delta against FP16 on the same hardware path, scored with EvaluateFunctionCalling, a JSON-schema validator as a heuristic gate, and a CustomLLMJudge rubric that scores semantic preservation case by case.

from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness, TaskCompletion, EvaluateFunctionCalling,
)
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider
from fi.testcases import TestCase

evaluator = Evaluator()  # FI_API_KEY / FI_SECRET_KEY from env

quant_delta_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "quantization_quality_delta",
        "model": "claude-sonnet-4-5-20250929",
        "grading_criteria": (
            "Compare FP16_ANSWER and QUANTIZED_ANSWER for the same input. "
            "Return 1.0 if semantically identical and structurally equivalent. "
            "Return 0.0 if the quantized answer breaks a JSON field, "
            "drops a tool argument, or changes a critical claim. "
            "Penalize structured-output drift more than free-form drift."
        ),
    },
)

def quant_regression(samples, fp16_fn, quant_fn):
    results = evaluator.evaluate(
        eval_templates=[
            Groundedness(),
            TaskCompletion(),
            EvaluateFunctionCalling(),
        ],
        inputs=[
            TestCase(
                input=ex.input,
                output=quant_fn(ex.input),
                context=getattr(ex, "context", ""),
                expected_output=getattr(ex, "gold", None),
            )
            for ex in samples
        ],
    )
    deltas = []
    for ex in samples:
        fp16, quant = fp16_fn(ex.input), quant_fn(ex.input)
        delta = quant_delta_judge.compute_one(CustomInput(
            question=ex.input, answer_a=fp16, answer_b=quant,
        ))["output"]
        deltas.append(delta)
    return {"per_axis": results.eval_results, "semantic_delta_mean": sum(deltas) / len(deltas)}

Decide the floor before the run. A 2-point drop on EvaluateFunctionCalling on chat workloads is acceptable. The same drop on a tool-using agent is a release blocker. Quantize for chat; pay for FP16 or FP8 on agents. The eval is what tells you which case you are in.

Check 2: tail quality, not aggregate quality

Aggregate scores lie about long outputs and long prompts. A model that scores 0.87 on Groundedness across a mixed bucket can score 0.93 on the first 500 output tokens and 0.71 past 2000 (same model, same prompts, same eval) because coherence and grounding degrade with output position when KV-cache pressure forces recomputation or when the decoder’s attention sink drifts.

The fix is to bin every score by two axes: prompt-length bucket (0-8k, 8k-32k, 32k-64k, 64k+) and output-position bucket (first 500 tokens, 500-2000, 2000+). Report per-bucket scores. The interesting bucket is the one that drops a cliff against the rest.

def bucketed_score(samples, model_fn, rubric):
    by_bucket = {}
    for ex in samples:
        prompt_tokens = ex.prompt_tokens
        prompt_bucket = (
            "0-8k" if prompt_tokens < 8000
            else "8k-32k" if prompt_tokens < 32000
            else "32k-64k" if prompt_tokens < 64000
            else "64k+"
        )
        output = model_fn(ex.input)
        for output_bucket, tok_slice in (
            ("first_500", output[:500]),
            ("500-2000", output[500:2000]),
            ("2000+", output[2000:]),
        ):
            if not tok_slice:
                continue
            key = (prompt_bucket, output_bucket)
            by_bucket.setdefault(key, []).append(
                evaluator.evaluate(
                    eval_templates=[rubric],
                    inputs=[TestCase(input=ex.input, output=tok_slice, context=ex.context)],
                ).eval_results[0]
            )
    return by_bucket

A typical curve on a 70B Llama 3.3 served at AWQ-INT4: 0-8k prompt bucket scores 0.88 / 0.86 / 0.84 across the three output buckets; the 64k+ prompt bucket scores 0.82 / 0.74 / 0.61. The 64k+ × 2000+ cell is the operating envelope. Past it, you route to a different model or you summarize the prompt before passing it in.

Check 3: continuous-batching jitter under burst

vLLM’s continuous batching is throughput-optimal and per-request volatile. The scheduler packs prefill tokens from a new request into the same forward pass as decode tokens from in-flight requests. The decode tokens slow by the prefill tax. Single-tenant load tests do not surface this. The pattern only appears when prefill bursts from one tenant collide with decode tail from another.

Detect it from spans, not from synthetic load tests. Compute per-token decode latency for every completed call (completion_duration_ms / completion_token_count), slide a rolling p99 window over 5-minute buckets, and alert on a 2x spike against the 24-hour trailing baseline.

from fi_instrumentation import using_attributes

with using_attributes({
    "llm.system": "vllm",
    "llm.vllm.engine_version": "0.6.4",
    "llm.vllm.quantization": "awq",
    "llm.vllm.batch_size": current_batch_size,
    "llm.vllm.prefill_in_batch": prefill_token_count,
}):
    response = client.chat.completions.create(...)

# In your analytics layer:
# per_token_latency = span.completion_duration_ms / span.token_count.completion
# rolling_p99 = percentile(per_token_latency, 0.99, window="5m")
# alert if rolling_p99 > 2.0 * trailing_baseline_p99_24h

The reason tags cluster. Prefill-burst spikes co-occur with prefill_in_batch > 4000. KV-eviction spikes co-occur with cache-hit-ratio drops. Scheduler-starve spikes co-occur with low-priority-tenant requests behind a high-priority queue. Tag the span attribute, route the cluster, and the fix is configuration: cap concurrent prefills, raise the prefill budget, or assign priority lanes by tenant. The LLM observability self-hosting guide walks the rolling-window detector pattern in more depth.

Check 4: KV-cache eviction patterns

PagedAttention is the reason 200k context is even feasible to serve. It is also the reason coherence drops in the middle of a long output you did not expect to fail. When GPU memory pressure rises, vLLM evicts cached KV blocks for older requests, then recomputes them on the next decode step. The compute cost is latency. The quality cost is the part teams miss.

When a long-context request loses prefix-cache hits mid-generation, the model behaves as if it never saw the early tokens. Tool-call sequences drop arguments seen only in the system prompt. RAG answers cite details from the wrong document. Reasoning chains lose the constraint stated up top. The aggregate Groundedness number stays flat; the per-output-position curve cliffs around the eviction point.

The detection is the bucketed score from check 2, refined with one more axis: the llm.vllm.cache_hit_ratio attribute on the span. Join scores to cache-hit-ratio buckets and the cliff appears cleanly.

def eviction_curve(spans, rubric):
    by_cache_bucket = {}
    for span in spans:
        cache_bucket = (
            "high" if span.attributes["llm.vllm.cache_hit_ratio"] > 0.85
            else "medium" if span.attributes["llm.vllm.cache_hit_ratio"] > 0.55
            else "low"
        )
        score = evaluator.evaluate(
            eval_templates=[rubric],
            inputs=[TestCase(input=span.input, output=span.output, context=span.context)],
        ).eval_results[0]
        by_cache_bucket.setdefault(cache_bucket, []).append(score)
    return by_cache_bucket

A typical finding: cache-hit-ratio > 0.85 scores Groundedness 0.88; the 0.55-0.85 bucket scores 0.82; the < 0.55 bucket scores 0.69. The fix is not eval. The fix is configuration: raise gpu_memory_utilization, drop max concurrent long-context requests, or shard the model across more GPUs. Eval is what tells you the configuration is wrong before the customer does.

Check 5: per-tenant fairness, not aggregate fairness

Continuous batching is globally throughput-optimal and per-tenant unfair. A tenant sending five requests per minute can see p99 triple when a tenant sending five hundred requests per minute starts a burst of long prefills. The aggregate p99 stays inside SLO. The small tenant’s p99 does not.

The check runs on purpose. Generate synthetic traffic from two tenants at different rates. Measure per-tenant p99 separately. Confirm the small tenant’s worst case stays inside the SLO. vLLM exposes priority and prefill-budget knobs in the scheduler; tune them against this test, not against the aggregate.

A CustomLLMJudge rubric over the trace tree catches drift after the fact, but the load test catches the configuration mistake before ship. Tag every span with tenant_id, compute per-tenant p99 in the rolling window, alert when a tenant’s p99 exceeds 1.5x the aggregate. That alert is the fairness SLO.

Wiring vLLM to traceAI in five lines

vLLM exposes an OpenAI-compatible endpoint, so the traceAI OpenAI instrumentor works as-is once base_url points at your vLLM server.

from fi_instrumentation import register, ProjectType
from traceai_openai import OpenAIInstrumentor
from openai import OpenAI

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="vllm-prod",
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

client = OpenAI(
    base_url="https://your-vllm-or-gateway.example.com/v1",
    api_key=os.environ["GATEWAY_KEY"],
)

Every chat completion now emits fi.span.kind=LLM spans with llm.model_name, llm.token_count.prompt, llm.token_count.completion, and llm.token_count.total. Add llm.system=vllm plus llm.vllm.engine_version, llm.vllm.quantization, llm.vllm.batch_size, and llm.vllm.cache_hit_ratio as using_attributes context to separate vLLM traffic from API-backed traffic and to feed the four serving-quality checks above. Those eight attributes are all the bucketed and rolling-window detectors need.

Pipe spans through the Agent Command Center when you want quantization-variant canary, shadow, and race modes without app-code changes. The gateway returns x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-model-used, and x-agentcc-routing-strategy on every call, so the cost-per-million-tokens and the rolling p99 ride on the same trace as the eval scores. Deploy it BYOC in the same Kubernetes cluster as vLLM and the network hop adds nothing.

How Future AGI ships vLLM eval

Future AGI ships the eval stack as a package. Start with the SDK and traceAI for code-defined gates. Graduate to the Platform when self-improving rubrics and per-cluster routing become the bottleneck.

  • ai-evaluation SDK (Apache 2.0). 60+ EvalTemplate classes covering Groundedness, ContextAdherence, TaskCompletion, EvaluateFunctionCalling, AnswerRefusal, PromptInjection, and DataPrivacyCompliance. CustomLLMJudge is the primitive for quantization-delta scoring and tail-quality bucketing. Local heuristic metrics (regex, JSON schema, BLEU, ROUGE) run offline at sub-second latency, which matters when you score every quantization variant against the same 500-case set on each model push.
  • traceAI. 50+ AI surfaces across Python, TypeScript, Java, and C#. The OpenAI-compatible instrumentor covers vLLM out of the box. Every span carries llm.model_name and token counts; add the vLLM-specific attributes as using_attributes context and the bucketing and rolling-window detectors run against the same span tree.
  • Agent Command Center. Single Go binary, Apache 2.0, 100+ providers including vLLM. Shadow, mirror, and race modes for quantization-variant rollouts; eval-gated canary rollback as the default; benchmarked at ~29k req/s, P99 ≤ 21 ms with guardrails on, on t3.xlarge. Returns the cost, latency, model, and routing-strategy headers on every call.
  • agent-opt. Six optimizers (PROTEGI, GEPA, MetaPrompt, BayesianSearch, RandomSearch, PromptWizard) for closing the residual quality gap on a quantized model with prompt tuning. If AWQ-INT4 loses 3 points on EvaluateFunctionCalling, PROTEGI’s gradient pass often recovers 2 of them on the same model before you trade up to FP8.
  • Future AGI Platform. Self-improving evaluators that retune from production traces; in-product authoring agent that writes rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2. Error Feed clusters failing traces with HDBSCAN and a Sonnet 4.5 Judge writes the immediate_fix per cluster. Typical clusters on a vLLM stack: “AWQ-INT4 drops 4 points on JSON-mode strict outputs, fix: route JSON-mode traffic to FP8 variant,” “Tenant B’s p99 doubles when Tenant A’s 9am burst hits, fix: raise prefill budget on Tenant B’s priority lane,” “long-context loses grounding past 64k, fix: route past 64k to summarization preprocessor.” Each immediate_fix flows back into the routing policy at the gateway and into the next eval run.

Drop ai-evaluation and traceAI into the gate this afternoon. Add the gateway when canary rollout becomes the workflow. Turn the Platform on when per-cluster routing is the bottleneck.

Ready to evaluate your first vLLM stack? Run pip install ai-evaluation traceai-openai, scaffold the quantization-delta and bucketed-tail-quality gates against your golden set, point your OpenAI client at https://gateway.futureagi.com/v1 with vLLM upstream for shadow traffic, and alarm the rolling p99 and cache-hit-ratio detectors on the canary cohort. The quantized variant that survives all five serving checks is the one worth shipping; everything else is a regression the notebook didn’t show you.

Frequently asked questions

Why is evaluating a vLLM self-hosted LLM different from evaluating an API-backed model?
Two reasons that compose. The model card is mutable: you swap base weights, you fine-tune, you quantize, so the eval has to re-baseline every artifact. The serving stack is yours: continuous batching, KV-cache eviction, prefill scheduling, and quantization kernels all live inside your binary, not behind a vendor SLA. The result is that vLLM eval is two problems, not one. Model quality eval scores the weights. Serving quality eval scores what happens when those weights meet PagedAttention, AWQ, FP8, and 200 concurrent tenants. Most teams eval the model and ship the serving, which is how p99 latency, JSON-mode adherence, and tail correctness regress silently between releases.
What is continuous-batching jitter and how do I catch it?
Continuous batching is vLLM's scheduling technique that holds GPU utilization above 80% by interleaving prefill and decode tokens from different requests inside the same forward pass. The side effect is jitter: a request's per-token latency depends on what else landed in the batch that scheduler tick, so p50 stays flat while p99 spikes whenever a long-prefill request joins a batch of short decodes. Catch it by computing per-token decode latency from spans (token count divided by completion duration), sliding a rolling p99 window over it, and alerting on a 2x spike against the 24-hour trailing baseline. A single-tenant load test will not surface this. You need real production trace data and a window-based detector.
How do KV-cache eviction patterns hurt quality on vLLM?
vLLM's PagedAttention lets you serve 128k and 200k context windows by paging KV blocks in and out of GPU memory. When GPU memory pressure rises, the scheduler evicts cached blocks and recomputes them on the next decode step. The compute cost shows up as latency. The quality cost shows up when a long-context request loses prefix-cache hits mid-generation and the model behaves as if it never saw the early tokens. Symptoms are coherence drops in the second half of long outputs, retrieval-grounded responses that mention details from the prompt incorrectly, and tool-use sequences that drop arguments seen in the system prompt. Detect by binning Groundedness by output-token position and watching for a score cliff past the recomputation point.
Does AWQ, GPTQ, or FP8 quantization actually degrade quality enough to matter?
Yes, and the degradation is shaped like a structured-output failure, not a free-form generation failure. Free-form summarization scores stay within one point across FP16, AWQ-INT4, GPTQ-INT4, and FP8 on a 70B model. JSON-mode strict outputs drop 3 to 8 points on INT4 variants. Tool-call argument correctness drops 4 to 10 points on three-or-more-tool composition. FP8 sits between FP16 and INT4 on both axes, costs roughly the same GPU memory as FP16 at half the bandwidth, and is the default for teams that want quantization without the structured-output cliff. The eval that catches this is a per-axis delta against FP16 on the same hardware path, scored with EvaluateFunctionCalling and a JSON-schema validator.
How do I instrument vLLM with traceAI?
vLLM ships an OpenAI-compatible endpoint, so the OpenAI instrumentor in traceAI works with base_url pointed at your vLLM server. Call register(project_type=ProjectType.OBSERVE, project_name=...) once, then OpenAIInstrumentor().instrument(tracer_provider=trace_provider), and every chat completion emits an fi.span.kind=LLM span with llm.model_name, prompt and completion token counts, and latency. Tag spans with llm.system=vllm to separate them from API-backed traffic. For deeper signal, add llm.vllm.engine_version, llm.vllm.quantization, llm.vllm.batch_size, and llm.vllm.cache_hit_ratio as custom attributes from your vLLM metrics endpoint. Those four attributes plus the standard token counts give you everything the serving-quality eval needs.
What does Future AGI ship for vLLM evaluation specifically?
The eval stack as a package. The ai-evaluation SDK (Apache 2.0) ships 60+ EvalTemplate classes including Groundedness, ContextAdherence, TaskCompletion, EvaluateFunctionCalling, and CustomLLMJudge for the model-quality side. traceAI captures vLLM spans through the OpenAI-compatible instrumentor across Python, TypeScript, Java, and C#. The Agent Command Center routes to vLLM as a backend, returns x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-model-used per call, and runs shadow and canary modes for quantization-variant rollouts. The Future AGI Platform's self-improving evaluators retune from production traces at lower per-eval cost than Galileo Luna-2, and Error Feed clusters per-tenant tail-latency failures so you can route the small-tenant burst case off the batched scheduler before it pages.
What is the worst anti-pattern for vLLM eval in 2026?
Evaluating the FP16 base model in a notebook and shipping the AWQ-INT4 quantized variant behind continuous batching with PagedAttention enabled. Three independent failure modes get hidden in one rollout: quantization drift on JSON and tool calls, KV-cache eviction quality cliff past the prefix-cache budget, and batching jitter that spikes p99 when a long prefill joins short decodes. The eval that caught nothing in the notebook will catch nothing in production until a customer files a ticket about a broken tool sequence three weeks in. Eval the artifact you ship, on the hardware path you ship it on, with the request mix it will actually see.
Related Articles
View all