Evaluating Anyscale Ray Serve LLM Apps in 2026: Catching the Cluster-Level Failures Model Eval Misses
How to evaluate an Anyscale Ray Serve LLM in 2026: catch autoscaling lag, replica skew, and tail-quality cliffs the model eval never sees.
Table of Contents
The launch is a 70B Llama 3.3 served on Ray Serve across an Anyscale cluster, two-to-eight replicas autoscaled on A100s, OpenAI-compatible endpoint, Groundedness 0.87 and TaskCompletion 0.91 in the offline notebook on the FP16 weights. The cluster goes live Monday. By Wednesday the on-call thread reads: p99 latency spikes from 1.8s to 5.4s on the 9am burst, JSON-mode adherence drops to 0.78 during scale-up, tool-call argument correctness drops 9 points on the third call of multi-step chains, and one customer’s small-tenant traffic sees p99 doubled even when the cluster aggregate is inside SLO. The model that scored clean in the notebook ships a cluster that scores nothing the notebook measured.
This is the failure shape every team running Ray Serve in production hits. The model eval scored the weights. The cluster (actor router, autoscaler, PagedAttention under burst, KV-cache sharing across replicas, per-tenant scheduling) was never on the eval path.
The opinion this post earns: Ray Serve eval is two problems, model quality and cluster-level serving, and most teams eval the model and ship the cluster. The cluster has its own failure modes. Autoscaling lag that spikes tail latency and tail quality on every burst. Replica skew that hides inside aggregate scores. KV-cache eviction under burst that breaks coherence past the prefix-cache budget. Traffic-pattern drift between the dev replay and prod’s actual burst-vs-steady mix. Each one is invisible to a single-replica offline benchmark and visible in production within a week. The eval that catches them runs load-aware, replica-aware, and p99-tail-quality-aware against the cluster shape you ship.
This guide is the working playbook for evaluating a Ray Serve LLM stack end to end in 2026, shaped against the ai-evaluation SDK, the traceAI OpenAI-compatible instrumentor, and the Agent Command Center gateway with Ray Serve as a backend. The model-eval side reuses the same templates you’d run against Claude or GPT-5; the cluster-eval side is what this post is about.
TL;DR: the model-vs-cluster eval split
| Side | What it scores | Where it runs | Misses if you skip it |
|---|---|---|---|
| Model quality | Groundedness, TaskCompletion, EvaluateFunctionCalling on golden set | Single replica at idle | Nothing if your serving never scales or batches |
| Replica skew | Per-replica score variance across the fleet | Pinned per-replica calls via Ray Serve routes | One bad node hidden in aggregate |
| Autoscaling lag tail quality | Score and latency binned by queue-depth bucket | Live burst against the running scaler | p99 quality cliffs during scale-up |
| KV-cache eviction under burst | Quality cliff past the prefix-cache budget under memory pressure | Real burst against PagedAttention | Mid-generation coherence drops |
| Traffic-pattern drift | Quality binned by burst-vs-steady mix | Production traffic replay, not synthetic uniform | Dev passes, prod cliffs |
| Per-tenant fairness | Small-tenant p99 under shared continuous batching | Multi-tenant burst at production rate | Customer-specific SLO breaches |
Ship only when the model side passes and the five cluster-level checks pass on the artifact you actually serve, on the cluster shape you actually run. Model-side green plus cluster-side untested is the quality cliff dressed as a launch.
Why model eval and cluster eval are two problems
When you call OpenAI or Anthropic, you evaluate one thing: the prompt and the application logic. The vendor owns the weights, the inference kernels, the scheduler, the scaler, and the SLA. Your eval covers content quality and that is enough.
A Ray Serve stack inverts every assumption.
The replica set is yours to mutate. You configure min_replicas and max_replicas, set target_ongoing_requests, pick the accelerator type, decide whether to run two big replicas or eight small ones. Each shape is a different runtime that needs its own baseline. The same model behind two A100-80G replicas does not behave the same as the same model behind eight L4 replicas.
The scheduler is yours to operate. Ray Serve’s actor router fans every request to one of N replicas. The autoscaler adds replicas when ongoing requests pass a threshold and removes them when they fall below. Continuous batching inside each replica interleaves prefill and decode tokens. PagedAttention shares KV blocks across requests. Each behavior is a place where quality regresses or latency drifts in ways the model eval cannot see.
The two surfaces fail differently. Model quality regressions are usually distribution shifts: the new fine-tune cliffs on a slice you didn’t sample. Cluster quality regressions are mostly tail behavior. p50 stays flat, p99 doubles, JSON adherence drops during scale-up but not at steady state, the small tenant’s burst gets starved, long-context grounding cliffs past the cache budget. Aggregate eval scores miss them. Per-bucket scores find them.
The rest of this post walks the five cluster-level checks that catch what the model eval misses. Each is a single failure mode with a code-level test. For broader background on the engine itself, see Evaluating vLLM self-hosted LLMs for the single-box equivalent and the LLM observability self-hosting guide for the detector patterns.
Check 1: replica skew on the fleet you actually ship
A Ray Serve LLM deployment is N replicas behind an actor router. The router’s job is throughput, not parity. Any one replica can silently drift: a partial weight load on a flaky node, a config divergence after a rolling restart, a GPU running at lower clocks because the chassis hit a thermal limit. The replica still passes the Ray Serve health check, still accepts traffic, still serves 1/N of your requests. The aggregate eval score absorbs it.
The eval is direct. Replicate the same golden set across every replica, score each response per replica, compute variance.
from fi.evals import Evaluator
from fi.evals.templates import (
Groundedness, TaskCompletion, EvaluateFunctionCalling,
)
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider
from fi.testcases import TestCase
evaluator = Evaluator() # FI_API_KEY / FI_SECRET_KEY from env
def call_pinned(replica_id, prompt):
# Ray Serve route prefix or header-based router that pins to one replica
return openai_client.chat.completions.create(
model="anyscale/llama-3-70b",
messages=[{"role": "user", "content": prompt}],
extra_headers={"x-ray-serve-replica-id": replica_id},
).choices[0].message.content
def replica_skew(golden_set, replica_ids):
per_replica = {rid: [] for rid in replica_ids}
for ex in golden_set:
for rid in replica_ids:
per_replica[rid].append(call_pinned(rid, ex.input))
scores = {rid: evaluator.evaluate(
eval_templates=[Groundedness(), TaskCompletion(), EvaluateFunctionCalling()],
inputs=[TestCase(input=ex.input, output=out, context=ex.context)
for ex, out in zip(golden_set, per_replica[rid])],
).eval_results for rid in replica_ids}
return scores
A healthy fleet scores within 1 to 2 points across replicas on every sub-rubric. A drifted replica usually drops 5 to 10 points on a structured sub-rubric (JSON-mode adherence, tool-call argument correctness) before it shows on free-form. Run this every weekday on a 100-case set. The first time a replica’s EvaluateFunctionCalling cliffs against the cohort, the cluster has told you which node to drain before any customer notices.
For deeper rubric design on the judge side, the LLM-as-judge platforms guide covers the CustomLLMJudge pattern in depth.
Check 2: autoscaling-lag tail quality, not aggregate quality
Ray Serve’s autoscaler reacts to ongoing-request pressure on the existing replica set. It adds replicas when ongoing requests cross target_ongoing_requests, removes them when they fall below. The reaction is not instant. When a burst arrives and the existing fleet saturates, the scaler hesitates while requests pile up in the queue, then begins to add replicas, which themselves take seconds to load model weights and accept traffic. p99 latency triples for 30 to 60 seconds.
The cost teams miss is that tail quality cliffs in lockstep. Requests held in the queue lose batching efficiency, decode latency stretches per token, and tail outputs degrade before any new replica is online. The aggregate Groundedness number stays flat across the burst. The per-queue-depth bucket cliffs above 80% saturation.
The detection bins every score by two axes: cluster.queue_depth at the moment the request landed, and cluster.replica_count at that same moment.
from fi_instrumentation import using_attributes
with using_attributes({
"llm.system": "anyscale",
"ray.serve.deployment_name": "llama-70b-prod",
"cluster.replica_count": current_replica_count(),
"cluster.queue_depth": current_queue_depth(),
}):
response = openai_client.chat.completions.create(...)
def autoscaling_lag_curve(spans, rubric):
by_bucket = {}
for span in spans:
qd = span.attributes["cluster.queue_depth"]
rc = span.attributes["cluster.replica_count"]
saturation = qd / max(rc, 1)
bucket = (
"idle" if saturation < 2
else "warm" if saturation < 8
else "saturating" if saturation < 16
else "scaling"
)
score = evaluator.evaluate(
eval_templates=[rubric],
inputs=[TestCase(input=span.input, output=span.output, context=span.context)],
).eval_results[0]
by_bucket.setdefault(bucket, []).append(score)
return by_bucket
A typical curve on a 70B Llama served on a 2-to-8 replica deployment: the idle bucket scores Groundedness 0.88; warm scores 0.86; saturating scores 0.79; scaling scores 0.71. The scaling bucket is the operating envelope during burst. The fix is configuration, not eval: lower target_ongoing_requests so the scaler reacts earlier, pre-warm replicas on a schedule that matches the burst pattern, or route burst traffic to a hosted fallback while the cluster catches up. Eval is what tells you the scaler config is wrong before the morning standup does.
Check 3: KV-cache eviction under burst
PagedAttention is the reason a Ray Serve LLM deployment can serve 128k or 200k context windows at all. It is also the reason coherence drops in the middle of a long output you did not expect to fail. When a burst arrives and the cluster spikes from 40% to 90% memory utilization in 20 seconds, the scheduler evicts cached KV blocks for older requests, then recomputes them on the next decode step. The compute cost shows up as latency. The quality cost is the part teams miss.
When a long-context request loses prefix-cache hits mid-generation, the model behaves as if it never saw the early tokens. Tool-call sequences drop arguments seen only in the system prompt. RAG answers cite details from the wrong document. Reasoning chains lose constraints stated up top. The aggregate Groundedness number stays flat across the burst. The per-output-position curve cliffs around the eviction point, and the per-cache-hit-ratio bucket cliffs underneath it.
The detection joins eval scores to the llm.vllm.cache_hit_ratio attribute on the span. Pulled from the Ray Serve LLM metrics endpoint and stamped on the trace context, it tells you which bucket the request landed in.
def eviction_curve(spans, rubric):
by_cache_bucket = {}
for span in spans:
ratio = span.attributes["llm.vllm.cache_hit_ratio"]
cache_bucket = (
"high" if ratio > 0.85
else "medium" if ratio > 0.55
else "low"
)
score = evaluator.evaluate(
eval_templates=[rubric],
inputs=[TestCase(input=span.input, output=span.output, context=span.context)],
).eval_results[0]
by_cache_bucket.setdefault(cache_bucket, []).append(score)
return by_cache_bucket
A typical finding: cache-hit-ratio > 0.85 scores Groundedness 0.88; the 0.55-0.85 bucket scores 0.81; the < 0.55 bucket scores 0.68. The fix is configuration: raise gpu_memory_utilization on the deployment, drop the max concurrent long-context requests per replica, route 64k+ contexts to a different deployment with more headroom, or summarize the prompt before passing it in. Eval is what tells you the configuration is wrong before the customer does.
Check 4: traffic-pattern drift between dev and prod
The fourth failure is the one that hides longest. The dev eval ran on a uniform synthetic load. Prod’s actual mix is bursty: morning spikes from one tenant, sustained mid-day from a second, long-context batches from a third overnight. The cluster behaves differently under each mix. Quality differs across each mix. The dev run that scored clean against uniform load scores nothing the prod mix actually produces.
The check is to bucket every production span by the traffic pattern around it, then run eval per bucket.
def traffic_pattern_label(span, window_spans):
# Look at the 60s window around this span
rates = [s.attributes["cluster.queue_depth"] for s in window_spans]
mean_rate = sum(rates) / len(rates)
peak_rate = max(rates)
if peak_rate > 2.5 * mean_rate:
return "burst"
if mean_rate > 0.7 * span.attributes["cluster.replica_count"] * 16:
return "sustained"
return "steady"
def pattern_drift(spans, rubric):
by_pattern = {}
for span in spans:
window = get_window_spans(span.timestamp, 60)
pattern = traffic_pattern_label(span, window)
score = evaluator.evaluate(
eval_templates=[rubric],
inputs=[TestCase(input=span.input, output=span.output, context=span.context)],
).eval_results[0]
by_pattern.setdefault(pattern, []).append(score)
return by_pattern
A typical finding on a multi-tenant 70B Llama deployment: steady scores 0.87, sustained scores 0.84, burst scores 0.74. The burst bucket is the one that exists in prod and never in the dev replay. Re-run the dev eval under a replayed burst pattern, not a uniform mix. If you can’t replay, route a small slice of real traffic to a shadow Ray Serve deployment through the Agent Command Center and score against the shadow’s spans instead. The pattern that broke the cluster in prod is the pattern the next eval run must contain.
Check 5: per-tenant fairness under shared scheduling
Continuous batching is globally throughput-optimal and per-tenant unfair. A tenant sending five requests per minute can see p99 latency and p99 quality cliff when a tenant sending five hundred requests per minute starts a burst of long prefills. The aggregate p99 stays inside SLO. The small tenant’s does not.
The check runs on purpose. Generate synthetic traffic from two tenants at different rates. Tag every span with tenant_id. Compute per-tenant p99 latency and per-tenant rubric scores. Confirm each tenant’s worst case stays inside the SLO.
def per_tenant_fairness(spans, rubric):
by_tenant = {}
for span in spans:
tid = span.attributes["tenant_id"]
score = evaluator.evaluate(
eval_templates=[rubric],
inputs=[TestCase(input=span.input, output=span.output, context=span.context)],
).eval_results[0]
by_tenant.setdefault(tid, []).append({
"score": score,
"latency_ms": span.duration_ms,
})
return by_tenant
# alert when a tenant's p99 score drops below aggregate - 5pt
# alert when a tenant's p99 latency exceeds 1.5x aggregate p99
Ray Serve exposes priority and replica-targeting knobs (route_prefix-based fan-out, application-level routing through DeploymentHandle). Tune them against this test, not against aggregate p99. If small-tenant traffic always trails large-tenant bursts in the same queue, isolate it on a dedicated deployment with its own min_replicas floor.
Wiring Ray Serve to traceAI in five lines
Ray Serve LLM exposes an OpenAI-compatible endpoint, so the traceAI OpenAI instrumentor works as-is once base_url points at your Anyscale service URL or the gateway in front of it.
from fi_instrumentation import register, ProjectType
from traceai_openai import OpenAIInstrumentor
from openai import OpenAI
import os
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="anyscale-llama-prod",
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
client = OpenAI(
base_url="https://gateway.futureagi.com/v1",
api_key=os.environ["GATEWAY_KEY"],
)
Every chat completion now emits fi.span.kind=LLM spans with llm.model_name, llm.token_count.prompt, llm.token_count.completion, and llm.token_count.total. Add llm.system=anyscale, ray.serve.replica_id, ray.serve.deployment_name, cluster.replica_count, cluster.queue_depth, and llm.vllm.cache_hit_ratio as using_attributes context, sourced from Ray Serve metrics and response middleware. Those eight attributes are everything the bucketed and rolling-window detectors need to slice tail quality by replica, by queue depth, by cache pressure, by traffic pattern, and by tenant.
Pipe spans through the Agent Command Center when you want autoscaler-policy canary, shadow, and race modes without app-code changes. The gateway returns x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-model-used, and x-agentcc-routing-strategy on every call, so the cost-per-million-tokens and the rolling p99 ride on the same trace as the eval scores. Deploy it BYOC inside the Anyscale workspace and the network hop adds nothing.
The ai-evaluation SDK also ships a Ray distributed runner, so the eval workload itself runs as Ray actors on the same cluster.
from fi.evals.runners import RayRunner
runner = RayRunner(address="anyscale://my-cluster", num_workers=8)
runner.run(evaluator=evaluator, eval_templates=[...], inputs=[...])
When Ray is already the inference runtime, running eval on a separate Celery or Kubernetes pool doubles your infrastructure surface for no upside. Use Ray for both.
How Future AGI ships Ray Serve eval
Future AGI ships the eval stack as a package. Start with the SDK and traceAI for code-defined gates. Graduate to the Platform when self-improving rubrics and per-cluster routing become the bottleneck.
ai-evaluationSDK (Apache 2.0). 60+EvalTemplateclasses coveringGroundedness,ContextAdherence,TaskCompletion,EvaluateFunctionCalling,AnswerRefusal,PromptInjection, andDataPrivacyCompliance.CustomLLMJudgeis the primitive for replica-parity scoring and tail-quality bucketing. Four distributed runners (Celery, Ray, Temporal, Kubernetes); the Ray runner schedules eval actors on the same Anyscale cluster as inference. Local heuristic metrics (regex, JSON schema, BLEU, ROUGE) run offline at sub-second latency, which matters when you score every replica on the same 100-case set every weekday.traceAI. 50+ AI surfaces across Python, TypeScript, Java, and C#. The OpenAI-compatible instrumentor covers Ray Serve out of the box. Every span carriesllm.model_nameand token counts; add the Ray-specific attributes asusing_attributescontext and the bucketed detectors run against the same span tree.- Agent Command Center. Single Go binary, Apache 2.0, 100+ providers including Anyscale as an OpenAI-compatible upstream. Shadow, mirror, and race modes for autoscaler-policy and replica-shape rollouts; eval-gated canary rollback as the default; benchmarked at ~29k req/s, P99 ≤ 21 ms with guardrails on, on t3.xlarge. Returns the cost, latency, model, and routing-strategy headers on every call. BYOC inside the Anyscale workspace keeps the hop in-cluster.
agent-opt. Six optimizers (PROTEGI, GEPA, MetaPrompt, BayesianSearch, RandomSearch, PromptWizard) for closing the residual quality gap that survives the cluster checks. If theburstbucket loses 4 points onEvaluateFunctionCalling, PROTEGI’s gradient pass often recovers 2 of them on the system prompt before you change the scaler config.- Future AGI Platform. Self-improving evaluators that retune from production traces; in-product authoring agent that writes rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2. Error Feed clusters failing traces with HDBSCAN soft-clustering, and a Sonnet 4.5 Judge writes the
immediate_fixper cluster. Typical clusters on a Ray Serve stack: “Replica 7 drops 6 points on JSON-mode adherence (config drift, drain and reload),” “Tail quality cliffs in thescalingqueue-depth bucket on the 9am burst, fix: lowertarget_ongoing_requestsfrom 16 to 10 or pre-warm two replicas at 8:50,” “Long-context past 64k loses grounding when cache-hit-ratio falls below 0.55, fix: route 64k+ to a dedicated deployment.” Eachimmediate_fixflows back into the routing policy at the gateway and into the next eval run. Linear OAuth ships today; Slack, GitHub, Jira, PagerDuty are roadmap.
Drop ai-evaluation and traceAI into the gate this afternoon. Add the gateway when autoscaler-policy canary becomes the workflow. Turn the Platform on when per-cluster routing is the bottleneck.
Ready to evaluate your first Ray Serve cluster? Run pip install ai-evaluation traceai-openai, scaffold the replica-skew and queue-depth-bucketed gates against your golden set, point your OpenAI client at https://gateway.futureagi.com/v1 with Anyscale upstream for shadow traffic, and alarm the rolling p99 and cache-hit-ratio detectors on the canary cohort. The cluster that survives all five checks is the one worth shipping; everything else is a regression the notebook didn’t show you.
Related reading
- Evaluating vLLM Self-Hosted LLMs (2026)
- LLM Observability Self-Hosting Guide (2026)
- LLM Eval Shadow Traffic and Canary Patterns (2026)
- Best Self-Hosted AI Gateways (2026)
- AI Agent Cost Optimization and Observability (2026)
- Agent Passes Evals, Fails Production (2026)
- Best LLM-as-Judge Platforms (2026)
Frequently asked questions
Why is evaluating a Ray Serve LLM different from evaluating an API model or a single vLLM box?
What is autoscaling-lag tail quality and how do I catch it?
How does replica skew show up and how do I detect it?
What is KV-cache eviction under burst and why does it matter on Ray Serve?
How do I instrument Ray Serve LLM with traceAI?
What does Future AGI ship for Ray Serve evaluation specifically?
What is the worst anti-pattern for Ray Serve LLM eval in 2026?
Cheap-fast-statistically-significant LLM eval gates in GitHub Actions: classifier cascade, fi CLI exit codes, Welch's t-test, path-scoped triggers, auto-rollback.
Contract review RAG in 2026: clause-level retrieval, citation enforcement, the eval suite in-house counsel will sign off, plus the LangGraph wiring to live OTel traces.
Customer support eval in 2026: escalation taxonomy first, clause-level retrieval, tool-call correctness on Zendesk and Intercom, paired Containment and False-Resolution rates.