Guides

Evaluating Anyscale Ray Serve LLM Apps in 2026: Catching the Cluster-Level Failures Model Eval Misses

Q: Why is evaluating a Ray Serve LLM different from evaluating an API model or a single vLLM box?

Ray Serve fans every request across N replicas behind an actor router, and that replica set is elastic. Three things change. The replica set is your real surface area, so a single drifted replica (partial weight load, stale config, thermally throttled GPU) hides inside healthy aggregates because it only serves 1/N of traffic. The cluster is dynamic: the autoscaler adds and removes replicas under load, and a cold-started replica is not the same animal as a warm one. Traffic patterns dominate quality: continuous batching and KV-cache sharing behave one way at idle and a different way at peak, so quiet-cluster eval lies about peak-cluster reality. The result is that Ray Serve eval is two problems, not one. Model quality eval scores the weights. Cluster-level eval scores what happens when those weights meet the router, the scheduler, the scaler, and 200 concurrent tenants.

Q: How does replica skew show up and how do I detect it?

A Ray Serve replica can drift independently from its siblings. The model loaded from a partial download on a flaky node, a config divergence after a rolling restart, a GPU running at lower clocks because thermal limits kicked in. The replica still passes the health check and still serves traffic, so the aggregate eval score absorbs it because the bad replica is only 1/N of the cohort. Detection runs the same golden set against every replica in turn (pin requests via Ray Serve route prefixes or a header-based router), scores each per-replica response, and computes variance across the set. A healthy fleet scores within 1 to 2 points across replicas. A drifted replica usually drops 5 to 10 points on a sub-rubric like JSON-mode adherence or tool-call argument correctness, which is the signal that lands days before a customer files a ticket.

Q: What is KV-cache eviction under burst and why does it matter on Ray Serve?

Ray Serve LLM deployments use PagedAttention to serve 128k and 200k context windows by paging KV blocks in and out of GPU memory. When a burst arrives and the cluster spikes from 40% to 90% memory utilization in 20 seconds, the scheduler evicts cached blocks for older requests and recomputes them on the next decode step. Latency rises. The hidden cost is quality. When a long-context request loses prefix-cache hits mid-generation, the model behaves as if it never saw the early tokens: tool-call arguments from the system prompt drop, RAG answers cite the wrong document, reasoning chains lose constraints stated up top. The aggregate Groundedness number stays flat. The per-output-position curve cliffs around the eviction point. Detect by binning scores against the llm.vllm.cache_hit_ratio attribute on the span and watching where the cliff lands.

Q: How do I instrument Ray Serve LLM with traceAI?

Ray Serve LLM exposes an OpenAI-compatible endpoint, so the traceAI OpenAI instrumentor works with base_url pointed at your Anyscale service URL. Call register(project_type=ProjectType.OBSERVE, project_name=...) once, then OpenAIInstrumentor().instrument(tracer_provider=trace_provider), and every chat completion emits an fi.span.kind=LLM span with llm.model_name and token counts. Tag spans with llm.system=anyscale to separate Ray Serve traffic from API-backed traffic. Add ray.serve.replica_id, ray.serve.deployment_name, cluster.replica_count, cluster.queue_depth, and llm.vllm.cache_hit_ratio as custom attributes from Ray Serve metrics and response middleware. Those five attributes plus the standard token counts give the bucketed and rolling-window detectors everything they need to slice tail quality by replica, by queue depth, and by cache pressure.

Q: What does Future AGI ship for Ray Serve evaluation specifically?

The eval stack as a package. The ai-evaluation SDK (Apache 2.0) ships 60+ EvalTemplate classes including Groundedness, ContextAdherence, TaskCompletion, EvaluateFunctionCalling, and CustomLLMJudge for the model-quality side, and ships a Ray distributed runner so the eval workload runs as Ray actors on the same Anyscale cluster that serves the LLM. traceAI captures Ray Serve spans through the OpenAI-compatible instrumentor across Python, TypeScript, Java, and C#. The Agent Command Center routes Anyscale as an OpenAI-compatible upstream, returns x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-model-used per call, supports shadow and canary modes for autoscaler-policy rollouts, and runs BYOC inside the Anyscale workspace so the gateway hop adds nothing. The Future AGI Platform's self-improving evaluators retune from production traces at lower per-eval cost than Galileo Luna-2, and Error Feed clusters replica-skew and burst-quality failures with HDBSCAN soft-clustering and a Sonnet 4.5 Judge that writes an immediate_fix per cluster.

Q: What is the worst anti-pattern for Ray Serve LLM eval in 2026?

Running the offline eval on a single replica at idle, watching it pass, and shipping the deployment behind an N-replica autoscaling Ray Serve setup. Three failure modes stack. The drifted replica hides inside the aggregate because it serves 1/N of traffic. Autoscaler lag spikes tail latency and tail quality on every burst because warm replicas saturate before new ones accept traffic. KV-cache eviction under burst cliffs long-context quality past the prefix-cache budget. The eval that caught nothing on the single replica at idle catches nothing in production until a customer files a ticket about a broken tool sequence on the morning burst three weeks in. Eval the cluster shape you actually ship: N replicas, with the scaler running, under the burst pattern production traffic actually hits.

How to evaluate an Anyscale Ray Serve LLM in 2026: catch autoscaling lag, replica skew, and tail-quality cliffs the model eval never sees.

March 15, 2026

Updated May 20, 2026

12 min read

anyscale ray-serve llm-evaluation autoscaling replica-skew tail-quality 2026

Table of Contents

The launch is a 70B Llama 3.3 served on Ray Serve across an Anyscale cluster, two-to-eight replicas autoscaled on A100s, OpenAI-compatible endpoint, Groundedness 0.87 and TaskCompletion 0.91 in the offline notebook on the FP16 weights. The cluster goes live Monday. By Wednesday the on-call thread reads: p99 latency spikes from 1.8s to 5.4s on the 9am burst, JSON-mode adherence drops to 0.78 during scale-up, tool-call argument correctness drops 9 points on the third call of multi-step chains, and one customer’s small-tenant traffic sees p99 doubled even when the cluster aggregate is inside SLO. The model that scored clean in the notebook ships a cluster that scores nothing the notebook measured.

This is the failure shape every team running Ray Serve in production hits. The model eval scored the weights. The cluster (actor router, autoscaler, PagedAttention under burst, KV-cache sharing across replicas, per-tenant scheduling) was never on the eval path.

The opinion this post earns: Ray Serve eval is two problems, model quality and cluster-level serving, and most teams eval the model and ship the cluster. The cluster has its own failure modes. Autoscaling lag that spikes tail latency and tail quality on every burst. Replica skew that hides inside aggregate scores. KV-cache eviction under burst that breaks coherence past the prefix-cache budget. Traffic-pattern drift between the dev replay and prod’s actual burst-vs-steady mix. Each one is invisible to a single-replica offline benchmark and visible in production within a week. The eval that catches them runs load-aware, replica-aware, and p99-tail-quality-aware against the cluster shape you ship.

This guide is the working playbook for evaluating a Ray Serve LLM stack end to end in 2026, shaped against the ai-evaluation SDK, the traceAI OpenAI-compatible instrumentor, and the Agent Command Center gateway with Ray Serve as a backend. The model-eval side reuses the same templates you’d run against Claude or GPT-5; the cluster-eval side is what this post is about.

TL;DR: the model-vs-cluster eval split

Side	What it scores	Where it runs	Misses if you skip it
Model quality	Groundedness, TaskCompletion, EvaluateFunctionCalling on golden set	Single replica at idle	Nothing if your serving never scales or batches
Replica skew	Per-replica score variance across the fleet	Pinned per-replica calls via Ray Serve routes	One bad node hidden in aggregate
Autoscaling lag tail quality	Score and latency binned by queue-depth bucket	Live burst against the running scaler	p99 quality cliffs during scale-up
KV-cache eviction under burst	Quality cliff past the prefix-cache budget under memory pressure	Real burst against PagedAttention	Mid-generation coherence drops
Traffic-pattern drift	Quality binned by burst-vs-steady mix	Production traffic replay, not synthetic uniform	Dev passes, prod cliffs
Per-tenant fairness	Small-tenant p99 under shared continuous batching	Multi-tenant burst at production rate	Customer-specific SLO breaches

Ship only when the model side passes and the five cluster-level checks pass on the artifact you actually serve, on the cluster shape you actually run. Model-side green plus cluster-side untested is the quality cliff dressed as a launch.

Why model eval and cluster eval are two problems

When you call OpenAI or Anthropic, you evaluate one thing: the prompt and the application logic. The vendor owns the weights, the inference kernels, the scheduler, the scaler, and the SLA. Your eval covers content quality and that is enough.

A Ray Serve stack inverts every assumption.

The replica set is yours to mutate. You configure min_replicas and max_replicas, set target_ongoing_requests, pick the accelerator type, decide whether to run two big replicas or eight small ones. Each shape is a different runtime that needs its own baseline. The same model behind two A100-80G replicas does not behave the same as the same model behind eight L4 replicas.

The scheduler is yours to operate. Ray Serve’s actor router fans every request to one of N replicas. The autoscaler adds replicas when ongoing requests pass a threshold and removes them when they fall below. Continuous batching inside each replica interleaves prefill and decode tokens. PagedAttention shares KV blocks across requests. Each behavior is a place where quality regresses or latency drifts in ways the model eval cannot see.

The two surfaces fail differently. Model quality regressions are usually distribution shifts: the new fine-tune cliffs on a slice you didn’t sample. Cluster quality regressions are mostly tail behavior. p50 stays flat, p99 doubles, JSON adherence drops during scale-up but not at steady state, the small tenant’s burst gets starved, long-context grounding cliffs past the cache budget. Aggregate eval scores miss them. Per-bucket scores find them.

The rest of this post walks the five cluster-level checks that catch what the model eval misses. Each is a single failure mode with a code-level test. For broader background on the engine itself, see Evaluating vLLM self-hosted LLMs for the single-box equivalent and the LLM observability self-hosting guide for the detector patterns.

Check 1: replica skew on the fleet you actually ship

A Ray Serve LLM deployment is N replicas behind an actor router. The router’s job is throughput, not parity. Any one replica can silently drift: a partial weight load on a flaky node, a config divergence after a rolling restart, a GPU running at lower clocks because the chassis hit a thermal limit. The replica still passes the Ray Serve health check, still accepts traffic, still serves 1/N of your requests. The aggregate eval score absorbs it.

The eval is direct. Replicate the same golden set across every replica, score each response per replica, compute variance.

from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness, TaskCompletion, EvaluateFunctionCalling,
)
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider
from fi.testcases import TestCase

evaluator = Evaluator()  # FI_API_KEY / FI_SECRET_KEY from env

def call_pinned(replica_id, prompt):
    # Ray Serve route prefix or header-based router that pins to one replica
    return openai_client.chat.completions.create(
        model="anyscale/llama-3-70b",
        messages=[{"role": "user", "content": prompt}],
        extra_headers={"x-ray-serve-replica-id": replica_id},
    ).choices[0].message.content

def replica_skew(golden_set, replica_ids):
    per_replica = {rid: [] for rid in replica_ids}
    for ex in golden_set:
        for rid in replica_ids:
            per_replica[rid].append(call_pinned(rid, ex.input))
    scores = {rid: evaluator.evaluate(
        eval_templates=[Groundedness(), TaskCompletion(), EvaluateFunctionCalling()],
        inputs=[TestCase(input=ex.input, output=out, context=ex.context)
                for ex, out in zip(golden_set, per_replica[rid])],
    ).eval_results for rid in replica_ids}
    return scores

A healthy fleet scores within 1 to 2 points across replicas on every sub-rubric. A drifted replica usually drops 5 to 10 points on a structured sub-rubric (JSON-mode adherence, tool-call argument correctness) before it shows on free-form. Run this every weekday on a 100-case set. The first time a replica’s EvaluateFunctionCalling cliffs against the cohort, the cluster has told you which node to drain before any customer notices.

For deeper rubric design on the judge side, the LLM-as-judge platforms guide covers the CustomLLMJudge pattern in depth.

Check 2: autoscaling-lag tail quality, not aggregate quality

Ray Serve’s autoscaler reacts to ongoing-request pressure on the existing replica set. It adds replicas when ongoing requests cross target_ongoing_requests, removes them when they fall below. The reaction is not instant. When a burst arrives and the existing fleet saturates, the scaler hesitates while requests pile up in the queue, then begins to add replicas, which themselves take seconds to load model weights and accept traffic. p99 latency triples for 30 to 60 seconds.

The cost teams miss is that tail quality cliffs in lockstep. Requests held in the queue lose batching efficiency, decode latency stretches per token, and tail outputs degrade before any new replica is online. The aggregate Groundedness number stays flat across the burst. The per-queue-depth bucket cliffs above 80% saturation.

The detection bins every score by two axes: cluster.queue_depth at the moment the request landed, and cluster.replica_count at that same moment.

from fi_instrumentation import using_attributes

with using_attributes({
    "llm.system": "anyscale",
    "ray.serve.deployment_name": "llama-70b-prod",
    "cluster.replica_count": current_replica_count(),
    "cluster.queue_depth": current_queue_depth(),
}):
    response = openai_client.chat.completions.create(...)

def autoscaling_lag_curve(spans, rubric):
    by_bucket = {}
    for span in spans:
        qd = span.attributes["cluster.queue_depth"]
        rc = span.attributes["cluster.replica_count"]
        saturation = qd / max(rc, 1)
        bucket = (
            "idle" if saturation < 2
            else "warm" if saturation < 8
            else "saturating" if saturation < 16
            else "scaling"
        )
        score = evaluator.evaluate(
            eval_templates=[rubric],
            inputs=[TestCase(input=span.input, output=span.output, context=span.context)],
        ).eval_results[0]
        by_bucket.setdefault(bucket, []).append(score)
    return by_bucket

A typical curve on a 70B Llama served on a 2-to-8 replica deployment: the idle bucket scores Groundedness 0.88; warm scores 0.86; saturating scores 0.79; scaling scores 0.71. The scaling bucket is the operating envelope during burst. The fix is configuration, not eval: lower target_ongoing_requests so the scaler reacts earlier, pre-warm replicas on a schedule that matches the burst pattern, or route burst traffic to a hosted fallback while the cluster catches up. Eval is what tells you the scaler config is wrong before the morning standup does.

Check 3: KV-cache eviction under burst

PagedAttention is the reason a Ray Serve LLM deployment can serve 128k or 200k context windows at all. It is also the reason coherence drops in the middle of a long output you did not expect to fail. When a burst arrives and the cluster spikes from 40% to 90% memory utilization in 20 seconds, the scheduler evicts cached KV blocks for older requests, then recomputes them on the next decode step. The compute cost shows up as latency. The quality cost is the part teams miss.

When a long-context request loses prefix-cache hits mid-generation, the model behaves as if it never saw the early tokens. Tool-call sequences drop arguments seen only in the system prompt. RAG answers cite details from the wrong document. Reasoning chains lose constraints stated up top. The aggregate Groundedness number stays flat across the burst. The per-output-position curve cliffs around the eviction point, and the per-cache-hit-ratio bucket cliffs underneath it.

The detection joins eval scores to the llm.vllm.cache_hit_ratio attribute on the span. Pulled from the Ray Serve LLM metrics endpoint and stamped on the trace context, it tells you which bucket the request landed in.

def eviction_curve(spans, rubric):
    by_cache_bucket = {}
    for span in spans:
        ratio = span.attributes["llm.vllm.cache_hit_ratio"]
        cache_bucket = (
            "high" if ratio > 0.85
            else "medium" if ratio > 0.55
            else "low"
        )
        score = evaluator.evaluate(
            eval_templates=[rubric],
            inputs=[TestCase(input=span.input, output=span.output, context=span.context)],
        ).eval_results[0]
        by_cache_bucket.setdefault(cache_bucket, []).append(score)
    return by_cache_bucket

A typical finding: cache-hit-ratio > 0.85 scores Groundedness 0.88; the 0.55-0.85 bucket scores 0.81; the < 0.55 bucket scores 0.68. The fix is configuration: raise gpu_memory_utilization on the deployment, drop the max concurrent long-context requests per replica, route 64k+ contexts to a different deployment with more headroom, or summarize the prompt before passing it in. Eval is what tells you the configuration is wrong before the customer does.

Check 4: traffic-pattern drift between dev and prod

The fourth failure is the one that hides longest. The dev eval ran on a uniform synthetic load. Prod’s actual mix is bursty: morning spikes from one tenant, sustained mid-day from a second, long-context batches from a third overnight. The cluster behaves differently under each mix. Quality differs across each mix. The dev run that scored clean against uniform load scores nothing the prod mix actually produces.

The check is to bucket every production span by the traffic pattern around it, then run eval per bucket.

def traffic_pattern_label(span, window_spans):
    # Look at the 60s window around this span
    rates = [s.attributes["cluster.queue_depth"] for s in window_spans]
    mean_rate = sum(rates) / len(rates)
    peak_rate = max(rates)
    if peak_rate > 2.5 * mean_rate:
        return "burst"
    if mean_rate > 0.7 * span.attributes["cluster.replica_count"] * 16:
        return "sustained"
    return "steady"

def pattern_drift(spans, rubric):
    by_pattern = {}
    for span in spans:
        window = get_window_spans(span.timestamp, 60)
        pattern = traffic_pattern_label(span, window)
        score = evaluator.evaluate(
            eval_templates=[rubric],
            inputs=[TestCase(input=span.input, output=span.output, context=span.context)],
        ).eval_results[0]
        by_pattern.setdefault(pattern, []).append(score)
    return by_pattern

A typical finding on a multi-tenant 70B Llama deployment: steady scores 0.87, sustained scores 0.84, burst scores 0.74. The burst bucket is the one that exists in prod and never in the dev replay. Re-run the dev eval under a replayed burst pattern, not a uniform mix. If you can’t replay, route a small slice of real traffic to a shadow Ray Serve deployment through the Agent Command Center and score against the shadow’s spans instead. The pattern that broke the cluster in prod is the pattern the next eval run must contain.

Check 5: per-tenant fairness under shared scheduling

Continuous batching is globally throughput-optimal and per-tenant unfair. A tenant sending five requests per minute can see p99 latency and p99 quality cliff when a tenant sending five hundred requests per minute starts a burst of long prefills. The aggregate p99 stays inside SLO. The small tenant’s does not.

The check runs on purpose. Generate synthetic traffic from two tenants at different rates. Tag every span with tenant_id. Compute per-tenant p99 latency and per-tenant rubric scores. Confirm each tenant’s worst case stays inside the SLO.

def per_tenant_fairness(spans, rubric):
    by_tenant = {}
    for span in spans:
        tid = span.attributes["tenant_id"]
        score = evaluator.evaluate(
            eval_templates=[rubric],
            inputs=[TestCase(input=span.input, output=span.output, context=span.context)],
        ).eval_results[0]
        by_tenant.setdefault(tid, []).append({
            "score": score,
            "latency_ms": span.duration_ms,
        })
    return by_tenant

# alert when a tenant's p99 score drops below aggregate - 5pt
# alert when a tenant's p99 latency exceeds 1.5x aggregate p99

Ray Serve exposes priority and replica-targeting knobs (route_prefix-based fan-out, application-level routing through DeploymentHandle). Tune them against this test, not against aggregate p99. If small-tenant traffic always trails large-tenant bursts in the same queue, isolate it on a dedicated deployment with its own min_replicas floor.

Wiring Ray Serve to traceAI in five lines

Ray Serve LLM exposes an OpenAI-compatible endpoint, so the traceAI OpenAI instrumentor works as-is once base_url points at your Anyscale service URL or the gateway in front of it.

from fi_instrumentation import register, ProjectType
from traceai_openai import OpenAIInstrumentor
from openai import OpenAI
import os

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="anyscale-llama-prod",
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key=os.environ["GATEWAY_KEY"],
)

Every chat completion now emits fi.span.kind=LLM spans with llm.model_name, llm.token_count.prompt, llm.token_count.completion, and llm.token_count.total. Add llm.system=anyscale, ray.serve.replica_id, ray.serve.deployment_name, cluster.replica_count, cluster.queue_depth, and llm.vllm.cache_hit_ratio as using_attributes context, sourced from Ray Serve metrics and response middleware. Those eight attributes are everything the bucketed and rolling-window detectors need to slice tail quality by replica, by queue depth, by cache pressure, by traffic pattern, and by tenant.

Pipe spans through the Agent Command Center when you want autoscaler-policy canary, shadow, and race modes without app-code changes. The gateway returns x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-model-used, and x-agentcc-routing-strategy on every call, so the cost-per-million-tokens and the rolling p99 ride on the same trace as the eval scores. Deploy it BYOC inside the Anyscale workspace and the network hop adds nothing.

The ai-evaluation SDK also ships a Ray distributed runner, so the eval workload itself runs as Ray actors on the same cluster.

from fi.evals.runners import RayRunner

runner = RayRunner(address="anyscale://my-cluster", num_workers=8)
runner.run(evaluator=evaluator, eval_templates=[...], inputs=[...])

When Ray is already the inference runtime, running eval on a separate Celery or Kubernetes pool doubles your infrastructure surface for no upside. Use Ray for both.

How Future AGI ships Ray Serve eval

Future AGI ships the eval stack as a package. Start with the SDK and traceAI for code-defined gates. Graduate to the Platform when self-improving rubrics and per-cluster routing become the bottleneck.

ai-evaluation SDK (Apache 2.0). 60+ EvalTemplate classes covering Groundedness, ContextAdherence, TaskCompletion, EvaluateFunctionCalling, AnswerRefusal, PromptInjection, and DataPrivacyCompliance. CustomLLMJudge is the primitive for replica-parity scoring and tail-quality bucketing. Four distributed runners (Celery, Ray, Temporal, Kubernetes); the Ray runner schedules eval actors on the same Anyscale cluster as inference. Local heuristic metrics (regex, JSON schema, BLEU, ROUGE) run offline at sub-second latency, which matters when you score every replica on the same 100-case set every weekday.
traceAI. 50+ AI surfaces across Python, TypeScript, Java, and C#. The OpenAI-compatible instrumentor covers Ray Serve out of the box. Every span carries llm.model_name and token counts; add the Ray-specific attributes as using_attributes context and the bucketed detectors run against the same span tree.
Agent Command Center. Single Go binary, Apache 2.0, 100+ providers including Anyscale as an OpenAI-compatible upstream. Shadow, mirror, and race modes for autoscaler-policy and replica-shape rollouts; eval-gated canary rollback as the default; benchmarked at ~29k req/s, P99 ≤ 21 ms with guardrails on, on t3.xlarge. Returns the cost, latency, model, and routing-strategy headers on every call. BYOC inside the Anyscale workspace keeps the hop in-cluster.
agent-opt. Six optimizers (PROTEGI, GEPA, MetaPrompt, BayesianSearch, RandomSearch, PromptWizard) for closing the residual quality gap that survives the cluster checks. If the burst bucket loses 4 points on EvaluateFunctionCalling, PROTEGI’s gradient pass often recovers 2 of them on the system prompt before you change the scaler config.
Future AGI Platform. Self-improving evaluators that retune from production traces; in-product authoring agent that writes rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2. Error Feed clusters failing traces with HDBSCAN soft-clustering, and a Sonnet 4.5 Judge writes the immediate_fix per cluster. Typical clusters on a Ray Serve stack: “Replica 7 drops 6 points on JSON-mode adherence (config drift, drain and reload),” “Tail quality cliffs in the scaling queue-depth bucket on the 9am burst, fix: lower target_ongoing_requests from 16 to 10 or pre-warm two replicas at 8:50,” “Long-context past 64k loses grounding when cache-hit-ratio falls below 0.55, fix: route 64k+ to a dedicated deployment.” Each immediate_fix flows back into the routing policy at the gateway and into the next eval run. Linear OAuth ships today; Slack, GitHub, Jira, PagerDuty are roadmap.

Drop ai-evaluation and traceAI into the gate this afternoon. Add the gateway when autoscaler-policy canary becomes the workflow. Turn the Platform on when per-cluster routing is the bottleneck.

Ready to evaluate your first Ray Serve cluster? Run pip install ai-evaluation traceai-openai, scaffold the replica-skew and queue-depth-bucketed gates against your golden set, point your OpenAI client at https://gateway.futureagi.com/v1 with Anyscale upstream for shadow traffic, and alarm the rolling p99 and cache-hit-ratio detectors on the canary cohort. The cluster that survives all five checks is the one worth shipping; everything else is a regression the notebook didn’t show you.

Frequently asked questions

Why is evaluating a Ray Serve LLM different from evaluating an API model or a single vLLM box?

Ray Serve fans every request across N replicas behind an actor router, and that replica set is elastic. Three things change. The replica set is your real surface area, so a single drifted replica (partial weight load, stale config, thermally throttled GPU) hides inside healthy aggregates because it only serves 1/N of traffic. The cluster is dynamic: the autoscaler adds and removes replicas under load, and a cold-started replica is not the same animal as a warm one. Traffic patterns dominate quality: continuous batching and KV-cache sharing behave one way at idle and a different way at peak, so quiet-cluster eval lies about peak-cluster reality. The result is that Ray Serve eval is two problems, not one. Model quality eval scores the weights. Cluster-level eval scores what happens when those weights meet the router, the scheduler, the scaler, and 200 concurrent tenants.

What is autoscaling-lag tail quality and how do I catch it?

Ray Serve's autoscaler reacts to ongoing-request pressure on the existing replica set, so when a burst arrives the scaler hesitates while in-flight requests pile up. Two failures stack. p99 latency triples for 30 to 60 seconds as warm replicas saturate. Quality also dips because the requests held in the queue lose batching efficiency and tail outputs degrade before any new replica accepts traffic. Catch it by tagging every span with cluster.replica_count and cluster.queue_depth from Ray Serve metrics, computing per-token decode latency from spans, and binning quality scores by the queue-depth bucket the request landed in. The interesting bucket is the one above 80% saturation; that is where tail quality cliffs in lockstep with tail latency. A single-tenant load test will not surface this. You need real burst traffic against a live scaler.

How does replica skew show up and how do I detect it?

A Ray Serve replica can drift independently from its siblings. The model loaded from a partial download on a flaky node, a config divergence after a rolling restart, a GPU running at lower clocks because thermal limits kicked in. The replica still passes the health check and still serves traffic, so the aggregate eval score absorbs it because the bad replica is only 1/N of the cohort. Detection runs the same golden set against every replica in turn (pin requests via Ray Serve route prefixes or a header-based router), scores each per-replica response, and computes variance across the set. A healthy fleet scores within 1 to 2 points across replicas. A drifted replica usually drops 5 to 10 points on a sub-rubric like JSON-mode adherence or tool-call argument correctness, which is the signal that lands days before a customer files a ticket.

What is KV-cache eviction under burst and why does it matter on Ray Serve?

Ray Serve LLM deployments use PagedAttention to serve 128k and 200k context windows by paging KV blocks in and out of GPU memory. When a burst arrives and the cluster spikes from 40% to 90% memory utilization in 20 seconds, the scheduler evicts cached blocks for older requests and recomputes them on the next decode step. Latency rises. The hidden cost is quality. When a long-context request loses prefix-cache hits mid-generation, the model behaves as if it never saw the early tokens: tool-call arguments from the system prompt drop, RAG answers cite the wrong document, reasoning chains lose constraints stated up top. The aggregate Groundedness number stays flat. The per-output-position curve cliffs around the eviction point. Detect by binning scores against the llm.vllm.cache_hit_ratio attribute on the span and watching where the cliff lands.

How do I instrument Ray Serve LLM with traceAI?

Ray Serve LLM exposes an OpenAI-compatible endpoint, so the traceAI OpenAI instrumentor works with base_url pointed at your Anyscale service URL. Call register(project_type=ProjectType.OBSERVE, project_name=...) once, then OpenAIInstrumentor().instrument(tracer_provider=trace_provider), and every chat completion emits an fi.span.kind=LLM span with llm.model_name and token counts. Tag spans with llm.system=anyscale to separate Ray Serve traffic from API-backed traffic. Add ray.serve.replica_id, ray.serve.deployment_name, cluster.replica_count, cluster.queue_depth, and llm.vllm.cache_hit_ratio as custom attributes from Ray Serve metrics and response middleware. Those five attributes plus the standard token counts give the bucketed and rolling-window detectors everything they need to slice tail quality by replica, by queue depth, and by cache pressure.

What does Future AGI ship for Ray Serve evaluation specifically?

The eval stack as a package. The ai-evaluation SDK (Apache 2.0) ships 60+ EvalTemplate classes including Groundedness, ContextAdherence, TaskCompletion, EvaluateFunctionCalling, and CustomLLMJudge for the model-quality side, and ships a Ray distributed runner so the eval workload runs as Ray actors on the same Anyscale cluster that serves the LLM. traceAI captures Ray Serve spans through the OpenAI-compatible instrumentor across Python, TypeScript, Java, and C#. The Agent Command Center routes Anyscale as an OpenAI-compatible upstream, returns x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-model-used per call, supports shadow and canary modes for autoscaler-policy rollouts, and runs BYOC inside the Anyscale workspace so the gateway hop adds nothing. The Future AGI Platform's self-improving evaluators retune from production traces at lower per-eval cost than Galileo Luna-2, and Error Feed clusters replica-skew and burst-quality failures with HDBSCAN soft-clustering and a Sonnet 4.5 Judge that writes an immediate_fix per cluster.

What is the worst anti-pattern for Ray Serve LLM eval in 2026?

Running the offline eval on a single replica at idle, watching it pass, and shipping the deployment behind an N-replica autoscaling Ray Serve setup. Three failure modes stack. The drifted replica hides inside the aggregate because it serves 1/N of traffic. Autoscaler lag spikes tail latency and tail quality on every burst because warm replicas saturate before new ones accept traffic. KV-cache eviction under burst cliffs long-context quality past the prefix-cache budget. The eval that caught nothing on the single replica at idle catches nothing in production until a customer files a ticket about a broken tool sequence on the morning burst three weeks in. Eval the cluster shape you actually ship: N replicas, with the scaler running, under the burst pattern production traffic actually hits.

View all

Guides

Best 5 Literal AI Alternatives in 2026 (Migration Guide)

Literal AI's hosted platform was discontinued. This migration guide ranks five alternatives and shows how to move traces, datasets, and prompts off it.

NVJK Kartik · May 21, 2026

21 min

Guides

Best 5 Parea AI Alternatives in 2026

Five Parea AI alternatives scored on eval-catalog depth, logs-capped pricing, optimizer loops, guardrails, and team scale, and what each fixes.

NVJK Kartik · May 21, 2026

17 min

Guides

Best 5 RagaAI Alternatives in 2026

Five RagaAI alternatives scored on eval-judge depth, optimizer loops, gateway and guardrails, self-host ops burden, vendor maturity, and what each fixes.

NVJK Kartik · May 21, 2026

19 min

TL;DR: the model-vs-cluster eval split

Why model eval and cluster eval are two problems

Check 1: replica skew on the fleet you actually ship

Check 2: autoscaling-lag tail quality, not aggregate quality

Check 3: KV-cache eviction under burst

Check 4: traffic-pattern drift between dev and prod

Check 5: per-tenant fairness under shared scheduling

Wiring Ray Serve to traceAI in five lines

How Future AGI ships Ray Serve eval

Related reading

Frequently asked questions