AI Agent Reliability Metrics in 2026: Six SLOs, Not One Score
AI agent reliability is six metrics, not one composite. Task completion, tool-call success, recovery, p99 latency, guardrail trips, score, wired as SLOs.
Table of Contents
An aggregate agent_score is the metric that hides the bug. It says the agent is at 0.85 today, 0.83 yesterday, dashboard green. It does not say that tool-call success sits at 0.97, argument extraction at 0.62, and every refund over $1,000 is going out with the wrong tax line. Agents are not single-output functions; you cannot SLO them with a single number any more than you can SLO an API on a composite of latency and error rate. AI agent reliability is six metrics, not one composite: task completion, tool-call success, recovery rate, p99 latency, guardrail trip rate, and a 4-dimensional trace-grounded score. Each maps to a layer that can break in isolation, each has a different fix, each earns its own SLO with its own error budget.
This guide walks the six SLOs, the trace-grounded score that makes the eval layer SRE-shaped, the CI gate that beats aggregate scoring, the Error Feed loop that turns production failures into regression tests, and where the Future AGI stack lands inside that pattern.
TL;DR: six SLOs, six error budgets
| Metric | What it measures | Working baseline (chat agent) | SLO type |
|---|---|---|---|
| Task completion rate | Trajectory delivered the user goal end-to-end | ≥ 90% | Availability |
| Tool-call success | Right tool, schema-valid args, payload used | ≥ 95% | Availability (component) |
| Recovery rate | Recovered from a transient tool failure | ≥ 70% | Reliability |
| p99 latency | Tail latency users actually feel | ≤ 30s chat, ≤ 5min batch | Latency |
| Guardrail trip rate | Scanner blocks at the gateway layer | 1-5% on production traffic | Policy |
| Trace-grounded score (4-D) | Factual grounding, privacy/safety, instruction adherence, plan execution | ≥ 4.0 of 5 on each axis | Rubric |
Five of the six map onto classical SRE SLO discipline. The sixth is the rubric layer that catches correct-looking outputs from broken trajectories. Wire all six to per-metric error budgets, alert on burn rate, postmortem when the budget breaks.
Why aggregate scoring hides the bug
The bisection problem is why aggregate scores fail. When a production agent regresses, the on-call engineer has to figure out which of six (or eight, or twelve) layers moved: did the planner pick the wrong tool, did the call carry a bad argument, did retrieval surface a stale chunk, did the model flip a number in the payload, did a safety filter turn aggressive, did the third-party endpoint start returning 429s.
An aggregate agent_score of 0.83 tells you none of this. The aggregate is fine for an exec slide; it is the wrong shape for an SLO. API teams figured this out a decade ago: you run a latency SLO, an availability SLO, and an error budget per metric, because the alert tells you what to fix.
Same with agents. Six metrics, six error budgets, six alerts. When the tool-call success SLO burns, you look at the tool registry. When the recovery SLO burns, you look at the retry policy. When the trace-grounded score on factual grounding drops, you look at retrieval. One bisect, not three days.
The six reliability SLOs
1. Task completion rate (availability)
End-to-end success on the user goal, scored on the full trajectory rather than the final turn. This is your availability SLO; it sits at the top of the stack the same way HTTP success rate sits at the top of an API SLO. A trajectory completes when the user goal is satisfied, the agent stops cleanly, and no upstream layer (tool, guardrail, retry budget) had to short-circuit the run.
The rubric is TaskCompletion against AgentTrajectoryInput (cloud eval_id=99). For multi-turn conversations, layer in ConversationCoherence and ConversationResolution.
Working baseline. Chat-style agents over open tool registries land around 88-92% in production after the first month. Below 80% means the planner or the tool registry is broken. Use pass^k (k independent rollouts, fraction that pass all k) for the consistency cut: when the mean stays flat but pass^8 slides, the planner is regressing under nondeterminism.
2. Tool-call success
The layer that breaks most often and the one aggregate scoring hides hardest. Tool-call success is a stack of three sub-checks, all of which have to pass:
- Tool selection. Right tool from the registry, or correctly call none on an irrelevance case (greeting, clarification, in-model factual question). Without an irrelevance bucket, the regression where a new prompt makes the model bolder about calling
searchon every input is invisible. - Argument extraction. Schema-valid arguments (Pydantic, JSON Schema, gate CI deterministically) plus semantic correctness (
departure_date="2026-01-01"validates and is wrong if the user said “next Friday”). - Result utilization. Did the agent use the tool payload or paraphrase it with a number flipped, substitute model knowledge, or drift off the payload by turn 3.
The SDK ships the stack: LLMFunctionCalling (cloud), deterministic function_name_match / parameter_validation / function_call_accuracy (sub-millisecond, local), and Groundedness with the context slot pointed at the tool return payload. The four-step contract for evaluating LLM tool use walks each sub-check in depth.
Working baseline. ≥ 95% schema validity, ≥ 90% semantic correctness, ≥ 90% result utilization. Aggregate this only into the dashboard summary; never gate on the aggregate in CI.
3. Recovery rate
Real tools fail. APIs time out, return 429s, return malformed JSON, return empty results. Recovery rate is the fraction of trajectories with at least one transient tool failure where the agent eventually delivered the user goal. It protects against the agent fabricating success when the upstream call was broken.
Build a stratified test set by replaying production traces with synthetic tool failures injected: one bucket per tool, one row per error code (400, 401, 403, 404, 408, 429, 5xx), plus empty-result and partial-result rows. Gate CI on per-bucket recovery rates. ActionSafety and TrajectoryScore cover the deterministic side; a CustomLLMJudge on the trajectory handles whether the agent communicated the failure clearly, stopped at a sensible retry cap (3-6 attempts), and routed to a fallback or escalation.
Working baseline. ≥ 70% recovery on transient errors. 100% recovery on persistent errors usually means the agent is masking a real failure with fabricated success; treat any line above 95% as suspect.
4. p99 latency
The latency SLO from API SRE, unchanged. p99 (not p50, not average) because the worst 1% of requests dominate user perception, and the tail on multi-step agents is heavy. A trajectory with 8 tool calls and 95% per-call latency in line still has a tail that fans out across compound calls.
OTel span durations are the data. Group by route, persona, model variant, and tool for cohort analysis. The most expensive failure mode is the long-tail trace where one stalled tool call held the response budget for 90 seconds; the average looks fine, p99 is the alert.
Working baseline. ≤ 30 seconds for chat agents, ≤ 5 minutes for batch, ≤ 2 seconds for autocomplete. Voice-agent budgets are tighter (≤ 800 ms p95 round-trip; the voice latency playbook walks the breakdown). A high-completion, high-p99 agent is broken even when the dashboard shows 92% success.
5. Guardrail trip rate
The policy SLO. The rate at which gateway-layer guardrail scanners block requests or responses on production traffic. Useful in two directions: a sudden spike means an upstream prompt change made the scanner aggressive (or a model swap pushed the response distribution into the blocked zone); a sudden drop means a scanner regressed or rules thinned. Either direction is an alert.
The Future AGI Agent Command Center (Apache 2.0, single Go binary) ships 18+ built-in scanners (PII detection, prompt injection, content moderation, secret detection, hallucination detection, topic restriction, tool permissions, MCP security, and the rest) plus 15 third-party adapters. Benchmarked at ~29k req/s, P99 21 ms with guardrails on t3.xlarge.
Working baseline. 1-5% trip rate on adversarial-mix production workloads, ≤ 1% on cleaner internal traffic. The actionable signal is the delta: a stable rate at 3% is normal; a jump to 11% after a model swap is a real alert.
6. Trace-grounded score (4-D rubric)
The five metrics above are SRE-shaped. The trace-grounded score is the rubric SLO that catches the failures only the AI layer makes: correct-looking output from a broken trajectory, hallucinated citations, refusal regressions, plan loops that look like steps. Four axes, scored 1 to 5 by the same judge on every trace:
- Factual grounding. Stayed anchored in retrieved or tool context, or confabulated. Catches result-utilization failures and retrieval drift.
- Privacy and safety. Leaked PII, crossed a tenant boundary, complied with a jailbreak. Catches refusal regressions and permission failures.
- Instruction adherence. Obeyed the system prompt and refused what should have been refused. Catches prompt drift directly.
- Optimal plan execution. Picked the right tool, in the right order, without redundant calls, retries, or unreachable branches. Catches tool-selection and plan-coherence regressions.
The same judge runs against the offline dataset in CI and against live spans in production. Same vocabulary, same calibration set, same threshold. When the composite drops, the four axes tell you which layer regressed.
Working baseline. ≥ 4.0 on each axis. Below 3.5 on any axis for more than a 5% volume slice is an alert.
Mapping the six SLOs to API SLO discipline
The discipline you already run on a payment API is the discipline that runs on an agent. The metrics differ; the SLO mechanics do not.
| Agent SLO | API SLO analog | Error-budget shape |
|---|---|---|
| Task completion rate | HTTP success rate | 1 - target_completion |
| p99 latency | p99 latency | Burn when above threshold |
| Recovery rate | Retry-success rate | 1 - target_recovery on failure subset |
| Tool-call success | (no clean analog) | Composite of selection, args, utilization |
| Guardrail trip rate | WAF block rate | Delta-based, not absolute |
| Trace-grounded score | (AI-specific) | Per-axis burn on rubric thresholds |
Tool-call success and trace-grounded score have no clean API analog: no classical SLO covers the layer where the agent decides which downstream endpoint to call, and no classical service produces correct-looking output from a structurally broken trajectory.
Define an error budget per metric. Alert at 2x budget (fast burn) or 1x sustained over the SLO window (slow burn). Postmortem when the budget breaks. Same pattern as the Google SRE error-budget playbook, agent-specific metrics.
The CI gate: per-metric thresholds, not aggregate
Aggregate gating is the most common failure mode in agent CI. A CI gate on agent_score >= 0.85 passes when tool-call success is 0.97 and argument extraction is 0.62, then production lights up with bad calls inside well-picked tools. The fix is per-metric assertions, calibrated against historical pass rates:
# config.yaml for `fi run`
assertions:
- "task_completion.score >= 0.90 for at_least 90% of cases"
- "tool_selection_f1.score >= 0.95 for at_least 95% of cases"
- "argument_validation.score >= 0.95 for at_least 95% of cases"
- "argument_semantics.score >= 0.85 for at_least 85% of cases"
- "result_groundedness.score >= 0.90 for at_least 90% of cases"
- "recovery_score.score >= 0.70 for at_least 80% of cases"
- "trace_score.factual_grounding >= 4.0 for at_least 90% of cases"
- "trace_score.instruction_adherence >= 4.0 for at_least 90% of cases"
When the gate fails, the failing assertion name is the root cause. The Future AGI fi CLI ships per-eval assertions natively; the build-from-scratch walkthrough covers the fixture wiring. Per-metric gating is noisier than aggregate gating (a PR that bumps argument semantics by 2 points while dropping factual grounding by 1 point flips from green to yellow), and that is the right behavior: the engineer reads the diff and decides instead of rolling forward on a composite that summed it away.
Production observability and the Error Feed loop
Six SLOs in CI is necessary, not sufficient. The eval set is a snapshot; production is a river. Score the live trace stream with the same rubrics and you get a regression signal the offline set cannot have. The agent passes evals and fails in production playbook walks the six drift modes that age every eval set the day it ships.
traceAI (Apache 2.0) captures the trajectory across 14 span kinds (TOOL, RETRIEVER, AGENT, GUARDRAIL, EVALUATOR, A2A_CLIENT, A2A_SERVER, and the rest) and 50+ AI surfaces in Python, TypeScript, Java, and C#. EvalTag wires rubrics to spans at zero added inference latency.
Error Feed is the loop closer. Failing traces flow into ClickHouse with their span embeddings. HDBSCAN soft-clustering groups them into named issues. Each cluster fires a Claude Sonnet 4.5 Judge agent for a 30-turn investigation across 8 span-tools. Per cluster, the Judge emits a 5-category 30-subtype taxonomy, the 4-D trace-grounded score, and an immediate_fix naming the change to ship today. The fix feeds the Platform’s self-improving evaluators; the on-call engineer promotes representative traces into the offline set; the next PR has to clear them. Linear is wired today via OAuth; Slack, GitHub, Jira, and PagerDuty are on the roadmap. Every incident becomes a regression test the team never has to write again.
Common reliability-metric mistakes
- Single composite score as the dashboard. Hides which layer broke. Six SLOs with six error budgets is the working pattern.
- Latency average instead of p99. Average hides tail-latency disasters. Users feel the worst 1%.
- Tool-call success without an irrelevance bucket. The over-call regression where the agent calls
searchon every input is invisible without cases where the gold answer is no tool call. - Recovery rate at 100% as a green signal. Usually means the agent is fabricating success on persistent errors. Treat anything above 95% as suspect.
- Trace-grounded score without per-axis thresholds. The composite drops, the engineer reads the average, the regression on instruction adherence stays buried under stable factual grounding.
- Guardrail trip rate measured as an absolute number. The signal is the delta; a stable 3% is fine, a jump to 11% after a model swap is the alert.
- No cohort stratification. Persona cohorts (frustrated_customer, hostile_attacker, edge_case) catch failures that average out.
- Eval and trace in different tools. Span-attached scores let regressions localize to the bad span. Cross-referencing two dashboards under pressure is how on-call rotations burn out.
How Future AGI ships the six-metric SLO stack
The eval stack ships as a package, not a single product. Start with the SDK for code-defined per-metric scoring. Graduate to the Platform when the loop needs self-improving rubrics. Add the Agent Command Center for guardrail trip rate at the gateway.
ai-evaluation (Apache 2.0) covers all six SLOs with 70+ EvalTemplate classes: TaskCompletion against AgentTrajectoryInput, LLMFunctionCalling plus deterministic function_name_match / parameter_validation / function_call_accuracy, Groundedness and ContextAdherence, AnswerRefusal, ConversationCoherence, plus 11 CustomerAgent* templates. Seven AgentTrajectoryInput metrics, 13 guardrail backends, four distributed runners (Celery, Ray, Temporal, Kubernetes). The fi CLI ships per-assertion CI gates natively. traceAI (Apache 2.0) carries the same rubrics as span-attached scores across 50+ AI surfaces and 14 span kinds.
The Future AGI Platform layers self-improving evaluators, in-product agent-authored rubrics, and classifier-backed evals at lower per-eval cost than Galileo Luna-2. Error Feed is the clustering and what-to-fix layer inside the stack.
Agent Command Center (Apache 2.0, single Go binary) powers the guardrail-trip-rate SLO: 100+ providers, 18+ built-in scanners, 15 third-party adapters, MCP and A2A protocol support, per-scanner block counts on /-/metrics. Self-hostable or via gateway.futureagi.com/v1. SOC 2 Type II, HIPAA, GDPR, and CCPA certified per futureagi.com/trust.
Ready to wire the six SLOs? Wire TaskCompletion, LLMFunctionCalling, Groundedness, and the 4-D TrajectoryScore into a pytest fixture via the ai-evaluation SDK, then attach the same templates as EvalTag scorers when production traces ask questions the CI gate missed. Same rubric in both places is the diff between an offline pass that ships and a 3 am page.
Related reading
- The Definitive Guide to AI Agent Evaluation (2026)
- Your Agent Passes Evals and Fails in Production (2026)
- Evaluating Tool-Calling Agents (2026)
- Best AI Agent Observability Tools (2026)
- Best AI Agent Reliability Solutions (2026)
- Build an LLM Evaluation Framework From Scratch (2026)
- LLM Incident Response Playbook (2026)
Frequently asked questions
Why split agent reliability into six metrics instead of one composite score?
What are the six agent reliability metrics?
How does the 4-dimensional trace-grounded score work?
How do agent SLOs map to classical API SLOs?
Why does CI gate on per-metric thresholds beat aggregate thresholds?
How does Error Feed turn production failures into the next regression test?
How does Future AGI ship the six-metric SLO stack?
Future AGI, Helicone, Langfuse, OpenRouter, Portkey, LangSmith, Datadog, and CloudZero compared on per-trace, per-developer LLM cost attribution.
Best LLMs May 2026: compare GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and DeepSeek V4 across coding, agents, multimodal, cost, and open weights.
Best Voice AI May 2026: compare Deepgram, Cartesia, ElevenLabs, Retell, and Vapi for STT, TTS, latency budgets, and production voice agents.