Research

AI Agent Reliability Metrics in 2026: Six SLOs, Not One Score

AI agent reliability is six metrics, not one composite. Task completion, tool-call success, recovery, p99 latency, guardrail trips, score, wired as SLOs.

·
Updated
·
12 min read
agent-reliability agent-slo tool-call-success trace-grounded-score agent-observability 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline AGENT RELIABILITY METRICS 2026 fills the left half. The right half shows a wireframe radial dashboard with 8 spoke metrics emanating from a central hub, drawn in pure white outlines, with a soft white halo glow on the topmost metric spoke as the focal element.
Table of Contents

An aggregate agent_score is the metric that hides the bug. It says the agent is at 0.85 today, 0.83 yesterday, dashboard green. It does not say that tool-call success sits at 0.97, argument extraction at 0.62, and every refund over $1,000 is going out with the wrong tax line. Agents are not single-output functions; you cannot SLO them with a single number any more than you can SLO an API on a composite of latency and error rate. AI agent reliability is six metrics, not one composite: task completion, tool-call success, recovery rate, p99 latency, guardrail trip rate, and a 4-dimensional trace-grounded score. Each maps to a layer that can break in isolation, each has a different fix, each earns its own SLO with its own error budget.

This guide walks the six SLOs, the trace-grounded score that makes the eval layer SRE-shaped, the CI gate that beats aggregate scoring, the Error Feed loop that turns production failures into regression tests, and where the Future AGI stack lands inside that pattern.

TL;DR: six SLOs, six error budgets

MetricWhat it measuresWorking baseline (chat agent)SLO type
Task completion rateTrajectory delivered the user goal end-to-end≥ 90%Availability
Tool-call successRight tool, schema-valid args, payload used≥ 95%Availability (component)
Recovery rateRecovered from a transient tool failure≥ 70%Reliability
p99 latencyTail latency users actually feel≤ 30s chat, ≤ 5min batchLatency
Guardrail trip rateScanner blocks at the gateway layer1-5% on production trafficPolicy
Trace-grounded score (4-D)Factual grounding, privacy/safety, instruction adherence, plan execution≥ 4.0 of 5 on each axisRubric

Five of the six map onto classical SRE SLO discipline. The sixth is the rubric layer that catches correct-looking outputs from broken trajectories. Wire all six to per-metric error budgets, alert on burn rate, postmortem when the budget breaks.

Why aggregate scoring hides the bug

The bisection problem is why aggregate scores fail. When a production agent regresses, the on-call engineer has to figure out which of six (or eight, or twelve) layers moved: did the planner pick the wrong tool, did the call carry a bad argument, did retrieval surface a stale chunk, did the model flip a number in the payload, did a safety filter turn aggressive, did the third-party endpoint start returning 429s.

An aggregate agent_score of 0.83 tells you none of this. The aggregate is fine for an exec slide; it is the wrong shape for an SLO. API teams figured this out a decade ago: you run a latency SLO, an availability SLO, and an error budget per metric, because the alert tells you what to fix.

Same with agents. Six metrics, six error budgets, six alerts. When the tool-call success SLO burns, you look at the tool registry. When the recovery SLO burns, you look at the retry policy. When the trace-grounded score on factual grounding drops, you look at retrieval. One bisect, not three days.

The six reliability SLOs

1. Task completion rate (availability)

End-to-end success on the user goal, scored on the full trajectory rather than the final turn. This is your availability SLO; it sits at the top of the stack the same way HTTP success rate sits at the top of an API SLO. A trajectory completes when the user goal is satisfied, the agent stops cleanly, and no upstream layer (tool, guardrail, retry budget) had to short-circuit the run.

The rubric is TaskCompletion against AgentTrajectoryInput (cloud eval_id=99). For multi-turn conversations, layer in ConversationCoherence and ConversationResolution.

Working baseline. Chat-style agents over open tool registries land around 88-92% in production after the first month. Below 80% means the planner or the tool registry is broken. Use pass^k (k independent rollouts, fraction that pass all k) for the consistency cut: when the mean stays flat but pass^8 slides, the planner is regressing under nondeterminism.

2. Tool-call success

The layer that breaks most often and the one aggregate scoring hides hardest. Tool-call success is a stack of three sub-checks, all of which have to pass:

  • Tool selection. Right tool from the registry, or correctly call none on an irrelevance case (greeting, clarification, in-model factual question). Without an irrelevance bucket, the regression where a new prompt makes the model bolder about calling search on every input is invisible.
  • Argument extraction. Schema-valid arguments (Pydantic, JSON Schema, gate CI deterministically) plus semantic correctness (departure_date="2026-01-01" validates and is wrong if the user said “next Friday”).
  • Result utilization. Did the agent use the tool payload or paraphrase it with a number flipped, substitute model knowledge, or drift off the payload by turn 3.

The SDK ships the stack: LLMFunctionCalling (cloud), deterministic function_name_match / parameter_validation / function_call_accuracy (sub-millisecond, local), and Groundedness with the context slot pointed at the tool return payload. The four-step contract for evaluating LLM tool use walks each sub-check in depth.

Working baseline. ≥ 95% schema validity, ≥ 90% semantic correctness, ≥ 90% result utilization. Aggregate this only into the dashboard summary; never gate on the aggregate in CI.

3. Recovery rate

Real tools fail. APIs time out, return 429s, return malformed JSON, return empty results. Recovery rate is the fraction of trajectories with at least one transient tool failure where the agent eventually delivered the user goal. It protects against the agent fabricating success when the upstream call was broken.

Build a stratified test set by replaying production traces with synthetic tool failures injected: one bucket per tool, one row per error code (400, 401, 403, 404, 408, 429, 5xx), plus empty-result and partial-result rows. Gate CI on per-bucket recovery rates. ActionSafety and TrajectoryScore cover the deterministic side; a CustomLLMJudge on the trajectory handles whether the agent communicated the failure clearly, stopped at a sensible retry cap (3-6 attempts), and routed to a fallback or escalation.

Working baseline. ≥ 70% recovery on transient errors. 100% recovery on persistent errors usually means the agent is masking a real failure with fabricated success; treat any line above 95% as suspect.

4. p99 latency

The latency SLO from API SRE, unchanged. p99 (not p50, not average) because the worst 1% of requests dominate user perception, and the tail on multi-step agents is heavy. A trajectory with 8 tool calls and 95% per-call latency in line still has a tail that fans out across compound calls.

OTel span durations are the data. Group by route, persona, model variant, and tool for cohort analysis. The most expensive failure mode is the long-tail trace where one stalled tool call held the response budget for 90 seconds; the average looks fine, p99 is the alert.

Working baseline. ≤ 30 seconds for chat agents, ≤ 5 minutes for batch, ≤ 2 seconds for autocomplete. Voice-agent budgets are tighter (≤ 800 ms p95 round-trip; the voice latency playbook walks the breakdown). A high-completion, high-p99 agent is broken even when the dashboard shows 92% success.

5. Guardrail trip rate

The policy SLO. The rate at which gateway-layer guardrail scanners block requests or responses on production traffic. Useful in two directions: a sudden spike means an upstream prompt change made the scanner aggressive (or a model swap pushed the response distribution into the blocked zone); a sudden drop means a scanner regressed or rules thinned. Either direction is an alert.

The Future AGI Agent Command Center (Apache 2.0, single Go binary) ships 18+ built-in scanners (PII detection, prompt injection, content moderation, secret detection, hallucination detection, topic restriction, tool permissions, MCP security, and the rest) plus 15 third-party adapters. Benchmarked at ~29k req/s, P99 21 ms with guardrails on t3.xlarge.

Working baseline. 1-5% trip rate on adversarial-mix production workloads, ≤ 1% on cleaner internal traffic. The actionable signal is the delta: a stable rate at 3% is normal; a jump to 11% after a model swap is a real alert.

6. Trace-grounded score (4-D rubric)

The five metrics above are SRE-shaped. The trace-grounded score is the rubric SLO that catches the failures only the AI layer makes: correct-looking output from a broken trajectory, hallucinated citations, refusal regressions, plan loops that look like steps. Four axes, scored 1 to 5 by the same judge on every trace:

  • Factual grounding. Stayed anchored in retrieved or tool context, or confabulated. Catches result-utilization failures and retrieval drift.
  • Privacy and safety. Leaked PII, crossed a tenant boundary, complied with a jailbreak. Catches refusal regressions and permission failures.
  • Instruction adherence. Obeyed the system prompt and refused what should have been refused. Catches prompt drift directly.
  • Optimal plan execution. Picked the right tool, in the right order, without redundant calls, retries, or unreachable branches. Catches tool-selection and plan-coherence regressions.

The same judge runs against the offline dataset in CI and against live spans in production. Same vocabulary, same calibration set, same threshold. When the composite drops, the four axes tell you which layer regressed.

Working baseline. ≥ 4.0 on each axis. Below 3.5 on any axis for more than a 5% volume slice is an alert.

Mapping the six SLOs to API SLO discipline

The discipline you already run on a payment API is the discipline that runs on an agent. The metrics differ; the SLO mechanics do not.

Agent SLOAPI SLO analogError-budget shape
Task completion rateHTTP success rate1 - target_completion
p99 latencyp99 latencyBurn when above threshold
Recovery rateRetry-success rate1 - target_recovery on failure subset
Tool-call success(no clean analog)Composite of selection, args, utilization
Guardrail trip rateWAF block rateDelta-based, not absolute
Trace-grounded score(AI-specific)Per-axis burn on rubric thresholds

Tool-call success and trace-grounded score have no clean API analog: no classical SLO covers the layer where the agent decides which downstream endpoint to call, and no classical service produces correct-looking output from a structurally broken trajectory.

Define an error budget per metric. Alert at 2x budget (fast burn) or 1x sustained over the SLO window (slow burn). Postmortem when the budget breaks. Same pattern as the Google SRE error-budget playbook, agent-specific metrics.

The CI gate: per-metric thresholds, not aggregate

Aggregate gating is the most common failure mode in agent CI. A CI gate on agent_score >= 0.85 passes when tool-call success is 0.97 and argument extraction is 0.62, then production lights up with bad calls inside well-picked tools. The fix is per-metric assertions, calibrated against historical pass rates:

# config.yaml for `fi run`
assertions:
  - "task_completion.score >= 0.90 for at_least 90% of cases"
  - "tool_selection_f1.score >= 0.95 for at_least 95% of cases"
  - "argument_validation.score >= 0.95 for at_least 95% of cases"
  - "argument_semantics.score >= 0.85 for at_least 85% of cases"
  - "result_groundedness.score >= 0.90 for at_least 90% of cases"
  - "recovery_score.score >= 0.70 for at_least 80% of cases"
  - "trace_score.factual_grounding >= 4.0 for at_least 90% of cases"
  - "trace_score.instruction_adherence >= 4.0 for at_least 90% of cases"

When the gate fails, the failing assertion name is the root cause. The Future AGI fi CLI ships per-eval assertions natively; the build-from-scratch walkthrough covers the fixture wiring. Per-metric gating is noisier than aggregate gating (a PR that bumps argument semantics by 2 points while dropping factual grounding by 1 point flips from green to yellow), and that is the right behavior: the engineer reads the diff and decides instead of rolling forward on a composite that summed it away.

Production observability and the Error Feed loop

Six SLOs in CI is necessary, not sufficient. The eval set is a snapshot; production is a river. Score the live trace stream with the same rubrics and you get a regression signal the offline set cannot have. The agent passes evals and fails in production playbook walks the six drift modes that age every eval set the day it ships.

traceAI (Apache 2.0) captures the trajectory across 14 span kinds (TOOL, RETRIEVER, AGENT, GUARDRAIL, EVALUATOR, A2A_CLIENT, A2A_SERVER, and the rest) and 50+ AI surfaces in Python, TypeScript, Java, and C#. EvalTag wires rubrics to spans at zero added inference latency.

Error Feed is the loop closer. Failing traces flow into ClickHouse with their span embeddings. HDBSCAN soft-clustering groups them into named issues. Each cluster fires a Claude Sonnet 4.5 Judge agent for a 30-turn investigation across 8 span-tools. Per cluster, the Judge emits a 5-category 30-subtype taxonomy, the 4-D trace-grounded score, and an immediate_fix naming the change to ship today. The fix feeds the Platform’s self-improving evaluators; the on-call engineer promotes representative traces into the offline set; the next PR has to clear them. Linear is wired today via OAuth; Slack, GitHub, Jira, and PagerDuty are on the roadmap. Every incident becomes a regression test the team never has to write again.

Common reliability-metric mistakes

  • Single composite score as the dashboard. Hides which layer broke. Six SLOs with six error budgets is the working pattern.
  • Latency average instead of p99. Average hides tail-latency disasters. Users feel the worst 1%.
  • Tool-call success without an irrelevance bucket. The over-call regression where the agent calls search on every input is invisible without cases where the gold answer is no tool call.
  • Recovery rate at 100% as a green signal. Usually means the agent is fabricating success on persistent errors. Treat anything above 95% as suspect.
  • Trace-grounded score without per-axis thresholds. The composite drops, the engineer reads the average, the regression on instruction adherence stays buried under stable factual grounding.
  • Guardrail trip rate measured as an absolute number. The signal is the delta; a stable 3% is fine, a jump to 11% after a model swap is the alert.
  • No cohort stratification. Persona cohorts (frustrated_customer, hostile_attacker, edge_case) catch failures that average out.
  • Eval and trace in different tools. Span-attached scores let regressions localize to the bad span. Cross-referencing two dashboards under pressure is how on-call rotations burn out.

How Future AGI ships the six-metric SLO stack

The eval stack ships as a package, not a single product. Start with the SDK for code-defined per-metric scoring. Graduate to the Platform when the loop needs self-improving rubrics. Add the Agent Command Center for guardrail trip rate at the gateway.

ai-evaluation (Apache 2.0) covers all six SLOs with 70+ EvalTemplate classes: TaskCompletion against AgentTrajectoryInput, LLMFunctionCalling plus deterministic function_name_match / parameter_validation / function_call_accuracy, Groundedness and ContextAdherence, AnswerRefusal, ConversationCoherence, plus 11 CustomerAgent* templates. Seven AgentTrajectoryInput metrics, 13 guardrail backends, four distributed runners (Celery, Ray, Temporal, Kubernetes). The fi CLI ships per-assertion CI gates natively. traceAI (Apache 2.0) carries the same rubrics as span-attached scores across 50+ AI surfaces and 14 span kinds.

The Future AGI Platform layers self-improving evaluators, in-product agent-authored rubrics, and classifier-backed evals at lower per-eval cost than Galileo Luna-2. Error Feed is the clustering and what-to-fix layer inside the stack.

Agent Command Center (Apache 2.0, single Go binary) powers the guardrail-trip-rate SLO: 100+ providers, 18+ built-in scanners, 15 third-party adapters, MCP and A2A protocol support, per-scanner block counts on /-/metrics. Self-hostable or via gateway.futureagi.com/v1. SOC 2 Type II, HIPAA, GDPR, and CCPA certified per futureagi.com/trust.

Ready to wire the six SLOs? Wire TaskCompletion, LLMFunctionCalling, Groundedness, and the 4-D TrajectoryScore into a pytest fixture via the ai-evaluation SDK, then attach the same templates as EvalTag scorers when production traces ask questions the CI gate missed. Same rubric in both places is the diff between an offline pass that ships and a 3 am page.

Frequently asked questions

Why split agent reliability into six metrics instead of one composite score?
Because an aggregate agent_score hides which layer broke. A composite 0.85 can mean tool-call success at 0.97 carrying argument failures at 0.62, or grounded responses at 0.98 covering a tail-latency disaster. When the dashboard alerts, the on-call engineer needs to know which of plan, tool, recovery, latency, guardrail, or trace-grounded score regressed, because each one has a different fix (rubric edit, schema validator, retry budget, model swap, scanner tune, retrieval patch). Six metrics with per-metric thresholds make the bisection one step. One aggregate makes it three days. The composite is fine for an exec slide; it is the wrong shape for an SLO.
What are the six agent reliability metrics?
Task completion rate (did the trajectory deliver the user goal end-to-end), tool-call success (right tool, schema-valid arguments, payload actually used), recovery rate (does the agent retry, fall back, or escalate on tool failure), p99 latency (the tail users feel, not the median), guardrail trip rate (rate at which the gateway scanners block requests or responses), and the 4-dimensional trace-grounded score (factual grounding, privacy and safety, instruction adherence, optimal plan execution, scored 1 to 5 each). Five of the six map to classical SRE SLO discipline; the trace-grounded score is the AI-specific axis that ties the trajectory to the rubric.
How does the 4-dimensional trace-grounded score work?
Four axes, scored 1 to 5 by the same judge on every trace. Factual grounding: did the agent stay anchored in retrieved or tool context, or confabulate. Privacy and safety: did the agent leak PII, cross a tenant boundary, comply with a jailbreak. Instruction adherence: did the agent obey the system prompt and refuse what should have been refused. Optimal plan execution: did the agent pick the right tool, in the right order, without redundant calls or loops. The composite is the trace score; the four axes are the diagnosis when the composite drops. The same judge runs against the offline dataset in CI and against live spans in production, so the vocabulary is identical across surfaces.
How do agent SLOs map to classical API SLOs?
Task completion is your availability SLO (analogous to HTTP success rate). p99 latency is your latency SLO unchanged. Recovery rate is the dual of error rate, scoped to recoverable failures. Tool-call success is a new SLO with no clean API analog because there is no API layer that decides which downstream endpoint to call. Guardrail trip rate is the policy SLO. Trace-grounded score is the rubric SLO that protects against the model layer producing correct-looking output from a broken trajectory. Define an error budget per metric, alert when burn rate exceeds the budget, run a postmortem when the budget breaks. The discipline is the same as API SLOs; the metrics are agent-specific.
Why does CI gate on per-metric thresholds beat aggregate thresholds?
Because the aggregate hides the layer that broke. A CI gate on agent_score >= 0.85 passes when tool-call success is 0.97 and argument extraction is 0.62, then production lights up with bad calls inside well-picked tools. A gate with six assertions (one per dimension) fails the bad PR on the argument line, the engineer reads the failure name, and the fix lands in the right file. The Future AGI fi CLI ships per-eval assertions natively, and the four distributed runners (Celery, Ray, Temporal, Kubernetes) handle the case where six rubrics on a 200-case suite outgrow a single-process budget. One aggregate is a probability; six per-metric thresholds are a diagnostic.
How does Error Feed turn production failures into the next regression test?
Error Feed sits inside the eval stack as the cluster-and-fix layer. Failing traces flow into ClickHouse with their span embeddings. HDBSCAN soft-clustering groups them into named issues. Each cluster fires a Claude Sonnet 4.5 Judge agent for a 30-turn investigation across 8 span-tools (read_span, get_children, get_spans_by_type, search_spans, submit_finding, submit_scores, submit_summary, plus a Haiku Chauffeur for spans over 3000 characters). Prompt-cache hit ratio sits around 90 percent. Per cluster, the Judge emits a 5-category 30-subtype taxonomy, the 4-D trace-grounded score, and an immediate_fix string. The fix feeds the Platform's self-improving evaluators; the on-call engineer promotes representative traces into the offline set; the next PR has to clear them. Linear is wired today via OAuth; Slack, GitHub, Jira, and PagerDuty are on the roadmap.
How does Future AGI ship the six-metric SLO stack?
Future AGI ships the eval stack as a package. The ai-evaluation SDK (Apache 2.0) is the code-first surface with 70+ EvalTemplate classes (TaskCompletion, LLMFunctionCalling, AgentTrajectory metrics, ConversationCoherence, Groundedness, AnswerRefusal, and the rest), 13 guardrail backends, four distributed runners, and a fi CLI with native per-assertion gates. traceAI (Apache 2.0) carries the same rubrics as span-attached scores on live traffic across 50+ AI surfaces in Python, TypeScript, Java, and C# with 14 span kinds. The Future AGI Platform layers self-improving evaluators and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed clusters, scores, and writes the fix. Agent Command Center (Apache 2.0, single Go binary) provides the gateway with 18+ built-in guardrail scanners and 15 third-party adapters for the guardrail-trip-rate SLO, benchmarked at ~29k req/s and P99 21 ms with guardrails on t3.xlarge.
Related Articles
View all