Agent Metrics Frameworks in 2026: Three Approaches, One Decision
Three agent metric frameworks own 2026: trajectory-first, task-completion-first, output-quality-first. Pick by your bug surface, not vendor pitch.
Table of Contents
A coding-assistant agent ships a planner upgrade. Final-answer pass-rate moves from 71 to 74 percent. The team celebrates the 3-point lift. Two weeks later, the on-call gets paged: cost has risen 38 percent. Investigation: the planner emits 1.6x as many tool calls per task; tool-call accuracy is unchanged; pass-rate is up because the extra calls recover edge cases; cost-per-success is dramatically worse. The team’s eval program had one framework out of three. Three frameworks own agent metrics in 2026, and the choice is not religion. Pick by what your bug surface looks like: trajectory bugs need trajectory metrics, output bugs need output metrics, and public benchmarks tell you nothing about either.
TL;DR: the three frameworks and what they catch
| Framework | What it scores | When to use it | Where it falls short |
|---|---|---|---|
| Trajectory-first | The full ordered trace: tool selection, arguments, plan, recovery, task completion | Multi-step agents, tool-heavy workflows, anything past 3 tool calls per task | Misses style and tone drift on the final response |
| Task-completion-first benchmarks | Black-box success on public datasets (BFCL, tau-bench, ToolBench) | Model selection, capability floor, vendor comparison | Tells you nothing about your registry, your schemas, your error codes |
| Output-quality-first judges | The final response against a rubric (G-Eval, Luna-2, Groundedness) | One-shot agents, RAG QA, summarization, content generation | A clean response can come from a broken trajectory |
If you only read one row: most production programs run all three at different cadences. Trajectory rubrics in CI. Output-quality judges on live spans. Public benchmarks at model-selection time. One bug class each.
Why one framework is never enough
Agents are not functions. A 12-span agent run with the right final answer can have 8 wrong tool calls, 3 unnecessary retries, and a 6x cost overrun. Final-answer accuracy alone scores it as a success. The engineering reality is the agent is broken.
The math agrees. End-to-end success on a k-step agent is roughly the product of per-step success rates. A 95 percent per-step agent over eight steps lands near 66 percent. A 99 percent per-step agent over eight steps lands near 92 percent. Two thirds of sessions ending structurally wrong while every individual step scores green is the default math of compound error, and it is why teams ship agents that pass per-turn eval and tank production.
Each framework sees a different class of bug. Output-quality judges miss tool-argument failures. Trajectory rubrics miss style and tone drift. Public benchmarks miss everything specific to your tools. Running one in isolation guarantees a blind spot somewhere.

Framework 1: trajectory-first (the six-dimension rubric)
Trajectory-first frameworks score the full ordered sequence: system prompt, user input, agent reasoning, tool calls with arguments and returns, retrieval results, intermediate LLM calls, final response. The trace is the unit of evaluation. Score the trajectory or you are grading luck.
The six dimensions that map cleanly onto trace shape:
- Tool selection. Did the agent pick the right tool, or correctly call none? Three failure modes: wrong tool, no tool when one was needed, fabricated tool. The piece most posts drop is the irrelevance bucket: cases where the gold answer is no tool call (greeting, clarification, in-model factual question). Without it, you cannot detect the regression where a new prompt makes the model bolder about calling
searchon every input. - Argument extraction. Right tool with wrong arguments is the most common production agent bug. Three buckets: schema mismatch (Pydantic catches this), semantic mismatch (
departure_date="2026-01-01"validates and is wrong if the user said “next Friday”), edge-case handling (null on optional fields, timezone on date fields, unicode in identifiers). - Result utilization. The tool returned. The agent has the payload. Did the agent use it, or substitute model knowledge?
get_account_balancereturns{"balance_cents": 12_400}and the model “knows” the standard $200 minimum, so it replies “your balance is above the $200 threshold.” The tool result was never read. - Error recovery. Real tools fail. APIs time out, return 429s, return malformed JSON. The agent’s behavior is a separate axis from happy-path. Did it retry with corrected arguments on a 400, or send the same broken string again? Did it stop at a sensible retry cap?
- Plan coherence. For multi-step agents: no loops, no dead-ends, right depth. A two-step task takes roughly two steps. A ten-step task takes roughly ten. Sub-tree explosion is a regression.
- Task completion. End-to-end success on the user goal, scored on the full trajectory rather than the final turn alone. Add a consistency slice: pick 30 hard cases, run them k times, the fraction that succeed on all k is your
pass^k.
Trajectory-first frameworks need spans. Without OpenTelemetry on every tool call and sub-agent dispatch, the data the rubric needs is not there. LangSmith covers this for LangChain-shaped traces. Phoenix and OpenInference cover the open-weights surface. Future AGI’s traceAI (Apache 2.0) ships 14 span kinds across 50+ AI surfaces in four languages. For the depth on the rubrics, see The Definitive Guide to AI Agent Evaluation.
Framework 2: task-completion-first (the public benchmarks)
Task-completion-first frameworks treat the agent as a black box. Inputs in, outputs out, score against a public dataset. Three benchmarks anchor the floor in 2026:
BFCL (Berkeley Function Calling Leaderboard) breaks tool calling into three tracks: AST correctness (the call parses), executable correctness (the call actually runs on a real endpoint), and an irrelevance-detection bucket (the model correctly does not call a tool). A model that aces AST and tanks irrelevance overcalls on your registry. A model that aces AST and tanks executable generates plausible non-running calls.
tau-bench runs multi-turn agents in airline and retail with an LLM-simulated user, a domain policy, and tool access. The headline metric is pass^k across k independent rollouts. Even strong models land below 25 percent at pass^8 on retail. Multi-turn tool-using agents are nondeterminism amplifiers, and the consistency metric is the cleanest exposure of that fact.
ToolBench tests across thousands of real APIs with a focus on instruction-following and tool composition. Use it when API breadth matters more than depth on any single tool.
The honest framing: public benchmarks tell you whether the underlying model can call tools at all. They tell you nothing about your registry, your argument schemas, your error codes, or your business policy. Treat them as model-selection signals. Use them when you are picking between GPT-5, Claude Opus 4.7, and Gemini for a new build. Do not gate releases on them. The private eval set, stratified by your tools and your error codes, is the one that gates production. For the rubric depth on this point, see Evaluating Tool-Calling Agents in 2026.
Framework 3: output-quality-first (LLM-judge stacks)
Output-quality-first frameworks score the final response against a rubric, ignoring the trajectory. G-Eval popularized the form-filling pattern. Galileo Luna-2 distilled it for cost. Braintrust ships an experimentation-first surface for output evals. Future AGI’s Groundedness, Hallucination, Tone, Factual Accuracy, and 50+ other EvalTemplate classes cover the same axis with the Turing judge family behind them.
The common rubric shapes:
- Groundedness. Does each claim in the response trace to a chunk in the retrieved context (for RAG) or the tool result payload (for tool-using agents)?
- Hallucination. Are there claims that cannot be verified against any source?
- Faithfulness. Does the response stay anchored to the input intent, or drift into adjacent topics?
- Format compliance. Does the response match the schema (JSON valid, fields present, types correct)?
- Tone, persona match, style. Does the response sound the way the brand or persona expects?
Output-quality judges fit best when the workload is mostly one-shot. RAG question-answering, summarization, content generation, and single-step Q and A bots all carry the value in the response. The trajectory is shallow or absent. For these workloads, trajectory metrics over-engineer the problem.
The failure mode: applying output-quality judges to multi-step agents. A coding-assistant agent that fixed the right file, ran the wrong tests, and produced a green response will score well on output-quality and badly on trajectory. The bug surface is in the trajectory; the framework cannot see it.
Pricing matters here more than the other surfaces, because output-quality judges run on every production response, not just CI traces. Galileo Luna-2’s flat per-token pricing is one anchor; Future AGI’s Platform classifier-cascade runs the same rubrics at lower per-eval cost than Galileo Luna-2. Hand-rolling GPT-4 as a judge across millions of monthly traces costs more than the model that produced them. The distilled-judge layer is what makes online scoring economical.
The decision: which framework for which job
| Your workload | Bug surface | Primary framework | Secondary |
|---|---|---|---|
| RAG QA, summarization, content gen | Hallucination, format, tone | Output-quality (Groundedness, Hallucination) | Trajectory only if retrieval depth > 2 |
| Customer-support agent (3-5 turns, 2-4 tools) | Tool selection, recovery, conversation flow | Trajectory (six dimensions) | Output quality on the final turn |
| Coding agent (planner + tools + 8+ steps) | Plan coherence, argument bugs, compound error | Trajectory (with consistency slice) | Public benchmarks at model selection |
| Voice agent (multi-turn, persona-driven) | Tone, persona match, refusal calibration | Output quality + persona-driven simulation | Trajectory on the tool-using turns |
| Browsing/computer-use agent | Plan coherence, action safety, recovery | Trajectory (with action-safety rubric) | None of the public benchmarks fit cleanly yet |
| Anything where you are picking a model | Capability floor | Public benchmarks (BFCL, tau-bench) | Trajectory and output quality once in build |
The decision is rarely “framework A or framework B.” It is which framework is primary, which is secondary, and which has no role. Trajectory-first wins as the primary surface for any agent with three or more tool calls, because that is where compound error lives. Output-quality wins when the response carries the value and the trajectory is thin. Public benchmarks win at model-selection time and lose at release-gate time. The framework matches the bug surface; the bug surface does not match the framework.
The hybrid pattern: all three, but each at the right cadence
Mature programs run all three frameworks. Each at a different cadence, against a different surface, gating a different decision.
Trajectory rubrics in CI. Wire six assertions in the CI fixture, one per dimension, with thresholds calibrated against historical pass rates. An aggregate 0.85 hides a 0.62 on argument extraction behind a 0.97 on tool selection. The aggregate ships the bug; the per-dimension gate catches it. The Future AGI fi CLI ships per-eval assertions natively; LangSmith, Braintrust, and Phoenix all expose a comparable shape. Distributed runners (Celery, Ray, Temporal, Kubernetes) handle the case where six rubrics across a 200-case suite outgrow a single-process budget. For the depth on this layer, see CI/CD for AI Agents Best Practices.
Output-quality judges on live spans. Same rubrics, different surface. Score the production trace stream with cheap distilled judges (Future AGI turing_flash runs guardrail screening at 50 to 70 ms p95; Galileo Luna-2 is the alternative). The offline set was frozen before users found the failure mode. Online scoring is the regression signal the offline set cannot have.
Public benchmarks at model-selection time. When you swap GPT-5 for Claude Opus 4.7, run BFCL and tau-bench on the candidate. They tell you whether the new model can hold the floor on tool calling and consistency. Once the model is picked, the public benchmarks have done their job. Stop reading them every Monday morning.
The aggregation layer is where the three meet. Cost-per-success, recovery rate, and planner-depth ratio compute from trace and eval data regardless of which framework emitted the score. Wire all three into the same dashboard with per-intent and per-cohort filters. Headline aggregates hide every regression that lives in one slice.
Common mistakes when wiring a metrics framework
- Picking the framework by vendor, not by bug surface. “We use LangSmith” is a tool, not a metric program. The question is which framework matches your trajectory depth, not which UI you prefer.
- One framework forever. Workloads evolve. The RAG bot that grew a planner is now a trajectory problem. Audit quarterly.
- No irrelevance bucket on tool selection. The over-call regression is invisible without cases where the gold answer is no tool.
- Output-quality judges on multi-step agents. Clean response, broken trajectory. The framework cannot see the bug.
- Trajectory rubrics on a one-shot RAG bot. Over-engineering. The bug surface is in the response, not the trace.
- Mocked tools, no error-recovery coverage. Happy-path eval at 0.95. Production 429 storm at 0.30.
- Treating public benchmarks as the gate. BFCL says the model can call tools. It says nothing about your registry.
- Eval and trace in different tools with no join. No on-call engineer cross-references two dashboards under pressure. Attach scores to the span.
Recent framework updates
| Date | Event | Why it matters |
|---|---|---|
| 2024 | Trajectory-level eval became standard in major platforms | Output-only eval acknowledged as insufficient for multi-step agents |
| 2025 | DeepEval shipped Task Completion, Tool Correctness, Argument Correctness, Step Efficiency, Plan Adherence, Plan Quality as first-party agent metrics | Open-source eval frameworks formalized the trajectory + outcome split |
| 2025 | OTel GenAI semantic conventions stabilized gen_ai.* for tool, retrieval, and agent spans | Trace-derived trajectory metrics became cross-vendor portable |
| 2025 | BFCL v3 added the irrelevance bucket | Public benchmarks caught up to the trajectory framing |
| 2026 | Distilled judges reached production scale | Online output-quality scoring became cost-feasible at every-response volume |
| 2026 | tau-bench pass^k became the standard consistency signal | Nondeterminism cost of multi-step agents made visible |
How Future AGI ships all three frameworks
Future AGI is the production-grade agent eval platform that ships all three frameworks on one Apache 2.0 self-hostable plane. The pattern is the eval-stack package, not a single product. Start with the SDK for code-defined per-dimension scoring. Graduate to the Platform when the loop needs self-improving rubrics and classifier-backed cost economics.
ai-evaluation SDK (Apache 2.0) covers all three surfaces. Seven AgentTrajectoryInput metrics (TaskCompletion, StepEfficiency, ToolSelectionAccuracy, TrajectoryScore, GoalProgress, ActionSafety, ReasoningQuality) for trajectory-first scoring. 50+ EvalTemplate classes (Groundedness, ContextAdherence, ChunkAttribution, Hallucination, Tone, Factual Accuracy, 11 CustomerAgent* templates) for output-quality scoring. Deterministic function-call metrics (function_name_match, parameter_validation, function_call_accuracy, function_call_exact_match) at sub-millisecond latency for the public-benchmark side of the surface.
traceAI (Apache 2.0) handles the OTel span layer that all three frameworks need. 14 span kinds across 50+ AI surfaces in Python, TypeScript, Java, and C#. Pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) mean spans flow into whatever collector you already run. Eval scores attach to spans via EvalTag; the collector runs evals server-side at zero inline latency.
Future AGI Platform. Self-improving evaluators tuned by feedback. In-product agent-authored custom rubrics. Classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the eval stack: HDBSCAN clusters failing trajectories, a Sonnet 4.5 Judge writes the 5-category 30-subtype taxonomy, the 4-D trace score, and an immediate_fix naming the change to ship today. The fix feeds the self-improving evaluators; the cluster becomes a candidate dataset entry; the next PR has to clear it.
The Agent Command Center renders trajectory, task-completion, and output-quality scores as first-class panels with per-intent and per-cohort filters, 100+ providers, 18+ built-in guardrail scanners, exact and semantic caching, MCP and A2A protocol support, and OTel observability on the same plane. The eval-stack package closes the loop without three vendors.
Honest framing on alternatives. LangSmith is the cleanest pick if you live in LangChain and want trajectory scoring on LangChain-shaped traces with minimal setup. Galileo Luna-2 leads on flat per-token output-quality pricing at high-volume online scoring. Braintrust is the cleanest experimentation tool for output-quality evals if you do not need a tracing or gateway story. DeepEval is the best open-source code-first surface if your team prefers running everything as a Python library. Future AGI fits when the trajectory, output-quality, and gateway surfaces need to share a runtime. For the broader landscape, see Best AI Agent Observability Tools in 2026.
Sources
- traceAI GitHub repo
- ai-evaluation GitHub repo
- OpenInference GitHub repo
- OpenTelemetry GenAI semantic conventions
- BFCL leaderboard
- tau-bench paper
- DeepEval agent metrics docs
- Future AGI Agent Command Center
Related reading
Read next: The Definitive Guide to AI Agent Evaluation, AI Agent Reliability Metrics in 2026, Evaluating Tool-Calling Agents in 2026, Best AI Agent Observability Tools in 2026
Frequently asked questions
What are the three agent metric frameworks teams use in 2026?
When should I pick a trajectory-first framework?
When are public benchmarks like BFCL and tau-bench enough?
When does output-quality-first scoring fit the job?
Why do most teams need all three frameworks?
How does Future AGI's framework compare to LangSmith, Galileo, and Braintrust?
How do I instrument trajectory metrics if I am already running LangSmith or Phoenix?
FutureAGI, Langfuse, Phoenix, Braintrust, LangSmith, and DeepEval as Comet Opik alternatives in 2026. Pricing, OSS license, judge metrics, and tradeoffs.
MMLU, GSM8K, SWE-bench Verified, BFCL, tau-bench, GPQA, ARC-AGI-2, Chatbot Arena. What each measures, where each breaks, triangulate-plus-private 2026.
LLM judge prompting in 2026: rubric structure, chain-of-thought, position bias, length bias, calibration, production patterns that survive real data.