Research

Agent Metrics Frameworks in 2026: Three Approaches, One Decision

Three agent metric frameworks own 2026: trajectory-first, task-completion-first, output-quality-first. Pick by your bug surface, not vendor pitch.

·
Updated
·
13 min read
agent-metrics agent-evaluation trajectory-eval tool-call-accuracy llm-judge bfcl tau-bench 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline AGENT EVAL METRICS fills the left half. The right half shows a wireframe radar chart with five agent-axes (TASK, TOOLS, PLAN, RECOVERY, COST), with a soft white halo glow on the weakest axis (PLAN), drawn in pure white outlines.
Table of Contents

A coding-assistant agent ships a planner upgrade. Final-answer pass-rate moves from 71 to 74 percent. The team celebrates the 3-point lift. Two weeks later, the on-call gets paged: cost has risen 38 percent. Investigation: the planner emits 1.6x as many tool calls per task; tool-call accuracy is unchanged; pass-rate is up because the extra calls recover edge cases; cost-per-success is dramatically worse. The team’s eval program had one framework out of three. Three frameworks own agent metrics in 2026, and the choice is not religion. Pick by what your bug surface looks like: trajectory bugs need trajectory metrics, output bugs need output metrics, and public benchmarks tell you nothing about either.

TL;DR: the three frameworks and what they catch

FrameworkWhat it scoresWhen to use itWhere it falls short
Trajectory-firstThe full ordered trace: tool selection, arguments, plan, recovery, task completionMulti-step agents, tool-heavy workflows, anything past 3 tool calls per taskMisses style and tone drift on the final response
Task-completion-first benchmarksBlack-box success on public datasets (BFCL, tau-bench, ToolBench)Model selection, capability floor, vendor comparisonTells you nothing about your registry, your schemas, your error codes
Output-quality-first judgesThe final response against a rubric (G-Eval, Luna-2, Groundedness)One-shot agents, RAG QA, summarization, content generationA clean response can come from a broken trajectory

If you only read one row: most production programs run all three at different cadences. Trajectory rubrics in CI. Output-quality judges on live spans. Public benchmarks at model-selection time. One bug class each.

Why one framework is never enough

Agents are not functions. A 12-span agent run with the right final answer can have 8 wrong tool calls, 3 unnecessary retries, and a 6x cost overrun. Final-answer accuracy alone scores it as a success. The engineering reality is the agent is broken.

The math agrees. End-to-end success on a k-step agent is roughly the product of per-step success rates. A 95 percent per-step agent over eight steps lands near 66 percent. A 99 percent per-step agent over eight steps lands near 92 percent. Two thirds of sessions ending structurally wrong while every individual step scores green is the default math of compound error, and it is why teams ship agents that pass per-turn eval and tank production.

Each framework sees a different class of bug. Output-quality judges miss tool-argument failures. Trajectory rubrics miss style and tone drift. Public benchmarks miss everything specific to your tools. Running one in isolation guarantees a blind spot somewhere.

Editorial figure on a black starfield background titled FOUR BUCKETS OF AGENT METRICS with subhead OUTCOME / TRAJECTORY / COST / RECOVERY. Four columns each with the bucket name and three example metrics; the TRAJECTORY column has a soft white halo glow as the focal element. Drawn in pure white outlines on pure black with faint grid background.

Framework 1: trajectory-first (the six-dimension rubric)

Trajectory-first frameworks score the full ordered sequence: system prompt, user input, agent reasoning, tool calls with arguments and returns, retrieval results, intermediate LLM calls, final response. The trace is the unit of evaluation. Score the trajectory or you are grading luck.

The six dimensions that map cleanly onto trace shape:

  • Tool selection. Did the agent pick the right tool, or correctly call none? Three failure modes: wrong tool, no tool when one was needed, fabricated tool. The piece most posts drop is the irrelevance bucket: cases where the gold answer is no tool call (greeting, clarification, in-model factual question). Without it, you cannot detect the regression where a new prompt makes the model bolder about calling search on every input.
  • Argument extraction. Right tool with wrong arguments is the most common production agent bug. Three buckets: schema mismatch (Pydantic catches this), semantic mismatch (departure_date="2026-01-01" validates and is wrong if the user said “next Friday”), edge-case handling (null on optional fields, timezone on date fields, unicode in identifiers).
  • Result utilization. The tool returned. The agent has the payload. Did the agent use it, or substitute model knowledge? get_account_balance returns {"balance_cents": 12_400} and the model “knows” the standard $200 minimum, so it replies “your balance is above the $200 threshold.” The tool result was never read.
  • Error recovery. Real tools fail. APIs time out, return 429s, return malformed JSON. The agent’s behavior is a separate axis from happy-path. Did it retry with corrected arguments on a 400, or send the same broken string again? Did it stop at a sensible retry cap?
  • Plan coherence. For multi-step agents: no loops, no dead-ends, right depth. A two-step task takes roughly two steps. A ten-step task takes roughly ten. Sub-tree explosion is a regression.
  • Task completion. End-to-end success on the user goal, scored on the full trajectory rather than the final turn alone. Add a consistency slice: pick 30 hard cases, run them k times, the fraction that succeed on all k is your pass^k.

Trajectory-first frameworks need spans. Without OpenTelemetry on every tool call and sub-agent dispatch, the data the rubric needs is not there. LangSmith covers this for LangChain-shaped traces. Phoenix and OpenInference cover the open-weights surface. Future AGI’s traceAI (Apache 2.0) ships 14 span kinds across 50+ AI surfaces in four languages. For the depth on the rubrics, see The Definitive Guide to AI Agent Evaluation.

Framework 2: task-completion-first (the public benchmarks)

Task-completion-first frameworks treat the agent as a black box. Inputs in, outputs out, score against a public dataset. Three benchmarks anchor the floor in 2026:

BFCL (Berkeley Function Calling Leaderboard) breaks tool calling into three tracks: AST correctness (the call parses), executable correctness (the call actually runs on a real endpoint), and an irrelevance-detection bucket (the model correctly does not call a tool). A model that aces AST and tanks irrelevance overcalls on your registry. A model that aces AST and tanks executable generates plausible non-running calls.

tau-bench runs multi-turn agents in airline and retail with an LLM-simulated user, a domain policy, and tool access. The headline metric is pass^k across k independent rollouts. Even strong models land below 25 percent at pass^8 on retail. Multi-turn tool-using agents are nondeterminism amplifiers, and the consistency metric is the cleanest exposure of that fact.

ToolBench tests across thousands of real APIs with a focus on instruction-following and tool composition. Use it when API breadth matters more than depth on any single tool.

The honest framing: public benchmarks tell you whether the underlying model can call tools at all. They tell you nothing about your registry, your argument schemas, your error codes, or your business policy. Treat them as model-selection signals. Use them when you are picking between GPT-5, Claude Opus 4.7, and Gemini for a new build. Do not gate releases on them. The private eval set, stratified by your tools and your error codes, is the one that gates production. For the rubric depth on this point, see Evaluating Tool-Calling Agents in 2026.

Framework 3: output-quality-first (LLM-judge stacks)

Output-quality-first frameworks score the final response against a rubric, ignoring the trajectory. G-Eval popularized the form-filling pattern. Galileo Luna-2 distilled it for cost. Braintrust ships an experimentation-first surface for output evals. Future AGI’s Groundedness, Hallucination, Tone, Factual Accuracy, and 50+ other EvalTemplate classes cover the same axis with the Turing judge family behind them.

The common rubric shapes:

  • Groundedness. Does each claim in the response trace to a chunk in the retrieved context (for RAG) or the tool result payload (for tool-using agents)?
  • Hallucination. Are there claims that cannot be verified against any source?
  • Faithfulness. Does the response stay anchored to the input intent, or drift into adjacent topics?
  • Format compliance. Does the response match the schema (JSON valid, fields present, types correct)?
  • Tone, persona match, style. Does the response sound the way the brand or persona expects?

Output-quality judges fit best when the workload is mostly one-shot. RAG question-answering, summarization, content generation, and single-step Q and A bots all carry the value in the response. The trajectory is shallow or absent. For these workloads, trajectory metrics over-engineer the problem.

The failure mode: applying output-quality judges to multi-step agents. A coding-assistant agent that fixed the right file, ran the wrong tests, and produced a green response will score well on output-quality and badly on trajectory. The bug surface is in the trajectory; the framework cannot see it.

Pricing matters here more than the other surfaces, because output-quality judges run on every production response, not just CI traces. Galileo Luna-2’s flat per-token pricing is one anchor; Future AGI’s Platform classifier-cascade runs the same rubrics at lower per-eval cost than Galileo Luna-2. Hand-rolling GPT-4 as a judge across millions of monthly traces costs more than the model that produced them. The distilled-judge layer is what makes online scoring economical.

The decision: which framework for which job

Your workloadBug surfacePrimary frameworkSecondary
RAG QA, summarization, content genHallucination, format, toneOutput-quality (Groundedness, Hallucination)Trajectory only if retrieval depth > 2
Customer-support agent (3-5 turns, 2-4 tools)Tool selection, recovery, conversation flowTrajectory (six dimensions)Output quality on the final turn
Coding agent (planner + tools + 8+ steps)Plan coherence, argument bugs, compound errorTrajectory (with consistency slice)Public benchmarks at model selection
Voice agent (multi-turn, persona-driven)Tone, persona match, refusal calibrationOutput quality + persona-driven simulationTrajectory on the tool-using turns
Browsing/computer-use agentPlan coherence, action safety, recoveryTrajectory (with action-safety rubric)None of the public benchmarks fit cleanly yet
Anything where you are picking a modelCapability floorPublic benchmarks (BFCL, tau-bench)Trajectory and output quality once in build

The decision is rarely “framework A or framework B.” It is which framework is primary, which is secondary, and which has no role. Trajectory-first wins as the primary surface for any agent with three or more tool calls, because that is where compound error lives. Output-quality wins when the response carries the value and the trajectory is thin. Public benchmarks win at model-selection time and lose at release-gate time. The framework matches the bug surface; the bug surface does not match the framework.

The hybrid pattern: all three, but each at the right cadence

Mature programs run all three frameworks. Each at a different cadence, against a different surface, gating a different decision.

Trajectory rubrics in CI. Wire six assertions in the CI fixture, one per dimension, with thresholds calibrated against historical pass rates. An aggregate 0.85 hides a 0.62 on argument extraction behind a 0.97 on tool selection. The aggregate ships the bug; the per-dimension gate catches it. The Future AGI fi CLI ships per-eval assertions natively; LangSmith, Braintrust, and Phoenix all expose a comparable shape. Distributed runners (Celery, Ray, Temporal, Kubernetes) handle the case where six rubrics across a 200-case suite outgrow a single-process budget. For the depth on this layer, see CI/CD for AI Agents Best Practices.

Output-quality judges on live spans. Same rubrics, different surface. Score the production trace stream with cheap distilled judges (Future AGI turing_flash runs guardrail screening at 50 to 70 ms p95; Galileo Luna-2 is the alternative). The offline set was frozen before users found the failure mode. Online scoring is the regression signal the offline set cannot have.

Public benchmarks at model-selection time. When you swap GPT-5 for Claude Opus 4.7, run BFCL and tau-bench on the candidate. They tell you whether the new model can hold the floor on tool calling and consistency. Once the model is picked, the public benchmarks have done their job. Stop reading them every Monday morning.

The aggregation layer is where the three meet. Cost-per-success, recovery rate, and planner-depth ratio compute from trace and eval data regardless of which framework emitted the score. Wire all three into the same dashboard with per-intent and per-cohort filters. Headline aggregates hide every regression that lives in one slice.

Common mistakes when wiring a metrics framework

  • Picking the framework by vendor, not by bug surface. “We use LangSmith” is a tool, not a metric program. The question is which framework matches your trajectory depth, not which UI you prefer.
  • One framework forever. Workloads evolve. The RAG bot that grew a planner is now a trajectory problem. Audit quarterly.
  • No irrelevance bucket on tool selection. The over-call regression is invisible without cases where the gold answer is no tool.
  • Output-quality judges on multi-step agents. Clean response, broken trajectory. The framework cannot see the bug.
  • Trajectory rubrics on a one-shot RAG bot. Over-engineering. The bug surface is in the response, not the trace.
  • Mocked tools, no error-recovery coverage. Happy-path eval at 0.95. Production 429 storm at 0.30.
  • Treating public benchmarks as the gate. BFCL says the model can call tools. It says nothing about your registry.
  • Eval and trace in different tools with no join. No on-call engineer cross-references two dashboards under pressure. Attach scores to the span.

Recent framework updates

DateEventWhy it matters
2024Trajectory-level eval became standard in major platformsOutput-only eval acknowledged as insufficient for multi-step agents
2025DeepEval shipped Task Completion, Tool Correctness, Argument Correctness, Step Efficiency, Plan Adherence, Plan Quality as first-party agent metricsOpen-source eval frameworks formalized the trajectory + outcome split
2025OTel GenAI semantic conventions stabilized gen_ai.* for tool, retrieval, and agent spansTrace-derived trajectory metrics became cross-vendor portable
2025BFCL v3 added the irrelevance bucketPublic benchmarks caught up to the trajectory framing
2026Distilled judges reached production scaleOnline output-quality scoring became cost-feasible at every-response volume
2026tau-bench pass^k became the standard consistency signalNondeterminism cost of multi-step agents made visible

How Future AGI ships all three frameworks

Future AGI is the production-grade agent eval platform that ships all three frameworks on one Apache 2.0 self-hostable plane. The pattern is the eval-stack package, not a single product. Start with the SDK for code-defined per-dimension scoring. Graduate to the Platform when the loop needs self-improving rubrics and classifier-backed cost economics.

ai-evaluation SDK (Apache 2.0) covers all three surfaces. Seven AgentTrajectoryInput metrics (TaskCompletion, StepEfficiency, ToolSelectionAccuracy, TrajectoryScore, GoalProgress, ActionSafety, ReasoningQuality) for trajectory-first scoring. 50+ EvalTemplate classes (Groundedness, ContextAdherence, ChunkAttribution, Hallucination, Tone, Factual Accuracy, 11 CustomerAgent* templates) for output-quality scoring. Deterministic function-call metrics (function_name_match, parameter_validation, function_call_accuracy, function_call_exact_match) at sub-millisecond latency for the public-benchmark side of the surface.

traceAI (Apache 2.0) handles the OTel span layer that all three frameworks need. 14 span kinds across 50+ AI surfaces in Python, TypeScript, Java, and C#. Pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) mean spans flow into whatever collector you already run. Eval scores attach to spans via EvalTag; the collector runs evals server-side at zero inline latency.

Future AGI Platform. Self-improving evaluators tuned by feedback. In-product agent-authored custom rubrics. Classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the eval stack: HDBSCAN clusters failing trajectories, a Sonnet 4.5 Judge writes the 5-category 30-subtype taxonomy, the 4-D trace score, and an immediate_fix naming the change to ship today. The fix feeds the self-improving evaluators; the cluster becomes a candidate dataset entry; the next PR has to clear it.

The Agent Command Center renders trajectory, task-completion, and output-quality scores as first-class panels with per-intent and per-cohort filters, 100+ providers, 18+ built-in guardrail scanners, exact and semantic caching, MCP and A2A protocol support, and OTel observability on the same plane. The eval-stack package closes the loop without three vendors.

Honest framing on alternatives. LangSmith is the cleanest pick if you live in LangChain and want trajectory scoring on LangChain-shaped traces with minimal setup. Galileo Luna-2 leads on flat per-token output-quality pricing at high-volume online scoring. Braintrust is the cleanest experimentation tool for output-quality evals if you do not need a tracing or gateway story. DeepEval is the best open-source code-first surface if your team prefers running everything as a Python library. Future AGI fits when the trajectory, output-quality, and gateway surfaces need to share a runtime. For the broader landscape, see Best AI Agent Observability Tools in 2026.

Sources

Read next: The Definitive Guide to AI Agent Evaluation, AI Agent Reliability Metrics in 2026, Evaluating Tool-Calling Agents in 2026, Best AI Agent Observability Tools in 2026

Frequently asked questions

What are the three agent metric frameworks teams use in 2026?
Trajectory-first frameworks score the full ordered sequence (tool selection, arguments, result use, plan, recovery, task completion). Future AGI's six-dimension rubric is the cleanest example, and DeepEval, LangSmith, and Phoenix all expose variants. Task-completion-first frameworks treat the agent as a black box and score whether public benchmarks pass: BFCL for function calling, tau-bench for multi-turn retail and airline, ToolBench for API breadth. Output-quality-first frameworks score the final response with an LLM judge (G-Eval, Galileo Luna-2, Braintrust's eval suite). None of the three is wrong. Each catches a different class of bug. Mature programs run all three at different cadences, not one in isolation.
When should I pick a trajectory-first framework?
When tool calls, retrievals, or multi-step plans drive the failure mode. If the response looks right but the agent picked the wrong tool, passed bad arguments, or looped three times before answering, trajectory metrics are the only ones that surface it. The math is unforgiving: a 95 percent per-step agent over eight steps lands near 66 percent end-to-end. Per-step rubrics gate the regression before it compounds. Trajectory-first frameworks need OpenTelemetry spans on every tool call, retrieval, and sub-agent dispatch; without traces, you are scoring the response with extra words.
When are public benchmarks like BFCL and tau-bench enough?
When you are picking a model, not gating a release. BFCL tells you whether GPT-5 or Claude Opus 4.7 generates syntactically valid calls on a generic registry; tau-bench tells you whether the model stays consistent across eight rollouts of a retail conversation. Neither tells you anything about your tool registry, argument schemas, error codes, or business policy. Use them as model-selection signals. Then build a private eval set stratified by your tools, your arguments, and your error codes. The private set is the one that gates production. Public benchmarks anchor the floor.
When does output-quality-first scoring fit the job?
When the agent is mostly one-shot, the response carries the value, and the tool layer is shallow. RAG question-answering, summarization, content generation, single-step Q and A bots are the classic cases. G-Eval, Galileo Luna-2, Braintrust, and Future AGI's Groundedness and Hallucination templates all score the final response against a rubric. For these workloads, trajectory metrics over-engineer the problem. For multi-step agents with five or more tool calls, the same approach fails: a clean response can come from a broken trajectory.
Why do most teams need all three frameworks?
Because each framework sees a different bug surface. Output-quality judges miss tool-argument bugs (the response paraphrases the wrong tool result). Trajectory rubrics miss style and tone drift (the trajectory is clean but the response is wooden). Public benchmarks miss everything that is specific to your domain. The pattern that ships: trajectory metrics in CI (gate every PR), output-quality judges on live spans (catch drift in production), public benchmarks at model-selection time (anchor the floor). One cadence each, one bug class each, one source of truth per surface.
How does Future AGI's framework compare to LangSmith, Galileo, and Braintrust?
Future AGI ships all three frameworks on one Apache 2.0 plane. The ai-evaluation SDK exposes six trajectory dimensions plus 70+ EvalTemplate classes for output quality, traceAI handles the OTel span layer across Python, TypeScript, Java, and C#, and the platform classifier-cascade runs Groundedness and Hallucination scores at lower per-eval cost than Galileo Luna-2. LangSmith is strongest as a LangChain-native trajectory tool. Galileo's Luna-2 leads on flat per-token output-quality pricing. Braintrust is a clean experimentation tool for output quality with a thinner trajectory story. The decision is rarely about which framework is best in isolation; it is about which one matches your workload's bug surface.
How do I instrument trajectory metrics if I am already running LangSmith or Phoenix?
Wire OpenTelemetry once, route to multiple backends. traceAI is OTel-native and exports spans through the standard collector, so the same trace stream can fan out to Future AGI, Phoenix, LangSmith, or any OTLP receiver. Attach evaluator scores to spans via EvalTag (or the vendor's equivalent) so the eval and the trace share a primary key. Avoid the failure mode of scoring in one tool and tracing in another with no join, because no on-call engineer cross-references two dashboards under pressure. The unit of work is the trace; everything else hangs off it.
Related Articles
View all