Research

Agent Evaluation Frameworks in 2026: 6 Picks Compared for Trajectory-First Teams

Six agent eval frameworks for trajectory-first teams 2026: LangSmith, Future AGI, Braintrust, DeepEval, Phoenix, OpenAI Evals, honest tradeoffs.

·
Updated
·
16 min read
agent-evaluation llm-evaluation agent-observability langsmith deepeval braintrust phoenix 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline AGENT EVALUATION FRAMEWORKS 2026 fills the left half. The right half shows a wireframe constellation of seven small node-clusters connected by faint lines, with a soft white halo glow on the central cluster, drawn in pure white outlines.
Table of Contents

Most “agent eval” frameworks are LLM eval frameworks with a trajectory bolted on. Pick the one designed trajectory-first. A single input-output pair has one rubric. A multi-step agent has a planner that broke the task into substeps, a tool selector that picked seven tools, a retriever that returned stale chunks twice, and a final answer that may or may not have satisfied the goal. Aggregate “did it work” tells you nothing about which layer regressed last Tuesday. This guide compares six frameworks for that job in 2026: where each fits, where it falls short, what it costs, how to choose.

TL;DR: best agent evaluation framework per use case

Use caseBest pickWhy (one phrase)License
LangChain or LangGraph runtimeLangSmithTrajectory semantics native to the agent frameworkClosed (MIT SDK)
Trajectory-first, framework-neutral, OSSFuture AGISpan-attached evals + simulation + gateway in one stackApache 2.0
Polished hosted experiment UIBraintrustBest dataset diff and experiment surface, hosted onlyClosed
Pytest-style component evalDeepEvalLargest open metric library; CI is the system of recordApache 2.0
OTel-first agent traces under open standardsArize PhoenixOpenInference auto-instrumentation across 14+ frameworksElastic License 2.0
OpenAI-only stack, YAML config is enoughOpenAI EvalsFloor option; single-vendor, CI-onlyMIT

If you only read one row: pick LangSmith if you live in LangGraph, Future AGI when the runtime is mixed and trajectory is the unit, DeepEval when CI is the system of record.

What “agent evaluation” actually has to score

Six dimensions. If a framework cannot score these independently, treat it as LLM eval, not agent eval.

  • Tool selection. Right tool, or correctly no tool. F1 plus an explicit irrelevance bucket so “called search when it shouldn’t have” is a graded failure.
  • Argument extraction. Schema-valid and semantically correct. departure_date="next Friday" schema-validates and fails the user.
  • Result utilization. Did the agent use the tool payload, or substitute model knowledge. Tool returns {"amount_cents": 4500}; agent says “your refund of $54.00 is processing.” Off by an order of magnitude.
  • Error recovery. On 4xx, timeout, or empty result, did the agent retry, fall back, or escalate. Per-call rubrics never see this; the trajectory metric does.
  • Plan coherence. Loop-free, dead-end-free, right depth.
  • Task completion. End-to-end on the user’s goal. A 95-percent per-step agent over eight steps lands near 66 percent end-to-end. That compound-error math is why teams ship agents that pass per-turn eval and tank production.

Score the six separately and the diagnostic vocabulary collapses from “the agent failed” to “argument extractor regressed on date strings on the flight-booking path.” One bisect instead of three days. The frameworks below ship varying coverage across these dimensions. Verify per metric, per framework.

The 2026 landscape at a glance

The split that matters is not OSS-vs-closed; it is trajectory-first versus LLM-eval-with-traces.

FrameworkLicenseTrajectory-first?Self-hostableStrongest at
LangSmithClosed (MIT SDK)Yes, inside LangChainEnterprise tierNative LangGraph eval
Future AGIApache 2.0Yes, framework-neutralYes (OSS plane)Trajectory + span + gateway loop
BraintrustClosedPartialNoHosted experiment UI
DeepEvalApache 2.0Partial (via @observe)Yes (SDK)CI-native pytest eval
PhoenixELv2Partial (OTel traces)YesOpenInference instrumentation
OpenAI EvalsMITNoYes (run-it-yourself)Single-call YAML eval

1. LangSmith: best when your runtime is LangChain or LangGraph

Closed platform. Open-source SDKs. Cloud, hybrid, and self-hosted Enterprise.

Quick take. LangSmith is the lowest-friction pick if every agent run is already a LangGraph execution. Native trajectory tracing, evaluators, datasets, prompt management, deployment, and Fleet workflows live on one surface. The framework knows what a Runnable is and what the planner’s state diff means; you do not infer it from spans.

Ideal for. Teams that debug in the LangChain mental model and want trajectory semantics next to deployment, prompts, and the agent IDE. Strong for support copilots, retrieval-heavy agents, and small-to-medium agent fleets where LangGraph is the system of record.

Key strengths.

  • Trajectory eval reads LangGraph runs natively, not from generic spans.
  • Prompts, datasets, deployment, Fleet, and Studio sit on one surface.
  • Per-node evaluator wiring scores the planner, the tool node, and the post-tool LLM call separately without manual span surgery.
  • Mature dataset versioning, threaded comments, hosted human review queue.

Honest limitations.

  • Strongly opinionated toward LangChain. OTel ingest exists, but non-LangChain agents lose the native trajectory semantics. Custom agents, LiteLLM, and direct provider SDKs sit awkwardly on the surface.
  • Per-seat pricing makes cross-functional access expensive.
  • Closed platform. SDK is MIT; the platform is not.
  • Metric library is shallower than DeepEval’s; you write more custom evaluators than expected.

Pricing intel. Developer is $0 per seat per month with 5K base traces and 1 Fleet agent. Plus is $39 per seat per month with 10K base traces, unlimited Fleet agents, 500 Fleet runs, one dev-sized deployment, up to 3 workspaces. Base traces $2.50 per 1,000 after included usage. Enterprise custom, cloud or hybrid or self-hosted.

Expert verdict. The default for LangChain shops. Picking anything else when your runtime is LangGraph means rebuilding the trajectory layer the framework already gives you. Picking LangSmith when your runtime is not LangChain is paying for semantics you cannot use.

2. Future AGI: best for trajectory-first eval across mixed runtimes

Open source (Apache 2.0). Self-hostable. Hosted cloud option.

Quick take. Future AGI is built trajectory-first, not LLM-eval-with-a-trajectory-bolted-on. The AgentTrajectoryInput type ingests the full step list, the available tools, and the expected goal as one object. Seven trajectory metrics (TaskCompletion, StepEfficiency, ToolSelectionAccuracy, TrajectoryScore, GoalProgress, ActionSafety, ReasoningQuality) operate on it directly. The same rubrics attach to live spans as EvalTag scorers. Simulation, span-attached online eval, prompt optimization, and gateway-level guardrails run on one plane.

Ideal for. Teams whose runtime is mixed (some LangGraph, some custom, some direct provider SDKs) and who want trajectory eval, tool-call scoring, simulated personas, and span-attached production scores in one OSS deployment. Strong fit for RAG agents, voice agents, support automation, and internal copilots.

Key strengths.

  • AgentTrajectoryInput is a first-class type. The trajectory metric takes the full step list as input, not the final string.
  • traceAI auto-instruments 50+ AI surfaces across Python, TypeScript, Java, and C# (LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Pydantic AI, DSPy, Mastra, Vercel AI SDK, Spring AI, LangChain4j). Multi-agent dispatch renders as a tree, not a flat span list.
  • Eval stack ships as a package: ai-evaluation SDK (Apache 2.0); the Future AGI Platform with self-improving evaluators at lower per-eval cost than Galileo Luna-2; Error Feed (HDBSCAN + Sonnet 4.5 Judge) writes a 5-category 30-subtype taxonomy and an immediate_fix.
  • Agent Command Center on the same plane: 100+ providers, 18+ built-in guardrail scanners, per-virtual-key AllowedTools / DeniedTools, MCP security.
  • Six prompt optimizers (PROTEGI, GEPA, MetaPrompt, BayesianSearch, RandomSearch, PromptWizard) consume failing trajectories as labelled training data.
  • SOC 2 Type II, HIPAA BAA, GDPR, CCPA certified per futureagi.com/trust; ISO/IEC 27001 in active audit.

Honest limitations.

  • More moving parts than a pytest-style library. If you want a CLI runner first and a trace UI later, DeepEval bootstraps faster.
  • The hosted cloud is younger than Braintrust’s experiment UI; teams that buy for the dashboard prefer Braintrust on first impression.
  • Falcon AI (the in-product copilot for skill authoring) is Enterprise Edition only. The OSS self-host story does not include it.

Pricing intel. Free for early teams; usage-based billing at scale. SOC 2, HIPAA BAA, SAML SSO + SCIM, and dedicated CSM available as add-ons. Pricing.

Expert verdict. The strongest pick when the runtime is mixed and the trajectory is the unit. One self-hostable plane covers trace, eval, simulate, optimize, and gateway. Picking it for a LangChain-only shop works, but you pay for trajectory semantics LangSmith already gives you natively.

Future AGI four-panel dark product showcase mapped to agent eval surfaces. Top-left: Trajectory traces with tool-call accuracy column and trajectory length metric. Top-right: Simulation runs with persona library and per-persona pass rate. Bottom-left: Span-attached eval scores attached to a multi-step agent trace. Bottom-right: Optimizer pipeline turning failing traces into prompt revisions with CI gate enforcement.

3. Braintrust: best hosted experiment UI

Closed hosted SaaS. No self-host. Cloud only.

Quick take. Braintrust’s distinguishing capability is the surface. Dataset diffing, experiment cards side by side, prompt and dataset versioning, and a fast hosted human review workflow are the cleanest in this category. Eval primitives are competent; teams pick Braintrust because the UI does not get in the way of running 40 experiments in an afternoon.

Ideal for. Teams that buy for the dashboard. Cross-functional eval workflows where PMs, prompt engineers, and applied scientists read the same experiment results. Strong fit when “the platform is the system of record” matters more than self-hosting.

Key strengths.

  • Hosted experiment UI is the polished frontier of this category.
  • Dataset diffing across experiments is best-in-class for prompt regression review.
  • Native LLM-as-judge runner with built-in rubric authoring.
  • Solid OTel ingest; not LangChain-locked.
  • Strong prompt and dataset versioning with branch-and-merge semantics.

Honest limitations.

  • Closed hosted SaaS. No self-host. Regulated workloads with data-residency constraints push through procurement.
  • Agent-trajectory layer is less first-class than LangSmith or Future AGI. You score per-span more than per-trajectory.
  • Per-seat economics scale awkwardly when compliance and PM teams want read access.
  • Tool-call and multi-agent trace rendering trail LangSmith and Phoenix in the deep-debug case.

Pricing intel. Free tier with 1K traces per month. Pro from $249 per month with 100K traces, 5 seats, full experiment surface. Enterprise custom with SOC 2, SSO, dedicated support. Per-trace overages metered.

Expert verdict. The right pick when the UI is the product. Teams that need a hosted experiment surface PMs can use without an SDK should pick Braintrust. Teams that need self-host, OSS, or a first-class trajectory type should look elsewhere.

4. DeepEval: best for pytest-style component eval

Open source (Apache 2.0). Confident AI is the hosted cloud.

Quick take. DeepEval is the open-source LLM eval framework with the largest publicly documented metric library. Pytest ergonomics: write assert metric.score > 0.7, run deepeval test run, get a regression suite that fits your existing CI. For agents, component-level eval via @observe makes each trajectory step a unit test.

Ideal for. AI/ML teams that already write pytest suites and want eval as a CI job, not a separate dashboard. Strong when CI is the system of record and you want broad open metric coverage with zero vendor lock.

Key strengths.

  • Largest open metric library: AnswerRelevancy, GEval (research-backed custom-criteria scorer), Faithfulness, ContextualPrecision, Bias, Toxicity, Hallucination, ConversationCompleteness, plus 30+ others.
  • 14 safety vulnerability scanners shipped in the library.
  • pytest-native ergonomics. Fits existing CI without a new runner.
  • @observe component-level eval works with any tracing backend.
  • Confident AI hosted cloud adds a dashboard if you want one later.

Honest limitations.

  • Trace UI is less polished than purpose-built agent platforms. Flame-graph trajectory replay across sub-agents is thin.
  • Agent-trajectory metrics are component-stitched, not first-class. There is no equivalent of AgentTrajectoryInput that takes the full step list as one object.
  • Python-only. TS / Java / C# teams build their own bridges.
  • Multi-agent supervisor-worker rendering is weaker than LangSmith or Future AGI.

Pricing intel. Framework free and Apache 2.0. Confident AI Free is $0/month with 2 seats, 1 project, 5 test runs per week. Starter $19.99+/seat/month with full regression suite plus 1 GB-month traces. Premium $49.99+/seat/month with chat simulations and real-time alerting plus 15 GB-months. Team and Enterprise custom (75 GB-months, HIPAA/SOC 2, on-prem).

Expert verdict. The right pick when eval lives in CI and the metric library is the product. Pair it with Phoenix when you need a polished trajectory replay UI on top.

5. Arize Phoenix: best for OTel-native agent traces

Source available (Elastic License 2.0). Self-hostable. Phoenix Cloud and Arize AX paths exist.

Quick take. Phoenix is the right pick when open standards are the binding constraint. It ships agent trace rendering, eval scores attached to spans, datasets, experiments, and prompt iteration without buying the full Arize AX product. OpenInference instrumentation across 14+ agent frameworks is the deepest in the category by raw coverage.

Ideal for. Platform-engineering teams that care about OpenTelemetry and OpenInference, already use Arize for ML observability, or want a self-hosted lab with a path into the broader Arize platform when procurement asks.

Key strengths.

  • OpenTelemetry and OpenInference first-class. OTLP ingest, standard span attributes, no proprietary lock.
  • OpenInference auto-instrumentation for LangChain, LlamaIndex, DSPy, Mastra, Vercel AI SDK, OpenAI, Bedrock, Anthropic, CrewAI, OpenAI Agents, AutoGen, Pydantic AI across Python, plus 13 JS packages and 4 Java packages.
  • phoenix-evals ships prebuilt and custom scorers as a separate package; clean separation between tracing and eval.
  • Self-hostable for free; you manage span volume and retention.
  • Arize AX upgrade path when procurement requires a closed contract.

Honest limitations.

  • Elastic License 2.0, not OSI-open. Call it source available if your legal team uses OSI definitions.
  • Agent-trajectory metrics are partial. The workbench is strong; the per-trajectory rubric library is shallower than what Future AGI ships natively.
  • AX is a separate closed product with a different SKU.
  • Self-host requires you to operate the data plane (ClickHouse, retention, scaling).

Pricing intel. Phoenix is free and self-hosted. Arize AX Free: 25K spans/month, 1 GB ingestion, 15-day retention. AX Pro $50/month with 50K spans, 30-day retention, email support. AX Enterprise custom with SOC 2, HIPAA, dedicated support, self-hosting.

Expert verdict. The strongest open-standards pick. Teams that want OTel-native agent tracing under one self-hostable lab should pick Phoenix. For first-class trajectory metrics, pair phoenix-evals with a heavier eval framework.

6. OpenAI Evals: best as the floor for OpenAI-only stacks

Open source (MIT). YAML-defined. Run-it-yourself.

Quick take. OpenAI’s official eval library. YAML-defined evals run against OpenAI completions; deterministic and model-graded checks both supported. The floor option for OpenAI-first teams who want a single-vendor eval surface and do not need a dashboard, trajectory metrics, or multi-agent rendering.

Ideal for. Single-vendor OpenAI stacks where the workload is single-call completions, CI is the only surface that matters, and the team can wire YAML configs and small Python harnesses themselves.

Key strengths.

  • Free, MIT, run wherever your CI runs.
  • Tight OpenAI API integration; model-graded checks use OpenAI judges by default with one config line.
  • YAML eval definitions are easy to review, diff, and store in Git alongside the prompt.
  • Strong floor for “we want to start scoring something on every PR.”

Honest limitations.

  • Not an agent framework. No trajectory type, no multi-agent rendering, no span ingest, no UI.
  • Strongly OpenAI-centric. Anthropic, Gemini, Bedrock, or open-weight judges are bridge work.
  • Multi-step agent eval support is community-contributed and thin.
  • No hosted dashboard, dataset versioning, prompt management, or human review queue.

Pricing intel. Free, MIT. You pay only the OpenAI judge tokens. Source: github.com/openai/evals.

Expert verdict. The right pick when the runtime is single-vendor OpenAI and the team wants the smallest possible eval surface. Production agent fleets need more than YAML. Treat OpenAI Evals as the floor for one-vendor labs, not the production tool.

τ-bench and BFCL: public benchmarks, not frameworks

These get confused for frameworks in vendor pages. They are not.

τ-bench evaluates multi-turn agents in airline and retail environments with an LLM-simulated user. The headline metric is pass^k across k independent rollouts. Even strong frontier models land below 25 percent at pass^8 on retail.

BFCL (Berkeley Function Calling Leaderboard) evaluates function calling across an AST track, an executable track, and an irrelevance bucket. A model that aces AST and tanks irrelevance overcalls on your registry.

Use them as model-selection signals. They tell you nothing about your tool registry, argument schemas, error codes, or business policy — the same gap that makes public benchmarks mislead on production behavior. The private eval set gates production.

Decision framework: choose X if…

  • LangSmith if your runtime is LangChain or LangGraph. Buying signal: every agent run is already a LangGraph execution.
  • Future AGI if your runtime is mixed and trajectory is the unit. Buying signal: your team runs three or four tools (traces, offline tests, gateway, guardrails) and still cannot reproduce production failures before release.
  • Braintrust if the UI is the product. Buying signal: PMs and prompt engineers need to read the same experiment results without an SDK.
  • DeepEval if CI is the system of record. Buying signal: you want assert metric.score > 0.7 in pytest, no separate dashboard.
  • Phoenix if open standards are the binding constraint. Buying signal: your platform team cares about OpenTelemetry and OpenInference, not vendor SDKs.
  • OpenAI Evals if the stack is single-vendor and small. Buying signal: a YAML in Git is the eval surface you want. Avoid for production agent fleets.

Common mistakes when picking an agent eval framework

  • Scoring final answer only. A correct-looking response can come from a 12-step trajectory that should have been 4 steps. Tool-call accuracy, argument correctness, and result utilization are first-class metrics.
  • Treating LLM eval and agent eval as the same problem. AnswerRelevancy on the final response misses tool-call mistakes. Tool-selection accuracy on individual steps misses goal completion. You need both layers.
  • Ignoring online scoring cost. Three judges per step on a 10-step trace fires 30 judge calls per request. Use a distilled small judge or sample by failure signal.
  • Mismatched framework and runtime. LangSmith on a non-LangChain runtime loses native semantics. DeepEval on a multi-agent supervisor pattern works but the trace UI is thin. OpenAI Evals on anything non-OpenAI is bridge work. Pick by where your runtime already lives.
  • No CI gating, or aggregate-only thresholds. An aggregate 0.85 hides a 0.62 on argument extraction behind a 0.97 on tool selection. Wire per-dimension assertions from week one.
  • Conflating offline eval and online scoring. Offline catches regressions before release; online catches drift after. Different rubrics, different sample sizes, different cost budgets. Two workflows, not one.

How to actually evaluate this for production

Three steps before you commit to any of the six.

  1. Run a domain reproduction. Export a representative slice of real agent traces (failures, long-tail prompts, tool calls, retrieval misses, hand-labeled outcomes). Instrument each candidate with your OTel payload shape and your judge model. Do not accept a demo dataset.
  2. Measure trajectory eval cost. Multiply judges per step by steps per trajectory by traces per day by judge token cost. If the result is more than 10 percent of your LLM bill, switch to a distilled small judge or sample by failure signal.
  3. Test multi-agent rendering. Take a real run with branching and supervisor-worker dispatch. Verify the trace tree renders the actual graph, not a flat span list.

How Future AGI ships the trajectory-first stack

Future AGI ships the eval stack as a package. Start with the SDK for code-defined per-layer scoring; graduate to the Platform when the loop needs self-improving rubrics and classifier-backed cost economics.

  • ai-evaluation SDK (Apache 2.0): 50+ EvalTemplate classes; deterministic function-call metrics; seven AgentTrajectoryInput metrics; fi run CLI with CI assertions.
  • Future AGI Platform: self-improving evaluators tuned by thumbs feedback; in-product authoring agent for custom rubrics; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
  • traceAI (Apache 2.0): 50+ AI surfaces across 4 languages; 14 span kinds with first-class TOOL, AGENT, RETRIEVER, GUARDRAIL; EvalTag attaches rubrics to spans at zero inference latency.
  • Error Feed: HDBSCAN clusters tool-call failures; a Sonnet 4.5 Judge agent writes a 5-category 30-subtype taxonomy and an immediate_fix.
  • Agent Command Center: 100+ providers; 18+ guardrail scanners; per-virtual-key AllowedTools / DeniedTools; SOC 2 Type II, HIPAA BAA, GDPR, CCPA per futureagi.com/trust.

Most teams comparing agent eval frameworks end up running three or four tools in production: trajectories, offline tests, gateway, guardrails. Future AGI fits when the loop closes without stitching: the rubric in CI is the rubric on live spans, the dataset that gates pre-prod is the one that grows from production failures, and the gateway that meters tokens enforces the tool registry the evaluator scores against.

Sources

Frequently asked questions

What makes an agent evaluation framework different from an LLM evaluation framework?
An LLM evaluation framework scores a single input-output pair. An agent evaluation framework scores a trajectory: the ordered sequence of plans, tool selections, tool arguments, tool returns, error recoveries, and final responses. The unit of evaluation is the trace, not the response. Most frameworks marketed as 'agent eval' are LLM eval libraries with a trajectory bolted on. The ones worth using treat the trajectory as a first-class object, ship metrics that ingest the full step list and the available tools as input, and render multi-agent dispatch as a tree rather than a flat span list. If you cannot score tool selection, argument extraction, result utilization, and error recovery independently, you are running LLM eval on a trace and calling it agent eval.
Which agent evaluation framework should I pick in 2026?
It depends on your runtime and what already lives in your stack. Pick LangSmith if every agent run is already a LangChain or LangGraph execution and you want trajectory semantics native to the framework. Pick Future AGI when you want trajectory-first scoring, span-attached evals, simulation, and gateway-level guardrails in one Apache 2.0 stack across mixed runtimes. Pick Braintrust if your team buys for the UI and you want a polished hosted experiment surface. Pick DeepEval when CI is your system of record and you want pytest-style assertions with the largest open metric library. Pick Phoenix for OpenTelemetry-first agent traces under open standards. Pick OpenAI Evals if the workload is mostly OpenAI and a YAML config is enough. Choose by where your runtime already lives; pick by what the trajectory looks like, not the marketing page.
Do I need a dedicated agent evaluation framework, or can I use a general LLM evaluation tool?
If your agents are linear chains with one model call and one tool, generic LLM eval is enough. If they branch, loop, call sub-agents, or carry stateful memory across turns, you need trajectory metrics, tool-call accuracy scorers, and session-level outcome rubrics. AnswerRelevancy on the final response misses the eight retries, three retrieval misses, and one wrong tool call that produced the right-looking answer. Verify that trajectory eval is first-class before committing: that the framework ingests the full step list, scores selection plus arguments plus result utilization plus recovery separately, and renders multi-agent dispatch as a tree. If it cannot do those three things, treat it as LLM eval, not agent eval.
What metrics actually matter for agent evaluation?
Six dimensions. Tool selection: did the agent pick the right tool, or correctly call none. Argument extraction: are the arguments schema-valid and semantically correct. Result utilization: did the agent use the tool payload or substitute model knowledge. Error recovery: did the agent retry, fall back, or escalate when the tool failed. Plan coherence: is the trajectory shape efficient, loop-free, and within reasonable depth. Task completion: did the trajectory deliver the user goal end-to-end. Score the six separately. Aggregate task-completion alone hides which dimension regressed. Most production frameworks ship 20 to 50 prebuilt scorers across these categories; verify they match your tool registry, your error codes, and your business policy before importing.
How do public benchmarks like tau-bench and BFCL relate to these frameworks?
Public benchmarks anchor the floor; your private eval set gates production. BFCL (Berkeley Function Calling Leaderboard) breaks tool-calling into AST correctness, executable correctness, and an irrelevance bucket on a public registry. tau-bench runs multi-turn agents in airline and retail with an LLM-simulated user and reports pass^k across k independent rollouts. Even strong frontier models land below 25 percent at pass^8 on retail. Use them as model-selection signals. They tell you nothing about your tool registry, argument schemas, error codes, or business policy, which are the layers that actually break in production. Build a private set stratified by tool, by argument-edge-case bucket, and by error code; promote failing production traces into it weekly.
How much does agent evaluation cost in production?
Two cost lines: judge model tokens and platform fees. A trajectory eval that fires three judges per agent step on a 10-step trace fires 30 judge calls per request. At five dollars per million input tokens for a small judge model, with average 2K-token traces, that is roughly thirty cents per trace. At 10K traces per day, that is three thousand dollars per day, or roughly ninety thousand dollars per month in judge tokens alone. Sample wisely. Small distilled judges (Future AGI's Turing family, Galileo's Luna) exist precisely to cut this cost, but the offline distillation step is not free engineering time. Cost-aware teams run the deterministic layers everywhere, sample failure-flagged traces for LLM-judge layers, and reserve the pass^k consistency slice for release candidates.
Can these frameworks handle multi-agent or supervisor-worker patterns?
Yes, with caveats. LangSmith, Future AGI, Braintrust, and Phoenix render multi-agent traces as trees with sub-agent dispatches, state diffs, and per-agent eval scores. DeepEval supports component-level eval that you compose across agents manually. OpenAI Evals is single-call YAML; multi-agent support is community-contributed and thin. Test the rendering with one of your real multi-agent runs before committing. A flat span list on a supervisor-worker run makes debugging painful no matter how good the metric library is. The tree topology is the feature.
Related Articles
View all