How is multi-mode agent evaluation different from regular LLM evaluation?

Regular LLM evaluation often grades one input-output pair. Multi-mode agent evaluation grades the full trajectory of LLM calls, tool calls, and handoffs so failures can be localized.

Which mode should I run first?

Start with end-to-end TaskCompletion to know if the agent works at all. Then add step-level ToolSelectionAccuracy and trajectory-level StepEfficiency to localize where the failures cluster.

Multi-Mode Agent Evaluation: FutureAGI Guide

Q: What is multi-mode agent evaluation?

Multi-mode agent evaluation scores an agent at four levels: end-to-end goal completion, trajectory-level path quality, step-level tool-call accuracy, and turn-level conversation quality.

What Is Multi-Mode Agent Evaluation?

Multi-mode agent evaluation is the practice of scoring an AI agent at four resolutions: end-to-end goal completion, trajectory-level path quality, step-level tool-call correctness, and turn-level conversation handling. The term belongs to agent evaluation, not single-turn LLM grading, because each score is tied to a production trace. FutureAGI uses the modes together so a passing final answer cannot hide bad tool calls, wasted steps, or a conversation turn that lost the user’s intent.

Why Multi-Mode Agent Evaluation Matters in Production Agent Systems

A single number cannot tell you why an agent fails. End-to-end evaluation alone is too coarse: a 70% TaskCompletion score does not separate “wrong tool selection at step 2” from “missing tool entirely” from “good steps but wrong final summary.” Step-level alone is too narrow: the agent can pick the right tool every time and still solve the wrong problem because the planner picked the wrong subgoal at step 0.

The pain is specific to agentic systems. A backend engineer chasing a regression after a model upgrade needs to know which step changed. An SRE looking at runaway cost needs to see which trajectory bloated. A product reviewer comparing two prompt variants needs trajectory-level diffs, not just final-answer diffs. A compliance auditor needs step-level safety scores so a single unsafe tool call cannot hide inside an aggregate.

In 2026-era stacks — OpenAI Agents SDK, LangGraph, CrewAI, AutoGen, Google ADK — traces are routinely 15–40 spans deep. Compared with LangSmith trace review or a single AgentBench score, multi-mode evaluation separates outcome, path, step, and turn failures. The teams shipping reliable agents run all four modes against the same trace and dashboard the deltas: end-to-end shifted, but did the trajectory shorten or lengthen? Did step-level scores drop on the planner step or the tool step? That is the conversation that produces fixes.

How FutureAGI Handles Multi-Mode Agent Evaluation

FutureAGI’s approach is to run all four modes off the same trace. The traceAI integrations for openai-agents, langgraph, crewai, autogen, and google-adk emit OpenTelemetry spans for every agent step, with agent.trajectory.step and the active agent name. From that trace, FutureAGI evaluators slot into each resolution.

End-to-end: TaskCompletion returns a 0–1 score plus a reason for whether the user’s original goal was reached. Pair with GoalProgress for partial credit. Trajectory-level: TrajectoryScore aggregates step scores into a single trajectory rating; StepEfficiency measures wasted steps; ReasoningQuality grades the reasoning path given the observations. Step-level: ToolSelectionAccuracy scores each tool call against the state at that step; FunctionCallAccuracy checks parameter correctness; ActionSafety flags unsafe actions. Turn-level: CoherenceEval and IsHelpful score each conversation turn for context retention and forward progress.

Concrete example: a coding-assistant agent on LangGraph shows TaskCompletion at 68% after a model swap. End-to-end alone says “regression.” Adding step-level ToolSelectionAccuracy reveals the regression is concentrated on one planning step: the new model picks read_file when it should pick grep. Trajectory-level StepEfficiency confirms the average trajectory grew by 4 steps. The fix — adding two few-shot examples for the planner — is targeted because the multi-mode evaluation localized it. Single-mode evaluation would have prompted a model rollback instead.

How to Measure Multi-Mode Agent Evaluation

Run modes in layers and dashboard the deltas:

End-to-end: TaskCompletion, GoalProgress — headline outcome metrics.
Trajectory-level: TrajectoryScore, StepEfficiency, ReasoningQuality — path quality and reasoning.
Step-level: ToolSelectionAccuracy, FunctionCallAccuracy, ActionSafety — per-tool-call correctness.
Turn-level: ConversationCoherence, IsHelpful, AnswerRelevancy — per-conversation-turn quality.
mode-delta dashboard (FutureAGI dashboard): show per-mode score change after each prompt or model change; the mode that shifts most points to the regression site.
agent.trajectory.step (OTel attribute): the canonical span attribute that lets every mode roll up to the same trace.

from fi.evals import TaskCompletion, ToolSelectionAccuracy, StepEfficiency

t = TaskCompletion().evaluate(input=goal, trajectory=trace_spans)
s = ToolSelectionAccuracy().evaluate(input=goal, trajectory=trace_spans)
e = StepEfficiency().evaluate(input=goal, trajectory=trace_spans)
print(t.score, s.score, e.score)

Common mistakes

Running only end-to-end evaluation. A single score cannot localize failures; you cannot debug what you cannot see.
Treating modes as alternatives. They are layers, not options — production agents need all four resolutions over time.
Using turn-level metrics on single-shot agents. Conversation-coherence on a one-call task reports noise; pick modes that match the surface.
Comparing modes across different traces. Always run the four modes on the same trajectory so the dashboard deltas mean something.
Ignoring per-agent slicing. In multi-agent flows, a global step-score average hides which sub-agent regressed; slice by agent name.