LLM Agent Evaluation: The Complete Guide (2026)
An agent eval guide built around the closed loop: offline eval, CI gate, production trace eval, Error Feed, and optimization. Treat eval as a static endpoint and your agent drifts; treat it as a loop and it compounds.
Table of Contents
Most agent eval guides give you a list of metrics and a bag of templates and stop there. The honest one tells you what to do when the metrics move. An agent eval guide is incomplete without the closed-loop pattern: offline eval feeds the CI gate, the CI gate feeds production trace eval, trace eval feeds Error Feed, Error Feed feeds optimization, optimization ships back through CI. Treat the eval as the static endpoint and your agent drifts. Treat it as a loop and the agent compounds. This guide is the loop, end to end.
TL;DR: the loop is the differentiator
The dimensions tell you what to measure. The loop tells you what to do when the measurement moves.
- Offline eval scores a versioned trajectory dataset on the six dimensions with code-defined rubrics.
- The CI gate asserts per-dimension thresholds on every PR and exits non-zero on the failing axis, not the aggregate.
- Production trace eval attaches the same rubric as a score on live OpenTelemetry spans, server-side, post-export, with no inline latency.
- Error Feed clusters failing traces, writes a 4-D trace score and an
immediate_fix, and promotes representatives into the dataset. - Optimization runs
agent-optagainst the expanded dataset with the same rubric the CI gate uses. Winners ship through CI.
Five stages, one vocabulary. Pull any stage and the loop opens; the agent drifts within weeks.
The trace is the unit, the loop is the system
An LLM eval is a function from (input, output) to a score. An agent eval is a function from trajectory to a score, where the trajectory is the ordered sequence of system prompt, user input, agent reasoning, tool calls (name plus arguments plus return value), retrieval results, intermediate LLM calls, final response, and outcome metadata. The trace is the trajectory. That part is settled in the definitive guide and we are not relitigating it here.
What that guide leaves on the table: scoring the trajectory once does not ship a reliable agent. Every release ages your eval set the day it lands. New intents, new prompt revisions, new tool schemas, new retrieval indexes, new traffic shapes. A static offline pass is a necessary condition; it is never sufficient. The architecture that closes the gap runs the same rubric in CI and on live spans, clusters the live failures, ratchets the offline set off what production already broke, and feeds the optimizer the expanded set. The dimensions are the diagnostic. The loop is the system.
The six trajectory dimensions (recap)
The six dimensions of trajectory eval are tool selection, argument extraction, result utilization, error recovery, plan coherence, and task completion. The walkthrough is in the definitive guide; the SDK mapping is below.
| Dimension | Rubric in ai-evaluation |
|---|---|
| Tool selection | LLMFunctionCalling (cloud) + function_name_match (local) |
| Argument extraction | parameter_validation (schema, local) + CustomLLMJudge (semantic) |
| Result utilization | Groundedness with tool payload as context |
| Error recovery | ActionSafety + CustomLLMJudge over the trajectory |
| Plan coherence | StepEfficiency, TrajectoryScore, GoalProgress |
| Task completion | TaskCompletion (eval_id 99) on AgentTrajectoryInput |
The 4-D trajectory score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, scored 1 to 5 each by the same judge) is the composite the Error Feed Judge writes on every failing trace. Same vocabulary travels from offline rubric to live cluster to dataset entry to optimization objective. That is the property the loop relies on.
Stage 1: offline eval on a versioned trajectory dataset
The dataset is a JSONL of trajectories, not (input, output) pairs. Each row carries the user goal, the system prompt version, the expected tool sequence (when one exists), the realised tool calls with arguments and returns, the intermediate reasoning, the final response, and the outcome label. Build it stratified by tool, argument-edge-case bucket, and error code; promote failing production traces into it weekly.
from fi.evals import Evaluator, EvaluateFunctionCalling, TaskCompletion, AnswerRefusal
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
results = evaluator.evaluate(
eval_templates=[
EvaluateFunctionCalling(), # eval_id 98 — tool selection + args
TaskCompletion(), # eval_id 99 — end-to-end goal
AnswerRefusal(), # eval_id 88 — calibrated refusal
],
inputs=[
{
"input": user_request,
"output": agent_final_response,
"expected_tool_calls": expected_tools,
"tool_calls": actual_tools,
},
],
model_name="turing_flash",
)
turing_flash is FAGI’s classifier-backed judge served from app.futureagi.com; the cost economics let you re-score the full agent regression set every PR instead of begging for budget. The local agent-trajectory metrics (tool_selection_accuracy, task_completion, step_efficiency, trajectory_score, goal_progress, action_safety, reasoning_quality) are sub-millisecond and free; flip augment=True to cascade local-then-judge and halve your judge spend at scale.
For axes without a built-in template (memory consistency, plan-coherence-for-your-specific-graph, domain refusal calibration) use CustomLLMJudge and pin it with few_shot_examples so the next reviewer correction lands as in-context examples. The full pattern is in our tool-calling agent walkthrough.
Stage 2: the CI gate with per-dimension assertions
Per-dimension, always. An aggregate 0.85 hides a 0.62 on argument extraction behind a 0.97 on tool selection, and the production failure rides on the argument layer. The fi CLI runs a YAML fixture with one assertion per dimension and exits non-zero on the failing axis:
# fi-eval.yaml
project: support-agent
dataset: regression/agent-trajectories.jsonl
evaluations:
- name: tool_selection
template: EvaluateFunctionCalling
model: turing_flash
- name: argument_validation
metric: parameter_validation
engine: local
- name: result_grounding
template: Groundedness
model: turing_flash
- name: recovery
metric: action_safety
engine: local
- name: plan_coherence
metric: trajectory_score
engine: local
- name: task_completion
template: TaskCompletion
model: turing_flash
assertions:
- "tool_selection.score >= 0.95 for at_least 95% of cases"
- "argument_validation.score >= 0.90 for at_least 90% of cases"
- "result_grounding.score >= 0.90 for at_least 90% of cases"
- "recovery.score >= 0.80 for at_least 85% of cases"
- "plan_coherence.score >= 0.85 for at_least 90% of cases"
- "task_completion.score >= 0.85 for at_least 90% of cases"
When the gate fails, the failing assertion name is the root cause. One bisect instead of three days of dashboard archaeology. Distributed runners (Celery, Ray, Temporal, Kubernetes) handle suites where six rubrics across a few hundred cases outgrow a single-process budget. The GitHub Actions recipe sits in the SDK’s python/examples/ci-cd/ directory; the broader CI/CD pattern is in CI/CD for AI agents.
Stage 3: production trace eval with EvalTag
The CI gate is the floor. The span-attached score is the river. The eval set was frozen the day it shipped; production has been moving since the next commit. Score live traces with the same rubric and you get a regression signal the offline set cannot have.
traceAI (Apache 2.0) ships 14 span kinds — TOOL, CHAIN, LLM, RETRIEVER, EMBEDDING, AGENT, RERANKER, GUARDRAIL, EVALUATOR, CONVERSATION, VECTOR_DB, A2A_CLIENT, A2A_SERVER, UNKNOWN — across 50+ AI surfaces in Python, TypeScript, Java, and C#. The A2A_CLIENT and A2A_SERVER kinds propagate gen_ai.a2a.propagated_trace_id so a distributed multi-agent call shows up as one trace. The langchain instrumentor’s _langgraph submodule emits langgraph.graph.node_count, langgraph.node.name/type, langgraph.node.is_entry/is_end, conditional edges, and per-node state diffs so plan-coherence rubrics can score graph topology directly. Pluggable semantic conventions at register() time (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) mean spans flow into whatever OTel collector you already run.
EvalTag wires rubric to span at zero added inference latency. Declare it once at register(); the collector runs the eval server-side, post-export, and writes gen_ai.evaluation.score.value, gen_ai.evaluation.score.label, and gen_ai.evaluation.explanation back as span attributes:
from fi_instrumentation import register
from fi_instrumentation.fi_types import (
ProjectType, EvalTag, EvalTagType, EvalSpanKind, EvalName, ModelChoices,
)
trace_provider = register(
project_name="support-agent",
project_type=ProjectType.OBSERVE,
eval_tags=[
EvalTag(
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.TOOL,
eval_name=EvalName.EVALUATE_FUNCTION_CALLING,
model=ModelChoices.TURING_FLASH,
mapping={
"input": "input.value",
"output": "output.value",
"tool_calls": "gen_ai.tool.call.arguments",
},
),
EvalTag(
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.AGENT,
eval_name=EvalName.TASK_COMPLETION,
model=ModelChoices.TURING_LARGE,
mapping={"input": "input.value", "output": "output.value"},
),
],
)
The trace tree and the score live on the same span. The failing eval and the failing call show up in one place. No engineer cross-references two dashboards under pressure. The deep dive on the trace plumbing is in traceAI plus OpenTelemetry for LLM tracing.
Stage 4: Error Feed clusters, scores, and writes the fix
A production agent at modest scale produces hundreds of failing traces a day. Treating each as its own alert is operational malpractice. Stack-trace clusterers group by exception fingerprint; agents fail without exceptions. Tool misuse, ungrounded summary, step disorder, goal drift, redundant steps, hallucinated state — Sentry-style clustering cannot see any of these.
Error Feed sits inside the eval stack. Failing traces flow into ClickHouse with span embeddings. HDBSCAN soft-clustering groups them into named issues at prob >= 0.4 so noise points stay recoverable. Each cluster fires a JudgeAgent on Claude Sonnet 4.5 (Bedrock) for a 30-turn investigation across 8 span-tools — read_span, get_children, get_spans_by_type across 11 observation types, search_spans, submit_finding, submit_scores, submit_summary, plus a Claude Haiku Chauffeur summariser for spans over 3000 characters. Prompt cache hit ratio sits around 90 percent.
Per cluster the Judge emits three artifacts:
- A 5-category, 30-subtype taxonomy classification with dedicated agent-failure nodes (Hallucinated Content, Ungrounded Summary, Tool Misuse, Wrong Tool, Invalid Params, Goal Drift, Step Disorder, Redundant Steps, Missing CoT, Missing ReAct Planning, Lack of Self-Correction).
- The 4-D trace score (
factual_grounding,privacy_and_safety,instruction_adherence,optimal_plan_execution, 1 to 5 each). Composite is the cluster score; axes are the diagnosis. - An
immediate_fixstring naming the change to ship today: rubric edit, prompt patch, tool-call guard, retrieval-filter tweak.
Cluster IDs are stable across re-clusterings so the dashboard links the same issue across runs. Linear ships today via OAuth one-click; Slack, GitHub, Jira, and PagerDuty are on the development surface. The fix feeds the Platform’s self-improving evaluators (few-shot injection via FeedbackStore, threshold calibration via ThresholdCalibrator) and the cluster becomes a candidate dataset entry. The full failure-mode taxonomy is in AI agent failure modes 2026.
Stage 5: optimization on the expanded dataset
Failing trajectories surface a prompt regression more often than a model regression. agent-opt (Apache 2.0, pip install agent-opt) ships six optimizers — RandomSearchOptimizer, BayesianSearchOptimizer (Optuna TPE, teacher-inferred few-shot, resumable studies), MetaPromptOptimizer, ProTeGi (textual-gradient beam search), GEPAOptimizer, PromptWizardOptimizer — and a shared EarlyStoppingConfig. You point one at the regression set plus a metric; it iterates and returns the best prompt.
from fi.opt.optimizers import BayesianSearchOptimizer
from fi.opt.base import Evaluator
from fi.opt.generators import LiteLLMGenerator
optimizer = BayesianSearchOptimizer(
min_examples=2,
max_examples=8,
n_trials=20,
inference_model_name="gpt-4o-mini",
teacher_model_name="gpt-5",
storage="sqlite:///optuna.db",
study_name="support-agent-prompt-v3",
)
result = optimizer.optimize(
evaluator=Evaluator(eval_template="task_completion", eval_model_name="turing_flash"),
data_mapper=mapper,
dataset=regression_set,
initial_prompts=[current_system_prompt],
)
The winning candidate ships through the same CI gate that scored the originals. Same rubric, same vocabulary, same thresholds. The agent-opt webinar walks the live workflow; automated optimization for agents covers the longer-form pattern.
Honest framing. The signal source today is the offline dataset; the direct trace-stream-to-agent-opt connector that closes the loop end-to-end without the dataset round-trip is on the active roadmap, not shipped. Teams running the loop weekly through the promote step are doing the right thing in the meantime. Pretending the direct connector ships when it does not is the kind of vendor claim that costs trust the first time an engineer reads the code.
The eval stack ships as one package
The five stages are not five products you buy. Future AGI ships the eval stack as one package and the loop closes inside it.
ai-evaluation (Apache 2.0). 70+ EvalTemplate classes including LLMFunctionCalling, TaskCompletion, AnswerRefusal, ConversationCoherence, Groundedness, ContextAdherence, ChunkAttribution, and 11 CustomerAgent* templates. Seven AgentTrajectoryInput metrics. Deterministic function-call metrics (function_name_match, parameter_validation, function_call_accuracy, function_call_exact_match) at sub-millisecond cost. 13 guardrail backends (9 open-weight). Four distributed runners (Celery, Ray, Temporal, Kubernetes). Multi-modal CustomLLMJudge. fi CLI with native CI assertions.
traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java, C#. 14 span kinds with the agent-aware A2A_CLIENT, A2A_SERVER, GUARDRAIL, EVALUATOR, VECTOR_DB (Phoenix ships 8, Langfuse 5). LangGraphInstrumentor surfaces graph topology. Pluggable semantic conventions at register(). EvalTag wires rubric to span at zero inference latency.
Future AGI Platform. Self-improving evaluators tuned by feedback, in-product agent-authored custom rubrics, classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the stack as the clustering and what-to-fix layer.
agent-opt. Six optimizers consuming Evaluator scores as the objective. Shared EarlyStoppingConfig. Eval-driven optimization ships today; direct trace-stream ingestion is on the active roadmap.
The Future AGI Agent Command Center is the hosted runtime that ties this together for the gateway side: 6 native provider adapters plus 13 OpenAI-compatible presets, 6 routing strategies, 6 exact and 4 semantic cache backends, 5-level hierarchical budgets, native Anthropic /v1/messages and Gemini /v1beta. RBAC, SOC 2 Type II, HIPAA, GDPR, and CCPA certified.
Three deliberate tradeoffs
- Closing the loop costs operational surface. Span-attached scores plus auto-clustering plus a promote workflow is more parts than
pytest evals/. Payoff: a regression suite that ratchets stronger every week. New deployments can ship with traceAI plusai-evaluationalone, build a few weeks of trace history, and turn on Error Feed plusagent-optwhen the baseline stabilises. - Self-improving evaluators need their own calibration. Rubrics that pull few-shots from a feedback store can drift in unintended directions. Pin a human-labelled hold-out, alarm when the judge disagrees with it by more than the inter-rater baseline, re-audit quarterly.
- The end-to-end auto-loop is partial. Eval-driven optimization ships today; the direct trace-stream-to-
agent-optconnector is roadmap. The promote-back step is manual on purpose because the cluster-to-dataset judgement is still load-bearing. Honesty here is cheaper than the cleanup.
Common closed-loop mistakes
- Eval set frozen for a quarter. The set ages off the product within weeks. Promote failing traces into the offline set weekly.
- One aggregate CI threshold. Hides which dimension regressed. Per-dimension assertions or the gate is theatre.
- Eval on a dashboard, trace in OTel, no join. The failing eval has no replay. Attach scores to spans via
EvalTag. - Stack-trace clustering on agent traces. Misses tool misuse, ungrounded summaries, step disorder. Cluster on span embeddings, not exception fingerprints.
- Optimizer pointed at a stale set.
agent-optagainst the same dataset Error Feed has not touched for six weeks is shadowboxing. Run the promote step first. - Per-turn evals on conversational agents. Conversation-level rubrics (
ConversationCoherence,ConversationResolution) catch failure modes per-turn evals miss.
Why pick Future AGI for the closed loop
- Same rubric in CI and on live spans.
ai-evaluationruns inpytest;EvalTagruns the same template on OTel spans server-side at zero added latency. - Agent-aware tracing other vendors do not match. 14 span kinds, A2A trace propagation, LangGraph topology, 50+ surfaces in Python, TypeScript, Java, C#.
- Error Feed sees what stack-trace clusterers cannot. Tool Misuse, Goal Drift, Missing Self-Correction, Hallucinated Content as named clusters with
immediate_fixstrings the engineer ships today. - The loop closes inside one package. Eval, trace, cluster, optimize ship as
ai-evaluationplustraceAIplus Platform plusagent-opt, not as four vendors stapled together.
The trace-stream-to-agent-opt connector is the last roadmap mile; the loop runs weekly through the promote step in the meantime, and the same rubric vocabulary travels all the way through.
Related reading
- The Definitive Guide to AI Agent Evaluation (2026)
- Your AI Agent Passes Evals But Still Fails in Production
- Evaluating Tool-Calling Agents (2026)
- The 2026 LLM Evaluation Playbook
- CI/CD for AI Agents: Best Practices (2026)
- Automated Optimization for Agents (2026)
- AI Agent Failure Modes (2026)
- Trace and Debug Multi-Agent Systems
- traceAI + OpenTelemetry for LLM Tracing
Frequently asked questions
What does 'closed-loop agent evaluation' actually mean?
Why is closed-loop eval the differentiator and not the six trajectory dimensions?
What are the six trajectory dimensions and how do they map to the loop?
How does the CI gate fit into the loop and what should it assert?
How does production trace eval work without adding latency?
What does Error Feed do that stack-trace clustering does not?
Where does agent-opt fit in the loop and what is honest about today versus roadmap?
Evaluating LLM agent handoffs in 2026: the handoff is the cross-framework eval unit. Four rubrics, per-handoff spans, CI gates, and Error Feed clustering.
Evaluate Pydantic AI agents that call MCP tools in 2026: per-typed-output rubrics, tool-call argument fidelity, MCP security checks, dependency invariants.
LangGraph eval is graph-level, not message-level. Score state transitions: node-input, node-output, edge-routing, and checkpoint replay determinism.