Guides

LLM Agent Evaluation: The Complete Guide (2026)

An agent eval guide built around the closed loop: offline eval, CI gate, production trace eval, Error Feed, and optimization. Treat eval as a static endpoint and your agent drifts; treat it as a loop and it compounds.

·
Updated
·
10 min read
agent-evaluation llm-evaluation closed-loop-eval traceai agent-opt 2026
Editorial cover image for LLM Agent Evaluation: The Complete Guide (2026)
Table of Contents

Most agent eval guides give you a list of metrics and a bag of templates and stop there. The honest one tells you what to do when the metrics move. An agent eval guide is incomplete without the closed-loop pattern: offline eval feeds the CI gate, the CI gate feeds production trace eval, trace eval feeds Error Feed, Error Feed feeds optimization, optimization ships back through CI. Treat the eval as the static endpoint and your agent drifts. Treat it as a loop and the agent compounds. This guide is the loop, end to end.

TL;DR: the loop is the differentiator

The dimensions tell you what to measure. The loop tells you what to do when the measurement moves.

  • Offline eval scores a versioned trajectory dataset on the six dimensions with code-defined rubrics.
  • The CI gate asserts per-dimension thresholds on every PR and exits non-zero on the failing axis, not the aggregate.
  • Production trace eval attaches the same rubric as a score on live OpenTelemetry spans, server-side, post-export, with no inline latency.
  • Error Feed clusters failing traces, writes a 4-D trace score and an immediate_fix, and promotes representatives into the dataset.
  • Optimization runs agent-opt against the expanded dataset with the same rubric the CI gate uses. Winners ship through CI.

Five stages, one vocabulary. Pull any stage and the loop opens; the agent drifts within weeks.

The trace is the unit, the loop is the system

An LLM eval is a function from (input, output) to a score. An agent eval is a function from trajectory to a score, where the trajectory is the ordered sequence of system prompt, user input, agent reasoning, tool calls (name plus arguments plus return value), retrieval results, intermediate LLM calls, final response, and outcome metadata. The trace is the trajectory. That part is settled in the definitive guide and we are not relitigating it here.

What that guide leaves on the table: scoring the trajectory once does not ship a reliable agent. Every release ages your eval set the day it lands. New intents, new prompt revisions, new tool schemas, new retrieval indexes, new traffic shapes. A static offline pass is a necessary condition; it is never sufficient. The architecture that closes the gap runs the same rubric in CI and on live spans, clusters the live failures, ratchets the offline set off what production already broke, and feeds the optimizer the expanded set. The dimensions are the diagnostic. The loop is the system.

The six trajectory dimensions (recap)

The six dimensions of trajectory eval are tool selection, argument extraction, result utilization, error recovery, plan coherence, and task completion. The walkthrough is in the definitive guide; the SDK mapping is below.

DimensionRubric in ai-evaluation
Tool selectionLLMFunctionCalling (cloud) + function_name_match (local)
Argument extractionparameter_validation (schema, local) + CustomLLMJudge (semantic)
Result utilizationGroundedness with tool payload as context
Error recoveryActionSafety + CustomLLMJudge over the trajectory
Plan coherenceStepEfficiency, TrajectoryScore, GoalProgress
Task completionTaskCompletion (eval_id 99) on AgentTrajectoryInput

The 4-D trajectory score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, scored 1 to 5 each by the same judge) is the composite the Error Feed Judge writes on every failing trace. Same vocabulary travels from offline rubric to live cluster to dataset entry to optimization objective. That is the property the loop relies on.

Stage 1: offline eval on a versioned trajectory dataset

The dataset is a JSONL of trajectories, not (input, output) pairs. Each row carries the user goal, the system prompt version, the expected tool sequence (when one exists), the realised tool calls with arguments and returns, the intermediate reasoning, the final response, and the outcome label. Build it stratified by tool, argument-edge-case bucket, and error code; promote failing production traces into it weekly.

from fi.evals import Evaluator, EvaluateFunctionCalling, TaskCompletion, AnswerRefusal

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

results = evaluator.evaluate(
    eval_templates=[
        EvaluateFunctionCalling(),  # eval_id 98 — tool selection + args
        TaskCompletion(),           # eval_id 99 — end-to-end goal
        AnswerRefusal(),            # eval_id 88 — calibrated refusal
    ],
    inputs=[
        {
            "input": user_request,
            "output": agent_final_response,
            "expected_tool_calls": expected_tools,
            "tool_calls": actual_tools,
        },
    ],
    model_name="turing_flash",
)

turing_flash is FAGI’s classifier-backed judge served from app.futureagi.com; the cost economics let you re-score the full agent regression set every PR instead of begging for budget. The local agent-trajectory metrics (tool_selection_accuracy, task_completion, step_efficiency, trajectory_score, goal_progress, action_safety, reasoning_quality) are sub-millisecond and free; flip augment=True to cascade local-then-judge and halve your judge spend at scale.

For axes without a built-in template (memory consistency, plan-coherence-for-your-specific-graph, domain refusal calibration) use CustomLLMJudge and pin it with few_shot_examples so the next reviewer correction lands as in-context examples. The full pattern is in our tool-calling agent walkthrough.

Stage 2: the CI gate with per-dimension assertions

Per-dimension, always. An aggregate 0.85 hides a 0.62 on argument extraction behind a 0.97 on tool selection, and the production failure rides on the argument layer. The fi CLI runs a YAML fixture with one assertion per dimension and exits non-zero on the failing axis:

# fi-eval.yaml
project: support-agent
dataset: regression/agent-trajectories.jsonl
evaluations:
  - name: tool_selection
    template: EvaluateFunctionCalling
    model: turing_flash
  - name: argument_validation
    metric: parameter_validation
    engine: local
  - name: result_grounding
    template: Groundedness
    model: turing_flash
  - name: recovery
    metric: action_safety
    engine: local
  - name: plan_coherence
    metric: trajectory_score
    engine: local
  - name: task_completion
    template: TaskCompletion
    model: turing_flash
assertions:
  - "tool_selection.score >= 0.95 for at_least 95% of cases"
  - "argument_validation.score >= 0.90 for at_least 90% of cases"
  - "result_grounding.score >= 0.90 for at_least 90% of cases"
  - "recovery.score >= 0.80 for at_least 85% of cases"
  - "plan_coherence.score >= 0.85 for at_least 90% of cases"
  - "task_completion.score >= 0.85 for at_least 90% of cases"

When the gate fails, the failing assertion name is the root cause. One bisect instead of three days of dashboard archaeology. Distributed runners (Celery, Ray, Temporal, Kubernetes) handle suites where six rubrics across a few hundred cases outgrow a single-process budget. The GitHub Actions recipe sits in the SDK’s python/examples/ci-cd/ directory; the broader CI/CD pattern is in CI/CD for AI agents.

Stage 3: production trace eval with EvalTag

The CI gate is the floor. The span-attached score is the river. The eval set was frozen the day it shipped; production has been moving since the next commit. Score live traces with the same rubric and you get a regression signal the offline set cannot have.

traceAI (Apache 2.0) ships 14 span kinds — TOOL, CHAIN, LLM, RETRIEVER, EMBEDDING, AGENT, RERANKER, GUARDRAIL, EVALUATOR, CONVERSATION, VECTOR_DB, A2A_CLIENT, A2A_SERVER, UNKNOWN — across 50+ AI surfaces in Python, TypeScript, Java, and C#. The A2A_CLIENT and A2A_SERVER kinds propagate gen_ai.a2a.propagated_trace_id so a distributed multi-agent call shows up as one trace. The langchain instrumentor’s _langgraph submodule emits langgraph.graph.node_count, langgraph.node.name/type, langgraph.node.is_entry/is_end, conditional edges, and per-node state diffs so plan-coherence rubrics can score graph topology directly. Pluggable semantic conventions at register() time (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) mean spans flow into whatever OTel collector you already run.

EvalTag wires rubric to span at zero added inference latency. Declare it once at register(); the collector runs the eval server-side, post-export, and writes gen_ai.evaluation.score.value, gen_ai.evaluation.score.label, and gen_ai.evaluation.explanation back as span attributes:

from fi_instrumentation import register
from fi_instrumentation.fi_types import (
    ProjectType, EvalTag, EvalTagType, EvalSpanKind, EvalName, ModelChoices,
)

trace_provider = register(
    project_name="support-agent",
    project_type=ProjectType.OBSERVE,
    eval_tags=[
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.TOOL,
            eval_name=EvalName.EVALUATE_FUNCTION_CALLING,
            model=ModelChoices.TURING_FLASH,
            mapping={
                "input": "input.value",
                "output": "output.value",
                "tool_calls": "gen_ai.tool.call.arguments",
            },
        ),
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.AGENT,
            eval_name=EvalName.TASK_COMPLETION,
            model=ModelChoices.TURING_LARGE,
            mapping={"input": "input.value", "output": "output.value"},
        ),
    ],
)

The trace tree and the score live on the same span. The failing eval and the failing call show up in one place. No engineer cross-references two dashboards under pressure. The deep dive on the trace plumbing is in traceAI plus OpenTelemetry for LLM tracing.

Stage 4: Error Feed clusters, scores, and writes the fix

A production agent at modest scale produces hundreds of failing traces a day. Treating each as its own alert is operational malpractice. Stack-trace clusterers group by exception fingerprint; agents fail without exceptions. Tool misuse, ungrounded summary, step disorder, goal drift, redundant steps, hallucinated state — Sentry-style clustering cannot see any of these.

Error Feed sits inside the eval stack. Failing traces flow into ClickHouse with span embeddings. HDBSCAN soft-clustering groups them into named issues at prob >= 0.4 so noise points stay recoverable. Each cluster fires a JudgeAgent on Claude Sonnet 4.5 (Bedrock) for a 30-turn investigation across 8 span-tools — read_span, get_children, get_spans_by_type across 11 observation types, search_spans, submit_finding, submit_scores, submit_summary, plus a Claude Haiku Chauffeur summariser for spans over 3000 characters. Prompt cache hit ratio sits around 90 percent.

Per cluster the Judge emits three artifacts:

  • A 5-category, 30-subtype taxonomy classification with dedicated agent-failure nodes (Hallucinated Content, Ungrounded Summary, Tool Misuse, Wrong Tool, Invalid Params, Goal Drift, Step Disorder, Redundant Steps, Missing CoT, Missing ReAct Planning, Lack of Self-Correction).
  • The 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, 1 to 5 each). Composite is the cluster score; axes are the diagnosis.
  • An immediate_fix string naming the change to ship today: rubric edit, prompt patch, tool-call guard, retrieval-filter tweak.

Cluster IDs are stable across re-clusterings so the dashboard links the same issue across runs. Linear ships today via OAuth one-click; Slack, GitHub, Jira, and PagerDuty are on the development surface. The fix feeds the Platform’s self-improving evaluators (few-shot injection via FeedbackStore, threshold calibration via ThresholdCalibrator) and the cluster becomes a candidate dataset entry. The full failure-mode taxonomy is in AI agent failure modes 2026.

Stage 5: optimization on the expanded dataset

Failing trajectories surface a prompt regression more often than a model regression. agent-opt (Apache 2.0, pip install agent-opt) ships six optimizers — RandomSearchOptimizer, BayesianSearchOptimizer (Optuna TPE, teacher-inferred few-shot, resumable studies), MetaPromptOptimizer, ProTeGi (textual-gradient beam search), GEPAOptimizer, PromptWizardOptimizer — and a shared EarlyStoppingConfig. You point one at the regression set plus a metric; it iterates and returns the best prompt.

from fi.opt.optimizers import BayesianSearchOptimizer
from fi.opt.base import Evaluator
from fi.opt.generators import LiteLLMGenerator

optimizer = BayesianSearchOptimizer(
    min_examples=2,
    max_examples=8,
    n_trials=20,
    inference_model_name="gpt-4o-mini",
    teacher_model_name="gpt-5",
    storage="sqlite:///optuna.db",
    study_name="support-agent-prompt-v3",
)
result = optimizer.optimize(
    evaluator=Evaluator(eval_template="task_completion", eval_model_name="turing_flash"),
    data_mapper=mapper,
    dataset=regression_set,
    initial_prompts=[current_system_prompt],
)

The winning candidate ships through the same CI gate that scored the originals. Same rubric, same vocabulary, same thresholds. The agent-opt webinar walks the live workflow; automated optimization for agents covers the longer-form pattern.

Honest framing. The signal source today is the offline dataset; the direct trace-stream-to-agent-opt connector that closes the loop end-to-end without the dataset round-trip is on the active roadmap, not shipped. Teams running the loop weekly through the promote step are doing the right thing in the meantime. Pretending the direct connector ships when it does not is the kind of vendor claim that costs trust the first time an engineer reads the code.

The eval stack ships as one package

The five stages are not five products you buy. Future AGI ships the eval stack as one package and the loop closes inside it.

ai-evaluation (Apache 2.0). 70+ EvalTemplate classes including LLMFunctionCalling, TaskCompletion, AnswerRefusal, ConversationCoherence, Groundedness, ContextAdherence, ChunkAttribution, and 11 CustomerAgent* templates. Seven AgentTrajectoryInput metrics. Deterministic function-call metrics (function_name_match, parameter_validation, function_call_accuracy, function_call_exact_match) at sub-millisecond cost. 13 guardrail backends (9 open-weight). Four distributed runners (Celery, Ray, Temporal, Kubernetes). Multi-modal CustomLLMJudge. fi CLI with native CI assertions.

traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java, C#. 14 span kinds with the agent-aware A2A_CLIENT, A2A_SERVER, GUARDRAIL, EVALUATOR, VECTOR_DB (Phoenix ships 8, Langfuse 5). LangGraphInstrumentor surfaces graph topology. Pluggable semantic conventions at register(). EvalTag wires rubric to span at zero inference latency.

Future AGI Platform. Self-improving evaluators tuned by feedback, in-product agent-authored custom rubrics, classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the stack as the clustering and what-to-fix layer.

agent-opt. Six optimizers consuming Evaluator scores as the objective. Shared EarlyStoppingConfig. Eval-driven optimization ships today; direct trace-stream ingestion is on the active roadmap.

The Future AGI Agent Command Center is the hosted runtime that ties this together for the gateway side: 6 native provider adapters plus 13 OpenAI-compatible presets, 6 routing strategies, 6 exact and 4 semantic cache backends, 5-level hierarchical budgets, native Anthropic /v1/messages and Gemini /v1beta. RBAC, SOC 2 Type II, HIPAA, GDPR, and CCPA certified.

Three deliberate tradeoffs

  • Closing the loop costs operational surface. Span-attached scores plus auto-clustering plus a promote workflow is more parts than pytest evals/. Payoff: a regression suite that ratchets stronger every week. New deployments can ship with traceAI plus ai-evaluation alone, build a few weeks of trace history, and turn on Error Feed plus agent-opt when the baseline stabilises.
  • Self-improving evaluators need their own calibration. Rubrics that pull few-shots from a feedback store can drift in unintended directions. Pin a human-labelled hold-out, alarm when the judge disagrees with it by more than the inter-rater baseline, re-audit quarterly.
  • The end-to-end auto-loop is partial. Eval-driven optimization ships today; the direct trace-stream-to-agent-opt connector is roadmap. The promote-back step is manual on purpose because the cluster-to-dataset judgement is still load-bearing. Honesty here is cheaper than the cleanup.

Common closed-loop mistakes

  • Eval set frozen for a quarter. The set ages off the product within weeks. Promote failing traces into the offline set weekly.
  • One aggregate CI threshold. Hides which dimension regressed. Per-dimension assertions or the gate is theatre.
  • Eval on a dashboard, trace in OTel, no join. The failing eval has no replay. Attach scores to spans via EvalTag.
  • Stack-trace clustering on agent traces. Misses tool misuse, ungrounded summaries, step disorder. Cluster on span embeddings, not exception fingerprints.
  • Optimizer pointed at a stale set. agent-opt against the same dataset Error Feed has not touched for six weeks is shadowboxing. Run the promote step first.
  • Per-turn evals on conversational agents. Conversation-level rubrics (ConversationCoherence, ConversationResolution) catch failure modes per-turn evals miss.

Why pick Future AGI for the closed loop

  • Same rubric in CI and on live spans. ai-evaluation runs in pytest; EvalTag runs the same template on OTel spans server-side at zero added latency.
  • Agent-aware tracing other vendors do not match. 14 span kinds, A2A trace propagation, LangGraph topology, 50+ surfaces in Python, TypeScript, Java, C#.
  • Error Feed sees what stack-trace clusterers cannot. Tool Misuse, Goal Drift, Missing Self-Correction, Hallucinated Content as named clusters with immediate_fix strings the engineer ships today.
  • The loop closes inside one package. Eval, trace, cluster, optimize ship as ai-evaluation plus traceAI plus Platform plus agent-opt, not as four vendors stapled together.

The trace-stream-to-agent-opt connector is the last roadmap mile; the loop runs weekly through the promote step in the meantime, and the same rubric vocabulary travels all the way through.

Frequently asked questions

What does 'closed-loop agent evaluation' actually mean?
A closed loop has five stages and they pass scores between each other without a human carrying a CSV. Offline eval grades a versioned trajectory dataset under a code-defined rubric. The CI gate runs the same rubric on every PR and exits non-zero on per-dimension regressions, not on one aggregate score. Production trace eval scores live OpenTelemetry spans with the same rubric, attached as gen_ai.evaluation.score on the span. Error Feed clusters failing traces with HDBSCAN, runs a Claude Sonnet 4.5 Judge over 8 span-tools, writes a 4-D trace score and an immediate_fix. Optimization (agent-opt) consumes the dataset Error Feed just expanded and searches the prompt space against the same rubric. Same vocabulary in all five stages. The agent eval is no longer a static endpoint; it is a system that ratchets stronger every week.
Why is closed-loop eval the differentiator and not the six trajectory dimensions?
Most teams that get the six dimensions right still ship agents that regress. The dimensions tell you what to measure. The loop tells you what to do when the measurement moves. Without the loop, the offline set freezes in March, production drifts through June, the CI gate stays green on cases nobody runs anymore, and the on-call engineer reads traces by hand at 3am because the eval and the trace live in different dashboards. The six dimensions are necessary. The loop is what compounds. Treat one as a substitute for the other and you have a fast metric for a slow problem.
What are the six trajectory dimensions and how do they map to the loop?
Tool selection, argument extraction, result utilization, error recovery, plan coherence, task completion. Each is scored by a code-defined template in the ai-evaluation SDK (LLMFunctionCalling, parameter_validation, Groundedness with the tool payload as context, ActionSafety, StepEfficiency, TaskCompletion) and the same template runs on offline data in CI and on live spans in production via EvalTag. The 4-D trajectory score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution) is the composite the Error Feed Judge writes back on every failing trace so the same vocabulary travels from offline rubric to live cluster to dataset entry to optimization objective.
How does the CI gate fit into the loop and what should it assert?
Per-dimension assertions, never one aggregate. An aggregate 0.85 hides a 0.62 on argument extraction behind a 0.97 on tool selection, and the production failure rides on the argument layer. The fi CLI runs a YAML fixture with six assertions, one per dimension, plus thresholds calibrated against historical pass rates. fi run exits non-zero on the first failing assertion; the failing assertion name is the root cause. Distributed runners (Celery, Ray, Temporal, Kubernetes) handle suites where six rubrics across a few hundred cases outgrow a single-process budget. Cascading augment=True keeps cost survivable: local heuristics catch the easy regressions, the LLM judge runs only on cases the heuristic flags as uncertain.
How does production trace eval work without adding latency?
EvalTag at register() time. You declare which rubric runs against which span kind with which model and which attribute mapping. The collector runs the eval server-side, post-export, and writes gen_ai.evaluation.score.value, gen_ai.evaluation.score.label, and gen_ai.evaluation.explanation back as span attributes. The user's request never waits on the eval. The trace tree and the score live on the same span, so the failing eval and the failing call show up in one place. Sample by failure signal, not uniformly: score every span flagged by a cheap classifier first, escalate to frontier judges only on the cases that smell expensive.
What does Error Feed do that stack-trace clustering does not?
Stack-trace clusterers group by exception fingerprint. Agents fail without exceptions: tool misuse, ungrounded summary, step disorder, goal drift, redundant steps, hallucinated state. Error Feed clusters by HDBSCAN soft-clustering over ClickHouse-stored embeddings of category plus root cause plus recommendation. A Claude Sonnet 4.5 Judge on Bedrock investigates each cluster with 8 span-tools (read_span, get_children, get_spans_by_type across 11 observation types, search_spans, submit_finding, submit_scores, submit_summary) for up to 30 turns at 90 percent prompt-cache hit. Output per cluster is a 5-category 30-subtype taxonomy classification, the 4-D trace score, and an immediate_fix string the on-call engineer ships today.
Where does agent-opt fit in the loop and what is honest about today versus roadmap?
agent-opt is the optimization stage. Six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) consume an Evaluator score as the objective and search the prompt space. Today the signal source is the offline dataset: Error Feed clusters failing traces, the on-call engineer promotes representatives into the dataset, agent-opt runs against the expanded set with the same rubric the CI gate uses. The direct trace-stream-to-agent-opt connector that closes the loop end-to-end without the dataset round-trip is on the active roadmap, not shipped. Teams that want continuous optimization run the loop weekly through the promote step in the meantime.
Related Articles
View all
Evaluating LLM Agent Handoffs (2026)
Guides

Evaluating LLM agent handoffs in 2026: the handoff is the cross-framework eval unit. Four rubrics, per-handoff spans, CI gates, and Error Feed clustering.

NVJK Kartik
NVJK Kartik ·
11 min