Agents

What Is an AI Agent?

An LLM-powered system that plans, calls tools, observes results, and pursues a goal across multiple steps.

What Is an AI Agent?

An AI agent is a software system that uses a large language model to plan actions, call tools, observe results, and keep working toward a goal across multiple steps. It belongs to the agent systems family, not plain prompt engineering, because its production surface is a trajectory: planner spans, model calls, tool calls, memory reads, handoffs, and stop conditions. In FutureAGI, traceAI instrumentation exposes that trajectory so evaluators can score whether the agent chose and completed the right work. The May 2026 short version: if your system has no loop, no tool use, no memory, and no stop condition, it’s a chatbot with marketing copy. not an agent.

Why AI agents matter in production LLM and agent systems

An ignored agent boundary turns small model errors into operational incidents. A wrong planner step selects issue_refund before check_policy; a stale memory read gives the billing agent a prior subscription tier; a missing stop condition creates an infinite loop that burns tokens until a timeout kills the request. The user sees a confident action, not the chain of small mistakes behind it. This is why 2026 frontier model cards report τ-bench (repo), SWE-Bench Verified (site), and GAIA (leaderboard) scores: those are the only benchmarks that surface trajectory failures.

Developers feel it as debugging ambiguity. The final answer is wrong, but logs show ten model calls, four tools, and two retries. and the trajectory has no canonical shape, so there is no obvious place to start. SREs see p99 latency and token cost spike when one tool starts throttling; without trajectory-level agent observability, they cannot tell whether the spike is the model, the tool, or the planner deciding to retry. Compliance teams ask why an agent sent an email, changed a record, or exposed PII without a review step. and the audit log has to point at a specific step, not a request blob. Product teams see support tickets that say “the agent did the wrong thing” without a reproducible prompt, because the same input plus a different observation creates a different trajectory next time.

This matters more in 2026 because common stacks mix the OpenAI Agents SDK, LangGraph, CrewAI, MCP servers, RAG, and gateway routing in one request. A single-response eval cannot explain that system. The useful evidence is the trajectory: which step chose the tool, which observation changed the plan, which model variant ran, and where the goal stopped making progress. The benchmarks that matter. τ-bench retail/airline at 55-70% frontier, SWE-Bench Verified at 70-78%, GAIA Level 3 still under 50%, OSWorld under 40%. all penalize bad trajectories, not bad single answers.

The second 2026 shift is protocol-level. MCP (Model Context Protocol) standardizes how an agent talks to tools across processes; A2A (Agent-to-Agent Protocol) standardizes how agents delegate to other agents. Production agents now cross service boundaries on every interesting workflow, which means the trajectory has to propagate across services through OpenTelemetry context, or you lose the trace at the network hop. An “AI agent” in 2026 is rarely one process. it is a distributed system that happens to use LLMs as its decision layer.

How FutureAGI handles AI agents

FutureAGI’s approach is to treat an AI agent as a traceable trajectory and an evaluable object. With traceAI integrations for openai-agents, langchain, langgraph, crewai, autogen, pydantic-ai, google-adk, strands, mastra, and mcp, each planner turn, tool call, handoff, memory read, and final response becomes an OpenTelemetry span. The shared attribute agent.trajectory.step lets an engineer filter the trace by step number or node name; gen_ai.tool.name, gen_ai.request.model, gen_ai.usage.input_tokens, and fi.span.kind show what changed at runtime.

The evaluation layer attaches scores to the same run. TaskCompletion checks whether the original goal was met. ToolSelectionAccuracy checks whether a tool choice matched the state at that step. TrajectoryScore, GoalProgress, and StepEfficiency separate “failed completely” from “made progress with too many steps.” Pair them with Faithfulness and Groundedness on retrieval steps, PromptInjection and ProtectFlash on untrusted tool outputs, and ActionSafety on the action surface.

A real example: a procurement team builds an agent on the OpenAI Agents SDK that handles “find the approved vendor, compare contract terms, and draft a purchase request.” FutureAGI captures the run with the traceAI-openai-agents integration. After a model upgrade from GPT-5.0 to GPT-5.1, TaskCompletion drops from 0.86 to 0.74 on sampled traces. Unlike a trace-only workflow such as LangSmith without attached evaluator scores, the FutureAGI view shows that ToolSelectionAccuracy fell specifically on the vendor-lookup step. the new model is calling the contract-search tool before the vendor-lookup tool. The engineer updates the tool description, adds a regression eval for that step, wires a pre-guardrail in Agent Command Center to short-circuit out-of-order tool calls, and alerts if eval-fail-rate-by-cohort rises above 5% on enterprise tenants. The fix ships behind a release gate in a single afternoon, not a sprint.

In our 2026 evals across agent stacks, the largest source of “model upgrade made things worse” surprises is not model quality. it’s that the new model has a different default tool-selection bias that the eval suite never measured. The fix is structural: per-step ToolSelectionAccuracy plus golden-dataset replay before every model swap, gated in CI through FutureAGI. Compare this to Arize, where agent eval is bolted on top of an ML observability product, and Braintrust, which centers on offline grading without a runtime gateway. FutureAGI keeps trace, eval, golden dataset, and gateway audit on the same trajectory object, which is what makes the diagnosis fast.

How to measure an AI agent

Measure the agent, not the final message alone. Single-turn QA-style scoring is what made 2022-era agent eval misleading; in 2026 the question is whether the trajectory closed the task without breaking state. The table maps the common production agent shapes to the FutureAGI surfaces that score them.

Agent shape (2026)What can go wrongFutureAGI evaluatorsTracing target
Single-LLM ReAct loop (planner-only)Infinite loop, no stop, wrong toolTaskCompletion, ToolSelectionAccuracy, StepEfficiencyagent.trajectory.step, fi.span.kind
LangGraph workflow agentPer-node regression, branch misrouteTaskCompletion, TrajectoryScore per nodetraceAI-langgraph spans
OpenAI Agents SDK with handoffsHandoff to wrong sub-agentToolSelectionAccuracy, TaskCompletion at child tracetraceAI-openai-agents
CrewAI multi-agent crewCrew member skips peer handoffTrajectoryScore, GoalProgress, CustomEvaluationtraceAI-crewai
MCP-mediated tool agentServer returns hostile or stale contentProtectFlash, PII, PromptInjection on tool outputtraceAI-mcp
Voice agent (LiveKit / Pipecat)ASR drift, barge-in failure, tone driftASRAccuracy, ConversationCoherence, TonetraceAI-livekit, traceAI-pipecat
Code agent (Aider / SWE-style)Edit fails hidden tests, infinite reasoningTaskCompletion, CustomEvaluation against test suitetraceAI-langchain + custom
Computer-use agent (OSWorld-style)Wrong app action, irreversible clickActionSafety, TaskCompletionOSWorld harness + traceAI
Research / browsing agent (GAIA-style)Hallucinated source, stale answerFaithfulness, Groundedness, ContextRelevanceretriever spans, fi.span.kind=RETRIEVER
A2A delegated sub-agentSub-agent ignores task contractTaskCompletion at child trace, CustomEvaluationA2A task id, parent trace id

The signals to wire on every agent:

  • TaskCompletion: returns a score and reason for whether the agent completed the user’s original goal; pair with a per-cohort threshold in the release gate.
  • ToolSelectionAccuracy: scores whether each tool choice matched the state, available tools, and requested outcome. the single highest-signal score for tool-using agents.
  • TrajectoryScore: rolls step-level evidence into a single run-level score for triage dashboards.
  • StepEfficiency: flags trajectories that took too many steps to reach the goal; useful for catching budget regressions.
  • GoalProgress: gradient view of progress at each step; useful when end-to-end is binary.
  • ActionSafety: returns dangerous-action and sensitive-leak findings on the trajectory, before the action fires.
  • agent.trajectory.step: the span attribute to filter planner, tool, memory, handoff, and final-answer steps.
  • eval-fail-rate-by-cohort: tracks failing traces by model, route, tenant, framework, or release version.
  • User-feedback proxy: thumbs-down rate and escalation rate catch failures that offline evals did not cover.
from fi.evals import TaskCompletion, ToolSelectionAccuracy, TrajectoryScore

task = TaskCompletion()
tool = ToolSelectionAccuracy()
trajectory = TrajectoryScore()

task_result = task.evaluate(input=user_goal, trajectory=trace_spans)
tool_result = tool.evaluate(input=user_goal, trajectory=trace_spans)
traj_result = trajectory.evaluate(input=user_goal, trajectory=trace_spans)
print(task_result.score, tool_result.score, traj_result.score)

For cohort-filtered regression over a Dataset. the gate that catches per-tenant or per-intent breakage hidden inside a green aggregate. pair the evaluators with a saved cohort:

from fi.evals import TaskCompletion, ToolSelectionAccuracy, TrajectoryScore
from fi.datasets import Dataset

cohort = Dataset.load("agent_golden_v7")  # 600 production trajectories
report = cohort.run_eval(
    evaluators=[TaskCompletion(), ToolSelectionAccuracy(), TrajectoryScore()],
    group_by=["tenant_tier", "intent"],
    baseline_run_id="release_2026_05_08",
)
report.assert_no_regression(metric="task_completion", tolerance=0.02)

The eval should run on every release candidate against a golden dataset of 200-1000 trajectories sampled from production, with TrajectoryScore aggregated by cohort and ToolSelectionAccuracy aggregated by tool. A release gate that only checks aggregate TaskCompletion hides cohort regressions; per-cohort thresholds are what catch a model upgrade that broke “enterprise SSO refund” without moving the global score.

Choosing an agent framework in 2026

The framework picture in 2026 looks roughly like this: the OpenAI Agents SDK is the leanest abstraction and a sensible default for OpenAI-only stacks; LangGraph is the most observable for multi-step workflows; CrewAI is fastest to prototype crews but hardest to debug; AutoGen suits structured group-chat agents with critic roles; Pydantic AI gives you typed nodes and is the best fit for strict schema enforcement; Google ADK and Strands are gaining ground for Vertex-native agents; Claude Agent SDK is the right pick for Anthropic-only stacks. None of them get you out of trajectory evaluation. they only change how the trajectory is shaped and traced. FutureAGI ships traceAI integrations for all of them, so the eval contract stays constant across framework swaps.

The one anti-pattern to avoid in 2026: rolling a custom agent framework “for full control” when your team is under five people and you have not yet shipped a single agent. The frameworks above all encode hard-won lessons about stop conditions, retry budgets, handoff semantics, and trace propagation; reimplementing those takes weeks and the result usually has worse StepEfficiency than off-the-shelf. Use a real framework, instrument it, and spend the time on the prompt, the tool surface, and the evaluator suite. those are where the wins live.

The 2026 agent stack: what we see in production traces

Across FutureAGI’s customer base in 2026, the most common production agent stack is some variation of: LangGraph or OpenAI Agents SDK at the orchestration layer; Claude Opus 4.7, GPT-5.x, or Gemini 3.x as the planner; a smaller frontier model (Claude Sonnet 4.6, GPT-5-mini) for cheap intermediate steps via Agent Command Center routing; an MCP server fleet for tool surfaces; a vector store (pgvector, Qdrant, or LanceDB) for retrieval; LiveKit or Pipecat for voice agents. The stack varies, but the failure modes converge: tool-output hallucination, per-cohort regression after model swap, prompt-injection via retrieved content, runaway cost from retry loops, and silent loss of trace context across MCP hops.

The interesting 2026 shift is that “model quality” matters less than “trajectory shape.” Two teams running GPT-5.1 on the same task can have wildly different TaskCompletion numbers if one team has cleaner tool descriptions and stricter pre-guardrail chains. We’ve found that the highest-impact production change is rarely a model upgrade. it’s tightening the tool surface, adding ProtectFlash to untrusted content, and gating release on per-cohort ToolSelectionAccuracy. That’s why the FutureAGI surface centers on traces, tools, and guardrails, not on chasing every new frontier model release.

MCP, A2A, and the network-boundary problem

The protocol shift matters for measurement. When a tool is an MCP server, the agent’s local span ends at the network call and a new span starts at the server. Without traceparent propagation, FutureAGI sees two disconnected traces and TaskCompletion cannot aggregate. The fix is the traceAI-mcp integration, which propagates context across the MCP boundary so the parent trajectory keeps a single trace id. Same logic for A2A: a delegated sub-agent is a child trajectory, and traceAI-a2a propagates the task id so the parent agent’s TaskCompletion can roll up the child’s TrajectoryScore. Skipping this is the most common cause of “the agent looks fine in logs but evals are noisy” in 2026. the eval is averaging unrelated traces.

Common mistakes

  • Calling any chatbot an agent. If no loop, tool boundary, memory, or stop condition exists, it is a chatbot with marketing copy. The distinction matters for capacity planning, security review, and eval choice.
  • Evaluating only final text. A correct answer can hide a dangerous tool call, leaked context, or unapproved action in the middle of the trajectory. Always pair end-to-end TaskCompletion with per-step ToolSelectionAccuracy and ActionSafety.
  • Missing stop conditions. Max-iteration caps, budget caps, and loop detection are required before users can trigger long-running agent work. StepEfficiency plus a hard hop budget catches both budget overruns and bouncing-between-tools regressions.
  • Treating tools as trusted truth. Tool outputs can be stale, malformed, or injected; score external content with Faithfulness and PromptInjection before feeding it back to the planner. This is the single most common 2026 failure mode for RAG-backed agents.
  • Aggregating all failures. One “agent failed” metric hides whether planning, retrieval, memory, handoff, or tool selection caused the regression. Slice by agent.trajectory.step and by cohort.
  • Confusing autonomy with permission. An agent can decide the next step, but sensitive actions still need policy gates, audit logs, and approval paths. Wire pre-guardrail and post-guardrail in Agent Command Center on every write action.
  • Skipping golden-dataset replay before model swaps. A new model with a different tool-selection bias regresses silently on aggregate metrics. Replay a 500-trajectory golden dataset through the new model and diff ToolSelectionAccuracy per tool before shipping.
  • Letting agent memory grow unbounded. Long-running agents accumulate stale state that contaminates the planner; checkpoint, summarize, and evict on every workflow boundary.
  • No release gate on TaskCompletion. A green CI that runs evals without blocking the deploy is reporting, not evaluation. Gate the deploy on per-cohort thresholds plus a hard “no regression > 2 points on safety-critical cohorts” rule.

Frequently Asked Questions

What is an AI agent?

An AI agent is an LLM-driven system that plans, calls tools, observes results, and works across multiple steps to complete a goal, rather than answering one prompt once.

How is an AI agent different from a chatbot?

A chatbot usually responds to each turn. An AI agent owns a loop: it decides the next step, invokes tools, reads observations, updates state, and stops when a goal or guard condition is reached.

How do you measure an AI agent?

Use traceAI spans tagged with agent.trajectory.step, then attach FutureAGI evaluators such as TaskCompletion, ToolSelectionAccuracy, and TrajectoryScore to the same run.