How are Strands Agents different from LangGraph?

Strands Agents rely more on the model to choose steps and tools at runtime. LangGraph usually asks engineers to define an explicit state graph, which gives more control but a different trace and failure shape.

How do you measure Strands Agents?

Measure Strands Agents with traceAI:strands spans such as agent.trajectory.step, tool-call status, token cost, and latency. FutureAGI can score runs with ToolSelectionAccuracy, TaskCompletion, TrajectoryScore, StepEfficiency, and ReasoningQuality.

What Are Strands Agents? FutureAGI Guide (2026)

Q: What are Strands Agents?

Strands Agents are model-driven AI agents built with the Strands Agents SDK, where a model, tools, and prompt run inside an autonomous agent loop. FutureAGI traces them with traceAI:strands and scores tool choice, task completion, and trajectory quality.

What Are Strands Agents?

Strands Agents are model-driven AI agents built with the open-source Strands Agents SDK, where a model, tools, and prompt run inside an autonomous agent loop. They belong to the agent framework family because reliability depends on the runtime path, not only the final model answer. In production traces, Strands behavior appears as planner steps, tool calls, MCP interactions, context decisions, and OpenTelemetry spans. FutureAGI observes Strands through traceAI:strands and scores runs with agent evaluators.

Why It Matters in Production LLM and Agent Systems

Strands Agents move control from hand-written workflow code into the model-driven loop. That is useful when the model can plan, choose tools, reflect, and stop correctly. It is risky when the same loop chooses a destructive tool too early, repeats a retrieval tool until cost spikes, or hides a failed MCP call behind a confident final answer. A bad Strands run rarely looks like one malformed completion. It looks like a long trajectory with extra reasoning steps, mismatched tool arguments, missing context, retries, and an answer that arrived after the wrong work.

Developers feel this first because local examples can pass while production data exposes new tool combinations. SREs see token-cost-per-trace, p99 latency, tool-timeout rate, and retry counts jump after adding a tool or changing the model. Product teams see inconsistent task completion by user cohort. Compliance teams ask why an agent touched a CRM record, invoked a shell tool, or skipped human approval.

This matters more in 2026 agentic pipelines because Strands Agents can connect models, custom functions, MCP servers, AWS services, memory stores, and multi-agent patterns. Unlike LangGraph, where many transitions are explicit graph edges, Strands often asks the model to decide the next action. The reliability question is therefore not “did the model answer?” It is “did the agent choose the right path, use the right tool, stop at the right time, and leave enough trace evidence to prove it?”

How FutureAGI Handles Strands Agents

FutureAGI’s approach is to treat a Strands run as an evaluable trajectory, not a black-box SDK call. The specific FAGI surface for this glossary entry is traceAI:strands, the traceAI integration slug for Strands. When a Strands agent handles a support request, the trace can capture the user prompt, model call, tool selection, tool result, MCP server interaction, memory or context update, and final response under one trace context. The useful fields include agent.trajectory.step, tool name, tool status, model provider, llm.token_count.prompt, latency, and error metadata when the integration emits them.

Example: a billing agent built with Strands receives “refund the duplicate charge and explain it.” It can retrieve policy, inspect payments, call a refund tool, and draft the response. FutureAGI records each step through traceAI:strands, then evaluates the same run with ToolSelectionAccuracy for tool choice, TaskCompletion for the end outcome, TrajectoryScore for the full path, and StepEfficiency for unnecessary work. If the model calls refund_charge before policy lookup, the failure is visible as an early risky tool call, not only as a final-answer defect.

The engineer’s next action is concrete: add the failed trace to a regression dataset, set an eval-fail-rate threshold for Strands traces, and block release when a model or tool change lowers the score. If the failure is safety-sensitive, the fix may be a stricter tool schema, a pre-guardrail in Agent Command Center, a human approval step, or a narrower MCP tool registry.

How to Measure or Detect It

Measure Strands Agents by scoring the final outcome and the decisions inside the loop.

ToolSelectionAccuracy: checks whether the agent selected the expected tool for the user intent.
TaskCompletion: checks whether the requested task was completed by the end of the run.
TrajectoryScore: scores the full agent path across planning, tool use, errors, and stopping.
StepEfficiency: flags extra or repeated steps that increase latency and cost.
Trace signals: count agent.trajectory.step, tool error rate, retry count, p99 latency, llm.token_count.prompt, token-cost-per-trace, and MCP failure rate.
User proxies: escalation rate, manual-review rate, reopened-ticket rate, and thumbs-down rate by Strands agent version.

Minimal Python:

from fi.evals import ToolSelectionAccuracy, TrajectoryScore

tool_eval = ToolSelectionAccuracy()
path_eval = TrajectoryScore()

tool_result = tool_eval.evaluate(trajectory=trace_spans, expected_tool="refund_charge")
path_result = path_eval.evaluate(trajectory=trace_spans)
print(tool_result.score, path_result.score)

Common Mistakes

Treating Strands as only a simpler SDK. The model-driven loop changes the failure mode from code-path bugs to trajectory bugs.
Evaluating only the final answer. A correct response can hide a premature tool call, skipped policy lookup, or unauthorized action.
Adding too many tools at once. Large tool registries increase wrong-tool risk unless retrieval, schemas, and permissions are measured.
Ignoring MCP boundary failures. Timeouts, stale schemas, and missing tool results need trace fields, not only app logs.
Comparing Strands to LangGraph by syntax. The real difference is runtime control: model-chosen steps versus explicit state transitions.