Agents

What Is the ReAct Pattern (Reason + Act)?

An agent control pattern that interleaves reasoning thoughts and acting steps in a loop until the model emits a final answer.

What Is the ReAct Pattern?

ReAct is the canonical agent control pattern that interleaves Reasoning and Acting in a loop. Introduced by Yao et al. in 2022, it formalised what most agents now do: at each step the model writes an explicit thought (the reasoning), emits an action (typically a tool call), observes the result, and continues. The pattern produces a chain of (thought, action, observation) triplets, captured in production as an agent trajectory. ReAct underpins the default agent loops in LangChain, the OpenAI Agents SDK, the Claude Agent SDK, LangGraph, CrewAI, Pydantic AI, Google’s ADK, and most ReAct-style implementations across 2026 frameworks. In a FutureAGI trace, each triplet appears as three correlated spans.

The 2026 update: with reasoning-native frontier models (GPT-5.x, Claude Opus 4.7, Gemini 3.x), the line between “ReAct thought” and “model’s internal reasoning” is blurring. Modern ReAct stacks often surface only the action and observation, with reasoning emitted by the model’s native thinking mode and captured separately. The pattern is the same; the implementation has gotten leaner.

Why ReAct matters in production LLM and agent systems

ReAct is popular because it makes reasoning legible. The model says what it’s thinking before it acts, which means every wrong action has a recorded justification. That sounds minor. it isn’t. The thought trace is the difference between debugging “the agent picked the wrong tool” and debugging “the agent picked the wrong tool because it misread the user’s intent as a returns query when it was an exchange query.” One bug fix needs a tool description tweak; the other needs a system prompt edit.

Each role gets value differently. A backend engineer reads thoughts to localise where the agent’s understanding diverged from the user’s intent. A product reviewer audits thoughts to catch agents that say one thing and do another. An SRE uses thought-token cost as a proxy for agent complexity. verbose ReAct agents are expensive ones. A QA engineer turns canonical thoughts into regression evals: “the agent should always reason about eligibility before issuing a refund.”

The flip side: thoughts cost tokens. A ReAct agent generates 2–4x the tokens of a no-thought agent for the same output. In 2026, teams increasingly mix patterns. ReAct for high-stakes paths, plan-and-execute for predictable workflows, no-thought tool calls for low-stakes lookups, agent-as-judge for verification. The goal is to put thought tokens where they earn their cost.

The 2026 agentic AI ecosystem also pulled ReAct into multi-agent territory. With A2A (Agent-to-Agent Protocol) and MCP now production protocols, a ReAct agent is rarely alone. The planner ReAct agent calls specialist sub-agents via A2A; the tool surface is an MCP server registry. The trajectory now spans agents, not just steps within one agent. That shifts the eval question: instead of “did this agent solve the task?”, it becomes “did the agent ensemble solve the task, and which agent’s step caused the failure?”

ReAct is also where agent benchmarks like τ-bench, SWE-Bench Verified, GAIA, and OSWorld live. These benchmarks score trajectory completion, tool-selection accuracy, and step efficiency. the exact signals a production ReAct loop exposes. A team that wires ReasoningQuality and ToolSelectionAccuracy against τ-bench retail is running the same eval format frontier labs publish in their model cards. That is not coincidence; it is the convergence of agent eval methodology.

How ReAct interacts with 2026 reasoning models

The biggest 2026 shift is that frontier models now reason natively. GPT-5.x has a reasoning mode that emits internal thoughts (visible or hidden depending on API surface); Claude Opus 4.7 has thinking mode; Gemini 3.x has deep-think mode. The interaction with ReAct is non-trivial:

  • Naive overlap: a ReAct prompt that says “first think step by step, then act” on top of a reasoning model produces double reasoning. once internal, once in the visible thought span. Cost doubles, quality often drops because the visible thought constrains the internal one.
  • Right pattern: let the model reason natively, and use ReAct as the action-and-observation loop. The “thought” span becomes the model’s reasoning trace (where the API exposes it), and the “action” span is the structured tool call.
  • Where it still helps: for non-reasoning models (Llama 4 base, Mistral Large, smaller open-weight models), explicit ReAct scaffolding still wins. The thought prompt acts as a CoT replacement.

We have shipped both patterns and recommend reasoning-native ReAct for frontier-tier deployments and explicit-thought ReAct for cost-optimised tier-2 deployments. The choice is a cost/quality tradeoff that the eval cohort will reveal.

How FutureAGI handles ReAct

FutureAGI’s approach is to evaluate thoughts and actions as separate first-class spans. The traceAI integrations. traceAI-langchain, traceAI-langgraph, traceAI-openai-agents, traceAI-claude-agent-sdk, traceAI-crewai, traceAI-pydantic-ai, traceAI-google-adk, traceAI-haystack, traceAI-agno. wrap the ReAct loop so each thought becomes an LLM span, each action becomes a tool span, and the observation is the tool span’s result. All three carry agent.trajectory.step and an iteration index, which makes per-triplet evaluation trivial.

Four evaluators cover the ReAct surface:

  • ReasoningQuality (local-metric) and the framework-eval ReasoningQualityEval score the logical validity of each thought given the prior observations. does the reasoning actually justify the next action?
  • ToolSelectionAccuracy scores whether the action that followed the thought was the correct tool.
  • TaskCompletion grades the trajectory end-to-end against the user’s goal.
  • TrajectoryScore aggregates step efficiency, tool selection, reasoning, and goal progress into one number; StepEfficiency and GoalProgress are the components when you need surgical diagnosis.

Used together they tell you whether a ReAct failure is “the thought was wrong” (a reasoning bug, often a model swap or prompt regression) or “the thought was right but the action contradicted it” (a tool-spec or registry mismatch).

Concretely: a refund agent built on LangChain’s ReAct executor with traceAI-langchain runs against a Scenario simulation. After a model swap from Claude Sonnet 4.5 to Claude Sonnet 4.6, ReasoningQuality stays at 0.91, but ToolSelectionAccuracy drops from 0.88 to 0.74. The trace shows the model still reasons correctly (“user wants refund, I should check the policy first”) but then calls issue_refund directly instead of check_policy. The fix is in the tool descriptions, not the prompt. the new model is more aggressive about action and less attentive to ordering hints. The team adds a regression eval that pins the policy-check call before any refund issuance, then re-runs the same Scenario to confirm the fix.

Unlike LangSmith’s trajectory view, which shows you the trace but does not score it, FutureAGI scores every (thought, action, observation) triplet and writes the scores back as span attributes. so agent.trajectory.step.reasoning_quality and agent.trajectory.step.tool_selection_accuracy become queryable signals in the dashboard. That is the difference between observability and evaluable observability.

Simulating ReAct trajectories before they hit production

The simulate-sdk gives ReAct development a second feedback loop. The pattern: define a Persona per user archetype (frustrated customer, edge-case tester, multilingual user), bundle them into a Scenario, and run the agent against the scenario with CloudEngine. The simulation produces a TestReport with one row per persona × trajectory. Evaluators run against each trajectory: ReasoningQuality per step, ToolSelectionAccuracy per action, TaskCompletion overall, plus product-specific custom evaluations.

For voice ReAct agents, swap CloudEngine for LiveKitEngine and add ASRAccuracy, AudioQualityEvaluator, and TTSAccuracy to the eval stack. The same trajectory-scoring discipline applies; the modality boundary is just additional spans.

This is the difference between “we tested the agent on 10 demo prompts” and “we ran 10,000 simulated trajectories with cohort-level cuts.” Frontier-lab agent teams have been doing this since 2024; in 2026 it became the default for any team shipping an agent to enterprise customers.

ReAct vs other agent control patterns (May 2026)

PatternLoop shapeToken costBest forPrimary evaluators
ReActthought → action → obs → thoughtHighOpen-ended, exploratory tasksReasoningQuality, ToolSelectionAccuracy
Plan-and-Executeplan once → execute stepsMediumPredictable workflowsTaskCompletion, StepEfficiency
ReflexionReAct + self-critique loopVery highSelf-improving agentsReasoningQuality, TaskCompletion
Self-Asksub-question decompositionMediumMulti-hop QAMultiHopReasoning, Faithfulness
Direct tool-callmodel → tool → answerLowLow-stakes lookupsToolSelectionAccuracy
Tree-of-Thoughtsbranch-and-prune reasoningVery highHard reasoning problemsReasoningQuality, GoalProgress
Native reasoning + ReActmodel’s CoT + tool callsMedium-high2026 frontier setupsReasoningQuality, TaskCompletion
ReWOOplan all tool calls upfront, then executeMediumParallel-safe workflowsTaskCompletion, StepEfficiency
Plan-Execute-Replanplan, execute, re-plan on failureHighLong-horizon tasksTrajectoryScore, GoalProgress

When NOT to use ReAct

ReAct is a default, not a universal answer. In 2026 we recommend skipping ReAct entirely in three cases:

  • Single deterministic tool calls. a function-calling pattern with no reasoning needed. ReAct here is pure overhead.
  • Workflows where the plan is known upfront. use plan-and-execute or DAG-based orchestration (LangGraph) and skip the per-step thought. The plan IS the reasoning.
  • Latency-critical voice agents. emitting a thought before each action adds 200-600 ms; for voice, that destroys the conversation feel. Use direct tool calls with ActionSafetyEval as the guardrail.

The right framing: ReAct is the right pattern when you cannot pre-compute the plan and the cost of a wrong action exceeds the cost of explicit reasoning. Otherwise, simpler patterns ship faster and cost less.

How to measure or detect ReAct quality

ReAct’s three-part structure gives you three independent signals. measure each:

  • ReasoningQuality (local-metric) and ReasoningQualityEval (framework-eval): score logical validity of each thought given prior observations.
  • ToolSelectionAccuracy: scores whether the action following each thought was the correct tool.
  • TaskCompletion: end-to-end success rate on the trajectory.
  • TrajectoryScore: aggregated per-trajectory score across reasoning, tool selection, and progress.
  • StepEfficiency: measures whether the agent reached the goal in a reasonable number of steps.
  • GoalProgress: per-step progress signal toward the user’s stated goal.
  • ActionSafety: catches actions that are valid but unsafe (e.g., refund issued without policy check).
  • Thought-action consistency (custom): % of triplets where the action follows logically from the thought.
  • Thought-token cost (dashboard signal): total reasoning tokens per trace; spikes flag verbose loops.
  • agent.trajectory.step (OTel attribute): per-iteration tag; combine with span kind to slice thought vs. action spans.
  • Max-iteration breach rate: % of traces that hit the iteration cap without completing. a leading indicator of agent loop bugs.

Minimal Python:

from fi.evals import ReasoningQuality, ToolSelectionAccuracy, TrajectoryScore

reasoning = ReasoningQuality().evaluate(
    input=user_query,
    trajectory=react_spans,
)
tool_acc = ToolSelectionAccuracy().evaluate(
    trajectory=react_spans,
    expected_tools=expected,
)
traj = TrajectoryScore().evaluate(
    trajectory=react_spans,
    goal=user_goal,
)
print(reasoning.score, tool_acc.score, traj.score)

Common mistakes (May 2026 edition)

  • Letting thoughts hide in the prompt. If thoughts aren’t captured as their own span, you lose the legibility that makes ReAct worth using; instrument them explicitly via traceAI.
  • Evaluating actions only. A wrong action with a reasonable thought is a tool-spec bug; a right action with a broken thought is a model bug. distinguish them with ReasoningQuality vs ToolSelectionAccuracy.
  • No max-iteration cap. ReAct loops without bounds are runaway-cost candidates; cap turn count and alert on max-iteration breach rate.
  • Treating thought tokens as free. Verbose thoughts can double inference cost; track thought-token cost per trace.
  • Pinning to ReAct everywhere. Plan-and-execute or direct tool calls beat ReAct on predictable workflows. match the pattern to the task.
  • Double-counting reasoning with native reasoning models. GPT-5.x reasoning mode already emits CoT internally. adding “let’s think step by step” in the ReAct prompt is wasted tokens and sometimes confuses the model.
  • Skipping prompt injection checks on tool outputs. A ReAct agent that fetches the web is one indirect-injection payload away from compromise. Wire PromptInjection or ProtectFlash on every tool.output before it re-enters the planner.
  • No regression eval on the planner prompt. The planner is the most consequential prompt. one regression there reshapes every trajectory. Pin a golden dataset of planner inputs and score with ReasoningQuality.
  • Ignoring agent loop detection. ReAct agents that get confused often re-call the same tool in a circle. Score loop-rate as a first-class signal.
  • Treating τ-bench / SWE-Bench Verified as the only ground truth. Public agent benchmarks are useful tier-filters, but your customer-support trajectories are not τ-bench retail. Pair public scores with a domain Scenario suite.
  • No regression set for tool-spec changes. A tool description edit is a behavioral change as significant as a prompt edit. Run the ReAct regression eval whenever the tool registry changes.
  • Ignoring agent handoff boundaries. Multi-agent ReAct stacks have handoffs that lose the agent memory trail. Score handoff fidelity with Faithfulness on the handoff message.
  • Treating planner-only patterns and ReAct as drop-in replacements. They are not. Plan-and-execute is brittle under unexpected tool failures; ReAct degrades gracefully. Match the pattern to the failure tolerance, not to the framework default.
  • Scoring only the last action. A trajectory that gets the right answer through a wasteful path is worse than a fast direct path. StepEfficiency penalises waste; aggregate it into TrajectoryScore.

2026 frontier-lab agent eval methodology, applied locally

Frontier-lab agent evals (Anthropic’s τ-bench papers, OpenAI’s SWE-Bench Verified release notes, Google’s GAIA results, DeepMind’s OSWorld scores) converged on the same methodology between 2024 and 2026:

  1. Define the task as a closed-loop trajectory with a measurable success criterion (not just a chat output).
  2. Pin the environment. simulated user, simulated database state, simulated browser.
  3. Score the trajectory at multiple layers. final answer, tool-selection correctness, reasoning quality, step count, cost.
  4. Run at scale. hundreds to thousands of trajectories per release candidate.
  5. Slice by cohort. task type, difficulty, modality, persona.
  6. Compare to baseline. the previous shipped model’s trajectory on the same rows.

This is exactly the pattern FutureAGI exposes through Dataset + Scenario + the evaluator suite. The frontier labs do not have access to magical eval tooling; they have access to disciplined eval methodology. The same methodology is now available to any team that wires the surfaces together.

Frequently Asked Questions

What is the ReAct pattern?

ReAct is an agent control pattern that interleaves reasoning thoughts and acting steps. usually tool calls. in a loop, with each action's observation feeding into the next thought, until the model emits a final answer.

How is ReAct different from chain-of-thought?

Chain-of-thought is reasoning only. the model writes its thoughts but does not act. ReAct extends it by interleaving acting steps, so the agent can call tools, observe results, and update its reasoning. CoT is a single mental pass; ReAct is a loop.

How do you measure a ReAct agent?

FutureAGI uses ReasoningQuality on each thought, ToolSelectionAccuracy on each action, and TaskCompletion on the trajectory; per-step spans expose every triplet for debugging.