Research

Agent Architecture Patterns in 2026: The Five Named Shapes

Five agent architecture patterns in 2026: ReAct, plan-then-execute, supervisor and workers, graph, event-driven. Four-axis tradeoff per pattern.

February 10, 2025

Updated May 20, 2026

13 min read

agent-architecture react-pattern plan-execute supervisor-worker langgraph event-driven-agents 2026

Table of Contents

A refund agent built on a ReAct loop handles 87 percent of customer requests in 1.4 seconds across three tool calls. The same agent burns through 14 seconds and 19 tool calls on the next 11 percent because the model gets stuck comparing two refund policies in circles. The remaining 2 percent never finish; the loop hits the token budget. Swap the same logic into a plan-then-execute shape and the loop disappears, because the planner emits “check policy, look up order, calculate refund, escalate if over $500” up front and the executor runs each step once. Now 95 percent finish in 2.1 seconds and 5 percent escalate cleanly. The architecture change saved more latency and tokens than any prompt tuning would. Agent architecture is a four-axis tradeoff: latency, recovery, observability, complexity. Every named pattern picks three of the four and pays the tax on the fourth. This guide names the five patterns that carry production agents in 2026, the axes each one wins on, and the failure mode each one buys with its tradeoff.

TL;DR: The four-axis tradeoff in one table

Pattern	Latency	Recovery	Observability	Complexity tax
ReAct loop	Slow on hard tasks	Step-by-step, local	High (flat trace)	Low
Plan-then-execute	Fast when plan holds	Brittle on intermediate failure	High (plan + steps)	Medium
Supervisor and workers	Medium, parallelisable	Per-worker, coordinated	Medium (cross-agent)	High
Graph (LangGraph-style)	Medium, node-bounded	Excellent, edge-conditional	Excellent (tree)	High
Event-driven	Excellent (parallel)	Strong, bus-managed	Hard (cross-service)	Highest

If you read one row: most production agents in 2026 are graph or supervisor at the outer layer with ReAct loops at the leaves, and event-driven only appears when throughput beats trace clarity as the binding constraint.

The four axes, defined

The four axes are what the patterns are actually trading off. Score the workload against each before you pick.

Latency budget. ReAct adds a model call per Thought-Action cycle; plan-then-execute spends one extra round on the planner; graph spends a model call per node transition; event-driven hides latency behind queue depth.

Recovery cost. When step five fails, can the agent retry step five, or does the trajectory restart from step one? ReAct recovers locally. Plan-then-execute restarts from the planner when a result invalidates the plan. Graph recovers per edge if the topology models failure paths explicitly. Event-driven recovers at the bus through dead-letter queues and retry policies.

Observability needs. ReAct gives one clean flat trace per request. Graph gives one clean tree. Event-driven scatters one request across queues, services, and consumer groups, and the trace only reassembles if every producer propagates OTel context on the message header.

Complexity tolerance. ReAct is a few hundred lines and a step budget. Plan-then-execute adds a planner prompt and an executor loop. Supervisor adds a dispatcher plus per-worker prompts, tools, and rubrics. Graph adds a typed state graph, node definitions, edge predicates, and a runtime. Event-driven adds a message bus, schema registry, and a tracing discipline that has to hold across every service.

Pick the three axes that matter for the workload. The fourth is the tax.

Pattern 1: ReAct loop

ReAct is the reason-act loop from the 2022 paper “ReAct: Synergizing Reasoning and Acting in Language Models”. One LLM, one prompt that asks it to emit Thought, Action, Observation, Thought, and so on until Final Answer. The runtime parses the Action, calls the tool, returns the Observation, and re-prompts. The loop ends on Final Answer or a step budget.

Wins on. Latency for simple tasks, recovery, observability, complexity. The trace is the conversation history. Every Thought is a span event; every Action is a tool span. Recovery is local: a tool fails on step four, the model reads the error in the next Observation and adjusts. Implementation fits in a few hundred lines.

Loses on. Latency for hard tasks. The loop is unbounded by default. The model can retry the same tool with the same arguments, reason in circles, or hit the step budget without finishing. Token cost climbs linearly with steps because every prior Thought and Observation appends to context.

Use when. The task resolves in one to five tool calls, the next step depends on the previous result, and you want the simplest agent that uses tools. ReAct is also the right pick for the leaves of a graph or supervisor.

Instrument. Track thought-action ratio, step count per task, and termination reason (final answer versus step budget versus error). traceAI auto-instruments OpenAI Agents, LangChain, CrewAI, and AutoGen so the spans appear without manual code.

Pattern 2: Plan-then-execute

A planner LLM reads the task and emits an ordered list of sub-steps with arguments. An executor (another LLM, a tool dispatcher, or a sub-agent) iterates the plan, runs each step, and returns the final answer.

Wins on. Observability for tasks that decompose cleanly. The plan is a single artifact you can log, audit, edit, or reject before any tool runs. Execution is deterministic once the plan is fixed; replay is straightforward. The trace shows a planner span followed by N child step spans, so failure attribution is immediate.

Loses on. Recovery. When step three returns a result that should change steps four through seven, the planner has to re-plan, and the half-executed trajectory is hard to splice back in. Real production tasks have plan-execute-replan cycles, which erodes the determinism that made the pattern attractive.

Use when. The task is decomposable in advance (form-filling, report generation, multi-step CRUD workflows), latency is not the binding constraint, and you want the plan as an inspectable artifact for compliance or human review.

Instrument. Plan span carries the structured plan as an attribute. Each execution step is a child span with parent_span_id pointing at the plan. Two metrics that matter: plan adherence and plan quality. The Groundedness evaluator from the ai-evaluation SDK scores execution steps against the plan as context; a CustomLLMJudge scores plan quality against the user goal.

Pattern 3: Supervisor and workers

A supervisor agent receives the user request, decides which specialised worker to dispatch to, and integrates the worker’s output. Workers have their own prompts, tool registries, and eval rubrics. Refund worker, escalation worker, FAQ worker. The supervisor can dispatch in sequence or in parallel.

Wins on. Latency through parallelism, observability per domain, recovery per worker. Specialisation keeps the supervisor prompt small. Parallel dispatch on independent sub-tasks is the obvious win. Each worker has its own rubric, so eval failures attribute to the worker that regressed.

Loses on. Complexity. Coordination overhead is real. Latency stacks (supervisor reasoning plus worker reasoning) when dispatches are serial. Eval is a two-layer problem: per-worker output rubrics plus a dispatch-accuracy rubric on the supervisor. A supervisor that routes 95 percent correctly but 5 percent to the wrong worker hides production bugs behind a green aggregate.

Use when. Sub-tasks have genuinely distinct domains with different prompts, tools, or compliance rules. The single-agent prompt has gotten too large or too slow.

Instrument. Supervisor span is the parent. Worker spans are children. Track dispatch accuracy and integration correctness. traceAI’s A2A_CLIENT and A2A_SERVER span kinds capture agent-to-agent relationships across the boundary, so multi-agent traces stay readable as a tree.

Pattern 4: Graph (LangGraph-style)

The graph pattern makes the architecture an explicit state machine. Nodes are agent steps, tool calls, or sub-agents. Edges are transitions, sometimes conditional. You declare the topology upfront and the runtime walks it. LangGraph is the dominant 2026 example: add_node("call_model", agent_node), add_conditional_edges("call_model", route_decision), compile() to a runnable.

Wins on. Observability and recovery. The graph is a value, not a control flow. That changes debugging. Every transition is a parent-child relationship in the trace, and the trace renders as the actual topology. When the agent breaks, the failing node is at the top of the tree. Edges can route to fallback nodes, retry nodes, or human-review nodes. State is a typed dict, so partial state is replayable.

Loses on. Complexity and upfront latency. You commit to the topology before the agent runs, which means anticipating control flow ReAct would discover at runtime. Adding a branch is a code change, not a prompt change. Per-node LLM calls stack: a five-node graph spends five model round-trips even when a ReAct loop might have finished in three.

Use when. The team is willing to invest in a state-machine model in exchange for explicit recovery paths and tree-shaped traces. Compliance workflows (KYC, fraud review, multi-step claims) are the natural fit because the topology is the audit trail.

Instrument. traceAI ships a LangGraphInstrumentor that surfaces node_count, conditional-edge topology, and per-node latency directly on the trace. Render the trace as the tree the graph already is, not as a flat list, or you have given up the main reason to use the pattern.

Pattern 5: Event-driven

Agents subscribe to event types on a message bus and publish results back as new events. One agent publishes CustomerEscalated; a routing agent and a CRM-update agent both consume it and publish their own events. There is no single supervisor and no global plan. The architecture is the choreography of subscriptions.

Wins on. Latency through parallelism, recovery at the bus, horizontal scale. Agents are decoupled by topic, so each scales independently. The bus handles retry, dead-letter, and idempotency through standard primitives (Kafka consumer groups, NATS JetStream, Redis Streams, SQS with DLQs). One slow consumer doesn’t stall the producer.

Loses on. Observability and complexity. A single user request becomes 40 spans across 12 services and four message hops. The trace only reassembles if every producer propagates W3C trace context on the message header and every consumer attaches its span to the propagated context. Get propagation wrong on one hop and the trace splits in two.

Use when. Throughput beats per-request trace clarity. High-volume customer-service automation, real-time enrichment pipelines, back-office workflows where each agent’s work is independently valuable. Reach for it when ReAct or graph patterns can’t sustain the request rate, not before.

Instrument. Propagate trace context as a traceparent header on every message. Use OTel semantic conventions for messaging spans (messaging.system, messaging.destination, messaging.operation). Event-driven without disciplined context propagation is event-driven without observability.

The decision framework: which pattern wins which job

Score the workload on four axes, then read the table.

If your binding constraint is…	And you can afford to pay…	Pick
Iteration speed, simple loops	Latency on hard tasks	ReAct loop
Inspectable plan before execution	Replan complexity	Plan-then-execute
Per-domain specialisation	Coordination overhead	Supervisor and workers
Explicit recovery, tree traces	Upfront topology design	Graph (LangGraph-style)
Throughput and parallel scale	Cross-service trace stitching	Event-driven

Two rules that fall out of this table.

Use hybrids for real workloads. Pure patterns are rare in production. The most common 2026 shape is a graph or supervisor at the outer layer with ReAct loops at the leaves. A second common hybrid is plan-then-execute with ReAct execution: the planner generates a sketch, ReAct handles each step, the planner re-runs when the sketch goes stale.

Don’t pick by what’s trending. LangGraph stars don’t make graph the right pattern for a three-tool support agent. A message bus with five Kafka topics is not the right pattern for a personal research assistant. Score the workload first.

Common mistakes when choosing

Defaulting to ReAct. ReAct is the most flexible loop, not the cheapest or the most observable for graphs of work. For 1-3 tool tasks, the native tool-calling loop in the provider SDK is simpler.
Going graph too early. Graph buys explicit topology at the cost of state-machine maintenance. If the workflow fits in a one-page ReAct loop, the graph is overhead.
Going event-driven for observability reasons. Event-driven costs observability, it doesn’t buy it. Reach for it when throughput is the binding constraint, not before.
Skipping the plan on multi-step tasks. Tasks with five or more deterministic steps benefit from a plan-then-execute layer. Pure ReAct on long tasks loops.
No step budget. ReAct and tool-augmented loops need a hard cap. Without it, edge cases burn through context windows.
Flat trace rendering. A flat span list for a graph or supervisor agent is a debugging nightmare. Render the trace as the actual tree.
One aggregate eval score across patterns. ReAct trajectory eval is not supervisor dispatch eval. Per-pattern rubrics are non-negotiable.
Coupling the loop to the framework. A LangChain ReAct loop is hard to port to LangGraph. Keep the agent logic separable from the framework primitives.

Production hardening, regardless of pattern

Five practices that turn an experimental agent into a production one.

Step budget per pattern. ReAct: 12-15 steps max. Plan-then-execute: plan length plus 50 percent buffer. Supervisor: 5 dispatches max per call. Graph: per-node retry cap plus a wall-clock budget on the walk. Event-driven: per-consumer retry cap with dead-letter routing.

Per-pattern eval gates. Score tool selection and argument correctness for every pattern — the four-layer eval stack for tool-calling agents covers the rubric design. Add dispatch accuracy for supervisor; node-level transitions for graph; event ordering for event-driven. The ai-evaluation SDK ships seven AgentTrajectoryInput metrics (TaskCompletion, StepEfficiency, ToolSelectionAccuracy, TrajectoryScore, GoalProgress, ActionSafety, ReasoningQuality) plus deterministic function_name_match, parameter_validation, and function_call_accuracy for sub-millisecond local checks.

Trajectory replay in CI. Replay 200 to 500 known production traces against the candidate. Compare per-step rubric scores against incumbent. A model swap that changes trajectory shape without moving aggregate accuracy is a yellow flag.

Observability discipline. OTel spans for every agent call, tool call, dispatch, node transition, event. State diffs as span events. Tree-rendered traces, not flat lists. For event-driven, traceparent propagation on every message header is the price of admission.

Human-in-the-loop checkpoints. For high-risk tasks, pause and require approval. The pattern is a span with status PENDING_HUMAN that the agent can resume on approval. LangGraph models this as a native interrupt; ReAct loops have to bolt it on.

How Future AGI fits across all five patterns

The eval and observability stack stays the same shape across all five patterns. The pattern decides the trace topology; the platform reads any topology.

traceAI (Apache 2.0) ships 14 span kinds (AGENT, TOOL, RETRIEVER, LLM, CHAIN, RERANKER, EMBEDDING, GUARDRAIL, EVALUATOR, VECTOR_DB, CONVERSATION, A2A_CLIENT, A2A_SERVER, UNKNOWN) across 50+ AI surfaces in Python, TypeScript, Java, and C#. Auto-instrumentation covers LangChain, LangGraph (via LangGraphInstrumentor), CrewAI, AutoGen, OpenAI Agents SDK, Pydantic AI, DSPy, Smolagents, BeeAI, and Strands. For event-driven systems, traceAI honours W3C traceparent propagation so a producer span in service A attaches to a consumer span in service B without manual stitching.

ai-evaluation SDK (Apache 2.0) covers per-pattern rubrics with 70+ EvalTemplate classes including LLMFunctionCalling, TaskCompletion, ConversationCoherence, ConversationResolution, Groundedness, ContextAdherence, and 11 CustomerAgent* templates for vertical-specific failure modes. Eval scores attach to spans via EvalTag; the collector runs evals server-side post-export at zero inline latency.

Future AGI Platform adds self-improving evaluators tuned by feedback, in-product agent-authored custom rubrics, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed clusters failing traces, fires a Sonnet 4.5 Judge across 8 span-tools, and emits a 5-category 30-subtype taxonomy plus a 4-D trace score plus an immediate_fix per cluster.

Agent Command Center fronts 100+ providers as a single Go binary (Apache 2.0). 18+ built-in guardrail scanners plus 15 third-party adapters run at the gateway. Verified benchmark: ~29k req/s with P99 ≤ 21 ms with guardrails on, on t3.xlarge. The gateway sits between your agent and the providers regardless of which pattern the agent uses.

Ready to instrument your first multi-pattern agent? Wire traceAI into the framework you already use, attach TaskCompletion and TrajectoryScore as EvalTag scorers on the agent span, and let the same rubrics gate CI and surface live regressions.

Sources

Frequently asked questions

What are the five agent architecture patterns in 2026?

Five named shapes carry every production agent in 2026. ReAct is the reason-act loop where one LLM alternates thoughts and tool calls until it emits a final answer. Plan-then-execute splits planning and execution: a planner enumerates the steps upfront, an executor runs them. Supervisor and workers routes a request through a dispatcher to specialised sub-agents with their own tools and rubrics. Graph (LangGraph-style) makes the agent an explicit state machine with named nodes and edges, so the topology is visible before runtime. Event-driven uses a message bus where agents subscribe to event types and publish results back, scaling horizontally but trading trace clarity for throughput. Most production stacks are hybrids, with a graph or supervisor outer layer and ReAct loops at the leaves.

How do I pick the right agent architecture pattern?

Score the workload on four axes and the pattern picks itself. Latency budget: how many seconds can a user wait? ReAct and graph add latency per node; tool-augmented and event-driven minimise it. Recovery cost: when a tool fails, can the agent retry locally or must the whole trajectory restart? Plan-then-execute is brittle on intermediate failure; ReAct and graph recover step by step. Observability needs: do you need to debug a single user trace, or aggregate failure across millions? ReAct and graph give clean parent-child traces; event-driven scatters context across queues. Complexity tolerance: how much state-machine code is the team willing to maintain? ReAct is a few hundred lines, graph is a few thousand, event-driven is a platform. Pick three axes to optimise and accept the fourth as the tax.

What is the ReAct pattern and when does it fail?

ReAct is the reason-act loop introduced in the 2022 paper 'ReAct: Synergizing Reasoning and Acting in Language Models'. One LLM emits Thought, Action, Observation, Thought, and so on, until it produces Final Answer. The runtime parses the Action line, calls the tool, returns the Observation, and re-prompts. ReAct is the simplest agent that uses tools and still the foundation of most 2026 production agents at the leaves. It fails three ways. Unbounded loop length when the model gets stuck retrying the same tool with the same arguments. Token blowup on long trajectories because every Thought and Observation appends to context. Poor recoverability when a downstream step requires re-planning, because there is no plan to re-plan from. A step budget, no-progress detection, and a retry-with-different-tool rule mitigate but do not eliminate these failure modes.

When should I use plan-then-execute over ReAct?

Use plan-then-execute when the task decomposes cleanly into sub-steps known in advance, latency is not the binding constraint, and you want a plan you can inspect, log, or edit before any tool runs. The planner reads the task, emits an ordered list of steps with arguments, and a separate executor runs each step. Strengths: the plan is a single artifact you can audit; execution is deterministic once the plan is fixed; the trace shows planner output as a distinct span before any side effect. Weakness: plans go stale when intermediate results change what should happen next. Real production tasks have plan-execute-replan cycles, which adds a control loop on top of the simple two-phase shape. Use ReAct when the next step genuinely depends on the previous result and no planner can enumerate the branches upfront.

What is a LangGraph-style graph agent, and why is it more debuggable?

Graph agents make the architecture an explicit state machine. Nodes are agent steps or tools; edges are transitions, sometimes conditional. LangGraph is the dominant 2026 example: you declare nodes (call_model, call_tool, check_response) and edges (call_model to call_tool when there is a tool call, otherwise to END). The graph is a value, not a control flow. That is what makes it debuggable. Every transition becomes a parent-child relationship in the trace. The traceAI LangGraphInstrumentor surfaces node_count and conditional-edge topology so the trace renders as the actual graph, not a flat span list. The cost is upfront design: you must commit to the topology before the agent runs. The win is that when something breaks, the failing node is visible at the top of the trace tree.

When is event-driven the right pattern?

Event-driven is the right pattern when you need horizontal scale and parallel execution more than you need a clean single-request trace. Agents subscribe to event types on a message bus (Kafka, NATS, Redis Streams, a serverless queue). One agent publishes a CustomerEscalated event; a routing agent and a CRM-update agent both consume it and publish their own events. There is no single supervisor and no global plan; the architecture is the choreography. Strengths: trivial horizontal scale, natural retry semantics at the bus layer, agents can ship independently. Weaknesses: tracing requires correlating spans across queues with explicit trace context propagation; a single user request becomes 40 spans across 12 services and four message hops; debugging is hard without disciplined OTel context propagation through every producer and consumer. Reach for it when throughput matters more than per-request observability, not before.

How do I instrument multi-pattern agents with traceAI?

Three rules. First, every agent dispatch and every tool call is a span with a parent that reflects the actual call hierarchy, not the order of emission. Second, state transitions are span events on the agent span, carrying the input state, output state, and the decision that produced the transition. Third, render the trace as the tree, not the flat list, because a flat list buries the loop and the dispatch decisions. traceAI ships 14 span kinds (AGENT, TOOL, RETRIEVER, LLM, CHAIN, RERANKER, EMBEDDING, GUARDRAIL, EVALUATOR, VECTOR_DB, CONVERSATION, A2A_CLIENT, A2A_SERVER, UNKNOWN) across 50+ AI surfaces in Python, TypeScript, Java, and C#. The LangGraphInstrumentor exposes graph topology directly. For event-driven systems, propagate trace context as a header on every message so consumers attach their spans to the producer trace.

View all

Research

What is LangChain? A 2026 Production Engineer's Guide

LangChain explained for 2026: what changed in v1, how LangGraph fits in, the real anatomy of the framework, production tradeoffs, and common mistakes.

Vrinda Damani · Jun 17, 2025

27 min

Research

Best LLMs of May 2026: Top Closed-Source, Open-Weight, Multimodal, and Coding Picks

Best LLMs May 2026: compare GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and DeepSeek V4 across coding, agents, multimodal, cost, and open weights.

Vrinda Damani · May 6, 2026

29 min

Research

Best Voice AI Models in May 2026: STT, TTS, and Voice Agent Stack

Best Voice AI May 2026: compare Deepgram, Cartesia, ElevenLabs, Retell, and Vapi for STT, TTS, latency budgets, and production voice agents.

Vrinda Damani · May 6, 2026

21 min

TL;DR: The four-axis tradeoff in one table

The four axes, defined

Pattern 1: ReAct loop

Pattern 2: Plan-then-execute

Pattern 3: Supervisor and workers

Pattern 4: Graph (LangGraph-style)

Pattern 5: Event-driven

The decision framework: which pattern wins which job

Common mistakes when choosing

Production hardening, regardless of pattern

How Future AGI fits across all five patterns

Sources

Related reading

Frequently asked questions