Where do thunks appear in AI engineering?

In agent runtimes thunks appear as deferred LLM calls, lazy retrieval, async tool invocations, and middleware patterns like redux-thunk. Most modern agent frameworks use thunk-like primitives under the hood.

How does FutureAGI relate to thunks?

FutureAGI does not implement thunks. It evaluates the outputs they produce — LLM responses, tool-call results, retrieval chunks — via fi.evals evaluators tied to traceAI spans on agent trajectories.

What Is a Thunk? Definition & FutureAGI Guide (2026)

What Is a Thunk?

A thunk is a zero-argument function or closure that wraps a deferred computation: the wrapped expression is not evaluated until the thunk is explicitly called. The pattern originates in compiler theory and lazy-functional languages like Haskell, where every expression is implicitly thunked. It now shows up across modern stacks — Redux middleware (redux-thunk), JavaScript and Python closures, generator-based async patterns, lazy-evaluated frameworks like Dask, and the deferred LLM-call primitives inside agent runtimes. In a FutureAGI trace, a thunk is invisible; the LLM call it forces is the span you see.

Why It Matters in Production LLM and Agent Systems

Thunks are the implementation pattern behind much of agent-runtime laziness. An agent that decides to call a tool emits a thunk that, when forced, executes the tool — letting the planner separate “decide what to do” from “actually do it.” Streaming LLMs are thunk-like: each token is produced lazily as the consumer demands. Async retrieval is thunk-like: the embedding lookup is deferred until a downstream step needs the result. When these patterns work, they make agents efficient and parallelizable. When they break, the bugs are some of the hardest to debug.

Application engineers feel this when an agent appears to call a tool but the tool span never fires — the thunk was created but never forced. SREs feel it when async tool calls leak across requests because the closure captured the wrong context. Platform teams feel it when retry logic re-forces an already-evaluated thunk and produces duplicate side effects (a duplicate billing call, a duplicate ticket creation). None of these are LLM bugs; they are runtime-control-flow bugs that look like LLM bugs in the trace.

For 2026 agent stacks the pattern matters because LangGraph, OpenAI Agents SDK, CrewAI, and AutoGen all use thunk-like deferred-execution primitives internally. Understanding the pattern is what separates “the agent loop hung” from “the planner emitted three thunks and the runtime forced one.” FutureAGI’s role is to surface the actual side effects in the trajectory, regardless of the runtime’s internal control-flow model.

How FutureAGI Handles Thunks

FutureAGI does not implement thunks — they live in the framework you use (LangGraph state machines, OpenAI Agents SDK handoffs, CrewAI task delegation, JavaScript runtime closures). FutureAGI is the evaluation and observability layer above the runtime. Where thunks become visible in FutureAGI is at the trace boundary: when a thunk is forced, the runtime emits a span (LLM call, tool call, retrieval), traceAI captures it, and agent.trajectory.step plus agent.tool.name attach to the span.

A real workflow: a code-review agent on LangGraph uses lazy tool execution — the planner emits a list of file-read thunks, the runtime forces them in parallel, and the LLM consumes the merged context. Each forced thunk produces a tool span; FutureAGI’s ToolSelectionAccuracy scores whether the right files were chosen, StepEfficiency flags wasted forces, and eval-fail-rate-by-cohort surfaces regressions when a runtime upgrade changes thunk-forcing semantics. When a LangGraph upgrade introduces a bug where some thunks are forced twice, FutureAGI sees it as duplicate spans on the same trace — the engineer correlates the duplication with the runtime version and rolls back.

For frameworks where thunks are user-controlled (custom Python agents, Redux-store-backed agent UIs), the same pattern applies: evaluate the outputs, watch the trajectory, attribute regressions to runtime changes.

How to Measure or Detect It

Thunks themselves are not measurable; their forced effects are:

agent.trajectory.step (OTel attribute): the canonical span attribute on every forced thunk in an agent run; filter dashboards by it.
Span-duplication rate: count of identical tool calls within a single trajectory; non-zero usually means a runtime bug.
ToolSelectionAccuracy: returns whether each forced tool thunk picked the right tool given the state.
StepEfficiency: returns the fraction of forced steps that contributed to the final goal — flags wasted thunks.
Trajectory parallelism rate: how many thunks fire concurrently versus serially; runtime-tuning signal for cost and latency.

Minimal Python:

from fi.evals import ToolSelectionAccuracy, StepEfficiency

tool_acc = ToolSelectionAccuracy()
efficiency = StepEfficiency()

acc_result = tool_acc.evaluate(input=user_goal, trajectory=trace_spans)
eff_result = efficiency.evaluate(input=user_goal, trajectory=trace_spans)
print(acc_result.score, eff_result.score)

Common Mistakes

Treating a thunk as a side-effect-free wrapper. Forcing a thunk usually triggers I/O — a tool call, an API request, a billing event — make idempotency explicit.
Capturing per-request state in a long-lived closure. The thunk holds the wrong context when it eventually fires; agent runtimes have leaked customer data this way.
Re-forcing on retry without deduplication. Tool-call retries can re-execute side effects; pair retries with idempotency keys.
Treating LangGraph or Agents SDK control flow as opaque. Spend the day to read how the runtime forces thunks; it will save weeks of incident debugging later.
Ignoring trajectory parallelism. Sequential forcing of independent thunks is a free 3x latency improvement most teams miss.