How is AWM different from agent memory?

Agent memory stores facts, conversation state, preferences, or retrieved context. AWM is narrower: it stores procedural workflows such as step order, tool choices, and success conditions.

How do you measure Agent Workflow Memory?

FutureAGI measures AWM through KnowledgeBase recall traces, ContextRelevance for the recalled workflow, TaskCompletion for the final task, and agent.trajectory.step for step-level attribution.

Agent Workflow Memory (AWM): FutureAGI Guide (2026)

Q: What is Agent Workflow Memory (AWM)?

Agent Workflow Memory (AWM) is an agent-memory method that turns prior action trajectories into reusable workflows and recalls the right workflow during later multi-step tasks.

What Is Agent Workflow Memory (AWM)?

Agent Workflow Memory (AWM) is an agent-memory method that stores reusable workflows induced from past action trajectories and retrieves them to guide later multi-step tasks. It belongs to the agent family because the memory changes planning, tool selection, and step order during a production run. In FutureAGI, AWM shows up as fi.kb.KnowledgeBase recall, agent.trajectory.step spans, and eval signals showing whether the recalled workflow improved task completion or caused a bad detour.

Why Agent Workflow Memory matters in production agent systems

The main failure is procedural drift. An agent may remember a workflow for “update billing address” and apply it to “change invoice recipient,” even though one path needs a CRM write and the other needs a finance approval. The run looks plausible until it calls the wrong tool, skips a policy check, or repeats a browser-navigation routine that no longer matches the site.

Developers feel this as hard-to-reproduce regressions: the same prompt works for one user because AWM recalled a useful workflow, then fails for another because a stale routine was selected. SREs see rising p99 latency, retry loops, tool-timeout clusters, and a step count that grows while task completion falls. Product teams hear “the agent almost got it, then went sideways.” Compliance teams care because reusable workflows often encode approval rules, data-access order, and write boundaries.

AWM matters more in 2026 multi-step pipelines than in single-turn LLM calls because it changes the control flow. A recalled workflow is not just extra context; it can bias the planner toward a known sequence of actions. That is powerful when the workflow captures a reusable routine, and dangerous when it is stale, over-specific, or learned from a failed trajectory. The monitoring target is therefore not “did memory return something?” The target is “did this recalled workflow improve the next trajectory?”

How FutureAGI Handles Agent Workflow Memory

FutureAGI’s approach is to separate workflow recall from answer generation so both can be evaluated. A team can store induced workflows in fi.kb.KnowledgeBase as compact procedure records: goal, preconditions, tool sequence, stop condition, failure notes, and freshness metadata. When an agent starts a task, the runtime retrieves candidate workflows from that knowledge base, attaches the selected workflow ID to the trace, and emits agent.trajectory.step spans as the agent follows or rejects each step.

For example, a marketplace support agent learns a reusable “verify shipment claim” routine: retrieve order, check carrier event, compare promised delivery date, ask for confirmation if the carrier state is ambiguous, then create a ticket or issue credit. With traceAI-langchain, FutureAGI captures the knowledge-base retrieval, the planner step, each tool call, and the final response. ContextRelevance scores whether the recalled workflow matches the new user goal. Groundedness checks that the final answer is supported by retrieved order and policy context. TaskCompletion and StepEfficiency tell the engineer whether the workflow helped the agent finish with fewer safe steps.

Unlike a raw LangSmith trace review, the workflow is measured at the point where it affects the next run. If the “verify shipment claim” workflow starts being recalled for tax-invoice disputes, the engineer can alert on low ContextRelevance, quarantine that workflow from fi.kb.KnowledgeBase, and add the failed trace to a regression dataset before the bad routine spreads.

How to measure or detect Agent Workflow Memory

Measure AWM as a retrieval decision and as a trajectory intervention:

Workflow-recall relevance: ContextRelevance returns a score for whether the recalled workflow fits the current goal and user state.
Grounded workflow use: Groundedness checks whether the final response stays supported by the evidence loaded during the workflow.
Task-completion lift: compare TaskCompletion with AWM enabled versus disabled on the same regression cohort.
Step efficiency: use StepEfficiency, step count, loop rate, and tool-timeout rate to catch workflows that add unnecessary actions.
Trace attribution: filter by agent.trajectory.step, workflow ID, gen_ai.evaluation.score.value, and eval-fail-rate-by-cohort.
User proxy: escalation rate, reopened-ticket rate, and thumbs-down rate by recalled workflow family.

Minimal Python:

from fi.evals import ContextRelevance, TaskCompletion

recall = ContextRelevance().evaluate(
    input=current_goal,
    context=recalled_workflow,
)
done = TaskCompletion().evaluate(input=current_goal, output=final_state)
print(recall.score, done.score)

Common mistakes

Most AWM bugs come from treating workflows as trusted memory instead of learned hypotheses.

Storing raw traces as workflows. A trace records what happened; a workflow should abstract the reusable routine and remove incidental UI noise.
Recalling by keyword only. Lexical matches miss task intent, required permissions, and tool preconditions; score semantic relevance before following a routine.
Never expiring workflows. Product flows, policies, and tool schemas change; stale procedures need freshness checks and re-verification.
Measuring only final success. A task can finish while using unsafe extra steps; pair TaskCompletion with StepEfficiency and action review.
Letting failed runs write memory automatically. Bad trajectories should enter review or regression data, not become the agent’s next playbook.