How is agentic memory different from agent memory?

Agent memory is the broad category of short-term, session, and long-term state. Agentic memory is a specific design where the agent organizes memory through structured notes, links, updates, and retrieval decisions.

How do you measure agentic memory?

FutureAGI measures it with ContextRelevance, ContextRecall, Groundedness, ToolSelectionAccuracy, and trace fields such as agent.trajectory.step. Track eval-fail-rate-by-cohort and memory-write-conflict rate.

Agentic Memory (A-MEM): FutureAGI Definition (2026)

Q: What is agentic memory (A-MEM)?

Agentic memory (A-MEM) is an agent-memory architecture where an AI agent actively writes, links, retrieves, and updates memories. FutureAGI evaluates it through memory relevance, recall, grounding, and trace-level agent steps.

What Is Agentic Memory (A-MEM)?

Agentic memory (A-MEM) is an agent-memory pattern where an AI agent actively organizes experience: it writes structured memory notes, links related notes, retrieves relevant memories, and updates older memories when new context changes them. It is an agent reliability concept that shows up inside knowledge-base reads, memory-write spans, and multi-step agent trajectories. FutureAGI evaluates A-MEM by checking whether retrieved memories are relevant, grounded, complete, and chosen at the right agent step.

Why Agentic Memory Matters in Production LLM and Agent Systems

Memory bugs usually look like model bugs until you inspect the trace. A support agent answers from a stale policy note, a research agent retrieves three related notes but misses the one constraint that changes the answer, or a coding agent writes a wrong project fact into long-term memory and repeats it for a week. The user sees inconsistency. The engineer sees a trace that “succeeded” while carrying the wrong context.

A-MEM matters because it changes memory from storage into control flow. Each new memory can become a structured note with a description, keywords, tags, and links to older notes. That is useful, but it creates new failure modes: stale-context reuse, memory-write conflicts, over-linking, memory injection, and context overflow from too many linked notes. Unlike a plain ChromaDB vector store that retrieves nearest chunks, A-MEM asks the agent to decide what to link, update, and recall.

Developers feel the pain as nondeterministic regressions across long conversations. SREs see p99 latency rise when memory recall fans out across many links. Product teams hear “it forgot” and “it remembered the wrong thing” from the same account. Compliance teams ask whether the agent can purge user-specific memories and prove which notes influenced a decision.

In 2026 multi-step agent pipelines, memory often sits between tools, RAG, MCP connections, and handoffs. One bad memory write can contaminate future retrieval, choose the wrong tool, and produce a confident answer that appears grounded but is anchored to stale state.

How FutureAGI Handles Agentic Memory (A-MEM)

FutureAGI’s approach is to treat agentic memory as an evaluated knowledge workflow, not a background cache. For the sdk:KnowledgeBase anchor, the concrete FutureAGI surface is fi.kb.KnowledgeBase: engineers create or update a knowledge base, attach uploaded files, and then evaluate how an agent reads from that source during a trajectory.

Example: a customer-success agent stores account facts, contract terms, and past escalation summaries in a long-term memory layer backed by a FutureAGI knowledge base. The traceAI-langchain integration records the agent step, model call, and retrieval span; the team adds agent.trajectory.step, llm.token_count.prompt, knowledge-base ID, retrieved note IDs, and note age as span attributes. The trace now shows whether a bad renewal answer came from planning, retrieval, memory evolution, or final generation.

Evaluation then runs at three levels. ContextRelevance checks whether retrieved notes match the current user goal. ContextRecall checks whether expected memories were missed. Groundedness checks whether the final answer stays tied to the supplied memory and knowledge-base context. If the agent can choose between a knowledge-base lookup, CRM API, or web search, ToolSelectionAccuracy checks that the memory route was the right tool.

The engineer sets thresholds, for example ContextRecall >= 0.85 on renewal cases and memory-write-conflict rate below 1%. Failed traces become a regression dataset. The fix may be a namespace rule, a stricter write policy, or a fallback to human review when a high-impact memory is stale.

How to Measure or Detect Agentic Memory Quality

Measure A-MEM by separating retrieval, evolution, and downstream answer quality:

ContextRelevance: scores whether retrieved memory notes are relevant to the current task.
ContextRecall: checks whether required memories appear in the retrieved set.
Groundedness: scores whether the answer is supported by retrieved memory and knowledge-base context.
ToolSelectionAccuracy: checks whether the agent chose the memory or knowledge-base route when it should.
Trace signals: repeated agent.trajectory.step, rising llm.token_count.prompt, old note age, high top-k fanout, and memory-write-conflict rate.
User proxies: repeated corrections, reopened tickets, thumbs-down rate, and escalation rate on long-session cohorts.

Minimal Python:

from fi.evals import ContextRelevance, Groundedness

memory_result = ContextRelevance().evaluate(
    input=user_goal,
    context=retrieved_memory_notes,
)
answer_result = Groundedness().evaluate(
    response=agent_answer,
    context=retrieved_memory_notes,
)

Common mistakes

Most A-MEM failures come from treating memory evolution as harmless metadata rather than production state.

Writing before the outcome is known. Store durable memories only after the agent step succeeds and the fact is verified.
Optimizing top-k alone. A-MEM quality depends on links, note updates, namespaces, and recall, not only vector similarity.
No contradiction check on updates. Evolving an old note can overwrite a valid fact with a temporary exception.
Mixing user memory and global knowledge. Keep user-specific facts separate from product policy or shared documentation.
Skipping deletion design. Long-term memory needs TTLs, purge paths, and audit evidence for regulated workflows.