How is agent memory different from a context window?

The context window is a hard token limit on a single LLM call. Agent memory is the broader system that decides what to load into that window from external stores — vector DBs, KV caches, knowledge graphs.

How do you measure agent memory quality?

FutureAGI evaluates memory reads with ContextRelevance and ContextRecall, and traces every memory operation as a span so you can audit which memories were loaded and why.

Agent Memory: Definition & FutureAGI Guide (2026)

What Is Agent Memory?

Agent memory is the persistent state an AI agent carries across steps, sessions, or user interactions. It is an agent-system capability, not just a longer prompt: the agent decides what to keep in short-term context, what to retain for the current session, and what to recall from long-term stores such as vector databases or knowledge graphs. In a FutureAGI production trace, agent memory appears as read and write spans against memory backends, where teams evaluate relevance, recall, freshness, and conflicts.

Why agent memory matters in production LLM and agent systems

Memory is where many “agentic” products quietly degrade. The first few turns work because everything fits in context. By turn 12 the conversation buffer is too long, the agent starts forgetting earlier facts, and the user repeats themselves. Or the long-term vector store retrieves stale entries because no one set TTLs. Or the agent writes contradictory facts about the same user across sessions and the next session’s behavior is incoherent.

Each role sees a different shape. A backend engineer fights context-overflow errors when the conversation buffer exceeds the model’s window. A product manager hears “it forgot what I told it” complaints. A compliance lead is asked whether the agent stores PII in long-term memory and how it’s purged. An SRE watches latency climb as the memory recall fans out to unbounded vector queries.

In 2026, agent-memory frameworks are maturing — LangGraph’s checkpointer and MemorySaver, OpenAI Agents SDK’s session memory, agentic RAG patterns with hybrid retrieval, dedicated stores like Letta and Mem0. The engineering challenge is no longer whether to add memory but how to evaluate that the right memory was loaded at the right moment, that writes did not corrupt prior facts, and that staleness does not silently poison new reasoning.

How FutureAGI handles agent memory

FutureAGI’s approach is to instrument memory as a first-class span and evaluate it like retrieval. The traceAI-langgraph integration captures LangGraph checkpointer reads and writes; traceAI-langchain captures conversation-buffer accesses; and the vector-store integrations — traceAI-pinecone, traceAI-qdrant, traceAI-weaviate, traceAI-chromadb, traceAI-milvus, traceAI-pgvector, traceAI-lancedb, traceAI-mongodb-vector, traceAI-redis-vector — capture every long-term memory query as an OpenTelemetry span. Each span carries the query, the retrieved IDs, and the recency metadata.

Evaluation runs the same way you’d evaluate any retrieval. ContextRelevance scores whether the loaded memories were on-topic for the agent’s current step. ContextRecall scores whether all the memories the agent should have loaded actually were. Unlike LangGraph MemorySaver, which persists checkpoints without judging whether recalled state helped the next step, FutureAGI attaches eval scores to the memory span. A custom evaluator can check freshness: any memory older than X days flagged as “stale unless re-verified.” On the write side, evaluators can check that long-term writes are deduplicated and that they do not contradict existing facts.

Concretely: a personal-assistant agent built on LangGraph stores user preferences in a Mem0-style long-term memory. After two weeks of users complaining “it forgot my dietary preferences,” FutureAGI traces show that the recall query was using the user’s current utterance as the query string, missing preferences stored under semantically distinct phrasing. ContextRecall averaged 41%. The team adds an explicit “preferences” namespace queried on every turn; recall jumps to 89%. Without the memory spans and a recall evaluator, the bug presents as a vague “the agent is dumb.”

How to measure or detect agent memory

Treat memory like any retrieval surface — measure relevance, recall, and freshness:

ContextRelevance: returns 0–1 for whether retrieved memories are on-topic for the current step.
ContextRecall: returns 0–1 for whether all required memories were retrieved given a known set.
TaskCompletion: end-to-end check; memory failures often surface as TaskCompletion regressions on multi-turn cohorts.
memory-staleness signal (custom): % of retrieved memories older than your freshness window.
memory-write-conflict rate (dashboard signal): % of writes that contradict existing facts about the same entity.
agent.trajectory.step (OTel attribute): combined with span kind = memory.read or memory.write, lets you isolate memory operations.

Minimal Python:

from fi.evals import ContextRelevance, ContextRecall

relevance = ContextRelevance().evaluate(
    input=current_turn,
    context=loaded_memories,
)
print(relevance.score, relevance.reason)

Common mistakes

Treating memory as one undifferentiated bucket. Short-term, session, and long-term have different read patterns, write semantics, and TTLs — design and evaluate them separately.
No freshness eval. A stale long-term memory poisons future reasoning silently; flag and re-verify entries past their freshness window.
Storing every turn in long-term memory. Most turns don’t matter; selective writes prevent noisy retrieval downstream.
Skipping recall evals. Relevance alone misses the case where the agent retrieved on-topic but incomplete memory; pair with ContextRecall.
PII in long-term memory with no purge story. Compliance reviews catch this late; bake redaction and TTL into the write path.