How is conversation memory different from agent memory?

Conversation memory specifically holds prior turns of dialogue. Agent memory is broader — it includes tool results, world state, scratchpads, and long-term semantic stores in addition to dialogue.

How do you evaluate conversation memory?

FutureAGI evaluates conversation memory through ConversationCoherence and Groundedness over multi-turn sessions, plus traceAI inspection of which prior turns ended up in each prompt.

Conversation Memory: Definition & FutureAGI Guide (2026)

What Is Conversation Memory?

Conversation memory is the store an AI system uses to remember earlier turns of a dialogue and apply them to the next one. It comes in three common shapes: a short-term sliding window of recent messages, a summarised long-term store that compresses dropped turns into a paragraph, and an embedding-retrieval store that pulls semantically relevant past turns on demand. In agent stacks it is the bridge between the model’s hard context window and the actual conversation. FutureAGI does not implement the memory; it evaluates the agent behavior the memory produces, so engineers can tune the policy with measurable signal.

Why Conversation Memory Matters in Production LLM and Agent Systems

A chat product with no conversation memory loses continuity immediately: every turn is effectively amnesiac, the user re-types preferences, and the agent feels broken by turn three. A chat product with naive memory has the opposite problem: the buffer grows without bounds, latency rises turn over turn, cost-per-session climbs, and the provider may truncate silently until the model starts hallucinating continuity.

The pain is uneven. A backend engineer ships a ConversationBufferMemory from LangChain with a 4k-token cap and watches users complain that the agent forgets their travel dates. A product manager investigates “the assistant feels inconsistent” and finds the embedding-retrieval memory is pulling the wrong turn from a prior session. An SRE on-call sees p99 latency double turn over turn because the buffer never trims. A compliance reviewer asks how the agent handles PII written in a previous turn and discovers no one has tested it.

In 2026 agent stacks built on LangGraph, OpenAI Agents SDK, or CrewAI, conversation memory is no longer a one-line constructor. It is a tiered design: short-term buffer, on-overflow summariser, long-term vector store, and a tool-result cache. Each tier has its own failure modes, and production reliability depends on evaluating the agent end-to-end against multi-turn scenarios.

How FutureAGI Evaluates Conversation Memory

FutureAGI’s approach is to evaluate the dialogue the memory produces, not the memory implementation itself. Trace: every agent turn emits an OTel span through traceAI langchain or openai-agents integrations, with llm.token_count.prompt and the actual prompt content. The span tree shows which memory fragment was packed into each turn. Score: ConversationCoherence runs across the session and detects whether dropped or stale turns broke continuity. Groundedness flags responses that hallucinate references to forgotten state. MultiHopReasoning checks whether the agent correctly stitched together information across turns when it should have. Aggregate: Dataset.add_evaluation stores per-session scores so you can compare buffer-only vs summary-on-overflow vs retrieval policies under controlled conditions. Compared with raw LangChain memory inspection, FutureAGI ties the prompt view to per-session scores, so the team sees whether remembered content improved the answer.

Concretely: a travel-booking agent runs three memory policies in shadow — sliding-window-only, summarisation-on-overflow, and embedding-retrieval. The team simulates 800 multi-turn bookings through simulate-sdk Personas, runs ConversationCoherence and TaskCompletion, and measures cost per session. Embedding retrieval wins on coherence but costs 1.6× more per session; summarisation matches coherence at 1.1× cost. The team picks summarisation. Without FutureAGI’s per-session scoring, that decision would be a hunch; with it, the trade-off is a number on a dashboard.

How to Measure or Detect Conversation Memory Quality

Conversation memory quality is a multi-turn signal — single-turn metrics will not reveal the failure modes:

ConversationCoherence: per-session score; drops when memory loses information the model needed.
Groundedness: response-level score that flags hallucinated references to forgotten turns.
MultiHopReasoning: scores whether the agent correctly chained information across turns.
llm.token_count.prompt (OTel attribute): tracks how much memory was packed each turn; sudden cliffs indicate truncation.
Recall@turn: a custom evaluator that scores whether the agent remembers a fact stated K turns ago.
Cost-per-session (dashboard signal): unbounded memory inflates cost; bounded memory with summarisation should keep it flat.

from fi.evals import ConversationCoherence, Groundedness

coh = ConversationCoherence()
gnd = Groundedness()

result = coh.evaluate(conversation=session.turns)

Common mistakes

Picking buffer size by token count alone. Tools that emit long JSON crowd out user turns; size by what you must keep, not by what fits.
Using one memory tier for everything. Short-term buffers handle continuity; long-term retrieval handles user preferences. Conflating them produces stale-context bugs.
Letting the summariser drift. Summary quality decays with model swaps; evaluate the summariser as its own LLM call with IsGoodSummary or SummaryQuality.
Embedding-retrieval without freshness ranking. Retrieving an old turn that is now wrong is worse than not retrieving at all.
No multi-turn regression eval. Memory bugs are silent in single-turn evals; you need scenarios of 8–20 turns to find them.