Evaluating Agent Memory Systems in 2026: Four Dimensions Most Reports Miss
Evaluating agent memory is four problems, not one: recall, freshness, contradiction handling, forgetting. A 2026 framework for Mem0, Zep, Letta, LangMem.
Table of Contents
A support agent quotes the user’s address from a March conversation. The user updated it in June. The store has both writes; the March one ranks higher because the embedding is denser. Recall is technically correct, and the package still went to the wrong street. The postmortem says “memory worked.” That’s the failure mode inside most public memory eval reports: recall is the only dimension scored.
If your memory eval reports one number, you’re grading a quarter of the system. The opinion this post earns: agent memory eval is four problems, not one. Recall (retrieved the right fact). Freshness (used the LATEST fact). Contradiction handling (resolved conflicting facts). Forgetting (deleted what should be gone). Score the four independently. This guide walks each dimension, the rubric that catches it, how LoCoMo fits as the public floor, the adversarial set on top, how to instrument memory spans, and how the Future AGI eval stack wires the rubrics into CI and live traces against Mem0, Zep, Letta, LangMem, and homegrown stores.
TL;DR: the four-dimension framework
| Dimension | What you measure | Deterministic check | LLM-judge rubric |
|---|---|---|---|
| 1. Recall | Right fact surfaced for the turn | top-k presence, hit-rate by fact id | Groundedness, ContextRelevance, ChunkAttribution |
| 2. Freshness | LATEST version used when fact has updates | valid_at newer than older write with same key | MemoryFreshness CustomLLMJudge |
| 3. Contradiction handling | Right version wins when stored facts disagree | conflict-policy ran, decision logged | ContradictionResolution CustomLLMJudge |
| 4. Forgetting | Retracted, expired, scope-ended facts gone | recall returns empty on tombstoned fact id | ForgettingCompliance CustomLLMJudge + AnswerRefusal |
Non-negotiables: per-dimension scoring, an adversarial set that explicitly retracts and updates facts, a privacy-boundary slice (memory crossing tenants is a P0, not a regression), and trace-attached scores so the same rubric runs in CI and on live traffic.
Why one-metric memory eval misses production failures
Single-session eval misses the failure modes that matter for memory. Four properties make memory different.
Memory crosses sessions. A bad write in March affects a turn in September. The blast radius of one wrong fact is unbounded until someone reads the store or the entry decays.
Memory has three modes that fail differently. Episodic (past conversations), semantic (facts about the user or account), procedural (learned workflows). Each has different write triggers, recall queries, and acceptable error rates. Treating them as one bucket produces a “recall accuracy” number that doesn’t predict anything.
Memory writes are privacy events. A logging system that captures an API key is a bug. A memory system that captures an API key is the same bug plus a guarantee that the key resurfaces in a future prompt.
Memory drift accumulates silently. Recalled memory looks plausible right up until it isn’t. Without freshness and forgetting rubrics, a six-month-old price keeps surfacing as authoritative until a customer notices.
Mem0, Zep, Letta, and LangMem each ship sensible defaults tuned for the median case, not your retention policy or contradiction rules. Eval is how you find where the defaults break.
Dimension 1: recall (the dimension everyone measures)
Pull the chunks the memory system returned, compare against the gold set of facts a correct recall would have surfaced, score top-k hit-rate plus precision. Instrument the recall call as a retriever span:
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType, SpanAttributes, FiSpanKindValues
from opentelemetry import trace
register(project_type=ProjectType.OBSERVE, project_name="agent-memory-eval")
tracer = trace.get_tracer(__name__)
def recall_with_span(user_id: str, query: str):
with tracer.start_as_current_span("memory.recall") as span:
span.set_attribute(SpanAttributes.FI_SPAN_KIND, FiSpanKindValues.RETRIEVER.value)
span.set_attribute("memory.system", "mem0")
span.set_attribute("memory.scope", "user")
span.set_attribute("memory.operation", "recall")
span.set_attribute("memory.query", query)
results = mem_store.search(user_id=user_id, query=query, k=5)
span.set_attribute("memory.fact_count", len(results))
return results
With recall on the trace tree the eval is shaped like RAG eval. Score Groundedness (does the response only assert what the recalled memory supports), ContextRelevance (were the chunks relevant), and ChunkAttribution (which chunk drove the response). The agentic RAG playbook covers the same metric family; memory recall is RAG eval with a temporal axis bolted on.
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextRelevance, ChunkAttribution
from fi.testcases import TestCase
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
scores = evaluator.evaluate(
eval_templates=[Groundedness(), ContextRelevance(), ChunkAttribution()],
inputs=[
TestCase(query=turn.user_text, response=turn.agent_text,
context="\n".join(turn.recalled_chunks))
for turn in golden_session.turns if turn.required_recall
],
)
Recall above 0.95 looks great in a blog post and tells you nothing about whether the system is safe in production. Recall is the floor. The next three dimensions are where postmortem bugs hide.
Dimension 2: freshness (did you use the LATEST fact)
Freshness is the dimension most frameworks claim to handle and most teams never explicitly score. The eval needs three things: an adversarial set with deliberate updates to the same fact key, a deterministic gate that the conflict-policy hook actually ran, and an LLM-judge rubric on whether the agent’s response reflects the latest version.
The deterministic gate is the cheapest signal. If two writes have the same canonical key and recall returned both, the freshness layer either resolved one as winner or punted to the agent. Either path needs a span attribute so the eval can score it:
def write_with_freshness_span(user_id: str, key: str, value: str):
with tracer.start_as_current_span("memory.write") as span:
span.set_attribute(SpanAttributes.FI_SPAN_KIND, FiSpanKindValues.TOOL.value)
span.set_attribute("memory.system", "mem0")
span.set_attribute("memory.operation", "write")
span.set_attribute("memory.fact_key", key)
prior = mem_store.find_by_key(user_id=user_id, key=key)
span.set_attribute("memory.supersedes_count", len(prior))
for p in prior:
mem_store.mark_superseded(p.id, by=key, at=now())
return mem_store.write(user_id=user_id, key=key, value=value, valid_at=now())
The freshness rubric is a CustomLLMJudge: 5 if the response reflects the latest valid fact, 3 if it surfaces both with “I have two on file” hedging, 1 if it asserts the older version as current.
from fi.evals.templates import CustomLLMJudge
MemoryFreshness = CustomLLMJudge(
name="MemoryFreshness",
rubric=("Score 1-5 whether the agent used the LATEST valid version "
"of a fact updated in memory. 5: latest, older not surfaced. "
"3: both surfaced with hedging. 1: older asserted as current."),
grading_prompt_template="...",
)
Frameworks differ on freshness primitives. Mem0 ships per-fact timestamps and configurable TTLs. Zep ships bi-temporal modeling with valid_at and recorded_at on every graph edge, the strongest out-of-the-box freshness story in the open category. Letta ships memory blocks the agent can self-edit, which moves freshness into agent-authored policy and needs a stricter rubric. LangMem leans on the surrounding LangGraph store. None matches your domain by default; the eval is how you tune them.
Dimension 3: contradiction handling (when stored facts disagree)
Contradiction handling overlaps with freshness and isn’t the same. Freshness asks “which version is newer.” Contradiction handling asks “which version is right” when newer isn’t right. A typo today shouldn’t overwrite a correct preference from last month. An update from an unverified channel shouldn’t override one from a verified source.
This is where every framework’s default is wrong for somebody. Build the test with multi-turn scenarios where you know the ground truth and the correct resolution:
- Turn 1: “I live in Berlin.” (verified channel: account profile)
- Turn 4: “Actually we moved to Munich last month.” (chat, unverified)
- Turn 9: Ask the agent where the user lives.
A correct system under an unverified-source policy returns “I have Berlin from your profile and Munich from chat; can you confirm.” A correct system under newer-wins returns Munich and flags Berlin as superseded. A broken system returns both, returns Berlin because it ranks higher, or returns Munich but still surfaces Berlin in unrelated queries.
Score with a CustomLLMJudge that takes the scenario, the documented policy, and the agent’s response:
ContradictionResolution = CustomLLMJudge(
name="ContradictionResolution",
rubric=("Score 1-5 whether the system handled the conflict per policy. "
"5: applied resolution rule, only winning fact surfaced. "
"3: both facts surfaced without resolution. "
"1: wrong fact asserted, or both stored as parallel truths."),
grading_prompt_template="...",
)
Run the same shape for consolidation: score whether five turns mentioning the same preference end up as one stable record with a confidence score, or five entries the recall layer has to dedupe at query time. Mem0 merges aggressively; Zep’s graph merges entities while keeping relationships; Letta lets the agent decide. The right answer depends on your domain. The wrong answer is “we never tested it.”
Dimension 4: forgetting (did what should be gone, go)
Forgetting is the silent killer and the dimension teams skip longest. A user retracts consent: does the fact leave the store, or sit there with a deleted=true flag that recall still walks over. A user closes an account: is the scoped memory purged, or does it leak into a fresh signup with the same email. A six-month-old price quote: suppressed, qualified, or surfaced as current.
Build the forgetting test set with explicit lifecycle events:
| Event type | Setup | Test |
|---|---|---|
| Explicit retraction | Write fact, later turn says “forget what I said about X” | Recall returns empty for that fact key |
| TTL expiry | Write fact with 90-day TTL, advance simulated time past expiry | Recall returns empty or marked-stale |
| Scope end | Write session-scoped fact, end session | Recall in a new session returns empty |
| Consent withdrawal | Write data, call delete-my-data, recall | Recall returns empty for that user globally |
| Account closure | Write tenant-scoped facts, close tenant | Recall under any user in tenant returns empty |
Score forgetting with a deterministic gate (was the tombstone written, did the recall path filter on it) plus a CustomLLMJudge on the response:
ForgettingCompliance = CustomLLMJudge(
name="ForgettingCompliance",
rubric=("Score 1-5 whether the agent omits a fact that was retracted, "
"expired, scope-ended, or consent-withdrawn. "
"5: agent has no knowledge of the removed fact. "
"3: surfaces with 'this was retracted' framing. "
"1: surfaces the removed fact as current and authoritative."),
grading_prompt_template="...",
)
Pair forgetting with AnswerRefusal on the cross-tenant probe: if User A’s fact about Liverpool surfaces when User B asks for their favorite team, the right response is “I don’t know,” not “Liverpool.” Decay is the same shape with a clock skew: write at T-180 days, recall at T-0 with the underlying truth changed. The failure is the system surfacing the stale value as current.
Building the test set: LoCoMo as the floor, adversarial on top
LoCoMo (Long-Conversation Memory) is the public floor in 2026: roughly 600 multi-session dialogs averaging 26 sessions per conversation, with question-answer pairs covering single-hop, multi-hop, temporal, and open-domain memory. Use it the way teams use BFCL for tool calling: a model-selection signal and a sanity check on the backend. LoCoMo tests recall and temporal reasoning well; it doesn’t test your tenant scoping, retention policy, contradiction rules, or retraction flow. Don’t gate production on it alone.
The adversarial set is where the four dimensions become measurable. 200 to 500 multi-session scenarios deliberately built to trigger the failures one-metric eval misses:
| Slice | Setup | Dimension |
|---|---|---|
| Plain recall | Write fact, later session queries it | Recall |
| Explicit update | Write fact A, later turn updates to B, later turn asks | Freshness |
| Verified-vs-unverified | Profile says X, chat says Y, later turn asks | Contradiction handling |
| Newer-but-wrong | Old turn writes correct fact, new turn writes typo, asks | Contradiction handling |
| Retraction | Write fact, later turn explicitly retracts, asks | Forgetting |
| TTL expiry | Write with short TTL, advance clock past expiry | Forgetting |
| Cross-tenant probe | User A writes preference, User B asks | Forgetting + privacy boundary |
| Decay | Write at T-180d, ask at T-0 with truth changed | Forgetting + freshness |
Stratify so every framework-specific failure mode has at least 30 cases (Mem0’s consolidation default, Zep’s bi-temporal edge resolution, Letta’s self-edit, LangMem’s LangGraph store scoping). Promote failing production scenarios into the set weekly, the same pattern the LLM evaluation playbook uses for general eval datasets.
The cross-tenant slice deserves its own callout. A leak is a P0, not a regression. Run it per-deploy and on a sample of live traffic. The structural fix lives at the gateway (per-tenant namespacing in Agent Command Center); the eval tests that the fix held.
Production observability: memory spans as first-class steps
Trace-only observability and eval-only checkpoints both miss things. The pattern that works is to attach the same rubric to spans in production that runs against the adversarial set in CI. Three attributes on every memory span are the floor:
| Attribute | Values | Why it matters |
|---|---|---|
memory.system | mem0, letta, zep, langmem, custom | Per-framework regression on the same trace |
memory.scope | user, session, tenant, global | Privacy-boundary eval depends on this |
memory.operation | write, recall, consolidate, forget | Per-dimension routing of eval scores |
Plus per-operation: memory.fact_count, memory.recall_score, and on writes memory.fact_age_days, memory.supersedes_count, memory.valid_at. The same Groundedness rubric that scores the CI set attaches to live recall spans via EvalTag; evals run server-side post-export at zero inline latency.
The trace gives the diagnostic. When the eval fires “low freshness” on a Tuesday spike, the trace shows whether the supersede write ran, whether memory.supersedes_count was zero (the new write didn’t see the old one), whether valid_at was set, and which framework version was in the request path. That collapses the diagnostic from “memory broke” to “the consolidation pass on the 2026-04 Mem0 build stopped marking supersedes on address updates.” One bisect instead of three days.
The Error Feed runs HDBSCAN soft-clustering over failing memory traces and a Sonnet 4.5 Judge agent writes an immediate_fix per cluster. Typical clusters:
- “agent quotes superseded address on shipping_intent queries” → freshness rule on
address - “user A preference leaks into user B retrieval” → tenant-namespacing bug at the gateway
- “retracted fact surfaces under unrelated queries” → tombstone not honored in recall
- “recall p95 above 400 ms on tenants with > 5k facts” → consolidation bloat regression
Linear is the only Error Feed integration today. Slack, GitHub, Jira, and PagerDuty are roadmap, not shipped.
How Future AGI ships the memory eval stack
Future AGI ships the memory eval stack as a package, the same shape as the tool-calling stack one layer up. Start with the SDK for code-defined per-dimension scoring. Graduate to the Platform when the rubric loop needs self-improving evaluators and classifier-backed cost economics.
- ai-evaluation SDK (Apache 2.0):
Groundedness,ContextRelevance,ChunkAttribution,ChunkUtilizationfor recall;FactualAccuracyfor freshness against a current-truth source;DataPrivacyCompliancefor PII checks on write and recall;AnswerRefusalfor the cross-tenant probe;CustomLLMJudgeforMemoryFreshness,ContradictionResolution, andForgettingCompliance. Local for deterministic checks, cloud for LLM-judge rubrics. - Future AGI Platform: self-improving evaluators tuned by thumbs feedback so the freshness rubric stays calibrated as retention policy evolves; in-product authoring agent for new memory rubrics; classifier-backed cascade at lower per-eval cost than Galileo Luna-2.
- traceAI (Apache 2.0):
RETRIEVERandTOOLspan kinds with thememory.*attribute set;EvalTagwires rubrics to live spans at zero inference latency. 50+ AI surfaces across 4 languages. - Error Feed: HDBSCAN soft-clusters failing memory traces; Sonnet 4.5 Judge writes the 5-category 30-subtype taxonomy plus an
immediate_fixper cluster. 4-dim trace score (factual_grounding,privacy_and_safety,instruction_adherence,optimal_plan_execution). - Agent Command Center: per-virtual-key tag namespacing scopes memory at the gateway boundary; 18+ built-in guardrail scanners including PII and secret detection on the write path. SOC 2 Type II, HIPAA, GDPR, and CCPA certified per futureagi.com/trust; ISO/IEC 27001 in active audit.
Compliance is load-bearing for memory specifically. A memory system stores user-volunteered facts, and the audit trail for retraction, deletion, and access has to map to GDPR Article 17 and the equivalent CCPA flow. Certification is what makes the eval defensible in a vendor questionnaire.
Three honest tradeoffs
- Per-dimension scoring costs more than a single conflated number. Four rubrics per scenario, not one. Payoff: when CI fails, the failing dimension is the root cause. Ship the deterministic gates first and turn on the LLM-judge rubrics once trace volume justifies it.
- The adversarial set takes weeks to build. 200 to 500 multi-session scenarios with deliberate updates, retractions, conflicts, and scope events isn’t a weekend. Seed from production failures and grow weekly; gating CI on LoCoMo alone isn’t enough.
- The freshness rubric is noisier than recall. Time-aware judgments are harder for LLM judges; pin a small human-labelled calibration set per domain and re-tune monthly. Calibration is the cost; the signal is worth it.
Ready to evaluate your memory system? Wire Groundedness + ContextRelevance for recall, MemoryFreshness for freshness, ContradictionResolution for conflicts, and ForgettingCompliance + AnswerRefusal for the cross-tenant probe into a pytest fixture against the ai-evaluation SDK. Attach the same templates as EvalTag scorers on memory.* spans via traceAI when production traces start asking questions the CI gate missed.
Related reading
Frequently asked questions
What are the four dimensions of agent memory evaluation?
Why isn't recall accuracy enough for memory eval?
What is the LoCoMo benchmark and how do you use it?
How do you instrument memory operations as OpenTelemetry spans?
Which Future AGI evaluators apply to memory?
How do Mem0, Zep, Letta, LangMem, and MemGPT differ on these four dimensions?
What are the common memory-eval anti-patterns?
Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.
Evaluating browser-use agents in 2026: WebArena grades happy-path completion; production grades recovery from six failure modes nobody benchmarks.
Aggregate quality hides which research stage broke. Score plan, retrieve, source, claim, and synthesis independently or you cannot fix anything.