Guides

Evaluating Agent Memory Systems in 2026: Four Dimensions Most Reports Miss

Evaluating agent memory is four problems, not one: recall, freshness, contradiction handling, forgetting. A 2026 framework for Mem0, Zep, Letta, LangMem.

·
Updated
·
12 min read
agent-memory agent-evaluation llm-evaluation mem0 zep letta 2026
Editorial cover image for Evaluating Agent Memory Systems in 2026
Table of Contents

A support agent quotes the user’s address from a March conversation. The user updated it in June. The store has both writes; the March one ranks higher because the embedding is denser. Recall is technically correct, and the package still went to the wrong street. The postmortem says “memory worked.” That’s the failure mode inside most public memory eval reports: recall is the only dimension scored.

If your memory eval reports one number, you’re grading a quarter of the system. The opinion this post earns: agent memory eval is four problems, not one. Recall (retrieved the right fact). Freshness (used the LATEST fact). Contradiction handling (resolved conflicting facts). Forgetting (deleted what should be gone). Score the four independently. This guide walks each dimension, the rubric that catches it, how LoCoMo fits as the public floor, the adversarial set on top, how to instrument memory spans, and how the Future AGI eval stack wires the rubrics into CI and live traces against Mem0, Zep, Letta, LangMem, and homegrown stores.

TL;DR: the four-dimension framework

DimensionWhat you measureDeterministic checkLLM-judge rubric
1. RecallRight fact surfaced for the turntop-k presence, hit-rate by fact idGroundedness, ContextRelevance, ChunkAttribution
2. FreshnessLATEST version used when fact has updatesvalid_at newer than older write with same keyMemoryFreshness CustomLLMJudge
3. Contradiction handlingRight version wins when stored facts disagreeconflict-policy ran, decision loggedContradictionResolution CustomLLMJudge
4. ForgettingRetracted, expired, scope-ended facts gonerecall returns empty on tombstoned fact idForgettingCompliance CustomLLMJudge + AnswerRefusal

Non-negotiables: per-dimension scoring, an adversarial set that explicitly retracts and updates facts, a privacy-boundary slice (memory crossing tenants is a P0, not a regression), and trace-attached scores so the same rubric runs in CI and on live traffic.

Why one-metric memory eval misses production failures

Single-session eval misses the failure modes that matter for memory. Four properties make memory different.

Memory crosses sessions. A bad write in March affects a turn in September. The blast radius of one wrong fact is unbounded until someone reads the store or the entry decays.

Memory has three modes that fail differently. Episodic (past conversations), semantic (facts about the user or account), procedural (learned workflows). Each has different write triggers, recall queries, and acceptable error rates. Treating them as one bucket produces a “recall accuracy” number that doesn’t predict anything.

Memory writes are privacy events. A logging system that captures an API key is a bug. A memory system that captures an API key is the same bug plus a guarantee that the key resurfaces in a future prompt.

Memory drift accumulates silently. Recalled memory looks plausible right up until it isn’t. Without freshness and forgetting rubrics, a six-month-old price keeps surfacing as authoritative until a customer notices.

Mem0, Zep, Letta, and LangMem each ship sensible defaults tuned for the median case, not your retention policy or contradiction rules. Eval is how you find where the defaults break.

Dimension 1: recall (the dimension everyone measures)

Pull the chunks the memory system returned, compare against the gold set of facts a correct recall would have surfaced, score top-k hit-rate plus precision. Instrument the recall call as a retriever span:

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType, SpanAttributes, FiSpanKindValues
from opentelemetry import trace

register(project_type=ProjectType.OBSERVE, project_name="agent-memory-eval")
tracer = trace.get_tracer(__name__)

def recall_with_span(user_id: str, query: str):
    with tracer.start_as_current_span("memory.recall") as span:
        span.set_attribute(SpanAttributes.FI_SPAN_KIND, FiSpanKindValues.RETRIEVER.value)
        span.set_attribute("memory.system", "mem0")
        span.set_attribute("memory.scope", "user")
        span.set_attribute("memory.operation", "recall")
        span.set_attribute("memory.query", query)
        results = mem_store.search(user_id=user_id, query=query, k=5)
        span.set_attribute("memory.fact_count", len(results))
        return results

With recall on the trace tree the eval is shaped like RAG eval. Score Groundedness (does the response only assert what the recalled memory supports), ContextRelevance (were the chunks relevant), and ChunkAttribution (which chunk drove the response). The agentic RAG playbook covers the same metric family; memory recall is RAG eval with a temporal axis bolted on.

from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextRelevance, ChunkAttribution
from fi.testcases import TestCase

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
scores = evaluator.evaluate(
    eval_templates=[Groundedness(), ContextRelevance(), ChunkAttribution()],
    inputs=[
        TestCase(query=turn.user_text, response=turn.agent_text,
                 context="\n".join(turn.recalled_chunks))
        for turn in golden_session.turns if turn.required_recall
    ],
)

Recall above 0.95 looks great in a blog post and tells you nothing about whether the system is safe in production. Recall is the floor. The next three dimensions are where postmortem bugs hide.

Dimension 2: freshness (did you use the LATEST fact)

Freshness is the dimension most frameworks claim to handle and most teams never explicitly score. The eval needs three things: an adversarial set with deliberate updates to the same fact key, a deterministic gate that the conflict-policy hook actually ran, and an LLM-judge rubric on whether the agent’s response reflects the latest version.

The deterministic gate is the cheapest signal. If two writes have the same canonical key and recall returned both, the freshness layer either resolved one as winner or punted to the agent. Either path needs a span attribute so the eval can score it:

def write_with_freshness_span(user_id: str, key: str, value: str):
    with tracer.start_as_current_span("memory.write") as span:
        span.set_attribute(SpanAttributes.FI_SPAN_KIND, FiSpanKindValues.TOOL.value)
        span.set_attribute("memory.system", "mem0")
        span.set_attribute("memory.operation", "write")
        span.set_attribute("memory.fact_key", key)
        prior = mem_store.find_by_key(user_id=user_id, key=key)
        span.set_attribute("memory.supersedes_count", len(prior))
        for p in prior:
            mem_store.mark_superseded(p.id, by=key, at=now())
        return mem_store.write(user_id=user_id, key=key, value=value, valid_at=now())

The freshness rubric is a CustomLLMJudge: 5 if the response reflects the latest valid fact, 3 if it surfaces both with “I have two on file” hedging, 1 if it asserts the older version as current.

from fi.evals.templates import CustomLLMJudge

MemoryFreshness = CustomLLMJudge(
    name="MemoryFreshness",
    rubric=("Score 1-5 whether the agent used the LATEST valid version "
            "of a fact updated in memory. 5: latest, older not surfaced. "
            "3: both surfaced with hedging. 1: older asserted as current."),
    grading_prompt_template="...",
)

Frameworks differ on freshness primitives. Mem0 ships per-fact timestamps and configurable TTLs. Zep ships bi-temporal modeling with valid_at and recorded_at on every graph edge, the strongest out-of-the-box freshness story in the open category. Letta ships memory blocks the agent can self-edit, which moves freshness into agent-authored policy and needs a stricter rubric. LangMem leans on the surrounding LangGraph store. None matches your domain by default; the eval is how you tune them.

Dimension 3: contradiction handling (when stored facts disagree)

Contradiction handling overlaps with freshness and isn’t the same. Freshness asks “which version is newer.” Contradiction handling asks “which version is right” when newer isn’t right. A typo today shouldn’t overwrite a correct preference from last month. An update from an unverified channel shouldn’t override one from a verified source.

This is where every framework’s default is wrong for somebody. Build the test with multi-turn scenarios where you know the ground truth and the correct resolution:

  • Turn 1: “I live in Berlin.” (verified channel: account profile)
  • Turn 4: “Actually we moved to Munich last month.” (chat, unverified)
  • Turn 9: Ask the agent where the user lives.

A correct system under an unverified-source policy returns “I have Berlin from your profile and Munich from chat; can you confirm.” A correct system under newer-wins returns Munich and flags Berlin as superseded. A broken system returns both, returns Berlin because it ranks higher, or returns Munich but still surfaces Berlin in unrelated queries.

Score with a CustomLLMJudge that takes the scenario, the documented policy, and the agent’s response:

ContradictionResolution = CustomLLMJudge(
    name="ContradictionResolution",
    rubric=("Score 1-5 whether the system handled the conflict per policy. "
            "5: applied resolution rule, only winning fact surfaced. "
            "3: both facts surfaced without resolution. "
            "1: wrong fact asserted, or both stored as parallel truths."),
    grading_prompt_template="...",
)

Run the same shape for consolidation: score whether five turns mentioning the same preference end up as one stable record with a confidence score, or five entries the recall layer has to dedupe at query time. Mem0 merges aggressively; Zep’s graph merges entities while keeping relationships; Letta lets the agent decide. The right answer depends on your domain. The wrong answer is “we never tested it.”

Dimension 4: forgetting (did what should be gone, go)

Forgetting is the silent killer and the dimension teams skip longest. A user retracts consent: does the fact leave the store, or sit there with a deleted=true flag that recall still walks over. A user closes an account: is the scoped memory purged, or does it leak into a fresh signup with the same email. A six-month-old price quote: suppressed, qualified, or surfaced as current.

Build the forgetting test set with explicit lifecycle events:

Event typeSetupTest
Explicit retractionWrite fact, later turn says “forget what I said about X”Recall returns empty for that fact key
TTL expiryWrite fact with 90-day TTL, advance simulated time past expiryRecall returns empty or marked-stale
Scope endWrite session-scoped fact, end sessionRecall in a new session returns empty
Consent withdrawalWrite data, call delete-my-data, recallRecall returns empty for that user globally
Account closureWrite tenant-scoped facts, close tenantRecall under any user in tenant returns empty

Score forgetting with a deterministic gate (was the tombstone written, did the recall path filter on it) plus a CustomLLMJudge on the response:

ForgettingCompliance = CustomLLMJudge(
    name="ForgettingCompliance",
    rubric=("Score 1-5 whether the agent omits a fact that was retracted, "
            "expired, scope-ended, or consent-withdrawn. "
            "5: agent has no knowledge of the removed fact. "
            "3: surfaces with 'this was retracted' framing. "
            "1: surfaces the removed fact as current and authoritative."),
    grading_prompt_template="...",
)

Pair forgetting with AnswerRefusal on the cross-tenant probe: if User A’s fact about Liverpool surfaces when User B asks for their favorite team, the right response is “I don’t know,” not “Liverpool.” Decay is the same shape with a clock skew: write at T-180 days, recall at T-0 with the underlying truth changed. The failure is the system surfacing the stale value as current.

Building the test set: LoCoMo as the floor, adversarial on top

LoCoMo (Long-Conversation Memory) is the public floor in 2026: roughly 600 multi-session dialogs averaging 26 sessions per conversation, with question-answer pairs covering single-hop, multi-hop, temporal, and open-domain memory. Use it the way teams use BFCL for tool calling: a model-selection signal and a sanity check on the backend. LoCoMo tests recall and temporal reasoning well; it doesn’t test your tenant scoping, retention policy, contradiction rules, or retraction flow. Don’t gate production on it alone.

The adversarial set is where the four dimensions become measurable. 200 to 500 multi-session scenarios deliberately built to trigger the failures one-metric eval misses:

SliceSetupDimension
Plain recallWrite fact, later session queries itRecall
Explicit updateWrite fact A, later turn updates to B, later turn asksFreshness
Verified-vs-unverifiedProfile says X, chat says Y, later turn asksContradiction handling
Newer-but-wrongOld turn writes correct fact, new turn writes typo, asksContradiction handling
RetractionWrite fact, later turn explicitly retracts, asksForgetting
TTL expiryWrite with short TTL, advance clock past expiryForgetting
Cross-tenant probeUser A writes preference, User B asksForgetting + privacy boundary
DecayWrite at T-180d, ask at T-0 with truth changedForgetting + freshness

Stratify so every framework-specific failure mode has at least 30 cases (Mem0’s consolidation default, Zep’s bi-temporal edge resolution, Letta’s self-edit, LangMem’s LangGraph store scoping). Promote failing production scenarios into the set weekly, the same pattern the LLM evaluation playbook uses for general eval datasets.

The cross-tenant slice deserves its own callout. A leak is a P0, not a regression. Run it per-deploy and on a sample of live traffic. The structural fix lives at the gateway (per-tenant namespacing in Agent Command Center); the eval tests that the fix held.

Production observability: memory spans as first-class steps

Trace-only observability and eval-only checkpoints both miss things. The pattern that works is to attach the same rubric to spans in production that runs against the adversarial set in CI. Three attributes on every memory span are the floor:

AttributeValuesWhy it matters
memory.systemmem0, letta, zep, langmem, customPer-framework regression on the same trace
memory.scopeuser, session, tenant, globalPrivacy-boundary eval depends on this
memory.operationwrite, recall, consolidate, forgetPer-dimension routing of eval scores

Plus per-operation: memory.fact_count, memory.recall_score, and on writes memory.fact_age_days, memory.supersedes_count, memory.valid_at. The same Groundedness rubric that scores the CI set attaches to live recall spans via EvalTag; evals run server-side post-export at zero inline latency.

The trace gives the diagnostic. When the eval fires “low freshness” on a Tuesday spike, the trace shows whether the supersede write ran, whether memory.supersedes_count was zero (the new write didn’t see the old one), whether valid_at was set, and which framework version was in the request path. That collapses the diagnostic from “memory broke” to “the consolidation pass on the 2026-04 Mem0 build stopped marking supersedes on address updates.” One bisect instead of three days.

The Error Feed runs HDBSCAN soft-clustering over failing memory traces and a Sonnet 4.5 Judge agent writes an immediate_fix per cluster. Typical clusters:

  • “agent quotes superseded address on shipping_intent queries” → freshness rule on address
  • “user A preference leaks into user B retrieval” → tenant-namespacing bug at the gateway
  • “retracted fact surfaces under unrelated queries” → tombstone not honored in recall
  • “recall p95 above 400 ms on tenants with > 5k facts” → consolidation bloat regression

Linear is the only Error Feed integration today. Slack, GitHub, Jira, and PagerDuty are roadmap, not shipped.

How Future AGI ships the memory eval stack

Future AGI ships the memory eval stack as a package, the same shape as the tool-calling stack one layer up. Start with the SDK for code-defined per-dimension scoring. Graduate to the Platform when the rubric loop needs self-improving evaluators and classifier-backed cost economics.

  • ai-evaluation SDK (Apache 2.0): Groundedness, ContextRelevance, ChunkAttribution, ChunkUtilization for recall; FactualAccuracy for freshness against a current-truth source; DataPrivacyCompliance for PII checks on write and recall; AnswerRefusal for the cross-tenant probe; CustomLLMJudge for MemoryFreshness, ContradictionResolution, and ForgettingCompliance. Local for deterministic checks, cloud for LLM-judge rubrics.
  • Future AGI Platform: self-improving evaluators tuned by thumbs feedback so the freshness rubric stays calibrated as retention policy evolves; in-product authoring agent for new memory rubrics; classifier-backed cascade at lower per-eval cost than Galileo Luna-2.
  • traceAI (Apache 2.0): RETRIEVER and TOOL span kinds with the memory.* attribute set; EvalTag wires rubrics to live spans at zero inference latency. 50+ AI surfaces across 4 languages.
  • Error Feed: HDBSCAN soft-clusters failing memory traces; Sonnet 4.5 Judge writes the 5-category 30-subtype taxonomy plus an immediate_fix per cluster. 4-dim trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution).
  • Agent Command Center: per-virtual-key tag namespacing scopes memory at the gateway boundary; 18+ built-in guardrail scanners including PII and secret detection on the write path. SOC 2 Type II, HIPAA, GDPR, and CCPA certified per futureagi.com/trust; ISO/IEC 27001 in active audit.

Compliance is load-bearing for memory specifically. A memory system stores user-volunteered facts, and the audit trail for retraction, deletion, and access has to map to GDPR Article 17 and the equivalent CCPA flow. Certification is what makes the eval defensible in a vendor questionnaire.

Three honest tradeoffs

  • Per-dimension scoring costs more than a single conflated number. Four rubrics per scenario, not one. Payoff: when CI fails, the failing dimension is the root cause. Ship the deterministic gates first and turn on the LLM-judge rubrics once trace volume justifies it.
  • The adversarial set takes weeks to build. 200 to 500 multi-session scenarios with deliberate updates, retractions, conflicts, and scope events isn’t a weekend. Seed from production failures and grow weekly; gating CI on LoCoMo alone isn’t enough.
  • The freshness rubric is noisier than recall. Time-aware judgments are harder for LLM judges; pin a small human-labelled calibration set per domain and re-tune monthly. Calibration is the cost; the signal is worth it.

Ready to evaluate your memory system? Wire Groundedness + ContextRelevance for recall, MemoryFreshness for freshness, ContradictionResolution for conflicts, and ForgettingCompliance + AnswerRefusal for the cross-tenant probe into a pytest fixture against the ai-evaluation SDK. Attach the same templates as EvalTag scorers on memory.* spans via traceAI when production traces start asking questions the CI gate missed.

Frequently asked questions

What are the four dimensions of agent memory evaluation?
Memory eval is four problems stacked, not one. Recall: did the agent retrieve the right fact for this turn. Freshness: did the agent use the LATEST fact when the same fact has been updated. Contradiction handling: when two stored facts disagree, did the agent resolve the conflict correctly (newer wins, more authoritative source wins, or surface both with confidence). Forgetting: did the agent delete or suppress facts that were explicitly retracted, expired, or scope-bound to a session that ended. Most public memory reports grade recall and stop. The production failures live in the other three. Score the four independently with per-dimension rubrics, not a single conflated 'memory accuracy' number.
Why isn't recall accuracy enough for memory eval?
Recall measures whether the right fact was retrieved. It does not measure whether the retrieved fact is still true (freshness), whether the agent picked the correct version when two facts disagree (contradiction handling), or whether facts that should be gone are actually gone (forgetting). A memory system that scores 95 percent on recall can still ship the failure where a user updates their address in March and the agent quotes the March address in August because the original write still has higher similarity. Recall is necessary; it's not sufficient. The post explains the rubric for each of the other three and how to score them against LoCoMo plus a custom adversarial set.
What is the LoCoMo benchmark and how do you use it?
LoCoMo (Long-Conversation Memory) is the public floor for long-horizon memory eval: roughly 600 multi-session conversations with question-answer pairs targeting single-hop, multi-hop, temporal, and open-domain memory across 26 average sessions per dialog. Use it the way you use BFCL for tool calling: a model-selection signal and a sanity check on the underlying memory backend. It does not test your tenant scoping, your retention policy, or your contradiction-resolution rules. Stratify a private adversarial set on top with deliberate retractions, deliberate updates, deliberate conflicts, and deliberate session-scoped facts that should not survive. Gate CI on the private set.
How do you instrument memory operations as OpenTelemetry spans?
Wrap every memory read as a span with `fi.span.kind=RETRIEVER` and every memory write as a span with `fi.span.kind=TOOL`. Attach custom attributes: `memory.system` (mem0, letta, zep, langmem, memgpt, custom), `memory.scope` (user, session, tenant, global), `memory.operation` (write, recall, consolidate, forget), `memory.fact_count`, `memory.recall_score`, and on writes `memory.fact_age_days` so freshness can be scored downstream. With traceAI, the spans land in the same trace tree as LLM calls and tool calls, which makes memory a first-class step in the trace rather than a hidden side effect of the framework.
Which Future AGI evaluators apply to memory?
From the ai-evaluation SDK: Groundedness scores whether the response only asserts what the recalled memory supports; ContextAdherence scores whether the response stays within recalled context; FactualAccuracy scores whether a recalled fact still matches reality; DataPrivacyCompliance scores PII leakage at write and read; AnswerRefusal scores whether the agent refused a query it should not have been able to answer. On top of those, build CustomLLMJudge rubrics for the four named dimensions: MemoryRecallAccuracy, MemoryFreshness, ContradictionResolution, ForgettingCompliance. Attach the same rubrics to live traces via EvalTag so the production policy and the regression rubric stay in sync.
How do Mem0, Zep, Letta, LangMem, and MemGPT differ on these four dimensions?
Mem0 ships timestamped facts with configurable TTLs (freshness and forgetting are explicit primitives). Zep ships temporal knowledge graphs with bi-temporal modeling, which means valid-at and recorded-at sit on every edge (strongest out-of-the-box freshness story). Letta (productized successor to the MemGPT paper) ships hierarchical memory blocks the agent can self-edit, which is powerful but moves contradiction handling into agent-authored policy. LangMem ships memory primitives inside LangChain and leans on the surrounding graph for scope. MemGPT itself now redirects to Letta and is historical context. None of these defaults will match your bank's retention policy, your healthcare consent model, or your support-agent retraction flow. The eval is how you find out where the defaults break.
What are the common memory-eval anti-patterns?
Five recur in postmortems. One, only testing within a session, when cross-session is where real memory bugs live. Two, scoring a single conflated 'memory accuracy' and missing freshness, contradiction, and forgetting. Three, no privacy boundary eval, so user-to-user leaks ship silently. Four, no decay or retention eval, so a six-month-old price keeps surfacing as authoritative. Five, evaluating memory by reading the store directly instead of grading what the agent says with what the recall returned. The agent step is where most failures actually surface.
Related Articles
View all