How is cross-session leak different from PII leak?

A PII leak describes the sensitive data exposed. A cross-session leak describes the isolation failure that moved that data from one session boundary into another.

How do you measure cross-session leak?

Use FutureAGI's PII evaluator on outputs and tool results, then join eval failures with session, user, tenant, cache, and retrieval identifiers. Track cross-session PII fail rate by route and release cohort.

What Is Cross-Session Leak? FutureAGI Guide (2026)

Q: What is cross-session leak?

Cross-session leak is an AI security failure where one user's, tenant's, or conversation's private data appears in another session. It usually points to weak isolation across memory, retrieval, cache, tool output, or trace handling.

What is Cross-Session Leak?

Cross-session leak is an AI security failure where data from one user, tenant, or conversation appears in another session’s prompt, memory, tool result, cache hit, or model response. It is a privacy and isolation failure in production traces, eval pipelines, agent memory, and gateways, especially when long-lived context, shared semantic caches, or reused retrieval filters blur session boundaries. FutureAGI detects it with the PII evaluator, trace session identifiers, and guardrails before leaked data reaches the user.

Why cross-session leak matters in production LLM/agent systems

Cross-session leak turns an LLM app into a data-isolation incident. The visible failure is simple: a user sees another customer’s name, account summary, support ticket, conversation history, or generated answer. The root cause is usually harder to find because the leak may pass through retrieval, memory, caching, tool output, or a reused trace fixture before the final response.

Two failure modes dominate production incidents. Tenant bleed-through happens when retrieval filters, vector metadata, or application caches fail to enforce the user or tenant boundary. Memory contamination happens when an agent stores one session’s facts and later treats them as reusable context for a different user. Both are worse than ordinary hallucination because the output may be accurate, private, and unauthorized.

Developers feel this as a confusing trace: the model appears to answer correctly, but the source document, memory entry, or cache key belongs to someone else. SREs see a privacy spike without a matching latency or error-rate spike. Compliance teams need proof of which session supplied the data, which route served it, and whether logs preserved the exposed value. End users lose trust immediately because the product proves it can mix identities.

This matters more in 2026-era agentic pipelines because session state now lives in more places: agent memory, MCP tool responses, browser state, semantic caches, vector stores, workflow retries, and human review queues. Single-turn chat tests rarely cover those boundaries.

How FutureAGI handles cross-session leak

FutureAGI handles cross-session leak as a boundary-and-evidence problem. For the eval:PII anchor, the concrete surface is the PII evaluator: it runs on model outputs, tool outputs, retrieved context, and memory reads to flag personal or sensitive data that should not appear in the current session. The same trace should carry session evidence such as user id, tenant id, conversation id, route, cache key, retrieval filter, source document id, and memory record id.

A real workflow looks like this. A support agent is instrumented with traceAI-langchain and routed through Agent Command Center. The agent answers a billing question for user B, but the trace includes a retrieved chunk with user A’s email and order number. FutureAGI runs PII on the candidate response and the retrieved chunk, then joins the result with the current session.id and source metadata. If the evaluator flags PII whose source tenant differs from the active tenant, a pre-guardrail blocks the response and routes to a fallback that asks the user to retry or opens an internal incident.

FutureAGI’s approach is to test the isolation boundary, not only the final sentence. Unlike Ragas faithfulness checks, which ask whether an answer matches provided context, cross-session leak detection must ask whether that context was allowed to enter this session at all. Engineers then add the failed trace to a regression dataset, fix the retrieval filter or cache partition, and set a release threshold of zero cross-session PII failures.

How to measure or detect cross-session leak

Measure cross-session leak by combining content evaluation with identity evidence:

PII evaluator - flags personal or sensitive data in outputs, tool results, retrieved chunks, and memory reads.
Boundary comparison - compare active session.id, user id, tenant id, source tenant, cache namespace, and retrieval filter on every flagged span.
Trace signal - inspect tool.output, memory read spans, retrieved chunk ids, cache-hit metadata, and agent.trajectory.step before the exposed answer.
Dashboard signal - track cross-session PII fail rate, tenant-mismatch rate, cache-hit leak rate, and eval-fail-rate-by-cohort.
User-feedback proxy - monitor reports containing “not my account,” “wrong customer,” “another user’s data,” and privacy escalations.

from fi.evals import PII

result = PII().evaluate(
    input="Active session tenant: acme",
    output="Your invoice for beta@example.com is overdue."
)
print(result.score, result.reason)

Common mistakes

Sharing one semantic cache namespace across tenants and assuming prompt similarity is enough isolation. Partition cache keys by tenant, route, policy, and data class.
Testing only final responses for PII while ignoring retrieved chunks, memory reads, and tool outputs. The leak source often appears one span earlier.
Treating redaction as the full fix. Redaction limits exposure, but the system still retrieved data from the wrong session boundary.
Replaying production traces in evals without scrubbing or partitioning fixtures. A test dataset can reintroduce the same private context into another run.
Logging session identifiers inconsistently across app, gateway, and vector store layers. Without aligned ids, incident review becomes guesswork instead of evidence.