How is context entity recall different from context recall?

Context recall checks whether the reference answer's supporting information is present. Context entity recall focuses specifically on named entities, so it catches misses for products, people, locations, dates, IDs, and policy names.

How do you measure context entity recall?

Use FutureAGI's `fi.evals.ContextEntityRecall` on a RAG eval row with query, retrieved contexts, and a reference answer or entity-bearing ground truth. Track entity miss rate by index version and retrieval configuration.

What Is Context Entity Recall? FutureAGI Guide (2026)

Q: What is context entity recall?

Context entity recall measures whether retrieved RAG context contains the entities needed for a correct answer. FutureAGI exposes it as `ContextEntityRecall` for entity-level retrieval completeness.

What Is Context Entity Recall?

Context entity recall is a RAG retrieval metric that measures whether the entities needed for a correct answer were present in the retrieved context. Instead of asking only whether supporting sentences were retrieved, it checks named-entity coverage for people, products, locations, dates, policies, account IDs, and other factual anchors. In FutureAGI, it shows up through ContextEntityRecall in eval pipelines and RAG traces when a retriever returns plausible context but misses the specific entity the answer depends on.

Why It Matters in Production LLM and Agent Systems

Entity misses create wrong-by-omission failures. The model may answer fluently, stay grounded in the chunks it received, and still omit the customer, contract, medication, product tier, jurisdiction, or date that makes the answer correct. A retriever that returns “ACME Basic refund policy” when the question asks about “ACME Plus EU refund policy” looks relevant at a glance, but the answer is unusable.

Developers feel this as retrieval bugs that do not reproduce through generic relevance checks. SREs see repeated escalations for a small set of accounts or regions. Compliance teams see answers that cite approved policy text while leaving out the regulated entity the policy applies to. End users experience it as a system that sounds knowledgeable but cannot distinguish their actual case from a nearby case.

Common symptoms include high context relevance with low task success, thumbs-down clusters tied to product names or geographies, answer traces where the retrieved chunks contain the right topic but not the right entity, and regression rows that fail only after metadata filters or synonym maps change.

In 2026 multi-step RAG agents, entity recall is also a control signal. If a planner depends on a missing company name or account ID, every downstream tool call can be technically valid and still aimed at the wrong target.

How FutureAGI Handles Context Entity Recall

FutureAGI’s approach is to treat entity coverage as a first-class retrieval completeness signal, not a cosmetic string check. The specific anchor is eval:ContextEntityRecall: the fi.evals.ContextEntityRecall local metric from the evaluation stack. It is designed for RAG eval rows where the query, retrieved contexts, and reference answer or entity-bearing ground truth are available. The metric to watch is the ContextEntityRecall score for required entity coverage.

A typical FutureAGI workflow starts with a golden dataset of production questions. A team using traceAI-langchain with a Pinecone retriever logs the query and retrieved context for each RAG span, then runs ContextEntityRecall alongside ContextRecall, ContextPrecision, and ChunkAttribution. One failing row asks, “Does ACME Plus cover EU audit-log export?” The retriever returns chunks about ACME Basic, EU retention rules, and generic audit logs, but misses the “ACME Plus” contract page. Sentence-level recall may partially pass; entity recall isolates the missing product tier.

The engineer then checks whether the entity is absent from the index, filtered out by metadata, split across chunks, or ranked below top-k. The fix might be alias normalization, a product-tier metadata filter, a reranker change, or a new regression case. Unlike a standalone Ragas notebook run that can leave failing entity cases detached from production evidence, FutureAGI keeps the score next to the trace or dataset row, so the missing entity is debuggable.

How to Measure or Detect It

Use context entity recall when you have reference answers, gold evidence, or annotated entity requirements. Measure these signals together:

fi.evals.ContextEntityRecall — returns a recall-style score for entity-level retrieval completeness.
fi.evals.ContextRecall — catches broader sentence-level coverage gaps that are not entity-specific.
traceAI-langchain or traceAI-llamaindex retrieval spans — provide the query and retrieved contexts needed for evaluation.
Eval-fail-rate-by-index-version — shows whether a new chunker, embedding model, metadata filter, or reranker created entity misses.
Escalation rate by entity cohort — useful when failures concentrate around product tiers, regions, account types, or regulated terms.

Minimal Python:

from fi.evals import ContextEntityRecall

metric = ContextEntityRecall()
result = metric.evaluate([{
    "query": "Does ACME Plus cover EU audit-log export?",
    "contexts": ["ACME Basic includes audit logs for US teams."],
    "reference": "ACME Plus covers audit-log export for EU teams."
}])
print(result)

Read low scores as retrieval evidence, not generation evidence. The model cannot answer with an entity it never received.

Common Mistakes

Treating entity recall as answer correctness. It only says required entities were retrieved; the answer can still misuse or contradict them.
Using raw string matching for aliases. “IBM,” “International Business Machines,” and internal account aliases need normalization before scoring.
Aggregating across entity types. Product IDs, medication names, jurisdictions, and broad product families have different risk; threshold them separately.
Running it on unversioned references. Entity requirements drift when policies and indexes change; version references with the corpus snapshot.
Fixing misses by raising top-k only. More chunks can hide the issue with noise; inspect chunking, filters, aliases, and reranking.