Models

What Is Named Entity Recognition (NER)?

The NLP task of locating mentions of real-world entities in text and assigning each one a type label.

What Is Named Entity Recognition (NER)?

Named entity recognition (NER) is the NLP task of locating mentions of real-world entities in text and assigning each a type label — person, organization, location, date, monetary amount, product, or a domain-specific category like drug or gene. Modern NER uses transformer encoders (BERT-class, RoBERTa, DeBERTa) or LLM prompting; classical pipelines used CRFs and feature engineering. In 2026 LLM stacks NER feeds RAG retrieval, knowledge-graph construction, PII redaction, structured-output evaluation, and audit logging — anywhere an LLM needs to anchor on something the user actually mentioned.

Why It Matters in Production LLM and Agent Systems

NER is the upstream step that decides what the rest of an LLM stack can see. If the NER model misses a drug name, the RAG retriever has no anchor and returns irrelevant chunks. If it confuses two organizations with similar names, the agent’s tool calls route to the wrong account. If it has low recall on minority-language names, PII redaction fails and personal data leaks into logs. Most “the agent gave the wrong answer” incidents trace back to an entity that was never identified or was mistyped.

The pain is felt across the stack. ML engineers see RAG quality regressions that turn out to be NER recall drops in disguise. Compliance teams see PII leakage because the NER model has 0.94 F1 on English names but 0.62 on names in Hindi script. Product managers see citation rates fall when entities the user explicitly mentioned are missing from the answer because NER did not surface them to the retriever.

In 2026 agentic stacks NER sits at three layers: input parsing, retrieval keying, and output redaction. Each layer needs its own NER model or its own evaluation, and a single global F1 number does not give you the resolution to find the bug. Per-entity-type, per-language, and per-cohort recall is the metric that matters. Unlike spaCy’s default English pipeline, which is a sensible baseline but not domain-tuned, production teams need NER models finetuned on their corpus and a regression eval that gates every model swap.

How FutureAGI Handles NER

FutureAGI does not train NER models — frameworks like spaCy, HuggingFace Transformers, and stanza do that. We evaluate the consequences of NER inside an LLM stack. ContextEntityRecall measures whether the entities a user mentioned are present in the retrieved context, which is the most direct production signal that NER + retrieval are working together. The PII evaluator runs across inputs and outputs to flag entity types that map to personal data, with redaction integrated into Agent Command Center pre-guardrail and post-guardrail stages.

Concretely: an enterprise RAG agent on traceAI-langchain parses incoming queries with a HuggingFace NER model to extract entities, then routes those entities into the retriever. FutureAGI’s ContextEntityRecall runs on every response and surfaces entity-level retrieval misses. After a model upgrade from bert-base-NER to a domain-finetuned variant, drug-name recall jumps from 0.71 to 0.89, and downstream Faithfulness improves by 6 points — direct evidence the NER change caused the retrieval improvement. On the PII side, the team runs PII as a post-guardrail and audits flagged entities weekly to refine the redactor’s allowlist for false positives.

How to Measure or Detect It

NER quality should be measured upstream of the LLM, then validated downstream:

  • NER F1 per entity type: precision/recall/F1 broken down per type — global F1 hides minority-class failures.
  • ContextEntityRecall: returns the fraction of question entities present in retrieved context; the production-side signal.
  • PII: detects PII-mapped entities; returns category labels and spans.
  • NER recall by language and script: track per-locale to catch multilingual gaps.
  • Citation-coverage (dashboard): the percentage of answer entities that have at least one cited source.

Minimal Python:

from fi.evals import ContextEntityRecall, PII

cer = ContextEntityRecall()
pii = PII()

cer_result = cer.evaluate(
    input="What's the dosage for Lisinopril?",
    context=retrieved_chunks
)
pii_result = pii.evaluate(input=user_message)

Common Mistakes

  • Reporting global F1 only. A 0.91 global F1 can hide a 0.62 recall on the entity type that matters.
  • English-only NER on multilingual traffic. Recall collapses on non-English scripts; use multilingual NER.
  • Skipping coreference resolution. “She,” “the CEO,” and “Patel” can refer to the same person — NER alone does not resolve this.
  • Ignoring domain mismatch. General-purpose NER misses drug names, gene symbols, and legal citations; finetune on your domain.
  • No drift monitoring. New product names, new competitors, and new public figures appear constantly; recall on novel entities decays fast.

Frequently Asked Questions

What is named entity recognition (NER)?

NER is the NLP task of locating entity mentions in text and assigning each a type — person, organization, location, date, money, product — performed by transformer encoders, LLM prompting, or classical CRF-based pipelines.

How is NER different from entity linking?

NER identifies that a mention is a person; entity linking maps that mention to a canonical record (e.g., 'Sarah Patel' → wikidata Q-id). NER is the upstream step; linking is downstream.

How is NER evaluated in production?

FutureAGI runs `ContextEntityRecall` to score whether retrieved context covers the entities a query actually mentioned, and `PII` to flag entity types that map to personal data — both run on every traced RAG response.