What is NDCG in LLM evaluation?

NDCG is a 0-1 ranking metric for retrieval and reranking. It rewards highly relevant items at top ranks and discounts useful items that appear lower in the list.

How is NDCG different from mean reciprocal rank?

Mean reciprocal rank focuses on the first relevant result. NDCG evaluates the whole ranked list and supports graded relevance, so exact, partial, and weak matches can contribute differently.

How do you measure NDCG?

FutureAGI measures it with `fi.evals.NDCG`, using ranked contexts and relevance scores from the eval dataset or trace. Teams usually track NDCG@K by retriever or reranker version.

What Is NDCG? Definition & FutureAGI Guide (2026)

What Is Normalized Discounted Cumulative Gain (NDCG)?

Normalized Discounted Cumulative Gain (NDCG) is an LLM-evaluation metric for ranked outputs, especially RAG retrieval and reranking, that rewards highly relevant items appearing near the top. It discounts lower ranks, compares the observed ranking with the ideal ranking, and returns a normalized 0-1 score. In production traces and eval pipelines, FutureAGI uses eval:NDCG through fi.evals.NDCG to catch cases where a retriever found useful evidence but buried it too late for the model to use.

Why NDCG Matters in Production LLM and Agent Systems

NDCG matters because a RAG system can retrieve correct evidence and still fail if it ranks that evidence after distractors. The visible failure is not “no context found”; it is a grounded-looking answer built from the wrong top chunk. Support agents quote outdated policy text, coding assistants prefer a broad tutorial over the exact API reference, and research agents summarize a weak source because the strongest source arrived at rank 8.

The pain is shared across retrieval engineers, product owners, and SREs. Engineers see answer-quality regressions after changing embeddings, hybrid-search weights, or a reranker. Product sees lower task completion on long-tail queries. SREs see higher token spend because teams increase top-k to hide poor ordering. In logs, the symptoms look like high recall with low answer quality, reranker-version skew, higher thumbs-down rate on query cohorts with many near-duplicate documents, and traces where retrieval.documents contains good evidence outside the first few positions.

Agentic systems make this worse. A planner may choose tools, issue follow-up searches, and synthesize from multiple ranked lists. If each step buries the best evidence, the final answer compounds small ranking errors into a confident hallucination. Unlike mean reciprocal rank, which only asks how quickly the first relevant item appears, NDCG keeps the whole graded ranking honest.

How FutureAGI Handles NDCG

FutureAGI’s approach is to treat NDCG as a retrieval-stage regression signal, not a vanity dashboard metric. The anchor eval:NDCG maps to fi.evals.NDCG, the local metric listed in the FutureAGI evaluation inventory for Normalized Discounted Cumulative Gain over ranked retrieval. It takes the ordered contexts and their relevance scores, applies the DCG discount, computes the ideal DCG from the sorted labels, and writes a 0-1 result plus DCG/IDCG detail.

A real workflow: a docs-chat team instruments a LangChain RAG app with traceAI-langchain. Each retriever span records retrieval.documents in ranked order; an annotation job adds graded relevance labels from 0 to 3 for a golden dataset. The team runs NDCG(config={"k": 5}) before promoting a new cross-encoder reranker. When NDCG@5 drops from 0.82 to 0.61 on API-reference queries, FutureAGI flags a regression even though Recall@K is unchanged. The engineer opens the failing traces, sees exact-title documents pushed below broad guides, and adjusts the reranker feature weight.

In production, the same score becomes an alert: if p25 NDCG@5 drops below 0.70 for a cohort, trigger a rollback or route those requests to the previous retriever. Pairing NDCG with ContextPrecision and ContextRecall keeps the decision grounded: NDCG checks ordering with graded labels; precision and recall explain whether the issue is noisy retrieval or missing coverage.

How to Measure or Detect NDCG

NDCG is measurable once each retrieved item has a graded relevance score. The usual score is NDCG@K, where K is the number of top results the model actually sees. Track:

fi.evals.NDCG - returns a 0-1 score plus DCG and IDCG details for ranked retrieval.
retrieval.documents - the ordered trace field that should match the contexts passed into the model.
Graded relevance labels - human or judge labels such as 0, 1, 2, and 3, not only relevant or irrelevant.
p25 NDCG@K by retriever version - the dashboard signal that catches tail regressions before mean answer accuracy moves.
Thumbs-down or escalation rate - a user-feedback proxy for low NDCG on high-choice queries.

Minimal Python:

from fi.evals import NDCG

metric = NDCG(config={"k": 5})
result = metric.evaluate([{
    "query": "refund policy for annual plans",
    "contexts": ["exact policy", "pricing FAQ", "old terms"],
    "relevance_scores": [3.0, 1.0, 0.0],
}])
print(result.eval_results[0].output)

Common Mistakes

Reporting NDCG without K. NDCG@3 and NDCG@10 answer different production questions because the model may only read the first few chunks.
Scoring the raw vector-search list. If a reranker changes the order before generation, score the final list the model receives.
Using binary labels for graded decisions. Exact evidence, partial evidence, and loosely related context should not all receive the same gain.
Treating high NDCG as answer faithfulness. NDCG checks retrieval order; pair it with Groundedness or Faithfulness for generated claims.
Averaging across every query. Separate navigational, troubleshooting, and exploratory cohorts; a healthy mean can hide broken long-tail retrieval.