Models

What Is Normalized Discounted Cumulative Gain?

A ranking quality metric that scores an ordered list against an ideal ordering, weighting hits at the top of the list more than hits further down.

What Is Normalized Discounted Cumulative Gain?

Normalized Discounted Cumulative Gain (NDCG) is a ranking quality metric that scores an ordered list of results against an ideal ordering, weighting hits at the top of the list more heavily than hits further down. The “discounted” part is a logarithmic position penalty; the “normalized” part divides by the score of the perfect ranking so the result is bounded in [0, 1]. NDCG handles graded relevance — items can be “highly relevant”, “somewhat relevant”, or “not relevant” — instead of binary relevance, which makes it the default ranking metric for retrieval, search, and RAG reranker evaluation.

Why It Matters in Production LLM and Agent Systems

In a RAG pipeline, the reranker’s NDCG directly translates into answer quality. If the most relevant chunk is consistently at position 4 instead of position 1, the LLM still sees it but spends more attention budget on lower-quality context first. Faithfulness drops. Hallucination rate rises. Token cost grows because the team raises top-K to compensate. Single-number retrieval-recall metrics miss this entirely — recall@10 can stay flat while NDCG@10 sinks because position quality is degrading even if the same documents are retrieved.

The pain is shared. RAG engineers see faithfulness drop after a reranker change without an obvious cause. Backend engineers see token cost climb because the team bumped top-K from 5 to 10 to “compensate.” Product owners see answer quality regress in narrow query cohorts. SREs see latency rise because larger candidate lists are now passing through reranking.

In 2026, agentic-RAG systems make NDCG even more important. Multiple retrieval calls per turn mean ranking degradation compounds across steps. A reranker swap that drops NDCG by 4 points may not break a single-turn RAG demo but will quietly destroy a 5-step agent workflow that depends on the right chunk being at position 1 every time.

How FutureAGI Handles NDCG

FutureAGI ships NDCG as a first-class evaluator in fi.evals, alongside MRR, PrecisionAtK, and RecallAtK. The pattern: instrument the retriever and reranker via traceAI-langchain, traceAI-llamaindex, or traceAI-pinecone (depending on your stack); each retrieval span records the top-K candidate IDs and the chosen ordering. Build a Dataset of queries with graded gold labels per candidate; run Dataset.add_evaluation with NDCG. The dashboard surfaces eval-fail-rate-by-cohort keyed on retrieval source, query type, or reranker version — so a reranker swap shows up as an NDCG@10 delta on the affected cohorts before it shows up as a downstream Faithfulness regression.

A real example: a RAG team upgrading from a cross-encoder reranker to a smaller LLM-as-judge reranker for cost reasons. Offline, MRR stays flat at 0.74. NDCG@10 drops from 0.81 to 0.72. The downstream Faithfulness evaluator drops a corresponding 5 points two days later in production. The team rolls back the reranker, dashboards both metrics together, and ships the next iteration only after NDCG@10 clears the previous baseline. Agent Command Center holds the new reranker behind a shadow-deployment route until the threshold is met. Without NDCG, the team would have seen a “mystery faithfulness regression” and chased the wrong layer.

How to Measure or Detect It

NDCG is a single computation but it pairs best with companion metrics:

  • NDCG (FutureAGI evaluator): returns a 0–1 score per query for the ranked list against graded gold labels.
  • MRR: complements NDCG; tells you how fast the first relevant hit appears.
  • PrecisionAtK and RecallAtK: position-blind sanity checks; if recall is fine but NDCG drops, ranking is the issue.
  • per-cohort NDCG dashboard: keyed on retrieval source, query intent, and reranker version.
  • NDCG@K vs. K curve: how quickly NDCG plateaus; sharp plateaus mean rerankers help, slow plateaus mean retrieval is doing the heavy lifting.
  • NDCG / Faithfulness correlation: track them jointly; an NDCG drop without a faithfulness drop suggests you can lower top-K and save tokens.

Minimal Python:

from fi.evals import NDCG

ndcg = NDCG()
result = ndcg.evaluate(
    ranked_list=retrieved_ids,
    relevance_scores=graded_gold_scores,
    k=10,
)
print(result.score)

Common Mistakes

  • Treating relevance as binary. NDCG’s main advantage is graded relevance; binary labels collapse it to PrecisionAtK in disguise.
  • Reporting NDCG at one K only. NDCG@5 vs. NDCG@10 vs. NDCG@20 tell different stories; pick K to match how many chunks actually feed the LLM.
  • Ignoring the gold-set size. A small gold set produces high NDCG variance; a 200-query gold set is the practical floor for stable comparisons.
  • Confusing NDCG with MRR. Unlike MRR, which only sees the first relevant result, NDCG penalises late hits too — they are different signals.
  • Skipping NDCG when the LLM “fixes it.” A strong LLM can recover from bad ranking, but the cost (tokens, latency, hallucination risk) is real; track NDCG anyway.

Frequently Asked Questions

What is NDCG?

Normalized Discounted Cumulative Gain (NDCG) is a ranking metric that scores an ordered list against the ideal ordering, weighting earlier positions more heavily and producing a 0–1 score.

How is NDCG different from MRR?

MRR (Mean Reciprocal Rank) only cares about the position of the first relevant hit. NDCG considers every relevant item in the list and weights all of them by position, so it captures the quality of the full ranking.

How do you compute NDCG in production?

Use FutureAGI's NDCG evaluator, which takes a ranked list and graded relevance scores and returns a value between 0 and 1. Pair it with PrecisionAtK and MRR for a full retrieval-quality picture.