Models

What Is Learning to Rank?

Supervised ML methods that train models to order a list of items by relevance to a query, used in search, recommendation, and RAG reranking.

What Is Learning to Rank?

Learning to rank (LTR) is the supervised machine-learning task of training a model to order candidate items by relevance to a query. It powers search, recommender systems, and the reranker stage of modern RAG. Approaches split into three families: pointwise methods (predict a per-item score, then sort), pairwise methods (learn from “A more relevant than B” comparisons, e.g. RankNet, RankSVM), and listwise methods (optimise the entire ordering against a target metric, e.g. LambdaMART, ListNet). In LLM stacks, LTR sits between retrieval and generation: vector search returns top-N, the LTR model reorders them, and only top-K reach the LLM.

Why It Matters in Production LLM and Agent Systems

The LLM only sees the top of the retrieved list. If the relevant chunk is at position 47, recall is wasted. The reranker — a learning-to-rank model — is what turns “the document was in the index” into “the model actually saw it.” In modern RAG, swapping the LTR component often moves quality more than swapping the embedding model or the LLM.

The pain shows up across roles. An ML engineer ships a faster vector retriever, sees ContextRecall improve and Faithfulness drop because ranking got noisier. A product lead benchmarks against a competitor and finds the gap is the reranker. A platform engineer pays for a heavyweight cross-encoder on every query when 40% of queries had the right answer at position 1 — they could skip the reranker on high-confidence retrieval.

In 2026 agent stacks, LTR shows up beyond retrieval. Agents rank tool candidates before calling, planners rank action options, and multi-agent systems rank handoff destinations. Each is a ranking problem with its own labels and evaluation. Without per-step evaluation, the wrong tool or the wrong handoff goes unnoticed until users complain.

How FutureAGI Handles Learning-to-Rank Evaluation

FutureAGI does not train rankers — it evaluates them. The relevant evaluators are Ranking, NDCG, MRR, PrecisionAtK, and RecallAtK. Together they score whether the ranker ordered the candidates correctly against ground-truth labels and whether top-K coverage is sufficient.

A concrete workflow: a RAG team versions a labelled Dataset of 1,500 queries, each with up to 20 candidate chunks and human-graded relevance. They run a candidate bake-off — BGE-reranker-large vs Cohere rerank-v3 vs a domain-fine-tuned cross-encoder — and call Dataset.add_evaluation(NDCG), Dataset.add_evaluation(MRR), and Dataset.add_evaluation(PrecisionAtK) for each. The dashboard slices results by latency p99 and quality. The domain fine-tune wins on relevance but adds 80ms p99; BGE wins on the latency-critical route. The Agent Command Center routes by a conditional rule that picks the heavier reranker when retrieval confidence is below a threshold.

For online evaluation, traceAI-langchain and traceAI-llamaindex emit retrieval and rerank as separate spans. Sampling 5% of production traces with ContextPrecision and ContextRelevance surfaces eval-fail-rate-by-cohort regressions before users notice — for example, a long-query cohort that drops 6 points after a reranker model swap.

How to Measure or Detect It

Learning-to-rank quality combines list-level and downstream signals:

  • NDCG — Normalized Discounted Cumulative Gain; the canonical listwise metric.
  • MRR — Mean Reciprocal Rank; how quickly the first relevant item appears.
  • PrecisionAtK — fraction of top-K items that are relevant.
  • RecallAtK — fraction of relevant items found in top-K.
  • Ranking — FutureAGI’s general-purpose ranking evaluator.
  • ContextPrecision / ContextRelevance — downstream confirmation that ranking helped.
  • Reranker latency p99 — operational metric to weigh against quality lift.
from fi.evals import NDCG, MRR, PrecisionAtK

ndcg = NDCG()
mrr = MRR()
pk = PrecisionAtK()

retrieved = ["doc_b", "doc_a", "doc_c", "doc_d"]
ground_truth = ["doc_a", "doc_b"]
print(ndcg.evaluate(retrieved=retrieved, expected=ground_truth))
print(mrr.evaluate(retrieved=retrieved, expected=ground_truth))
print(pk.evaluate(retrieved=retrieved, expected=ground_truth, k=2))

Common Mistakes

  • Optimising NDCG over the full candidate list when only top-3 reach the LLM. Use NDCG@K with K matching context-window usage.
  • Training rerankers on click-through data without dedup. Power users dominate; debias by query session.
  • Skipping the downstream eval. Better ranking does not always mean better answers; pair with Faithfulness and AnswerRelevancy.
  • Letting reranker latency leak into p99. Cross-encoder rerankers can add 100-200ms; route or skip on high-confidence retrieval.
  • Static eval cohort. Production query distributions shift; refresh or sample continuously.

Frequently Asked Questions

What is learning to rank?

Learning to rank (LTR) is the supervised ML task of training a model to order items by relevance to a query. It powers search engines, recommender systems, and the reranker stage of RAG.

What are the main learning-to-rank approaches?

Pointwise (predict per-item relevance independently), pairwise (learn from item-pair comparisons like RankNet), and listwise (optimise the ordering directly with LambdaMART or ListNet against NDCG).

How do you measure learning-to-rank quality in production?

FutureAGI exposes Ranking, NDCG, MRR, PrecisionAtK, and RecallAtK evaluators. Pair them with downstream Faithfulness and AnswerRelevancy to confirm better ranking translated into better LLM answers.