RAG

What Is Reranker?

A second-pass scoring model, typically a cross-encoder, that re-orders the top-k passages from a first-stage retriever to lift top-1 relevance precision.

What Is Reranker?

A reranker is a second-pass scoring model that re-orders the top-k passages returned by a first-stage retriever (usually vector search) to put the most relevant on top before generation. Most production rerankers are cross-encoders: they take the query and a candidate passage as a joint input and output a single relevance score. slower per pair than a bi-encoder embedding lookup, but far more precise because they can attend across query and passage tokens together. The pattern is retrieve-then-rerank: fetch top-20 with cheap vector search, rerank to top-3 with the cross-encoder, generate from the top-3. Rerankers do not replace vector search; they boost its precision.

Why It Matters in Production LLM and Agent Systems

Vector search has structural precision limits. Embeddings are bi-encoders. query and document are encoded independently, then compared with a single dot product or cosine. That is fast but loses query-specific signal. The result is decent recall (the right doc is usually in top-20) but mediocre top-1 precision (the right doc is often not in top-1). A reranker fixes that gap by attending across query and document tokens jointly, surfacing the actually-best match from the candidate pool.

The pain shows up across roles. ML engineers see RAG hallucinations that turn out to be ranking errors: the right chunk was retrieved but not selected. Retrieval engineers see eval-fail-rates flatten on long-tail queries because dense-only retrieval cannot distinguish near-duplicates. Product managers see “the bot quotes the wrong section” tickets when two policy clauses are similarly worded. Cost engineers see token waste from over-fetching k=20 to compensate for poor top-1.

In 2026, rerankers have become standard in any RAG system that prioritises answer quality over absolute latency. Cohere Rerank, Voyage Rerank, and Jina Reranker are the dominant hosted options; cross-encoders fine-tuned on domain pairs are common for high-stakes verticals. Newer LLM-as-reranker patterns use a small LLM to score query-passage pairs, trading more latency for better calibration. The decision is not whether to rerank. it is which reranker.

How FutureAGI Handles Rerankers

FutureAGI’s approach is to score the rerank step independently in eval and to expose it as a primitive in the gateway. On the eval side, fi.evals.Ranking scores whether the top-1 post-rerank result is actually the best of the candidate set. the canonical reranker-quality metric. fi.evals.ContextRelevance confirms the post-rerank context is useful for the query, and ChunkAttribution confirms the LLM downstream actually used it. Run these on a golden set before any reranker change, and on sampled production traces continuously to catch drift.

On the gateway side, the Agent Command Center exposes a rerank SDK resource so teams can route reranker calls through a centralised endpoint with caching, retries, and provider-fallback. Switching from Cohere Rerank to Voyage is a config change, not a code change, and traces capture which provider served each request.

A typical FutureAGI workflow: a team adds a Cohere Rerank step between their Pinecone vector-search and their LLM, expecting a precision lift. They sample the same golden dataset before and after, run Ranking, ContextRelevance, and end-to-end RAGScore, and read the deltas. Ranking improves top-1 by 18 points; ContextRelevance lifts a comparable amount; RAGScore lifts 9 points. Latency budget cost is 80ms p99. acceptable. They ship. That is reranker decision-making with empirical signals, not vendor benchmarks.

How to Measure or Detect It

Reranker quality is measurable on every request:

  • fi.evals.Ranking: scores whether top-1 is the best of the candidates. the canonical reranker-quality metric.
  • fi.evals.ContextRelevance: 0–1 on whether post-rerank context matches query intent.
  • MRR (Mean Reciprocal Rank): position of first correct hit on a labelled set. sensitive to top-k ordering quality.
  • NDCG@k: discounted cumulative gain. measures graded relevance across the ranked list.
  • Reranker-span latency: p99 on the rerank span. captured automatically by traceAI integrations.
  • Top-1 to top-5 quality lift: compare RAGScore with rerank disabled vs enabled. the cost-benefit signal.
from fi.evals import Ranking

result = Ranking().evaluate(
    input="What is the solar system?",
    context=[
        "The solar system consists of the Sun and celestial objects bound to it.",
        "Our solar system formed 4.6 billion years ago."
    ]
)
print(result.score, result.reason)

Reranker landscape in 2026

In our 2026 evals, the reranker market has consolidated around a few options that consistently lift top-1 precision on RAG. The table is how we think about choices:

RerankerHosting2026 fit
Cohere Rerank v3+ManagedStrong default; multilingual coverage
Voyage Rerank 2ManagedHigh precision on technical corpora
Jina Reranker v2Managed or self-hostedCost-sensitive deployments
BGE Reranker v2Self-hostedOpen-source, fine-tuneable
ColBERT v2 / ColBERTv2.5Self-hostedLate-interaction precision
LLM-as-reranker (Claude Sonnet 4.6, GPT-5-mini)ManagedBest precision, highest latency
Cross-encoder (BERT-style, fine-tuned)Self-hostedDomain-specific verticals

Frontier 2026 models (Claude Opus 4.7, GPT-5.1, Gemini 3 Pro) can serve as rerankers themselves through structured prompts, which is the most accurate option but adds 200-500 ms p99 to the rerank span. On RAGTruth’s 18K labeled chunks and RAGBench, hybrid retrieval plus a Cohere or Voyage reranker typically lifts top-1 precision 15-25 points over dense-only retrieval. the single largest gain available in a RAG stack short of a model upgrade. The trade-off is real. Unlike a single vendor benchmark, FutureAGI scores Ranking, ContextRelevance, ContextPrecision, and end-to-end RAGScore against the same golden cohort with reranking enabled and disabled, so the lift is measured on your corpus, not the provider’s. The recommended 2026 default is hybrid retrieval at k=30-50, then a Cohere or Voyage reranker to top-5, with Ranking and Groundedness wired as release gates.

Common Mistakes

  • Treating the reranker as a replacement for vector search. It is a second pass on top of recall. Without a first-stage retriever feeding candidates, there is nothing to rerank.
  • Reranking too few candidates. A reranker on top-3 has nothing to work with; feed it top-20 to top-50 and let it pick top-3.
  • Skipping the eval delta before/after. Adding a reranker without measuring its quality lift is paying latency tax for an unproven gain.
  • Using a reranker model trained on a different domain. Generic reranker models drop sharply on legal, medical, and code corpora. fine-tune or pick a domain-specific reranker.
  • Caching rerank scores by exact input. Rerankers are deterministic per (query, candidate), but query phrasing varies; cache at the embedding-similarity layer, not the rerank layer.
  • Skipping cohort cuts on reranker eval. A reranker that lifts global top-1 by 18 points can still regress on Spanish-locale RAG pipeline queries; slice Ranking, ContextPrecision, and Groundedness by tenant, locale, and intent before promoting.
  • Ignoring traceAI rerank-span attributes. Without gen_ai.tool.name, provider id, top-n in/out, and rerank latency on the span, an incident becomes guesswork. The evaluator cannot help if the span is opaque.

Frequently Asked Questions

What is a reranker?

A reranker is a second-pass scoring model that re-orders the top-k passages from a first-stage retriever to put the most relevant ones on top. Most production rerankers are cross-encoders that score query and passage jointly for higher precision than vector similarity alone.

How is a reranker different from vector search?

Vector search is a first-stage retriever that fetches top-k by embedding similarity. fast but imprecise on top-1. A reranker is a second-pass scorer (typically a cross-encoder) that re-orders that top-k. Rerankers do not replace vector search; they boost its top-1 precision.

How do you measure reranker quality?

FutureAGI's fi.evals Ranking evaluator scores whether the top-1 result is the best of the candidate set, while ContextRelevance confirms the post-rerank context is actually useful. Pair with MRR and NDCG@k on a labelled set for offline evaluation.