Guides

Evaluating Cohere Rerank in RAG (2026)

Reranking helps when recall is high but precision is low. It hurts when recall is low. The eval triangle (NDCG@k, recall delta, latency) tells you which.

·
Updated
·
11 min read
cohere-rerank rag llm-evaluation reranker traceAI 2026
Editorial cover image for Evaluating Cohere Rerank RAG Pipelines in 2026
Table of Contents

You added Cohere Rerank 3.5 on top of your dense retriever last month. The cost line went up. The latency p99 went up. The answer quality looks the same. Maybe the reranker is earning its slot. Maybe it’s rearranging chairs on a candidate set the base retriever already nailed in the top-3. Maybe it’s quietly demoting the correct doc on short queries and you never measured it.

The opinion this post earns: reranking helps when retrieval recall is high but precision is low. It hurts when recall is low. No reranker rescues a chunk that never made the candidate list. No reranker improves a top-3 that’s already correct. The way to know which case you’re in is the eval triangle, scored on the same queries with the reranker on and off: NDCG@k_post_rerank for ranking quality, Recall@k_pre_rerank versus Recall@k_post_rerank for recall preservation, and latency p95/p99 for the cost in user-perceived time. Without these three you’re guessing whether the rerank cost is paying for itself.

This guide is the eval methodology. The recall-precision tradeoff, the eval triangle, an honest comparison of Cohere Rerank 3.5 against Voyage rerank-2.5 and Mixedbread mxbai-rerank-large-v2, per-domain decisions, production patterns including semantic caching, and traceAI instrumentation that keeps the verdict honest after deploy. For the broader RAG eval landscape, the RAG evaluation metrics deep dive covers the foundational definitions; this post is the reranker-specific layer on top.

TL;DR: when reranking helps, when it hurts

SignalReranking helpsReranking hurts
recall@k_pre_rerank> 0.90 with large k (50-100)< 0.85; the right chunk isn’t even in the candidate set
recall@3 of base retriever< 0.70; right doc is past rank 3 often> 0.90; the base is already nailing it
query distributionEnglish, medium-length, well-formed questions< 5-token keyword queries; multilingual against English-trained reranker; identifier-heavy code
chunk sizeCoherent units (clause, paragraph, function)Fixed-size splits cutting clauses across boundaries
latency budget> 800 ms total response< 800 ms (real-time voice, sub-second chat)

Two non-negotiables across every reranker decision. Measure base retriever recall@50 before adding the reranker (recall is upstream; precision is downstream; the reranker is precision-only). Score NDCG@k_post_rerank plus the recall delta plus latency p99 as a triangle, not a single number.

The recall-precision tradeoff at the heart of reranking

A reranker is a precision tool, not a recall tool. The retriever decides what’s in the candidate set; the reranker decides the order inside it. This is the entire shape of when reranking helps.

Imagine four scenarios on a 1000-document corpus, top_k_pre_rerank=50, top_k_post_rerank=5:

  • Recall high, precision low. Retriever returns the supporting chunk at rank 23 of 50. Without rerank, the LLM sees top-5 (none of which contain the answer) and hallucinates. With rerank, the supporting chunk moves to rank 2; the LLM grounds correctly. Uplift is real. Reranker earns its slot.
  • Recall high, precision high. Retriever returns the supporting chunk at rank 2 of 50. Top-5 already contains it. Reranker moves it to rank 1. Marginal NDCG gain; the answer is identical. Reranker is decorative. Cost is paid for movement the LLM didn’t need.
  • Recall low. Retriever doesn’t surface the supporting chunk in top-50. The right chunk is at rank 200 in the dense index or absent entirely (chunking error, embedding-domain mismatch, missing document). Reranker has nothing to work with. No reranker rescues this. Fix the retriever, not the precision layer.
  • Recall medium, query distribution-shifted. Retriever returns the supporting chunk at rank 5. Query is two tokens of Hindi against an English-Hindi corpus. Cross-encoder over-weights English token overlap and demotes the Hindi-supporting chunk to rank 30. Reranker actively hurts.

The eval question is not “does the reranker improve answer quality?” The eval question is “for which slice of my queries does the reranker improve answer quality, by how much, at what latency cost, and where does it degrade?”

The eval triangle: NDCG@k, recall delta, latency cost

Most teams score reranker pipelines with one number (faithfulness on the final answer). That number changes for six reasons (retriever shifted, reranker shifted, chunker shifted, prompt shifted, model shifted, distribution shifted), and one number can’t tell you which.

Three rubrics, scored as a triangle, separate the reranker’s contribution from everything else:

NDCG@k_post_rerank. Normalised Discounted Cumulative Gain at the post-rerank top-k. Measures whether the reranker put the most relevant chunks in the highest ranks. Compute per query against a labelled relevance grade (binary or graded 0-3); aggregate per domain stratum. NDCG@10 is the standard; NDCG@5 matters more if your prompt only consumes top-5. A reranker that lifts NDCG@10 by 0.08 on contracts and drops it by 0.04 on multilingual is making a tradeoff you have to see per route.

Recall@k_pre_rerank versus Recall@k_post_rerank. The recall the retriever earned, versus the recall the reranker preserved. With k_pre_rerank=50 and k_post_rerank=5, a reranker that drops recall from 0.94 to 0.78 destroyed sixteen points of recall to win precision on the top-5. Sometimes that’s the right trade (precision matters more than coverage on factual QA). Sometimes it’s not (multi-hop questions need the second supporting chunk you just dropped). The number tells you.

Latency p95 and p99 added by the rerank hop. Per-call, not aggregate. Cohere Rerank 3.5 adds roughly 80 to 150 ms p50 on chunks under 2k tokens; 200 ms-plus p99 on chunks past 3k tokens. A reranker that doubles your end-to-end p99 is a different product than one that adds 5 percent to p50. Measure both.

Compute all three on the same golden set, with the reranker on and off, stratified by domain. The triangle reads:

  • NDCG up, recall preserved, latency reasonable: ship it.
  • NDCG up, recall dropped, latency reasonable: tune top_n (try k_post_rerank=10 instead of 5) or revisit the precision-recall preference per route.
  • NDCG flat, latency added: the reranker is decorative on this slice. Scope it out.
  • NDCG up, latency p99 blown: cache scores aggressively, downsample candidate-set size, or move to a faster reranker variant.

Cohere Rerank 3.5 vs Voyage vs Mixedbread (honest tradeoffs)

Three rerankers dominate production RAG in 2026. The decision is rarely about absolute quality on a leaderboard. It’s about cost model, latency tail, language coverage, and whether you need self-hostable weights.

RerankerBest forLanguagesMax chunkLatency p50Cost modelSelf-host
Cohere Rerank 3.5Multilingual production; broadest track record100+4096 tokens80-150 msPer-1k documentsNo (API only)
Voyage rerank-2.5Domain-tuned (code, finance, legal); strong EnglishEnglish-dominant8192 tokens90-180 msPer-1k documentsNo (API only)
Voyage rerank-2.5-liteLatency-sensitive EnglishEnglish-dominant8192 tokens40-80 msPer-1k documents (cheaper)No (API only)
Mixedbread mxbai-rerank-large-v2Self-hostable English; on-prem and air-gappedEnglish plus EU langs32k tokens60-200 ms (GPU-dependent)Amortised infraYes (open weights)

On most English RAG corpora the three are within 1 to 3 NDCG@10 points of each other. The decision matrix:

  • Multilingual or non-English production. Cohere Rerank 3.5. Voyage and Mixedbread degrade materially outside English on most public benchmarks; Cohere’s 100-plus language training holds quality on Spanish, French, German, Japanese, Mandarin, Arabic, Hindi, and Portuguese in our measurements.
  • English with strict latency budget (< 600 ms total). Voyage rerank-2.5-lite. Cuts the rerank hop roughly in half versus Cohere with a small quality cost on out-of-domain queries.
  • English code corpus or finance/legal in-domain. Voyage rerank-2.5 (domain variants). The fine-tuning on code identifiers and legal terminology shows up as 2 to 4 NDCG@10 points over generic rerankers on the corresponding stratum.
  • On-prem, air-gapped, or sensitive-data deployment. Mixedbread mxbai-rerank-large-v2. Open weights mean the model runs in your VPC. Pair with a single A10 or L4 GPU and the per-call cost moves from per-1k-documents to amortised infrastructure.
  • You don’t know yet. Cohere Rerank 3.5. Strongest default, broadest documentation, lowest activation cost. Switch later when the eval triangle says one of the others wins your specific workload.

The best rerankers for RAG 2026 post covers the broader landscape including the open-source long-tail (BGE-reranker-v2, Jina-reranker-v2). This post is about whether Cohere specifically earns its slot in your pipeline.

Per-domain decisions: where reranking wins and loses

The same reranker behaves differently across document types. Read the stratified breakdown, not the aggregate.

  • Legal and medical: high uplift, watch precision on identifiers. Contracts, regulations, clinical notes. Recall is typically high (the right clause is in the top-50); precision is typically low (similarly-worded clauses cluster). Reranker uplift on NDCG@5 of 0.10 to 0.15 is common. Pair with clause-level chunking and the gain compounds. Watch identifier queries (ICD codes, statute numbers) where token-level matching can lose to a cross-encoder’s semantic preference.
  • Long-form prose: moderate uplift if chunking is semantic. Research papers, transcripts, books. Reranker uplift on NDCG@10 of 0.05 to 0.08 typical. Fixed-size chunking neutralises the gain (the reranker scores ambiguous half-passages); semantic or section-level chunking preserves it.
  • Code and code-mixed: low uplift on cross-encoders, high on token-level matching. Source code, code with comments, mixed languages. Cohere Rerank 3.5 underperforms on identifier-heavy queries (function names, variable names) because cross-encoders over-weight semantic similarity. Late-interaction retrieval (ColBERT-style) at the retrieval layer often wins outright over dense-plus-cross-encoder-rerank for these. The chunking strategies post covers the late-interaction case in depth.
  • Multilingual: stratify or skip. Score per language. English, Spanish, French, German, Mandarin tend to hold quality across all three rerankers. Hindi, Arabic, Bengali, Swahili can lose 5 to 10 NDCG points on Voyage and Mixedbread; Cohere holds better but still loses points on under-five-token queries. A reranker that gains ten points on average and loses twenty on Hindi is not a win.
  • Marketing and product FAQs: skip. Uniform short paragraphs, queries are mostly lookup, recall@3 from a half-decent dense retriever sits near 0.95. Reranker adds latency and cost without changing the answer.

Route by document type before reranking, not after. A single reranker applied uniformly is the source of most reranker-eval failures we see in production.

Production patterns: instrumenting and caching

The CI gate catches regressions you can think of. Production catches everything else. Same rubrics, two places.

traceAI (Apache 2.0) ships a CohereInstrumentor that wires every rerank, embed, and chat invocation into the OpenTelemetry trace tree with one call:

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_cohere import CohereInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="cohere-rerank-prod",
)
CohereInstrumentor().instrument(tracer_provider=trace_provider)

Every rerank call emits a span with fi.span.kind=RERANKER. The span carries cohere.rerank.model, cohere.rerank.top_n, cohere.rerank.documents_count, per-document relevance scores, and end-to-end latency. The before-and-after rank lists land as span attributes, so you can see which doc moved up and which got demoted. This is what lets a RerankerUplift rubric score the actual delta rather than a generic faithfulness number, and what makes a cohere.rerank.model change auditable when Cohere ships a new version.

Semantic caching of reranker scores is the single highest-ROI production pattern most teams haven’t implemented. Two cache layers:

  • Exact cache. Key on (query_normalised, sorted_doc_ids, reranker_model_version). Hit rate is high on lookup workloads (product search, FAQ, support) and low on long-tail conversational queries. Hits return scores in under 5 ms versus 80 to 150 ms uncached. Cost saving is roughly the hit rate times the per-call cost.
  • Semantic cache. Key on embedding similarity to past queries above a threshold (0.95 cosine works for English question-answer corpora; lower for code where identifier overlap dominates). Reuses scores for paraphrased queries. Hit rate on semantic cache typically lands 20 to 40 percent on conversational RAG workloads.

Combined, exact-plus-semantic caching cuts rerank cost by 30 to 60 percent on most production RAG workloads we’ve measured with customers. Cache invalidation matters: bust the cache on reranker model version change (the span attribute is the audit trail), bust on document edit, and set TTL by route based on how fresh the underlying corpus is (24 hours for product docs, 7 days for legal corpora that rarely change).

The Agent Command Center ships both cache types (in-memory or Redis exact, plus Qdrant or Pinecone semantic) as gateway primitives with x-agentcc-cache-force-refresh, x-agentcc-cache-ttl, and x-agentcc-cache-namespace headers. Run Cohere through the gateway and the same OpenAI-compatible drop-in handles caching, observability, and cost telemetry without rewriting the SDK calls.

How Future AGI ships reranker evaluation

Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent.

  • ai-evaluation SDK (Apache 2.0): from fi.evals import Evaluator, then evaluator.evaluate(eval_templates=[...], inputs=[TestCase(...)]). Six RAG-specific EvalTemplate classes (Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization) plus 50+ total; CustomLLMJudge for RerankerUplift and domain rubrics; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
  • Future AGI Platform: self-improving evaluators tuned by thumbs up/down feedback; in-product authoring agent writes reranker-eval rubrics from natural-language descriptions.
  • traceAI (Apache 2.0): CohereInstrumentor plus 50+ AI surfaces across Python, TypeScript, Java; 14 span kinds with a first-class RERANKER that carries model, top_n, document count, and per-document scores.
  • Error Feed (inside the eval stack): HDBSCAN clustering plus Sonnet 4.5 Judge writes the immediate_fix per cluster. Common reranker clusters: “rerank demotes correct doc on under-five-token queries,” “rerank latency tail above p99 on 3k-plus token chunks,” “rerank quality drops on Hindi-English pairs.”
  • Agent Command Center: OpenAI-compatible gateway as a single Go binary (Apache 2.0). 100+ providers including Cohere, Voyage, and Mixedbread routing. 18+ built-in guardrail scanners plus 15 third-party adapters. Exact and semantic caching at the gateway layer. SOC 2 Type II, HIPAA, GDPR, CCPA certified; ISO/IEC 27001 in active audit.
from fi.evals import Evaluator
from fi.evals.templates import (
    ContextRelevance, ChunkAttribution, ChunkUtilization,
    Groundedness, CustomLLMJudge,
)
from fi.testcases import TestCase

evaluator = Evaluator()  # FI_API_KEY / FI_SECRET_KEY from env

reranker_uplift = CustomLLMJudge(
    name="reranker_uplift",
    rubric="""
    Score whether answer_with_rerank is closer to expected_answer
    than answer_no_rerank on (query). 1.0 if meaningfully better,
    0.5 if equivalent, 0.0 if worse. Name the specific quality
    difference (factual coverage, citation accuracy, relevance).
    """,
    judge_model="gpt-4.1",
)

def score_rerank_uplift(golden_set, retriever, reranker, synthesizer):
    rows = []
    for ex in golden_set:
        base = retriever(ex.query, k=50)
        reranked = reranker(ex.query, base, top_n=5)
        ans_off = synthesizer(ex.query, base[:5])
        ans_on  = synthesizer(ex.query, reranked)
        tc = TestCase(
            query=ex.query,
            answer_no_rerank=ans_off,
            answer_with_rerank=ans_on,
            expected_response=ex.expected,
            context=[c.text for c in reranked],
        )
        result = evaluator.evaluate(
            eval_templates=[
                ContextRelevance(), ChunkAttribution(),
                ChunkUtilization(), Groundedness(),
                reranker_uplift,
            ],
            inputs=[tc],
        )
        rows.append((ex.domain, result))
    return rows

Run the same rubric in CI before deploy and as span-attached scorers on live traces via traceAI after. Sample 5 to 10 percent of production traffic for LLM-as-judge rubrics; deterministic rubrics like citation validity run on 100 percent. Alarm on a 2 to 5 point drop in rolling-mean per rubric per route over 30 to 90 minutes.

Ready to score your reranker uplift? Wire ContextRelevance, ChunkAttribution, ChunkUtilization, and a RerankerUplift CustomLLMJudge into a pytest fixture this afternoon against the ai-evaluation SDK. Stratify the golden set by domain and language. Add the CohereInstrumentor when production traces start asking questions the CI gate missed.

Frequently asked questions

When does reranking actually help a RAG pipeline?
When retrieval recall is high but precision is low. The reranker can only re-order what the retriever already surfaced. If the supporting chunk is at rank 47 of 50, a cross-encoder pushes it to rank 3 and the answer gets grounded; the uplift is real and worth the 50 to 200 ms hop. If the supporting chunk never made the candidate list, no reranker rescues it. Measure recall@50 first. If recall@50 is below 0.85 on your golden set, fix the retriever (better embeddings, hybrid sparse-plus-dense, larger candidate window) before paying for a reranker. The pipeline where reranking wins biggest is k_pre_rerank=50 to 100 candidates, k_post_rerank=3 to 5 final, on a corpus where the dense retriever's top-3 misses the right chunk 30 to 40 percent of the time.
When does reranking hurt?
Three cases. First, when recall is low and you're paying latency to re-rank garbage. Second, when the base retriever is already good enough (recall@3 above 0.9). Adding a reranker pays cost and latency for movement inside an already-correct top-3. Third, when query distribution mismatches the reranker's training. Cross-encoders trained on English question-answer pairs degrade on short keyword queries, multilingual pairs, and code identifiers. A Spanish-language insurance query against a Spanish-English corpus with the default Cohere Rerank 3.5 model can be worse than no reranker. The way to know is the recall-delta plus NDCG@k_post_rerank measured on a domain golden set, not a vendor leaderboard.
What is the eval triangle for a reranker in RAG?
Three numbers, scored on the same query set, with the reranker on and off. NDCG@k_post_rerank measures whether the reranker put the most relevant chunks in the highest ranks. Recall@k_pre_rerank versus Recall@k_post_rerank measures whether the reranker preserved (or destroyed) the recall the retriever earned. Latency p95 and p99 added by the rerank hop measures the cost in user-perceived response time. Without all three you're guessing whether the rerank cost is paying for itself. NDCG alone can climb while latency blows the budget. Recall delta alone can hide a precision win that justifies the hop. Latency alone is meaningless without quality movement to weigh it against.
How does Cohere Rerank 3.5 compare to Voyage rerank-2.5 and Mixedbread mxbai-rerank-large-v2 in 2026?
Honest tradeoffs. Cohere Rerank 3.5 is the strongest default for English plus 100-plus languages, 4096-token chunks, and the broadest production track record; the cost line is per-1k-documents and adds 80 to 150 ms p50 on typical chunks. Voyage rerank-2.5 (and rerank-2.5-lite) ships tighter domain variants for code, finance, and legal corpora; lite cuts latency roughly in half at a small quality cost on out-of-domain queries. Mixedbread mxbai-rerank-large-v2 is the strongest open-weight option; deploy it self-hosted on a single GPU and the cost line moves from per-call to amortised infra. On most English RAG corpora the three are within 1 to 3 NDCG@10 points of each other. The decision is cost model, latency tail, and whether you need self-hostable weights, not absolute quality.
How does Future AGI score reranker uplift?
Three rubrics on a stratified golden set, the reranker run twice (on and off), the delta read per stratum. ContextRelevance scores whether the reranked top-k stayed on-topic. ChunkAttribution scores which reranked chunks the generator used. ChunkUtilization scores how much of the reranked context the answer covered. A CustomLLMJudge rubric called RerankerUplift takes (query, answer_no_rerank, answer_with_rerank, expected) and scores the delta directly. traceAI auto-instrumentation tags every rerank span with model, top_n, candidate count, and per-document scores via fi.span.kind=RERANKER, so production traces stay comparable across model-version changes. Error Feed clusters failing traces with HDBSCAN and a Sonnet 4.5 Judge writes the immediate_fix per cluster: 'rerank demotes correct doc on under-five-token queries,' 'rerank latency tail above p99 on 3k-plus token chunks,' 'rerank quality drops on Hindi-English pairs.'
Should I cache reranker scores in production?
Yes for any reranker call where the query plus candidate set repeats. Two patterns. Exact cache: keyed on (query, sorted candidate IDs, reranker model version), hit rate is high on lookup workloads (product search, FAQ) and low on long-tail conversational queries. Semantic cache: keyed on embedding similarity to past queries above a threshold (0.95 cosine works for English question-answer corpora; lower for code), reuses scores for paraphrased queries. Semantic caching of reranker scores cuts cost by 30 to 60 percent on most production RAG workloads and trims p50 latency to under 5 ms on cache hits. Cache invalidation: bust on reranker model version change (the cohere.rerank.model span attribute is the audit trail), bust on document edit, and TTL by route based on how fresh the underlying corpus is.
Should chunk size influence the reranker decision?
Yes, and most teams underestimate this. Cohere Rerank 3.5 supports 4096-token chunks; anything past that gets truncated and the signal in the tail is invisible to the reranker. Voyage and Mixedbread have their own context limits. The interaction matters: pair clause-level chunking (small, semantically coherent) with a reranker and the reranker has discrete units to score; pair fixed-size 1024 chunking with a reranker and a single chunk often contains two half-clauses, which gives the reranker an ambiguous unit to score. The chunking-evaluation post covers the splitter decision; the eval triangle here is what tells you whether the reranker recovered the recall the splitter cost you.
Related Articles
View all
Evaluating Embedding Models in 2026
Guides

MTEB Recall@10 does not transfer to your domain. 500 labeled query-passage pairs from your traffic decide which embedding wins. Without that, you are picking by leaderboard theater.

NVJK Kartik
NVJK Kartik ·
12 min