Reranking in RAG: Cross-Encoders and When They Are Worth the Latency
A reranker can rescue a mediocre retriever or waste 200ms on an already-correct answer. Knowing which is the whole skill.
A reranker is the cheapest big win in RAG, right up until it is a pure waste of latency. The difference is entirely about how good your retriever already is. This post is about how reranking works, what it actually costs, and how to tell which side of that line you are on before you ship it.
This follows on from chunking strategies for RAG. Chunking decides what can be retrieved; reranking decides what survives to the model.
What a reranker actually does#
Your first-stage retriever (dense embeddings, usually) is a bi-encoder. It embeds the query and every document separately, ahead of time, and compares vectors. That is why it is fast: the document vectors are precomputed. The downside is the query and document never actually meet. The model never reads them together.
A cross-encoder does. It takes the query and one candidate document as a single input and outputs a relevance score. Because it reads both at once with full attention, it catches things a bi-encoder cannot: negation, qualifiers, "X but not Y", subtle topic mismatches. The cost is that there is no precomputing. You run the model fresh for every query-document pair at request time.
So the standard pattern is two stages. Retrieve a generous candidate set cheaply (say top 50 with dense search), then rerank those 50 with the cross-encoder and keep the top 5.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
def retrieve_and_rerank(query, candidates, k=5):
pairs = [(query, doc.text) for doc in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, _ in ranked[:k]]What it costs#
Reranking is not free and the cost is latency, not money. As a rough guide, reranking adds somewhere from 50ms to 400ms depending on the model, the number of candidates, and how long they are. Cohere Rerank 3.5, for example, adds roughly 80 to 150ms at p50 on chunks under 2k tokens, climbing past 200ms at p99 on longer chunks. A self-hosted BGE reranker on a GPU can be faster or slower depending on your batch size and hardware.
The latency scales with how many candidates you feed it and how long each one is. Reranking 50 short chunks is cheap. Reranking 200 long ones is not.
API rerankers can spike under load. If you have a latency SLA, set a timeout and a fallback: if the reranker does not respond in time, return the pre-rerank order and move on. A circuit breaker here saves you from one slow dependency taking down your p99.
The decision: when it is worth it#
This is the part people get wrong. The value of a reranker depends almost entirely on your base retriever's recall.
Think about it in two failure modes. If your retriever's recall is low, the right document often is not even in your top 50, so you are paying latency to reorder garbage. The reranker can only promote what retrieval already found. Fix retrieval first.
If your retriever is already excellent (recall@3 above 0.9), the right answer is almost always in your top 3 already. The reranker is now shuffling cards that are all correct. You pay the latency and the answer does not change.
The sweet spot is the middle: a retriever with good recall@50 but mediocre ordering. The right documents are in your candidate set but buried at rank 20 or 30. That is exactly what a cross-encoder fixes, and the quality jump can be large.
| Base retriever | Reranker verdict |
|---|---|
| Low recall@50 (right doc often missing) | Skip it, fix retrieval first |
| recall@50 high, ordering poor | Use it, this is where it shines |
| recall@3 already above 0.9 | Skip it, no room to improve |
Concretely: marketing pages and product FAQs, where queries are simple lookups against uniform short paragraphs, often hit recall@3 near 0.95 from a decent dense retriever alone. A reranker there adds latency and cost for no answer change. A messy technical knowledge base with overlapping documents is the opposite case, and reranking earns its keep.
How to actually decide for your system#
Do not guess. Measure recall at two depths on a real eval set.
def recall_at_k(eval_set, retrieve, k):
hits = sum(
gold in [d.id for d in retrieve(q, k=k)]
for q, gold in eval_set
)
return hits / len(eval_set)
base_recall_50 = recall_at_k(eval_set, dense_retrieve, k=50)
base_recall_3 = recall_at_k(eval_set, dense_retrieve, k=3)Read it like this. If recall@50 is low, your ceiling is low and a reranker cannot save you. If recall@50 is high but recall@3 is much lower, you have a reordering problem and a reranker is the fix. If recall@3 is already high, leave it alone.
Then prove it: measure recall@3 again with the reranker in place and compare to the cost. If recall@3 jumps from 0.6 to 0.9 for 120ms, ship it. If it moves from 0.92 to 0.94, do not.
Hosted or self-hosted#
Cohere Rerank is the easy button: a good model, no infrastructure, pay per call. Pick it when you are early, do not want to run a model, or Cohere does well on your eval set. The tradeoffs are per-call cost and less predictable tail latency under bursty load.
A self-hosted cross-encoder like bge-reranker-v2-m3 gives you control over latency and cost and keeps data in your stack, at the price of running and scaling a GPU service. If reranking is on every request and you have the ops capacity, self-hosting usually wins on cost and tail latency.
Wrapping up#
Reranking is a two-stage idea: retrieve wide and cheap, then re-score with a model that reads query and document together. It is one of the highest-leverage upgrades in RAG, but only when your retriever has good recall and bad ordering. Measure recall@50 and recall@3 first, then let the numbers tell you whether to add it.
Next in the series: embeddings explained for engineers, since the quality of your first-stage retriever starts with the embedding model you pick.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.