Production RAG, Part 2: Measuring Retrieval Quality
You can't improve what you don't measure, an evaluation harness for RAG
In Part 1 we fixed chunking. But how do you know it worked? "The answers look better" is not an engineering answer. This part builds a small evaluation harness so every change to your RAG pipeline is backed by a number.
Evaluate retrieval and generation separately. If retrieval fails, no amount of prompt tuning fixes the answer, and you'll waste days in the wrong layer.
Two Layers, Two Questions#
A RAG system has two failure surfaces:
- Retrieval: did we fetch the chunks that contain the answer?
- Generation: given those chunks, did the model answer faithfully?
Measure them independently. Most teams that "can't get RAG to work" have a retrieval problem masquerading as a model problem.
Build a Golden Set#
You need questions with known-relevant chunks. Fifty good examples beat five thousand noisy ones. Write them by hand, or bootstrap with an LLM and review.
golden = [
{
"question": "What chunk overlap should I start with?",
"relevant_chunk_ids": ["chunking#overlap"],
},
{
"question": "How do I measure retrieval quality?",
"relevant_chunk_ids": ["eval#metrics", "eval#golden-set"],
},
]Retrieval Metrics#
Two metrics get you most of the way: hit rate (did a relevant chunk appear in the top-k?) and MRR (how high did the first relevant chunk rank?).
def evaluate_retrieval(golden, retrieve, k=5):
hits, reciprocal_ranks = 0, []
for ex in golden:
retrieved = [c.id for c in retrieve(ex["question"], k=k)]
relevant = set(ex["relevant_chunk_ids"])
rank = next(
(i + 1 for i, cid in enumerate(retrieved) if cid in relevant),
None,
)
if rank:
hits += 1
reciprocal_ranks.append(1 / rank)
else:
reciprocal_ranks.append(0.0)
return {
"hit_rate@k": hits / len(golden),
"mrr@k": sum(reciprocal_ranks) / len(golden),
}Run it before and after a change. If hit_rate@5 goes from 0.72 to 0.81 when you switch to heading-aware chunking, you have evidence, not a hunch.
Generation Metrics: Faithfulness#
The dangerous failure is a fluent answer that isn't supported by the retrieved context, a hallucination wearing a suit. Use an LLM-as-judge to score faithfulness: every claim in the answer must be grounded in the context.
JUDGE_PROMPT = """You are grading a RAG answer for FAITHFULNESS.
Context:
{context}
Answer:
{answer}
Does every factual claim in the answer follow from the context?
Reply with a JSON object: {{"faithful": true|false, "unsupported": ["..."]}}"""Pin the judge model and prompt, and keep them in version control. If your evaluator drifts, your metrics become meaningless, you can no longer compare last week's score to this week's.
Close the Loop#
| Metric | What it catches | Target to aim for |
|---|---|---|
hit_rate@5 | Missing-context failures | > 0.85 |
mrr@5 | Relevant chunk ranked too low | > 0.7 |
| Faithfulness | Hallucinated claims | > 0.95 |
Wire these into CI. Treat a regression in hit_rate@5 like a failing test, because it is one. Once retrieval is measured, the rest of the system stops being guesswork, you tune chunk size, embeddings, and re-rankers against a scoreboard instead of intuition.
This wraps the Production RAG series. The throughline: RAG quality is an engineering discipline, not a prompt. Measure the layers, fix the foundation, and the model will do its job.
Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.