Production RAG, Part 2: Measuring Retrieval Quality, Folarin Akinloye

In Part 1 we fixed chunking. But how do you know it worked? "The answers look better" is not an engineering answer. This part builds a small evaluation harness so every change to your RAG pipeline is backed by a number.

Important

Evaluate retrieval and generation separately. If retrieval fails, no amount of prompt tuning fixes the answer, and you'll waste days in the wrong layer.

Two Layers, Two Questions#

A RAG system has two failure surfaces:

Retrieval: did we fetch the chunks that contain the answer?
Generation: given those chunks, did the model answer faithfully?

Measure them independently. Most teams that "can't get RAG to work" have a retrieval problem masquerading as a model problem.

Build a Golden Set#

You need questions with known-relevant chunks. Fifty good examples beat five thousand noisy ones. Write them by hand, or bootstrap with an LLM and review.

golden = [
    {
        "question": "What chunk overlap should I start with?",
        "relevant_chunk_ids": ["chunking#overlap"],
    },
    {
        "question": "How do I measure retrieval quality?",
        "relevant_chunk_ids": ["eval#metrics", "eval#golden-set"],
    },
]

Retrieval Metrics#

Two metrics get you most of the way: hit rate (did a relevant chunk appear in the top-k?) and MRR (how high did the first relevant chunk rank?).

def evaluate_retrieval(golden, retrieve, k=5):
    hits, reciprocal_ranks = 0, []
    for ex in golden:
        retrieved = [c.id for c in retrieve(ex["question"], k=k)]
        relevant = set(ex["relevant_chunk_ids"])
        rank = next(
            (i + 1 for i, cid in enumerate(retrieved) if cid in relevant),
            None,
        )
        if rank:
            hits += 1
            reciprocal_ranks.append(1 / rank)
        else:
            reciprocal_ranks.append(0.0)
    return {
        "hit_rate@k": hits / len(golden),
        "mrr@k": sum(reciprocal_ranks) / len(golden),
    }

Run it before and after a change. If hit_rate@5 goes from 0.72 to 0.81 when you switch to heading-aware chunking, you have evidence, not a hunch.

Generation Metrics: Faithfulness#

The dangerous failure is a fluent answer that isn't supported by the retrieved context, a hallucination wearing a suit. Use an LLM-as-judge to score faithfulness: every claim in the answer must be grounded in the context.

JUDGE_PROMPT = """You are grading a RAG answer for FAITHFULNESS.
Context:
{context}
 
Answer:
{answer}
 
Does every factual claim in the answer follow from the context?
Reply with a JSON object: {{"faithful": true|false, "unsupported": ["..."]}}"""

Warning

Pin the judge model and prompt, and keep them in version control. If your evaluator drifts, your metrics become meaningless, you can no longer compare last week's score to this week's.

Close the Loop#

Metric	What it catches	Target to aim for
`hit_rate@5`	Missing-context failures	> 0.85
`mrr@5`	Relevant chunk ranked too low	> 0.7
Faithfulness	Hallucinated claims	> 0.95

Wire these into CI. Treat a regression in hit_rate@5 like a failing test, because it is one. Once retrieval is measured, the rest of the system stops being guesswork, you tune chunk size, embeddings, and re-rankers against a scoreboard instead of intuition.

This wraps the Production RAG series. The throughline: RAG quality is an engineering discipline, not a prompt. Measure the layers, fix the foundation, and the model will do its job.

Build a Golden Set#

You need questions with known-relevant chunks. Fifty good examples beat five thousand noisy ones. Write them by hand, or bootstrap with an LLM and review.

golden = [
    {
        "question": "What chunk overlap should I start with?",
        "relevant_chunk_ids": ["chunking#overlap"],
    },
    {
        "question": "How do I measure retrieval quality?",
        "relevant_chunk_ids": ["eval#metrics", "eval#golden-set"],
    },
]

Retrieval Metrics#

Two metrics get you most of the way: hit rate (did a relevant chunk appear in the top-k?) and MRR (how high did the first relevant chunk rank?).

def evaluate_retrieval(golden, retrieve, k=5):
    hits, reciprocal_ranks = 0, []
    for ex in golden:
        retrieved = [c.id for c in retrieve(ex["question"], k=k)]
        relevant = set(ex["relevant_chunk_ids"])
        rank = next(
            (i + 1 for i, cid in enumerate(retrieved) if cid in relevant),
            None,
        )
        if rank:
            hits += 1
            reciprocal_ranks.append(1 / rank)
        else:
            reciprocal_ranks.append(0.0)
    return {
        "hit_rate@k": hits / len(golden),
        "mrr@k": sum(reciprocal_ranks) / len(golden),
    }

Run it before and after a change. If hit_rate@5 goes from 0.72 to 0.81 when you switch to heading-aware chunking, you have evidence, not a hunch.

Generation Metrics: Faithfulness#

JUDGE_PROMPT = """You are grading a RAG answer for FAITHFULNESS.
Context:
{context}
 
Answer:
{answer}
 
Does every factual claim in the answer follow from the context?
Reply with a JSON object: {{"faithful": true|false, "unsupported": ["..."]}}"""

Warning

Pin the judge model and prompt, and keep them in version control. If your evaluator drifts, your metrics become meaningless, you can no longer compare last week's score to this week's.

Close the Loop#

Metric

What it catches

Target to aim for

hit_rate@5

Missing-context failures

> 0.85

mrr@5

Relevant chunk ranked too low

> 0.7

Faithfulness

Hallucinated claims

> 0.95

This wraps the Production RAG series. The throughline: RAG quality is an engineering discipline, not a prompt. Measure the layers, fix the foundation, and the model will do its job.

Production RAG, Part 2: Measuring Retrieval Quality

Two Layers, Two Questions#

Build a Golden Set#

Retrieval Metrics#

Generation Metrics: Faithfulness#

Close the Loop#

Related articles

What Are AI Agents, and What Is a Multi-Agent System?

Production RAG, Part 2: Measuring Retrieval Quality

Two Layers, Two Questions#

Build a Golden Set#

Retrieval Metrics#

Generation Metrics: Faithfulness#

Close the Loop#

Related articles

What Are AI Agents, and What Is a Multi-Agent System?