Generating a Synthetic Dataset for RAG Retrieval, Folarin Akinloye

Retrieval is the part of RAG that quietly decides everything. If the retriever hands the model the wrong passages, the best prompt in the world cannot save the answer. And retrievers that work great on English news often fall apart the moment you point them at Czech legal text or Indian tax rules or any narrow, low-resource domain. The usual fix is to fine-tune a retriever on your domain, which needs labeled query-document pairs, which you do not have.

Here is the move that gets you out of that hole: use a large model to generate the training data, then train a small retriever on it. It works, it is cheap, and there is real research behind it.

Why retrieval breaks in your domain#

A retriever's job is to look at a query and find the documents that are relevant to it. Off-the-shelf embedding models learned "relevant" from broad, mostly-English web data. Two things go wrong when you leave that comfort zone.

Language. Retrieval quality drops in languages that were thin in the training data. The model has a weaker sense of which words mean similar things.

Domain intent. "Relevant" is not universal. For an argument-retrieval task you might want supporting arguments; for another you want counter-arguments. Same query, same document, opposite relevance, depending on what the task actually means by relevant. A general retriever has no idea which one you want.

You could fix this with a labeled dataset of query-document pairs in your domain. But labeling thousands of those is a month of work and real money. That is the wall this technique walks around.

The idea: distill a big model into a small retriever#

Instead of labeling by hand, prompt a large model (ChatGPT or GPT-4 class) to generate the queries for you. You have documents already, that is your corpus. What you lack is queries that map to them. So you flip the usual direction: for each document, ask the big model to write a query that this document would answer.

Do that across your corpus and you have query-document pairs, which is exactly the training signal a retriever needs. You are distilling the large model's understanding of relevance into a small, cheap encoder. Training the encoder costs compute up front, but then inference is cheap and fast, and it is tuned to your domain.

This is the Promptagator approach from Dai et al., 2022. Their striking result: with only 8 manually labeled examples plus a corpus of unlabeled documents, a retriever trained on synthetic queries reached near state-of-the-art. Synthetic data closed a gap that otherwise needed thousands of hand-labels.

How the prompt is built#

You write a short task description and give a handful of labeled examples, then a new document for which the model should generate a query. Here is the counter-argument example from the guide:

Task: Identify a counter-argument for the given argument.
 
Argument #1: {passage X1}
A concise counter-argument query related to argument #1: {query Y1}
 
Argument #2: {passage X2}
A concise counter-argument query related to argument #2: {query Y2}
 
... (a few more examples) ...
 
Argument N: {a new document from your corpus}
A concise counter-argument query related to argument #N:

The model reads the pattern from your examples and produces a query for the new document. Only that last document and its generated query go into the training set. Written more generally, each generation step is: the task instruction, plus a few example (document, query) pairs, plus the new document, and out comes a query.

Tip

Prepare more examples than you use per prompt. The guide suggests writing around 20 good ones and randomly sampling 2 to 8 into each prompt. This adds diversity to your generated queries without much extra annotation work, since you reuse the same pool.

The examples do a lot of heavy lifting. They should be representative, correctly formatted, and specific about the details you care about: how long the query should be, what tone, what "relevant" means for your task. Sloppy examples produce sloppy synthetic data, and that flows straight into a worse retriever. This is worth being fussy about.

The cost, with real numbers#

This is where it gets persuasive. Take a prompt with instructions and a few examples at roughly 700 tokens, generating about 25 tokens per query, run across a 50,000-document corpus. Using GPT-3.5 Turbo pricing from the guide:

50,000 * (700 * 0.001 * $0.0015 + 25 * 0.001 * $0.002) ≈ $55

About fifty-five dollars to generate training data for a 50,000-document corpus. The 50,000 figure is not random either: Dai et al. found that is roughly the amount of manually labeled data you would otherwise need to match the quality you get from the synthetic set. So the comparison is stark. Fifty-five dollars and a couple of days of generation and training, versus a month of labeling and labor costs well past a thousand dollars.

For a specialized retriever (not general English news, but something like domain-specific legal or medical retrieval), that trade is usually a clear win. You can even generate 2 to 4 queries per document to expand the set further.

Fitting it into a real RAG pipeline#

The synthetic dataset is a means to an end: a retriever that actually surfaces the right documents in your domain. Once you have trained it, it slots into the same pipeline as any other retriever, and the rest of your RAG stack is unchanged.

A few things to keep straight. Watch generation quality, because a query that does not really match its document is a mislabeled training pair, and enough of those degrade the retriever. Spot-check a sample by hand. Keep your embedding and chunking choices consistent between generation and serving; if you train on one chunking scheme and serve another, you have moved the goalposts. If embeddings or chunking are fuzzy for you, embeddings explained for engineers and chunking strategies for RAG cover the groundwork.

And measure the payoff. The whole reason to do this is better retrieval, so score it: retrieval hit rate before and after, then end-to-end answer quality. Evaluating RAG: faithfulness, context relevance, and answer quality lays out the metrics. If the synthetic-trained retriever does not move those numbers, you have learned that cheaply too.

This is the more rigorous cousin of the general technique in generating synthetic data with prompts. Same core idea, prompting a model to manufacture labeled data, but aimed at the specific, high-value problem of making retrieval work where an off-the-shelf model gives up. For fifty-five dollars, it is one of the better trades in the RAG toolkit.

Here is the move that gets you out of that hole: use a large model to generate the training data, then train a small retriever on it. It works, it is cheap, and there is real research behind it.

Why retrieval breaks in your domain#

Language. Retrieval quality drops in languages that were thin in the training data. The model has a weaker sense of which words mean similar things.

You could fix this with a labeled dataset of query-document pairs in your domain. But labeling thousands of those is a month of work and real money. That is the wall this technique walks around.

The idea: distill a big model into a small retriever#

How the prompt is built#

You write a short task description and give a handful of labeled examples, then a new document for which the model should generate a query. Here is the counter-argument example from the guide:

Task: Identify a counter-argument for the given argument.
 
Argument #1: {passage X1}
A concise counter-argument query related to argument #1: {query Y1}
 
Argument #2: {passage X2}
A concise counter-argument query related to argument #2: {query Y2}
 
... (a few more examples) ...
 
Argument N: {a new document from your corpus}
A concise counter-argument query related to argument #N:

Tip

The cost, with real numbers#

50,000 * (700 * 0.001 * $0.0015 + 25 * 0.001 * $0.002) ≈ $55

Generating a Synthetic Dataset for RAG

Why retrieval breaks in your domain#

The idea: distill a big model into a small retriever#

How the prompt is built#

The cost, with real numbers#

Fitting it into a real RAG pipeline#

Related articles

Embeddings Explained for Engineers

Reranking in RAG: Cross-Encoders and When They Are Worth the Latency

Chunking Strategies for RAG: Fixed, Recursive, Semantic, and How to Choose

Generating a Synthetic Dataset for RAG

Why retrieval breaks in your domain#

The idea: distill a big model into a small retriever#

How the prompt is built#

The cost, with real numbers#

Fitting it into a real RAG pipeline#

Related articles

Embeddings Explained for Engineers

Reranking in RAG: Cross-Encoders and When They Are Worth the Latency

Chunking Strategies for RAG: Fixed, Recursive, Semantic, and How to Choose