Generating Synthetic Data with Prompts
Turn a model into a data factory for tests, evals, and cold-start training, without spending a month labeling
There is a familiar dead spot at the start of a machine learning project: you have an idea, but no data to test it with. The old answer was to spend weeks collecting and labeling before you could even find out if the idea was any good. Now you can prompt a model to hand you a few hundred labeled examples in a minute and start testing today.
Synthetic data is genuinely useful for tests, evaluation sets, and cold-starting a project. It also has sharp edges that will burn you if you treat it as free real data. This post covers both.
The simplest version#
At its most basic, you ask the model for labeled examples in a format you specify. Here is the sentiment-classification version straight from the Prompt Engineering Guide:
Produce 10 exemplars for sentiment analysis. Examples are categorized as either
positive or negative. Produce 2 negative examples and 8 positive examples.
Use this format for the examples:
Q: <sentence>
A: <sentiment>You get ten clean, formatted examples back, split the way you asked. That is the whole trick, and it is more useful than it looks. Notice two things about that prompt: it fixes the format (so the output is parseable) and it controls the distribution (2 negative, 8 positive). Those two levers, format and distribution, are what turn "the model made up some text" into "I have a usable dataset".
Where it actually earns its keep#
I reach for synthetic data in three situations.
Test fixtures. When I need realistic-looking inputs to exercise a pipeline (a hundred varied support messages, a batch of product reviews, some messy user queries), generating them is faster than hunting for real ones and safer than using real customer data.
Evaluation sets. To measure whether a prompt or a retrieval change helped, I need labeled cases to score against. Synthetic examples give me a starting eval set immediately, which I can then refine and extend with real cases as they arrive. Any measurement beats none, and this gets you measuring on day one.
Cold-start training. When there is not enough real labeled data to train a smaller model, synthetic examples can seed it. This is the whole premise behind distilling a big model's behavior into a small one, and it is especially valuable in domains and languages where labeled data barely exists.
Controlling what you get#
Naive generation gives you bland, repetitive output. The model settles into a groove and produces fifty variations of the same three sentences. Getting good synthetic data is mostly about fighting that tendency. A few things that help.
Specify the distribution explicitly, the way the example above asks for 2 negative and 8 positive. Do not let the model decide the balance, because it will drift toward whatever is most typical.
Ask for variety along real dimensions. Instead of "generate 20 customer complaints", say "generate 20 customer complaints varying in tone (angry, confused, polite), length (one line to a paragraph), and product area (billing, shipping, quality)". Naming the axes of variation is the single biggest lever on diversity.
Pin the format so the output parses without hand-cleaning. JSON is the reliable choice.
from openai import OpenAI
import json
client = OpenAI()
prompt = """Generate 8 customer support messages as a JSON array.
Each item: {"text": <the message>, "category": <one of: billing, shipping, technical>,
"urgency": <low|medium|high>}.
Vary the tone, length, and category. Make them read like real people wrote them,
with typos and informal phrasing where natural. Return only the JSON array."""
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=1.0, # higher temperature here, we WANT variety
)
examples = json.loads(resp.choices[0].message.content)
for ex in examples:
print(ex["category"], "|", ex["urgency"], "|", ex["text"][:60])This is one of the rare times you want temperature high, not low. For factual answering you turn it down to stay accurate. For data generation you turn it up, because diversity is the whole point. Just do not crank it so high the labels stop matching the text.
The failure modes, stated plainly#
Synthetic data is not real data, and pretending otherwise is how people get burned.
It carries the model's biases. If you generate training data from a model, whatever skews the model has come along for the ride, and now they are baked into your dataset. Balance and check it, do not assume it is neutral. (This is the same exemplar-bias problem I wrote about in bias in prompting, just at dataset scale.)
It repeats itself. Even with prompting for variety, generated sets clump. The model has favorite phrasings and returns to them. Left unchecked you get a dataset that looks bigger than it is, because half of it is near-duplicates. Deduplicate, and measure actual diversity rather than trusting it.
Wrong labels sneak in. The model sometimes generates a "positive" example that reads negative. If you train or evaluate on mislabeled data, your numbers lie. Spot-check a sample by hand. Always.
It does not know your real distribution. Synthetic data reflects what the model thinks your domain looks like, which may not match what your users actually send. Treat it as a scaffold you replace with real data over time, not a permanent substitute.
The honest bottom line#
Synthetic data is a great way to start and a poor place to stop. Use it to get moving on day one: test fixtures, a first eval set, a cold-start training batch. Control the distribution, push for real variety, pin the format, and always spot-check the labels. Then, as real data arrives, fold it in and lean on the synthetic set less.
For the specific and higher-stakes case of generating data to train a retriever, there is a whole method with real cost numbers and a published technique behind it. I broke that down separately in generating a synthetic dataset for RAG. And if you are generating examples to use as few-shot prompts rather than training data, the mechanics of picking good ones are in zero-shot vs few-shot prompting.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.