Dataset Diversity: Fixing Repetitive Synthetic Generations
Why your generated dataset all sounds the same, and the seeding trick that fixes it
Here is the trap with synthetic data: you write one good generation prompt, run it 10,000 times, and get 10,000 versions of the same three examples. Turning up temperature does not save you. The fix is to inject randomness into the prompt itself, and it is one of the highest-leverage tricks I know for anyone training or fine-tuning on generated data.
I covered the basic pipeline in generating synthetic data with prompts and the RAG-specific version in generating a synthetic dataset for RAG. This post is about the failure mode that hits you after those pipelines work: everything the model produces starts to rhyme.
Why temperature does not buy you diversity#
Temperature reshapes the probability distribution over the next token. It does not change what the model considers a typical answer to your prompt. Ask a model to "write a short story for a child" a thousand times at temperature 1.0 and you will get a thousand stories about a little girl or boy who learns a gentle lesson, because that is the center of mass for that request. The samples differ in wording, not in substance.
The same thing happens with more serious tasks. Generate classification examples for "customer complaint" and you get endless variations of a late delivery. Generate coding exercises and you get FizzBuzz in a hundred costumes. Your downstream model then overfits to that narrow slice and falls over on real inputs.
The fix: seed the prompt with random entities#
The cleanest demonstration of the fix comes from the TinyStories work by Eldan et al. (2023), which the Prompt Engineering Guide walks through in its dataset diversity chapter. The goal was a dataset of children's stories covering a young child's full vocabulary. Their approach:
- Build a vocabulary of about 1,500 basic words, split into nouns, verbs, and adjectives.
- On every generation, randomly pick one word from each list and require the story to use all three.
- Also maintain a list of story features (dialogue, a plot twist, a bad ending, a moral) and randomly require a couple of those too.
The prompt template looks like this:
import random
prompt = f"""Write a short story (3-5 paragraphs) which only uses very simple
words that a 3 year old child would likely understand. The story should use
the verb "{random.choice(verbs)}", the noun "{random.choice(nouns)}" and the
adjective "{random.choice(adjectives)}". The story should have the following
features: {random.choice(features)}, {random.choice(features)}.
Remember to only use simple words!"""Force "decorate", "thunder", and "ancient" into one story and you get something no amount of temperature would have produced. The model follows the constraints precisely, and because the constraints change every call, the outputs stop collapsing onto the same few templates. The randomness lives in your code, where you control it, not in the sampler, where you do not.
The general recipe:
- Identify which parameters or entities should vary across samples in your dataset.
- Generate or hand-write a pool of values for each one.
- Randomly fill the prompt template on every call. Set temperature a bit above default, but nowhere near max.
- Train your local model on the results.
One of your seeded entities can be the class label itself. For sentiment classification, inject "positive" or "negative" into the prompt and you get labeled data for free, no separate annotation pass.
Level two: hierarchical generation#
You can push this further by making the LLM generate some of the seed material itself. First ask for a story summary and a single sentence that must appear in the story (seeded with your random words). Then feed that intermediate output into the final generation prompt:
Summary: {summary generated in step one}
Features: {features from the initial prompt}
Sentence: {sentence generated in step one}
Words: {words from the initial prompt}
Story:Now the diversity compounds: random words produce varied summaries, and varied summaries produce even more varied stories. This two-stage structure also gives you labels for free. If the initial prompt asked for a dialogue and a plot twist, you know exactly which features each sample contains, which is exactly what you need to train a classifier over those properties.
Does this actually work? Phi-1 says yes#
The obvious question is whether synthetic data built this way trains models that hold up. Gunasekar et al. (2023), the "Textbooks Are All You Need" paper, is the strongest early evidence. They generated textbook-quality Python teaching material with GPT-3.5, deliberately diversifying by constraining topic and target audience, and used it to train Phi-1, a 1.5B parameter model that rivaled models roughly ten times its size on HumanEval.
The target audience constraint is worth stealing. A first-year undergrad, a bootcamp student, and a PhD candidate all explain the same concept differently, so rotating the audience in your prompt is a cheap second axis of diversity:
Write an extract from a Computer Science textbook for a 1st-year bachelor.
The coding language is Python 3.6.
This is an extract from the middle of the following topic: Singular matrices.
The extract starts with a high-level overview of the topic. Then, it presents
an example and describes the solution in natural language. After that, it
provides 1-2 code snippets, following the example. Each snippet has no more
than 10 rows. There should be no text after code snippets.They generated about 1B tokens this way. At 2023 prices that was roughly $2,000 of generation for a pretraining-scale augmentation. For a fine-tune you need a small fraction of that, and generation costs have dropped a lot since.
Where I would actually use this in 2026#
Three years on, the technique matters most in these situations:
| Situation | Why seeding helps |
|---|---|
| Data cannot leave your environment (legal, health, finance) | You must train a local model, and real labeled data is scarce |
| Niche domain or non-English language | Off-the-shelf models are weakest exactly where your data is thinnest |
| Training small, cheap classifiers | A seeded synthetic set plus a small model often beats calling a frontier model per request |
| Building eval sets | Diverse synthetic cases stress-test your pipeline better than temperature-sampled ones |
The failure mode to watch: your entity pools become the new ceiling on diversity. If all your seeded topics come from one brainstorm session, the dataset inherits that session's blind spots. I generate the pools with an LLM, prune them by hand, and keep growing them as I find gaps.
If you are deciding whether to fine-tune on synthetic data at all, I wrote up the decision framework in fine-tuning vs RAG vs prompting. And if you are generating data, measure the diversity: embed a sample of outputs and look at the pairwise cosine similarity. If the mean is creeping toward the high 0.8s, your dataset is rhyming, and it is time to widen the pools.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.