Bias in Prompting: Exemplar Distribution and Order Effects, Folarin Akinloye

There are two kinds of bias people talk about with LLMs. One is the big societal kind baked into the training data, which prompting alone will not fix. The other is smaller, sneakier, and entirely your fault: bias you introduce through how you write your prompt. The examples you pick, how many of each, and what order you put them in can all nudge the model toward an answer before it has really looked at the input.

This post is about that second kind, because it is the one you control and the one that quietly wrecks classification tasks.

Two levers that skew few-shot prompts#

When you do few-shot prompting, you hand the model a few labeled examples and then the real input. Those examples are not neutral. Two properties of them can bias the output: how the labels are distributed, and what order they come in.

Neither is obvious, and both are easy to test. Let me walk through each with the sentiment-classification setup from the Prompt Engineering Guide, because it makes the effect visible.

Distribution of examples#

The question: if most of your examples share one label, does the model drift toward that label?

Sometimes it does not. Take a balanced-ish prompt and a clearly negative final input:

Q: I just got the best news ever!
A: Positive
 
Q: We just got a raise at work!
A: Positive
 
... (several more positive examples) ...
 
Q: The weather outside is so gloomy.
A: Negative
 
Q: I just got some terrible news.
A: Negative
 
Q: That left a sour taste.
A:

The model answers "Negative", correctly, even though positive examples outnumber negative ones. For a task the model knows well, like English sentiment, its prior knowledge is strong enough to resist a skewed example set.

But push on a harder, more ambiguous input and the skew starts to bite. Feed it something genuinely unclear, like "I feel something", after a set of mostly negative examples, and it says "Negative". Flip the set to mostly positive examples, ask the exact same ambiguous sentence, and it says "Positive". The input did not change. The distribution of your examples decided the answer.

That is the tell. When the model has strong priors, it ignores your skew. When it is uncertain, it leans on whatever your examples over-represent. The advice that falls out: balance your labels. Roughly equal numbers per class, so the example set is not silently voting.

Order of examples#

Same idea, different lever. Even with a balanced set, the order can bias things, especially when the labels are already skewed. If you stack all the positive examples first and all the negative ones last, the model can pick up on the sequence itself as a signal rather than treating each example independently.

The fix is boring and effective: shuffle. Randomly interleave the labels instead of grouping them. Do not put all of one class together. It costs you nothing and removes a whole category of accidental bias.

import random
 
examples = [
    ("The food here is delicious!", "Positive"),
    ("I'm so tired of this coursework.", "Negative"),
    ("I had a great day today!", "Positive"),
    ("The service here is terrible.", "Negative"),
    ("I'm so blessed to have such an amazing family.", "Positive"),
    ("This meal tastes awful.", "Negative"),
]
 
random.shuffle(examples)  # break up any positional pattern
 
prompt = "\n\n".join(f"Q: {text}\nA: {label}" for text, label in examples)
prompt += "\n\nQ: I feel something.\nA:"

Shuffling on every build, or at least ensuring a mixed order, keeps the model from reading structure into your example list that you never intended to put there.

How to actually test for it#

Do not trust your gut on whether a prompt is biased. Measure it. The test is simple and you should run it on any classification prompt before shipping.

Take a set of ambiguous inputs, the borderline cases where the model could reasonably go either way. Run them through several versions of your prompt: balanced examples, skewed toward each label, and a few different orderings. If the answer on the same ambiguous input flips depending on your example set, you have found bias, and the amount it flips is how much bias you have.

# Pseudocode for the test loop
ambiguous_inputs = ["I feel something.", "It was fine, I guess.", "Not the worst."]
 
for variant_name, example_set in prompt_variants.items():
    for text in ambiguous_inputs:
        label = classify(build_prompt(example_set, text))
        print(variant_name, "|", text, "->", label)
# If the same input gets different labels across variants, that gap is your bias.

This is really just evaluation applied to your prompt design. A tiny labeled set of hard cases and a loop over prompt variants tells you more than any amount of staring at the wording.

What prompting can and cannot fix#

Balancing and shuffling fix the bias you introduce. They do not touch the deeper bias in the model itself: the stereotypes and skews learned from training data. That kind can produce harmful output on the wrong task, and no amount of example-shuffling removes it. For that you need heavier machinery, content moderation, filtering, sometimes fine-tuning, and a clear sense of where your application could cause harm if the model's priors leak through.

So treat this as two jobs. First, do not add bias yourself: balance the labels, mix the order, and test on ambiguous inputs. That part is squarely in your control and takes minutes. Second, know that the model brings its own biases to the table, and prompting is not the tool for those.

If you are building few-shot prompts in the first place, the mechanics of choosing and formatting examples matter a lot, and I covered those in zero-shot vs few-shot prompting and the anatomy of a good prompt. Balancing and shuffling are just the reliability layer on top of getting the examples right in the first place.

This post is about that second kind, because it is the one you control and the one that quietly wrecks classification tasks.

Two levers that skew few-shot prompts#

Neither is obvious, and both are easy to test. Let me walk through each with the sentiment-classification setup from the Prompt Engineering Guide, because it makes the effect visible.

Distribution of examples#

The question: if most of your examples share one label, does the model drift toward that label?

Sometimes it does not. Take a balanced-ish prompt and a clearly negative final input:

Q: I just got the best news ever!
A: Positive
 
Q: We just got a raise at work!
A: Positive
 
... (several more positive examples) ...
 
Q: The weather outside is so gloomy.
A: Negative
 
Q: I just got some terrible news.
A: Negative
 
Q: That left a sour taste.
A:

Order of examples#

import random
 
examples = [
    ("The food here is delicious!", "Positive"),
    ("I'm so tired of this coursework.", "Negative"),
    ("I had a great day today!", "Positive"),
    ("The service here is terrible.", "Negative"),
    ("I'm so blessed to have such an amazing family.", "Positive"),
    ("This meal tastes awful.", "Negative"),
]
 
random.shuffle(examples)  # break up any positional pattern
 
prompt = "\n\n".join(f"Q: {text}\nA: {label}" for text, label in examples)
prompt += "\n\nQ: I feel something.\nA:"

Shuffling on every build, or at least ensuring a mixed order, keeps the model from reading structure into your example list that you never intended to put there.

How to actually test for it#

Do not trust your gut on whether a prompt is biased. Measure it. The test is simple and you should run it on any classification prompt before shipping.

# Pseudocode for the test loop
ambiguous_inputs = ["I feel something.", "It was fine, I guess.", "Not the worst."]
 
for variant_name, example_set in prompt_variants.items():
    for text in ambiguous_inputs:
        label = classify(build_prompt(example_set, text))
        print(variant_name, "|", text, "->", label)
# If the same input gets different labels across variants, that gap is your bias.

This is really just evaluation applied to your prompt design. A tiny labeled set of hard cases and a loop over prompt variants tells you more than any amount of staring at the wording.

Bias in Prompting: How Your Prompt Design Skews the Model

Two levers that skew few-shot prompts#

Distribution of examples#

Order of examples#

How to actually test for it#

What prompting can and cannot fix#

Related articles

Generating Synthetic Data with Prompts

Factuality: Prompting to Reduce Hallucination

Active-Prompt: Stop Guessing Which Few-Shot Examples to Annotate

Bias in Prompting: How Your Prompt Design Skews the Model

Two levers that skew few-shot prompts#

Distribution of examples#

Order of examples#

How to actually test for it#

What prompting can and cannot fix#

Related articles

Generating Synthetic Data with Prompts

Factuality: Prompting to Reduce Hallucination

Active-Prompt: Stop Guessing Which Few-Shot Examples to Annotate