Zero-Shot vs Few-Shot Prompting (When Examples Help), Folarin Akinloye

Examples are the most effective context you can add to a prompt. Show the model two or three input-output pairs and it pattern-matches what you want better than a paragraph of rules ever could. This is few-shot prompting, and it is the technique I reach for first when format or labeling matters. But it has limits, and knowing where it stops working saves you from throwing examples at a problem that needs something else. This is part four of Prompt Engineering, Properly, following the anatomy of a good prompt.

Zero-shot: just ask#

Zero-shot means no examples in the prompt. You instruct the model directly and rely on what it already learned during training. Modern instruction-tuned models are good at this for common tasks, because instruction tuning and RLHF trained them to follow instructions without demonstrations.

Classify the text into neutral, negative, or positive.
 
Text: I think the vacation is okay.
Sentiment:

Neutral

No examples, correct answer. The model already understands "sentiment". For a well-known task with an unambiguous output, zero-shot is the right starting point: it is the cheapest prompt, the shortest, and often all you need. Always try it first. Reach for examples only when it falls short.

Few-shot: show, don't tell#

When zero-shot misses, add demonstrations. This is in-context learning: the examples condition the model on exactly the behavior you want for the next input. The classic illustration is teaching a made-up word:

A "whatpu" is a small, furry animal native to Tanzania. An example of a
sentence that uses the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.
 
To do a "farduddle" means to jump up and down really fast. An example of
a sentence that uses the word farduddle is:

When we won the game, we all started to farduddle in celebration.

One example was enough for the model to pick up the pattern. For harder tasks you scale up: 3-shot, 5-shot, more. The examples pin down three things at once: the label set, the output format, and how to handle the tricky cases.

The counterintuitive part: format matters more than correctness#

Here is the result that changes how you think about examples. Research on what makes demonstrations work (Min et al., 2022) found that the format and the label space matter more than whether each individual label is correct. In their tests, even examples with randomly assigned labels beat having no examples at all, as long as the format was consistent and the labels came from the real set of possible labels.

This is awesome! // Negative
This is bad! // Positive
Wow that movie was rad! // Positive
What a horrible show! //

Negative

The labels above are scrambled, yet the model still classifies the new input correctly. The lesson is not "use wrong labels", it is what the examples are really teaching: the shape of the task (here are inputs, here are the kinds of labels, here is the format) more than a lookup table of right answers. So when you build few-shot examples, get the format and the label set right first. That is where most of the lift comes from.

Tip

Spend your examples on the ambiguous cases, not the obvious ones. The model already nails "this is amazing" as positive. Use your few slots to show it the edge cases it tends to slip on: sarcasm, mixed sentiment, the input that maps to "neutral" rather than "positive". Examples are scarce context; aim them at the failures.

How many, and how to pick them#

A few practical rules I follow:

Start with the fewest that work. Try zero-shot, then one example, then add more only while accuracy keeps improving. Examples cost tokens and latency, so do not pay for ten when three do the job.
Keep the format identical across every example. Consistency is doing real work, as the random-label result shows. One stray format and you weaken the pattern.
Cover the label set. If there are three possible labels, show all three, so the model knows the full space it is choosing from.
Match the distribution. If "neutral" is common in your real data, your examples should reflect that, not give every label equal weight.
Target the hard cases. As above, the examples that move accuracy are the confusable ones.

Where few-shot breaks down#

Few-shot is not a universal fix. On tasks that need real multi-step reasoning, piling on examples often does not help. The guide shows this with a math-ish problem: deciding whether the odd numbers in a list sum to an even number. Zero-shot gets it wrong. Adding several input-output examples (just the list and the True/False answer) also gets it wrong:

The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: The answer is False.
... (several more examples) ...
The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1.
A:

The answer is True.   (wrong)

More examples did not help because the examples show the destination, not the route. The task needs intermediate steps (find the odd numbers, add them, check parity), and plain few-shot demonstrations skip straight to the answer, so the model never sees the reasoning it is supposed to do.

The fix is to put the reasoning into the examples, which is exactly chain-of-thought prompting, the next part of this series. When zero-shot and few-shot both fall short on a reasoning task, that is your signal to reach for it. And if neither prompting approach gets you there at all, that is when you start thinking about fine-tuning or a more advanced technique.

Wrapping up#

Try zero-shot first; it is cheapest and often enough. When it misses on format or labeling, add a few examples, keep their format identical, cover the label set, and aim them at the hard cases. When the task needs multi-step reasoning, examples alone will not carry it, and you want chain-of-thought instead.

Next in the series: Chain-of-Thought prompting, the technique for exactly the reasoning tasks where few-shot stalls. The previous part is The anatomy of a good prompt.

Note

Sources: the Prompt Engineering Guide, Zero-shot Prompting and Few-shot Prompting; Brown et al. 2020 and Min et al. 2022.

Zero-shot: just ask#

Classify the text into neutral, negative, or positive.
 
Text: I think the vacation is okay.
Sentiment:

Neutral

Few-shot: show, don't tell#

A "whatpu" is a small, furry animal native to Tanzania. An example of a
sentence that uses the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.
 
To do a "farduddle" means to jump up and down really fast. An example of
a sentence that uses the word farduddle is:

When we won the game, we all started to farduddle in celebration.

The counterintuitive part: format matters more than correctness#

This is awesome! // Negative
This is bad! // Positive
Wow that movie was rad! // Positive
What a horrible show! //

Negative

Tip

How many, and how to pick them#

A few practical rules I follow:

Start with the fewest that work. Try zero-shot, then one example, then add more only while accuracy keeps improving. Examples cost tokens and latency, so do not pay for ten when three do the job.
Keep the format identical across every example. Consistency is doing real work, as the random-label result shows. One stray format and you weaken the pattern.
Cover the label set. If there are three possible labels, show all three, so the model knows the full space it is choosing from.
Match the distribution. If "neutral" is common in your real data, your examples should reflect that, not give every label equal weight.
Target the hard cases. As above, the examples that move accuracy are the confusable ones.

Where few-shot breaks down#

The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: The answer is False.
... (several more examples) ...
The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1.
A:

The answer is True.   (wrong)

Wrapping up#

Next in the series: Chain-of-Thought prompting, the technique for exactly the reasoning tasks where few-shot stalls. The previous part is The anatomy of a good prompt.

Note

Sources: the Prompt Engineering Guide, Zero-shot Prompting and Few-shot Prompting; Brown et al. 2020 and Min et al. 2022.

Zero-Shot vs Few-Shot Prompting

Zero-shot: just ask#

Few-shot: show, don't tell#

The counterintuitive part: format matters more than correctness#

How many, and how to pick them#

Where few-shot breaks down#

Wrapping up#

Related articles

Self-Consistency: Sampling Your Way to Better Answers

Chain-of-Thought Prompting

Generated-Knowledge Prompting: Surface Facts Before You Answer

Zero-Shot vs Few-Shot Prompting

Zero-shot: just ask#

Few-shot: show, don't tell#

The counterintuitive part: format matters more than correctness#

How many, and how to pick them#

Where few-shot breaks down#

Wrapping up#

Related articles

Self-Consistency: Sampling Your Way to Better Answers

Chain-of-Thought Prompting

Generated-Knowledge Prompting: Surface Facts Before You Answer