Active-Prompt: Stop Guessing Which Few-Shot Examples to Annotate
Use the model's own uncertainty to pick which questions are worth a human-written reasoning chain
Few-shot chain-of-thought has a quiet weakness: the examples are hand-picked, and usually by vibes. Someone writes four or five reasoned examples, drops them in the prompt, and hopes they represent the task. Active-Prompt is the answer to a sharper question: which questions are actually worth the effort of writing a careful reasoning chain for? Its answer is elegant. Let the model tell you where it is confused, and annotate those.
This is a standalone deep-dive in my prompting-techniques thread. If you have not read zero-shot vs few-shot prompting and Chain-of-Thought prompting yet, start there. Active-Prompt sits directly on top of both.
The problem with fixed exemplars#
Standard few-shot CoT relies on a fixed set of human-annotated examples. You write them once and reuse them for everything in that task. Diao et al. (2023) point at the obvious flaw: those examples might not be the most effective ones for the task, and they are almost never chosen with any principle. You are spending scarce human annotation effort, arguably the most expensive thing in the whole pipeline, on questions you picked by feel.
Some questions are easy and the model nails them zero-shot. Writing a careful reasoning chain for those is wasted work. Other questions are exactly where the model wobbles, and a good annotated example there would help a lot. The trick is knowing which is which before you spend the effort.
The core idea: annotate where the model disagrees with itself#
Active-Prompt borrows from active learning. Instead of picking examples up front, it measures the model's uncertainty on a pool of candidate questions and annotates only the uncertain ones.
Here is the loop from the paper:
- Query the model on training questions. For each question, sample k answers (with or without a few CoT examples to start). Because sampling is stochastic, you get a spread of answers per question.
- Compute an uncertainty score. The paper uses disagreement: how much do the k answers for a question differ from each other? High disagreement means the model is unsure.
- Select the most uncertain questions for human annotation. These are the ones where a careful reasoning chain will do the most good.
- A human writes CoT annotations for just those selected questions.
- Use the new annotated exemplars to prompt the model on the rest of the task.
The whole thing rests on one assumption that turns out to hold up well: when a model gives you five different answers to the same question, that is a strong signal it does not have a reliable way to solve it, and that is exactly where your annotation budget should go.
Uncertainty, concretely#
Disagreement is the metric the paper leans on, and it is easy to reason about. Ask the same question k times at a non-zero temperature and count how many distinct answers come back.
from collections import Counter
def disagreement(answers: list[str]) -> float:
"""Fraction of unique answers among k samples. 0 = full agreement, ~1 = chaos."""
counts = Counter(answers)
unique = len(counts)
return (unique - 1) / (len(answers) - 1) # normalized to [0, 1]
# The model is confident here: annotate something else.
disagreement(["42", "42", "42", "42", "42"]) # -> 0.0
# The model is all over the place: prime candidate for a human CoT example.
disagreement(["42", "17", "42", "9", "58"]) # -> ~0.75You rank your candidate pool by this score and annotate from the top. If you have seen self-consistency, the sampling machinery here is the same. Self-consistency samples many chains and votes to get a better answer. Active-Prompt samples many answers and reads the disagreement as a signal for where to spend human effort. Same tool, different purpose.
Disagreement is not the only option. You can also use answer entropy, or the variance of a confidence score. Disagreement is just the cheapest thing that works, and it needs no extra model calls beyond the samples you already took.
Why this matters beyond the paper#
You will probably never implement Active-Prompt exactly. But the mindset is one of the highest-leverage things in applied prompting, and almost nobody does it deliberately.
Think about how most teams build a few-shot prompt. They eyeball the task, write some examples, ship it. Active-Prompt says: your intuition about which cases are hard is unreliable, so measure it. The model's own uncertainty is a free, honest map of where it struggles. Follow the map.
That reframes example selection as an engineering problem with a feedback loop instead of a one-time guess:
- Run your candidate questions through the model a handful of times each.
- Find the ones where it can't make up its mind.
- Those are your prompt examples, your eval hard cases, and your fine-tuning candidates, all at once.
It connects cleanly to evaluation work too. The high-disagreement set is a ready-made hard slice for your eval harness. If a change to your prompt or model calms the disagreement on those questions, that is real signal, not a vanity metric.
Where it fits, and where it does not#
Active-Prompt shines when:
- You have a real pool of task questions to sample from.
- Human annotation is expensive enough that you want to spend it wisely.
- The task benefits from CoT in the first place (multi-step reasoning, math, structured decisions).
It is overkill when:
- You are on a reasoning model that does its own thinking. As I covered in prompting reasoning models, piling on manual CoT examples there can backfire, so uncertainty-guided CoT annotation is solving a problem you may not have.
- The task is simple enough that any reasonable examples work.
- You cannot sample multiple answers cheaply (though even k=3 or k=5 is usually plenty).
The takeaway#
Active-Prompt is less a technique you install and more a habit worth stealing: don't guess which examples matter, let the model's uncertainty tell you. Sample a few answers per question, look at the disagreement, and spend your annotation effort where the model is genuinely confused. Even a rough version of this beats hand-picking examples by feel, and it doubles as a way to find the hard cases your eval suite should be watching.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.