Self-Consistency: Sampling Your Way to Better Answers
Run the same reasoning prompt several times, then take the majority answer
A single chain of thought can take one wrong turn early and confidently carry the mistake all the way to a wrong answer. Self-consistency fixes that with a simple idea: ask the same question several times, let the model reason its way to an answer each time, and take the answer that comes up most often. A wrong path tends to be wrong in its own random way, while the correct path keeps arriving at the same place. This is part six of Prompt Engineering, Properly, and it builds directly on chain-of-thought.
Why one reasoning chain is fragile#
Chain-of-thought, as covered in the last part, normally uses greedy decoding: the model takes the single most likely path and runs with it. That is efficient, but it stakes the whole answer on one trajectory. If the model makes an arithmetic slip or picks a wrong interpretation in step two, every step after inherits the error, and you get a fluent, confident, wrong result.
Self-consistency (Wang et al., 2022) replaces that single greedy pass with sampling. Instead of one path, you generate several diverse reasoning paths for the same prompt, then select the most consistent final answer across them. It is a direct upgrade to chain-of-thought on arithmetic and commonsense reasoning.
Watching it work#
Take this question, which trips up a single pass:
When I was 6 my sister was half my age. Now I'm 70. How old is my sister?A single attempt often answers 35, doing the tempting-but-wrong "half of 70". The right reasoning is: at 6, the sister was 3 (half of 6), a gap of 3 years that never changes, so at 70 she is 67.
Now run the same chain-of-thought prompt several times with sampling on, and you get a spread of reasoning paths:
Path 1: When I was 6 she was 3. The gap is 3. Now I'm 70, so she's 67. -> 67
Path 2: She was half my age, so 3. 70 - 3 = 67. -> 67
Path 3: She was half my age, so now she's 70 / 2 = 35. -> 35Two paths land on 67, one on 35. Majority vote picks 67, the correct answer. The single bad path got outvoted. That is the whole mechanism: diversity in the reasoning, agreement on the answer.
How to implement it#
The recipe is short:
- Write a chain-of-thought prompt (few-shot CoT with worked examples is the classic setup).
- Call the model N times with a non-zero temperature so the paths differ. N is often 5 to 40 depending on how much accuracy you need.
- Extract the final answer from each response.
- Return the most common answer.
from collections import Counter
def self_consistent_answer(client, prompt, n=10, temperature=0.7):
answers = []
for _ in range(n):
resp = client.responses.create(
model="...", input=prompt, temperature=temperature,
)
answers.append(extract_final_answer(resp.output_text))
# Majority vote over the extracted final answers.
return Counter(answers).most_common(1)[0][0]Two details matter. First, temperature has to be above zero, or every sample is identical and you have just paid N times for one answer. This is the one place you deliberately want the randomness I warned about in the settings post: diversity across paths is the point. Second, you need a reliable extract_final_answer, which is why pinning the answer format ("give the final answer on the last line as: ANSWER: X") pays off here. Voting only works if you can compare the final answers cleanly.
Self-consistency only works when the final answer is something you can compare and count: a number, a label, a yes/no. For open-ended generation (an essay, a summary) there is no clean "majority answer" to take, so this technique does not apply. Save it for tasks with a checkable result.
The cost, plainly#
Self-consistency multiplies your inference cost by N. Ten samples means roughly ten times the tokens and ten times the spend for that call, plus the latency of the slowest sample if you run them in parallel (or the sum if you run them in series). That is a real tax, so this is not something you turn on everywhere.
Use it where a wrong answer is expensive and the task is a hard, verifiable reasoning problem: a financial calculation, a logic-heavy classification, a step the rest of a pipeline depends on. Skip it for easy questions (where a single pass is already right) and for anything latency-critical. A reasonable pattern is to reserve it for the small set of high-stakes calls rather than applying it across the board. The cost framing in Cutting LLM cost and latency applies directly: spend the extra samples only where the accuracy is worth it.
Self-consistency is the simplest member of a family of "sample and select" methods. Tree of Thoughts, later in this series, generalizes it: instead of sampling whole reasoning chains and voting at the end, it explores and prunes multiple reasoning branches as it goes.
Wrapping up#
When chain-of-thought gets a reasoning problem right most of the time but not reliably, self-consistency turns "most of the time" into "reliably" by sampling several paths and voting. The price is N times the cost, so aim it at the hard, high-stakes, checkable problems and leave it off everywhere else. Make sure your temperature is non-zero and your final-answer format is easy to parse, or the vote falls apart.
Next in the series: Generated-knowledge prompting, where instead of voting on reasoning, we have the model surface relevant facts before it answers. The previous part is Chain-of-Thought prompting.
Source: the Prompt Engineering Guide, Self-Consistency; Wang et al. 2022.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.