Chain-of-Thought Prompting
Standard and zero-shot CoT, when step-by-step reasoning actually helps, and what it costs
In the previous part we hit a wall: a reasoning problem that neither zero-shot nor few-shot could solve, no matter how many examples we added. Chain-of-thought is what gets through that wall. The idea is simple: have the model lay out its reasoning steps before it commits to an answer. On problems with multiple steps, that one change can take a prompt from reliably wrong to reliably right. This is part five of Prompt Engineering, Properly.
Why reasoning steps help#
A model generates one token at a time, and each token it writes becomes part of the context for the next. When you force it to answer immediately, it has to compute a multi-step result in a single shot, with no room to work. When you let it write the steps first, those intermediate tokens become scratch space: the model literally reads its own partial work to produce the next part. It is the difference between answering a math problem in your head versus on paper.
That is why chain-of-thought helps on arithmetic, logic, and any multi-step decision, and why it does little for a simple lookup. The win comes from giving the model room to work through something that genuinely has steps.
Standard CoT: show the reasoning in your examples#
Standard chain-of-thought (from Wei et al., 2022) is few-shot prompting where the examples include the reasoning, not just the answer. Recall the odd-numbers problem that plain few-shot failed in the last part. Add the working to each example and it flips:
The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False.
The odd numbers in this group add up to an even number: 17, 10, 19, 4, 8, 12, 24.
A: Adding all the odd numbers (17, 19) gives 36. The answer is True.
The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1.
A:Adding all the odd numbers (15, 5, 13, 7, 1) gives 41. The answer is False.The only thing that changed from the failing few-shot prompt is that each example now shows the steps (which numbers are odd, what they sum to) before the True/False. That demonstration of the route, not just the destination, is what the model needed. Often a single worked example is enough.
Zero-shot CoT: the five-word fix#
You do not always need examples. Zero-shot CoT (Kojima et al., 2022) is almost embarrassingly simple: add "Let's think step by step" to the prompt. That phrase alone triggers the model to lay out its reasoning.
Without it:
I went to the market and bought 10 apples. I gave 2 apples to the
neighbor and 2 to the repairman. I then bought 5 more apples and ate 1.
How many apples did I remain with?11 apples (wrong)With it:
... How many apples did I remain with?
Let's think step by step.You started with 10 apples. You gave away 2 and 2, so 6 left. You bought
5 more, so 11. You ate 1, so you remain with 10 apples. (correct)Same question, one extra line, correct answer. Zero-shot CoT is especially handy when you do not have good examples to hand. It is the first thing to try on any reasoning task: it costs almost nothing to add.
If you want a parseable answer after the reasoning, pin the final line: "Let's think step by step, then give the final answer on the last line as: ANSWER: X". You get the accuracy boost from the reasoning and a clean field to extract, instead of having to parse the whole explanation.
Auto-CoT, briefly#
Hand-writing reasoning examples is work, and the examples you pick might be weak. Auto-CoT (Zhang et al., 2022) automates it: cluster your questions for diversity, then use zero-shot CoT ("Let's think step by step") to generate a reasoning chain for one representative question per cluster. The generated chains become your demonstrations. The diversity step matters because it keeps any single bad generated chain from dominating. You probably will not hand-build this for a one-off prompt, but it is worth knowing that the demonstrations themselves can be generated rather than written.
The cost, and when to skip it#
Chain-of-thought is not free, and it is not always the right call.
It costs tokens and latency. The model now writes a paragraph of reasoning before the answer, which you pay for and wait for. On a high-volume or latency-sensitive path, that adds up. I covered managing that tradeoff in Cutting LLM cost and latency.
It is overkill for simple tasks. For a classification or a lookup with no real steps, forcing visible reasoning just adds cost and can even introduce errors by overthinking something the model already had right. Use it where correctness on a multi-step problem matters more than speed.
And there is a 2026 wrinkle worth stating plainly. Modern reasoning models (the o1-style ones) already do this kind of step-by-step thinking internally, so manually appending "let's think step by step" does less than it used to, and can sometimes get in the way. How prompting changes for those models is its own topic, and a later part of this series covers it. On standard, non-reasoning models, explicit chain-of-thought is still very much worth it.
Chain-of-thought also underpins more advanced techniques later in this series: self-consistency samples many reasoning chains and votes, and Tree of Thoughts explores several reasoning paths at once. CoT is the building block they extend.
Wrapping up#
When a task has real steps, let the model show its work. Use standard CoT (reasoning inside your examples) when you have good demonstrations, and zero-shot CoT ("Let's think step by step") when you do not. Skip it for simple lookups, watch the token and latency cost, and remember that the newest reasoning models already do this on their own.
Next in the series: Self-consistency, which takes chain-of-thought further by sampling several reasoning paths and voting on the answer. The previous part is Zero-shot vs few-shot prompting.
Source: the Prompt Engineering Guide, Chain-of-Thought Prompting; Wei et al. 2022, Kojima et al. 2022, Zhang et al. 2022.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.