Prompting Reasoning Models (o3, Claude thinking, Gemini 2.5): What Changes, Folarin Akinloye

Everything in this series so far has been about teaching the model how to think: show it worked examples, ask it to reason step by step, chain prompts together. Reasoning models change the deal. They already do the thinking. If you keep prompting them the old way, you often make them worse.

This is part 14 of Prompt Engineering, Properly. If you have been following along from Chain-of-Thought through optimizing prompts, treat this post as a correction layer: here is what to unlearn when the model on the other end is a reasoning model.

What a reasoning model actually is#

A reasoning LLM (sometimes "large reasoning model", or LRM) is a model trained to do native chain-of-thought before it answers. It spends extra tokens thinking privately, then gives you a response. o3 and o4-mini, Gemini 2.5 Pro, and Claude with thinking turned on are the ones most people reach for.

The mental model that helps: a standard chat model is a next-token predictor that will happily answer in one pass. A reasoning model is trained to plan, deduce, and check itself before it commits. You are no longer the one supplying the reasoning scaffold. The model brought its own.

That single fact is why so much of the advice from earlier in this series flips.

The big one: stop writing manual chain-of-thought#

For a plain chat model, "let's think step by step" and a few reasoned exemplars are gold. For a reasoning model, spelling out the steps can actively hurt.

There is now research pointing at this directly. A 2025 paper, "The Pitfalls of Reasoning for Instruction Following in LLMs", found that forcing explicit chain-of-thought on reasoning models can degrade how well they follow your actual instructions. The model spends its reasoning budget elaborating on your steps instead of doing the task the way you asked.

So the default changes:

Chat model: tell it how to think. Give it the steps.
Reasoning model: tell it what you want. Trust it to find the steps.

Warning

"Think step by step" is not a free addition on a reasoning model. If instruction-following matters (format, constraints, structure), manual CoT can cost you accuracy on exactly the part you care about.

If you do hit a case where the model reasons badly, the paper's suggested fixes are worth knowing: a few carefully chosen few-shot examples, letting the model self-reflect and revise, or letting it decide for itself whether a problem needs heavy reasoning at all.

Be explicit, not procedural#

Dropping step-by-step instructions does not mean writing vague prompts. It means changing what you are explicit about.

You still give clear, direct instructions: the goal, the constraints, the output format, anything the model would otherwise have to guess. What you drop is the internal procedure. Think of it as the difference between briefing a senior engineer and micromanaging a junior one. You tell the senior engineer the outcome and the constraints. You do not tell them which functions to write first.

A weak reasoning-model prompt:

Let's solve this step by step. First, identify the variables.
Then set up the equation. Then solve for x. Then check your work.
Question: ...

A stronger one:

Solve for x. Show the final equation you used and the numeric answer.
If the problem is underspecified, say so instead of guessing.
Question: ...

The second prompt is explicit about the output (final equation, numeric answer) and about behavior (flag underspecification), but it leaves the reasoning to the model.

Dial thinking effort like a real parameter#

Reasoning quality scales with thinking time, which is just compute spent before answering. This is "inference-time scaling" or "test-time compute", and most reasoning models expose it as a knob:

Effort	Use it when
`low`	Simple tasks, cost- and latency-sensitive paths, high call volume
`medium`	Default balance of accuracy and speed
`high`	Hard reasoning: tricky math, multi-step planning, gnarly debugging

More thinking usually means better answers, but also more tokens, higher cost, and higher latency. Do not default everything to high. Start low and climb only when the output tells you to.

Hybrid models: start with thinking off#

A lot of current models are hybrid: one model that can run in normal mode or switch reasoning on. Claude is the obvious example. The workflow that has worked well for me:

Run in standard mode, thinking off. Look at the answer. A manual CoT prompt is fine here, because in this mode it behaves like a chat model.
If the answers are shallow or wrong, and the task genuinely needs deeper analysis, turn thinking on at low effort.
Still not enough? Move to medium, then high.
If the problem is format or style rather than depth, reach for a few-shot example instead of more thinking.

The point is to treat reasoning as something you earn your way into, not the starting position. Reasoning is slower and pricier, so you want evidence that the task needs it.

Structure inputs and outputs#

This part carries over from normal prompting, and it matters even more when the model is doing a lot of internal work. Wrap your inputs in delimiters so the model can tell instructions from data. Ask for structured output when you are wiring the model into an app.

One field note worth repeating: most reasoning models handle both JSON and XML well, and XML is a solid default for structuring generated content unless you specifically need JSON. Also, the output format tends to mirror your prompt format. If you write the prompt in Markdown, models like Claude 4 lean toward Markdown output. So format the prompt the way you want the answer to look.

Where reasoning models earn their cost#

You do not need reasoning everywhere. Apply separation of concerns: reach for a reasoning model on the reasoning-heavy parts of your system and use cheaper models for the rest. The patterns where they clearly pay off:

Planning inside agents. A reasoning model is good at breaking a fuzzy task into a plan before an orchestrator runs it. This is the backbone of deep-research-style agents.
Agentic RAG. Routing complex queries, deciding which source or tool to hit, and reasoning over messy knowledge bases. If you are building RAG, this connects straight to context engineering for agents.
LLM-as-a-judge. Evaluating and critiquing other outputs, then feeding that critique back into a meta-prompt to improve the base prompt. This pairs naturally with the ideas in evaluating RAG.
Visual reasoning. Models like o3 can reason with images inside their chain-of-thought, even zooming or cropping with tools mid-reasoning.
Big, ambiguous, technical work. Debugging large codebases, literature synthesis, scientific math, data validation.

The limitations that will bite you#

Reasoning models are not magic, and the failure modes are specific:

Overthinking and underthinking. Poorly prompted, they either spin forever or bail too early. The fix is being very specific about the task and the expected output, or routing only the hard subtasks to the reasoning model.
Reasoning can hurt instruction-following. Covered above, but it is the one people trip on most.
Cost and latency. Meaningfully higher than chat models. Track token usage, watch for inconsistent outputs inflating your bill, and lean on streaming to improve perceived latency. If you care about the money side, cutting LLM cost and latency goes deeper.
Shaky tool calling. Some reasoning models still handle parallel or multi-tool calling poorly unless they were trained for it. Do not assume strong agentic behavior comes free with strong reasoning.

The rule I follow: optimize for accuracy first, then claw back latency and cost once the quality is where you need it.

What to take away#

Prompting a reasoning model is mostly about doing less of the wrong thing. Drop the manual step-by-step. Be explicit about outcomes and constraints, not procedure. Treat thinking effort as a real dial and start low. On hybrid models, earn your way into reasoning rather than starting there. And keep reasoning models on the parts of your system that actually reason, not everywhere.

That is the last of the core reasoning-and-agent techniques in this series. If you want to go back to fundamentals, the anatomy of a good prompt still applies, reasoning model or not. The difference is which parts of that anatomy you lean on.

What a reasoning model actually is#

That single fact is why so much of the advice from earlier in this series flips.

The big one: stop writing manual chain-of-thought#

For a plain chat model, "let's think step by step" and a few reasoned exemplars are gold. For a reasoning model, spelling out the steps can actively hurt.

So the default changes:

Chat model: tell it how to think. Give it the steps.
Reasoning model: tell it what you want. Trust it to find the steps.

Warning

Be explicit, not procedural#

Dropping step-by-step instructions does not mean writing vague prompts. It means changing what you are explicit about.

A weak reasoning-model prompt:

Let's solve this step by step. First, identify the variables.
Then set up the equation. Then solve for x. Then check your work.
Question: ...

A stronger one:

Solve for x. Show the final equation you used and the numeric answer.
If the problem is underspecified, say so instead of guessing.
Question: ...

The second prompt is explicit about the output (final equation, numeric answer) and about behavior (flag underspecification), but it leaves the reasoning to the model.

Dial thinking effort like a real parameter#

Reasoning quality scales with thinking time, which is just compute spent before answering. This is "inference-time scaling" or "test-time compute", and most reasoning models expose it as a knob:

Effort	Use it when
`low`	Simple tasks, cost- and latency-sensitive paths, high call volume
`medium`	Default balance of accuracy and speed
`high`	Hard reasoning: tricky math, multi-step planning, gnarly debugging

More thinking usually means better answers, but also more tokens, higher cost, and higher latency. Do not default everything to high. Start low and climb only when the output tells you to.

Hybrid models: start with thinking off#

A lot of current models are hybrid: one model that can run in normal mode or switch reasoning on. Claude is the obvious example. The workflow that has worked well for me:

Run in standard mode, thinking off. Look at the answer. A manual CoT prompt is fine here, because in this mode it behaves like a chat model.
If the answers are shallow or wrong, and the task genuinely needs deeper analysis, turn thinking on at low effort.
Still not enough? Move to medium, then high.
If the problem is format or style rather than depth, reach for a few-shot example instead of more thinking.

The point is to treat reasoning as something you earn your way into, not the starting position. Reasoning is slower and pricier, so you want evidence that the task needs it.

Structure inputs and outputs#

Where reasoning models earn their cost#

Planning inside agents. A reasoning model is good at breaking a fuzzy task into a plan before an orchestrator runs it. This is the backbone of deep-research-style agents.
Agentic RAG. Routing complex queries, deciding which source or tool to hit, and reasoning over messy knowledge bases. If you are building RAG, this connects straight to context engineering for agents.
LLM-as-a-judge. Evaluating and critiquing other outputs, then feeding that critique back into a meta-prompt to improve the base prompt. This pairs naturally with the ideas in evaluating RAG.
Visual reasoning. Models like o3 can reason with images inside their chain-of-thought, even zooming or cropping with tools mid-reasoning.
Big, ambiguous, technical work. Debugging large codebases, literature synthesis, scientific math, data validation.

The limitations that will bite you#

Reasoning models are not magic, and the failure modes are specific:

Overthinking and underthinking. Poorly prompted, they either spin forever or bail too early. The fix is being very specific about the task and the expected output, or routing only the hard subtasks to the reasoning model.
Reasoning can hurt instruction-following. Covered above, but it is the one people trip on most.
Cost and latency. Meaningfully higher than chat models. Track token usage, watch for inconsistent outputs inflating your bill, and lean on streaming to improve perceived latency. If you care about the money side, cutting LLM cost and latency goes deeper.
Shaky tool calling. Some reasoning models still handle parallel or multi-tool calling poorly unless they were trained for it. Do not assume strong agentic behavior comes free with strong reasoning.

The rule I follow: optimize for accuracy first, then claw back latency and cost once the quality is where you need it.

Prompting Reasoning Models Is Almost the Opposite of Prompting Chat Models

What a reasoning model actually is#

The big one: stop writing manual chain-of-thought#

Be explicit, not procedural#

Dial thinking effort like a real parameter#

Hybrid models: start with thinking off#

Structure inputs and outputs#

Where reasoning models earn their cost#

The limitations that will bite you#

What to take away#

Related articles

The Anatomy of a Good Prompt

The Settings That Change Your Output

Generated-Knowledge Prompting: Surface Facts Before You Answer

Prompting Reasoning Models Is Almost the Opposite of Prompting Chat Models

What a reasoning model actually is#

The big one: stop writing manual chain-of-thought#

Be explicit, not procedural#

Dial thinking effort like a real parameter#

Hybrid models: start with thinking off#

Structure inputs and outputs#

Where reasoning models earn their cost#

The limitations that will bite you#

What to take away#

Related articles

The Anatomy of a Good Prompt

The Settings That Change Your Output

Generated-Knowledge Prompting: Surface Facts Before You Answer