LLM Reasoning Research: What Prompting Actually Elicits, Folarin Akinloye

I have spent a good chunk of this year writing about reasoning techniques one at a time: chain-of-thought, self-consistency, tree of thoughts, ReAct. This post zooms out. What does the research literature actually say about whether prompting elicits reasoning, how the techniques relate to each other, and whether any of it deserves the word "reasoning" at all? That last question has a serious dissent worth taking seriously, especially if you build agents.

The jumping-off point is the LLM Reasoning research page of the Prompt Engineering Guide, which collects three surveys and one pointed position paper.

The landscape: reasoning is not one thing#

Sun et al. (2023) survey reasoning with foundation models and the first useful thing they do is refuse to treat reasoning as a single capability. They split it into task families: mathematical, logical, causal, commonsense, visual reasoning, and more, spanning multimodal models and autonomous agents. That split matters practically because technique transfer across families is weak. A prompting strategy that adds ten points on math word problems can do nothing for causal questions. When someone says "model X is better at reasoning", the first question is: which kind?

The taxonomy: how prompting elicits reasoning#

Qiao et al. (2023) give the field its cleanest map. They divide the research into two branches: reasoning-enhanced strategies and knowledge-enhanced reasoning. The strategy branch splits further into three families that will feel familiar if you have followed my prompt engineering series:

Family	The idea	Examples
Prompt engineering	Shape the prompt so the model externalizes intermediate steps	Chain-of-thought, Active-Prompt, single-stage and multi-stage prompting
Process optimization	Improve or select over the reasoning process itself	Self-consistency, verifiers, calibrators
External engines	Delegate parts of the computation to tools	PAL, tool use, code execution

The knowledge branch covers injecting facts the model lacks, either from its own generations (generated-knowledge prompting) or from retrieval.

Huang et al. (2023) cut the same territory a different way: fully supervised fine-tuning on explanation datasets versus prompting-time techniques like chain-of-thought, problem decomposition, and in-context learning. Reading the two surveys together, the striking thing is how much of the field reduces to one move: make the model produce intermediate tokens, then do something smart with them (sample them, vote over them, verify them, or execute them). Nearly everything I have covered this year is a variation on that move.

The dissent: universal approximate retrieval#

Now the uncomfortable part. Kambhampati (2024) wrote a position paper asking directly: can LLMs reason and plan? His conclusion, quoted in full because paraphrase softens it:

To summarize, nothing that I have read, verified, or done gives me any compelling reason to believe that LLMs do reasoning/planning, as normally understood. What they do instead, armed with web-scale training, is a form of universal approximate retrieval, which, as I have argued, can sometimes be mistaken for reasoning capabilities.

"Universal approximate retrieval" is a genuinely useful lens even if you reject the conclusion. It predicts real failure patterns: performance that collapses when a problem is superficially reworded away from training distributions, plans that look coherent but violate constraints no planner would violate, and confidence uncorrelated with validity. Anyone who has watched an agent produce a beautiful, subtly impossible plan has felt this.

My position, for what it is worth: the debate matters less than its engineering corollary. If reasoning might be approximate retrieval, you do not trust chains of thought as proofs. You treat them as proposals and verify externally. That is why the techniques that hold up best in production are the ones with a verifier bolted on: self-consistency votes across samples, PAL executes the reasoning as code, agent loops check outcomes against the environment. The research debate and good engineering practice converge on the same answer: verify, do not believe.

Note

The debate has softened but not closed in the reasoning-model era. Models trained with reinforcement learning to produce long deliberation traces (the o-series, R1 lineage, and their descendants) score far better on reasoning benchmarks, and skeptics respond that better approximate retrieval over reasoning traces is still retrieval. The engineering corollary survives either way.

What this means for how you prompt#

Reading the literature changed a few of my defaults:

Match the technique to the reasoning type. Decomposition helps math, retrieval helps knowledge-heavy causal questions, code execution helps anything computable. The surveys' taxonomies are effectively a routing table.
Spend on verification, not just elicitation. An extra sampled chain plus voting usually beats a longer, fancier single prompt. Process optimization is the underrated branch of the taxonomy.
Test out-of-distribution rewording. If your eval set phrases problems the way the internet does, you are measuring retrieval. Paraphrase your evals aggressively and watch what survives.
For agents, plans are hypotheses. Validate preconditions in the environment before acting on any step. Kambhampati's framing is basically a design requirement for autonomous systems.

And a reminder that this whole area moved under our feet: with models that reason internally by default, much of the classic elicitation layer is being absorbed into training. What to do differently when you prompt those models is its own topic, and I covered it in prompting reasoning models.

The literature's honest summary is this: prompting reliably improves performance on reasoning tasks, nobody fully agrees on why, and the systems that work in production are the ones designed to not need to know. Build for that.

The landscape: reasoning is not one thing#

The taxonomy: how prompting elicits reasoning#

Family

The idea

Examples

Prompt engineering

Shape the prompt so the model externalizes intermediate steps

Chain-of-thought, Active-Prompt, single-stage and multi-stage prompting

Process optimization

Improve or select over the reasoning process itself

Self-consistency, verifiers, calibrators

External engines

Delegate parts of the computation to tools

PAL, tool use, code execution

The knowledge branch covers injecting facts the model lacks, either from its own generations (generated-knowledge prompting) or from retrieval.

The dissent: universal approximate retrieval#

Now the uncomfortable part. Kambhampati (2024) wrote a position paper asking directly: can LLMs reason and plan? His conclusion, quoted in full because paraphrase softens it:

To summarize, nothing that I have read, verified, or done gives me any compelling reason to believe that LLMs do reasoning/planning, as normally understood. What they do instead, armed with web-scale training, is a form of universal approximate retrieval, which, as I have argued, can sometimes be mistaken for reasoning capabilities.

Note

What this means for how you prompt#

Reading the literature changed a few of my defaults:

Match the technique to the reasoning type. Decomposition helps math, retrieval helps knowledge-heavy causal questions, code execution helps anything computable. The surveys' taxonomies are effectively a routing table.

Spend on verification, not just elicitation. An extra sampled chain plus voting usually beats a longer, fancier single prompt. Process optimization is the underrated branch of the taxonomy.

Test out-of-distribution rewording. If your eval set phrases problems the way the internet does, you are measuring retrieval. Paraphrase your evals aggressively and watch what survives.

For agents, plans are hypotheses. Validate preconditions in the environment before acting on any step. Kambhampati's framing is basically a design requirement for autonomous systems.

What the Research Actually Says About Prompting Reasoning

The landscape: reasoning is not one thing#

The taxonomy: how prompting elicits reasoning#

The dissent: universal approximate retrieval#

What this means for how you prompt#

Related articles

A Real Prompt Engineering Case Study: 65.6 to 91.7 F1 on Job Classification

Dataset Diversity: Fixing Repetitive Synthetic Generations

Generating Synthetic Data with Prompts

What the Research Actually Says About Prompting Reasoning

The landscape: reasoning is not one thing#

The taxonomy: how prompting elicits reasoning#

The dissent: universal approximate retrieval#

What this means for how you prompt#

Related articles

A Real Prompt Engineering Case Study: 65.6 to 91.7 F1 on Job Classification

Dataset Diversity: Fixing Repetitive Synthetic Generations

Generating Synthetic Data with Prompts