Automatic Reasoning and Tool-use (ART), Explained for Engineers, Folarin Akinloye

ART is a 2023 idea that looks a lot like the agents everyone builds now. It let a frozen language model write its own step-by-step reasoning, stop when it needed a calculator or a search, run the tool, paste the result back, and keep going. No fine-tuning, no hand-scripted tool orchestration per task. If that sounds like a modern tool-calling agent, that is the point.

This is a standalone deep-dive in my prompting-techniques thread. It pairs well with ReAct and with structured outputs and function calling, which are the mechanisms that made this idea practical at scale.

The problem ART was solving#

By early 2023, two tricks were clearly powerful when combined: chain-of-thought (get the model to reason in steps) and tool use (let it call a calculator, a search API, code execution). The trouble was that stitching them together was manual labor. For each new task you had to hand-write reasoned examples and carefully script exactly when the model should stop and call a tool.

That does not scale. Every new task is a new pile of bespoke prompt engineering. Paranjape et al. (2023) proposed Automatic Reasoning and Tool-use, or ART, to automate the whole thing with a frozen LLM. Frozen matters: no gradient updates, no training run. It is all done through prompting and orchestration.

How ART works#

The framework has two libraries and a runtime loop.

A task library holds examples of multi-step reasoning and tool use, written as little programs.
A tool library holds the tools the model is allowed to call.

Given a new task, ART does roughly this:

Pick related demonstrations from the task library and put them in the prompt. These show the model the pattern of decompose-then-use-tools.
Let the model generate its reasoning as a program for the new task.
Watch the generation. Whenever the model emits a tool call, pause, run the tool, splice the output back in, then resume generation.
Produce the final answer.

Task: What is the sum of the ages of the last three US presidents at inauguration?
 
Reasoning (generated by the model, as a program):
  Q1: Who were the last three presidents?  -> [search]  -> Biden, Trump, Obama
  Q2: Age at inauguration for each?        -> [search]  -> 78, 70, 47
  Q3: Sum them.                            -> [calc]    -> 195
Answer: 195

The important move is that pause-run-resume cycle. The model is not hallucinating the calculator's answer. Generation literally stops, the real tool runs, and the real result comes back before the model continues. That interleaving is what makes the output trustworthy on things models are bad at, like arithmetic and fresh facts.

Because the demonstrations teach a general pattern rather than one task's script, ART can generalize to a new task zero-shot: decompose it and drop tool calls in the right places without a human writing a task-specific example first.

The part that aged really well#

ART is extensible by editing libraries, not retraining. Two knobs:

Fix reasoning by editing the task library. If the model decomposes a class of problems badly, you add or correct a demonstration. No fine-tuning.
Add capability by editing the tool library. New tool, new prompt entry, done.

The paper found ART beat few-shot prompting and automatic CoT on unseen BigBench and MMLU tasks, and, when a human stepped in to fix reasoning steps, it exceeded hand-crafted CoT prompts. The reported numbers are strong: around a 10.8% average improvement over few-shot on unseen tasks, with tool use adding roughly another 12 points on the tasks where tools help.

Note

Do not over-index on the exact percentages. The durable lesson is architectural: separate the reasoning pattern (task library) from the capabilities (tool library), and let humans improve the system by editing data instead of weights.

What it prefigured#

Read the ART loop again and map it onto 2026 tooling:

ART (2023)	What we call it now
Task library of reasoning demos	Few-shot examples, system prompt patterns
Tool library	Registered tools / function schemas
Pause on tool call, run, resume	Native tool calling / the agent loop
Frozen LLM, no training	Prompt-and-orchestrate agents
Edit libraries to improve	Editing tool sets and examples, not weights

The modern version is cleaner because the model providers built tool calling into the models themselves, so you do not have to parse generated pseudo-programs to know when a tool is being called. If you want the mechanics of that, structured outputs and function calling is the how. And the reasoning-plus-acting interleaving is exactly what ReAct formalized around the same time.

Should you use ART today?#

Not the literal framework. Native tool calling and today's agent frameworks have absorbed what ART did, and they do it with less glue. But the design principles are worth stealing directly:

Keep a curated set of reasoning demonstrations for your hard task types, and treat them as data you can improve.
Keep your tools well-described and swappable.
Make sure real tool output flows back into the model before it commits to an answer, rather than letting the model guess what a tool would have said.
When the agent reasons badly, fix the examples before you reach for fine-tuning.

That last one is the mindset shift. A lot of agent quality problems that people try to solve with training are really task-library problems: the model was never shown the pattern you actually want. ART made that explicit three years ago, and it is still the cheapest lever most teams are not pulling.

If tool selection itself is your bottleneck (too many tools, model picks the wrong one), that is a retrieval problem, and it is worth reading alongside how RAG picks the right tool instead of every tool.

The problem ART was solving#

How ART works#

The framework has two libraries and a runtime loop.

A task library holds examples of multi-step reasoning and tool use, written as little programs.
A tool library holds the tools the model is allowed to call.

Given a new task, ART does roughly this:

Pick related demonstrations from the task library and put them in the prompt. These show the model the pattern of decompose-then-use-tools.
Let the model generate its reasoning as a program for the new task.
Watch the generation. Whenever the model emits a tool call, pause, run the tool, splice the output back in, then resume generation.
Produce the final answer.

Task: What is the sum of the ages of the last three US presidents at inauguration?
 
Reasoning (generated by the model, as a program):
  Q1: Who were the last three presidents?  -> [search]  -> Biden, Trump, Obama
  Q2: Age at inauguration for each?        -> [search]  -> 78, 70, 47
  Q3: Sum them.                            -> [calc]    -> 195
Answer: 195

The part that aged really well#

ART is extensible by editing libraries, not retraining. Two knobs:

Fix reasoning by editing the task library. If the model decomposes a class of problems badly, you add or correct a demonstration. No fine-tuning.
Add capability by editing the tool library. New tool, new prompt entry, done.

Note

What it prefigured#

Read the ART loop again and map it onto 2026 tooling:

ART (2023)	What we call it now
Task library of reasoning demos	Few-shot examples, system prompt patterns
Tool library	Registered tools / function schemas
Pause on tool call, run, resume	Native tool calling / the agent loop
Frozen LLM, no training	Prompt-and-orchestrate agents
Edit libraries to improve	Editing tool sets and examples, not weights

Should you use ART today?#

Not the literal framework. Native tool calling and today's agent frameworks have absorbed what ART did, and they do it with less glue. But the design principles are worth stealing directly:

Keep a curated set of reasoning demonstrations for your hard task types, and treat them as data you can improve.
Keep your tools well-described and swappable.
Make sure real tool output flows back into the model before it commits to an answer, rather than letting the model guess what a tool would have said.
When the agent reasons badly, fix the examples before you reach for fine-tuning.

ART: Let the Model Write Its Own Tool-Using Reasoning

The problem ART was solving#

How ART works#

The part that aged really well#

What it prefigured#

Should you use ART today?#

Related articles

Prompting Reasoning Models Is Almost the Opposite of Prompting Chat Models

PAL: Let the Model Reason in Words, but Let Python Do the Math

The Agent Loop: Building ReAct From Scratch

ART: Let the Model Write Its Own Tool-Using Reasoning

The problem ART was solving#

How ART works#

The part that aged really well#

What it prefigured#

Should you use ART today?#

Related articles

Prompting Reasoning Models Is Almost the Opposite of Prompting Chat Models

PAL: Let the Model Reason in Words, but Let Python Do the Math

The Agent Loop: Building ReAct From Scratch