ART: Let the Model Write Its Own Tool-Using Reasoning
Automatic Reasoning and Tool-use, and why it reads like an early sketch of the agents we build today
ART is a 2023 idea that looks a lot like the agents everyone builds now. It let a frozen language model write its own step-by-step reasoning, stop when it needed a calculator or a search, run the tool, paste the result back, and keep going. No fine-tuning, no hand-scripted tool orchestration per task. If that sounds like a modern tool-calling agent, that is the point.
This is a standalone deep-dive in my prompting-techniques thread. It pairs well with ReAct and with structured outputs and function calling, which are the mechanisms that made this idea practical at scale.
The problem ART was solving#
By early 2023, two tricks were clearly powerful when combined: chain-of-thought (get the model to reason in steps) and tool use (let it call a calculator, a search API, code execution). The trouble was that stitching them together was manual labor. For each new task you had to hand-write reasoned examples and carefully script exactly when the model should stop and call a tool.
That does not scale. Every new task is a new pile of bespoke prompt engineering. Paranjape et al. (2023) proposed Automatic Reasoning and Tool-use, or ART, to automate the whole thing with a frozen LLM. Frozen matters: no gradient updates, no training run. It is all done through prompting and orchestration.
How ART works#
The framework has two libraries and a runtime loop.
- A task library holds examples of multi-step reasoning and tool use, written as little programs.
- A tool library holds the tools the model is allowed to call.
Given a new task, ART does roughly this:
- Pick related demonstrations from the task library and put them in the prompt. These show the model the pattern of decompose-then-use-tools.
- Let the model generate its reasoning as a program for the new task.
- Watch the generation. Whenever the model emits a tool call, pause, run the tool, splice the output back in, then resume generation.
- Produce the final answer.
Task: What is the sum of the ages of the last three US presidents at inauguration?
Reasoning (generated by the model, as a program):
Q1: Who were the last three presidents? -> [search] -> Biden, Trump, Obama
Q2: Age at inauguration for each? -> [search] -> 78, 70, 47
Q3: Sum them. -> [calc] -> 195
Answer: 195The important move is that pause-run-resume cycle. The model is not hallucinating the calculator's answer. Generation literally stops, the real tool runs, and the real result comes back before the model continues. That interleaving is what makes the output trustworthy on things models are bad at, like arithmetic and fresh facts.
Because the demonstrations teach a general pattern rather than one task's script, ART can generalize to a new task zero-shot: decompose it and drop tool calls in the right places without a human writing a task-specific example first.
The part that aged really well#
ART is extensible by editing libraries, not retraining. Two knobs:
- Fix reasoning by editing the task library. If the model decomposes a class of problems badly, you add or correct a demonstration. No fine-tuning.
- Add capability by editing the tool library. New tool, new prompt entry, done.
The paper found ART beat few-shot prompting and automatic CoT on unseen BigBench and MMLU tasks, and, when a human stepped in to fix reasoning steps, it exceeded hand-crafted CoT prompts. The reported numbers are strong: around a 10.8% average improvement over few-shot on unseen tasks, with tool use adding roughly another 12 points on the tasks where tools help.
Do not over-index on the exact percentages. The durable lesson is architectural: separate the reasoning pattern (task library) from the capabilities (tool library), and let humans improve the system by editing data instead of weights.
What it prefigured#
Read the ART loop again and map it onto 2026 tooling:
| ART (2023) | What we call it now |
|---|---|
| Task library of reasoning demos | Few-shot examples, system prompt patterns |
| Tool library | Registered tools / function schemas |
| Pause on tool call, run, resume | Native tool calling / the agent loop |
| Frozen LLM, no training | Prompt-and-orchestrate agents |
| Edit libraries to improve | Editing tool sets and examples, not weights |
The modern version is cleaner because the model providers built tool calling into the models themselves, so you do not have to parse generated pseudo-programs to know when a tool is being called. If you want the mechanics of that, structured outputs and function calling is the how. And the reasoning-plus-acting interleaving is exactly what ReAct formalized around the same time.
Should you use ART today?#
Not the literal framework. Native tool calling and today's agent frameworks have absorbed what ART did, and they do it with less glue. But the design principles are worth stealing directly:
- Keep a curated set of reasoning demonstrations for your hard task types, and treat them as data you can improve.
- Keep your tools well-described and swappable.
- Make sure real tool output flows back into the model before it commits to an answer, rather than letting the model guess what a tool would have said.
- When the agent reasons badly, fix the examples before you reach for fine-tuning.
That last one is the mindset shift. A lot of agent quality problems that people try to solve with training are really task-library problems: the model was never shown the pattern you actually want. ART made that explicit three years ago, and it is still the cheapest lever most teams are not pulling.
If tool selection itself is your bottleneck (too many tools, model picks the wrong one), that is a retrieval problem, and it is worth reading alongside how RAG picks the right tool instead of every tool.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.