The Agent Loop: Building ReAct From Scratch
Reason, act, observe, repeat. The whole engine is about 20 lines, and one counting rule explains all of it.
An agent is not a special kind of model. It is a stateless model wrapped in a loop you write. Once that clicks, the mystery falls away, and you realize the whole engine is about twenty lines of code. This post builds that engine from scratch: the loop, the memory, the exact algorithm, and the one counting rule that explains how many times the model actually runs.
I am going to lean on a worked example the whole way through, because the only way this sticks is by tracing it. Keep "add 2 and 3" and "weather in Lagos, convert to Fahrenheit, do I need a jacket?" in your head. We will count both.
The pattern: Reason, Act, Observe#
ReAct stands for Reason plus Act. The model alternates between three things:
- Reason: "I need the product of 17 and 23, so I will use the multiply tool."
- Act: it actually requests the tool call.
- Observe: your code runs the tool and feeds the result back.
Then it reasons again, with the new result in hand, and keeps going until it can answer in plain text. Here is the shape of the loop:
task -> memory
|
v
model.generate(memory, tools) <-------------+
| |
does the model want a tool? |
| | |
yes no |
| | |
v v |
run the tool return final answer |
| |
append result (observation) to memory ---------+
(stop if max_steps exceeded)Every lap around that loop is exactly one model call. That is the single most important thing to hold onto, and we will come back to it when we count.
Memory is a list of messages, each with a role#
The "memory" is not anything clever. It is a plain Python list of message objects. Each message carries a role that tells the model what kind of turn it is.
| Role | What it is |
|---|---|
system | The instructions and persona, set once at the start. "You are a helpful agent. Use tools when needed. When done, answer in plain text." |
user | The task from the human. |
assistant | What the model says: its text, its tool-call requests, or both. |
tool | The result of running a tool, the observation, fed back so the model can react to it. |
The tool role is the one that carries an observation back into the conversation. That is the channel through which the model finds out what its own action did.
Here is the rule that makes the whole thing work, and it is the piece most people have to learn the hard way:
The model is stateless. The message list is the memory. Every iteration you pass the entire list back. If something is not in the list, the model does not know about it. Full stop.
If you have built a stateless /chat HTTP endpoint, you already know this in your bones. The server holds no session, so the client re-posts the full conversation history on every request. An agent is the same pattern: a list you keep appending to and resend in full on every model call. I went deeper on what belongs in that list, and for how long, in Agent Memory: Short-Term vs Long-Term.
The subtle part: who is speaking when the model asks for a tool?#
When the model decides to call multiply(17, 23), that request gets recorded in memory as a message. What role does it have?
It is an assistant message. The model is the one speaking, so its tool request is an assistant turn, the same as if it had spoken text. And here is the nuance that trips people up: a single assistant turn can carry text and tool calls at the same time. The model sometimes thinks out loud in text and asks for a tool in the same breath.
So one lap of the loop can add two messages to memory: the assistant turn (with its tool calls), and then a separate tool message with the result. Append the assistant turn, run the tool, append the observation. Two messages, one lap.
If the function-calling mechanics here feel shaky, I wrote a primer on them in Structured Outputs and Function Calling. This post assumes you know how a model emits a tool call; here we care about the loop around it.
The algorithm: Generate, Record, Branch, Guard#
This is the engine, as pseudocode. It is genuinely about twenty lines.
memory = [system_message, user_message(task)] # seed
for step in range(1, max_steps + 1): # bounded loop
response = await model.generate(memory, tools) # 1. GENERATE
memory.append(assistant_message(response)) # 2. RECORD the assistant turn
if not response.tool_calls: # 3a. no tools -> final answer
return AgentResult(answer=response.content, steps=step)
for call in response.tool_calls: # 3b. has tools -> run each
observation = run_tool_safely(call)
memory.append(tool_message(observation, call.id))
# loop again so the model can SEE the observations
return AgentResult(answer=..., steps=max_steps, stop_reason="max_steps") # 4. guardFour moves, in order: Generate, Record, Branch, Guard.
Generate asks the model what to do next, given everything so far. Record appends that assistant turn to memory immediately. Branch is the fork: no tool calls means the model is done and its text is the answer, so you return; tool calls means you run each one and append the observations. Guard is the max_steps ceiling that stops a confused model from looping forever (more on that below).
The condition that ends the loop with a normal answer is the one line if not response.tool_calls. When the model stops asking for tools, it has nothing left to do but answer, and its text is the result.
The observation must be appended to memory before the next generate call. If you run a tool but forget to append its result, the next generate sees no record of what happened. The model then either repeats the same call forever or hallucinates a result. The entire point of the loop is that each generate sees everything that came before it.
Mapping it to real code#
In a real minimal agent the plumbing is mechanical. model.generate(memory, tools) returns a ModelResponse with .content (the text) and .tool_calls (a list). A convenience flag like response.wants_tool_call is just True when .tool_calls is non-empty, so if not response.wants_tool_call: is your final-answer branch. The memory item is a Message(role=..., content=..., tool_calls=..., tool_call_id=...). A registry dispatches a tool by name, raising for unknown names. And you return an AgentResult(answer, steps, messages, stop_reason). The only thing you write is the loop body. The hard thinking is the algorithm; the typing is wiring.
The counting rule that explains everything#
Here is the question that separates people who get the loop from people who think they do. How many times does model.generate run for a given task?
Trace "add 2 and 3" with one add tool, watching memory grow:
seed: memory = [system, user("add 2 and 3")]
generate #1 model sees [system, user]
returns: tool_call add(2, 3)
append assistant turn -> memory += assistant(tool_calls=[add(2,3)])
run add(2,3) = 5
append observation -> memory += tool("5")
memory: [system, user, assistant(add), tool("5")]
generate #2 model sees [system, user, assistant, tool("5")]
returns: "The sum is 5." (no tool calls)
RETURNTwo generate calls. Not three. Right before that final call, memory holds four messages: system, user, the assistant turn with the tool call, and the tool observation "5".
The temptation is to count three, because a two-step task like "17 times 23, then add 100" really is three. But that task has two tool steps. "Add 2 and 3" has one. The rule that fixes the confusion for good:
generate calls = tool rounds + 1
The "+1" is the final round where the model has everything it needs and just answers. It is always there for an agent that finishes normally, because the model is stateless and one-shot: it cannot both receive the last observation and answer in the same call. Seeing a result and responding to it are two separate calls.
So "add 2 and 3" is 1 tool round, 2 generate calls. "17 times 23, then plus 100" is 2 tool rounds, 3 generate calls. The +1 lap is the one people drop, and it is the most important lap, because it is where the answer actually comes from.
Tool rounds are not tool calls#
Now the refinement, because the rule above hides something. A single generate can return several tool calls at once. That is exactly why the algorithm loops over response.tool_calls instead of handling one. So we count rounds, not calls:
- A tool call is one tool invocation.
- A tool round is one
generatethat came back asking for one or more tools. You run all of them in that lap before looping. - generate calls = tool rounds + 1. Two tools in one response is still one round, so still one generate.
Whether the model can pack several tools into one round comes down to data dependency.
Independent tools can fire in one round. Ask "what is the population of France and of Japan?" with a lookup(country) tool, and the model can return both lookup("France") and lookup("Japan") in a single response. Neither needs the other's answer. That is 1 round, 2 tool calls, then 1 final answer: 2 generate calls.
Dependent tools must run in separate rounds. Ask "weather in Lagos, then convert that to Fahrenheit, then tell me if I need a jacket," with get_weather(city) and to_fahrenheit(celsius). The model literally cannot ask for to_fahrenheit(???) before it has seen the weather, because it does not have the number yet. So it must ask for the weather, see the result, then ask for the conversion. The "do I need a jacket" judgment is pure reasoning, no tool. Trace it:
generate #1 model asks get_weather("Lagos") (round 1)
run it, append the Celsius observation
generate #2 model asks to_fahrenheit(28) (round 2)
run it, append the Fahrenheit observation
generate #3 model says "yes, bring a light jacket" (final, no tool)Two tool rounds, three generate calls. The dependency forces the rounds apart.
When tools are independent, you do not just batch them, you want to run them concurrently so you are not waiting on each one in turn. That is where asyncio.gather earns its keep:
import asyncio
async def run_round(tool_calls):
# independent calls in this round, run them all at once
results = await asyncio.gather(
*(run_tool_safely(call) for call in tool_calls)
)
return resultsThis is a real performance lever, not a toy detail. A round with three independent API lookups should take as long as the slowest one, not the sum of all three. If you care about agent latency and cost, this is one of the cheapest wins available, and it sits alongside the other levers I covered in Cutting LLM Cost and Latency.
How the loop ends, and how it survives failure#
Everything above is the happy path. What separates a toy from a real agent is the two ways the loop ends and what it does when a tool breaks.
Two termination modes#
Natural termination is the model deciding it is done: an assistant message with no tool calls. That is your if not response.tool_calls: return.
Forced termination is the max_steps ceiling. Without a hard cap, a confused model can loop forever, calling tools endlessly and burning time and money. The for step in range(1, max_steps + 1) is the safety net, and when it runs out you return with stop_reason="max_steps" so the caller knows the answer is not trustworthy.
You must handle both. The line I keep in my head: agents that do not know when to stop cause more production incidents than agents that fail outright. max_steps is the week-one version of stopping. Later you add cost budgets and stuck-detection on top, but the ceiling comes first.
Error recovery is one try/except#
When a tool call fails, because the model invented a tool that does not exist, passed bad arguments, or the tool itself raised, you have two choices. A toy lets the exception bubble up and the whole loop crashes. A real agent catches it, turns the error into an observation, and feeds it back so the model can recover.
def run_tool_safely(call):
try:
return str(registry.call(call.name, call.arguments))
except Exception as e: # intentionally broad: never crash the loop
return f"Error calling {call.name}: {e}"That error string becomes a tool message exactly like a normal result. And here is the quietly magical part: the model often reads the error and fixes its next call by itself. A one-shot call cannot self-correct. An agent in a loop can. That recovery is the entire reason the loop exists.
Three edge cases worth testing explicitly:
- Unknown tool. The model calls
frobnicate. The registry raises, the error becomes an observation, the loop keeps going. - Tool raises. The exception text becomes the observation, the loop keeps going.
- Text and tool calls in one response. Tool calls win: execute them and keep looping. The text was the model thinking out loud, not the final answer.
A broad except is the right starting point, but not the end. Eventually you split errors into recoverable and fatal. A tool failing is recoverable, so you feed it back and continue. The model API call itself dying (bad key, network down) is fatal, so you raise rather than retry a broken call in a tight loop. Mature frameworks also do a graceful max_steps (one last call asking for a best guess instead of returning junk) and treat final_answer as an explicit tool rather than relying on "no tool calls means done." You do not need those on day one, but it helps to know where the road leads.
Why this is the whole game#
Strip away the frameworks and this loop is what every tool-using agent is doing underneath. Seed a message list, generate, record the turn, branch on whether it wants a tool, run tools and append observations, guard with a step ceiling, recover from failures by turning them into observations. That is it.
Once you have built it by hand, the framework abstractions stop being magic and start being conveniences, and you can reason about why an agent looped, stalled, or hallucinated by asking one question: what was in the message list right before that generate call? If you want to go up a level from here, the structural choices (a plain loop like this, a state machine, or a graph) are the subject of Graph, State Machine, or Plain Loop: How to Structure an Agent. And the discipline of deciding what actually goes into that growing message list is Context Engineering for Agents.
Build the twenty lines yourself once. After that, you are not learning agents anymore, you are just choosing how fancy to make the loop.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.