Graph, State Machine, or Plain Loop: How to Structure an Agent
Most agents start as a while loop and should stay that way. Here is how to know when you have actually outgrown it.
Every agent is a control-flow problem wearing an LLM costume. The model decides what to do next, but something has to decide how much freedom the model gets, where it can loop, when it stops, and what happens when a step fails. You have three broad ways to write that something: a plain while loop, an explicit state machine, or a full graph. The mistake I see most is reaching for the graph framework on day one because that is what the tutorials use, when a 20-line loop would have been clearer and easier to debug.
Here is how I actually decide, with the tradeoffs that matter and code for each.
The plain loop: start here, almost always#
The simplest agent is a while loop. Call the model, if it asked for a tool run the tool and feed the result back, repeat until it stops asking for tools. That is the whole ReAct pattern, and it is more capable than people give it credit for.
def run_agent(user_input: str, tools: dict, max_steps: int = 10) -> str:
messages = [{"role": "user", "content": user_input}]
for _ in range(max_steps):
response = model.invoke(messages, tools=tool_specs)
messages.append(response)
if not response.tool_calls:
return response.content # done, no more tools wanted
for call in response.tool_calls:
result = tools[call.name](**call.args)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": str(result),
})
return "Hit step limit without finishing."This hands the steering wheel to the model and gets out of the way. The model picks tools, the loop executes them, the conversation grows until the model is satisfied. For a single-purpose assistant with a handful of tools, this is the right answer and you should not feel bad about it. It is easy to read, easy to log, and easy to step through in a debugger.
The max_steps guard is the one piece people forget. Without it, a confused model loops forever and burns your budget. Always cap it.
The loop runs out of room when you need things it cannot express cleanly: running two branches in parallel, pausing for human approval and resuming hours later, surviving a crash mid-run, or enforcing that a validation step always happens before a write. You can bolt all of that onto a loop, but each bolt-on makes it messier, and at some point you are writing a worse version of a state machine.
The explicit state machine: when the flow has rules#
A state machine names the stages your agent moves through and the legal transitions between them. You are taking some control back from the model: the model still makes local decisions, but you decide the shape of the overall flow. This is the right move when "what happens next" is not purely the model's call, when there are steps that must happen in a certain order, gates that must pass, or stages you want to reason about explicitly.
from enum import Enum
class Stage(Enum):
PLAN = "plan"
RETRIEVE = "retrieve"
ANSWER = "answer"
VALIDATE = "validate"
DONE = "done"
def run(user_input: str) -> str:
state = {"input": user_input, "stage": Stage.PLAN, "docs": [], "draft": ""}
while state["stage"] != Stage.DONE:
if state["stage"] == Stage.PLAN:
state["plan"] = plan(state["input"])
state["stage"] = Stage.RETRIEVE
elif state["stage"] == Stage.RETRIEVE:
state["docs"] = retrieve(state["plan"])
state["stage"] = Stage.ANSWER
elif state["stage"] == Stage.ANSWER:
state["draft"] = answer(state["input"], state["docs"])
state["stage"] = Stage.VALIDATE
elif state["stage"] == Stage.VALIDATE:
# a gate the model cannot skip
state["stage"] = Stage.DONE if is_grounded(state["draft"], state["docs"]) \
else Stage.RETRIEVE # not grounded? go get more
return state["draft"]The win here is predictability. The validation gate runs every time, by construction, not because the model remembered to. The flow is inspectable: you can log the stage at every step and know exactly where a run is. And you can loop deliberately (the failed validation sends it back to retrieve) rather than hoping the model decides to.
The cost is that you wrote it. You designed the stages and transitions, and if the problem needs a stage you did not anticipate, you change code. That rigidity is exactly the point for workflows with rules, and exactly the wrong fit for open-ended exploration where you do not know the stages in advance.
The graph: when you need parallelism, persistence, and resumability#
A graph is a state machine where the framework owns the execution. Nodes are steps, edges are transitions, and a shared state object flows through. The reason to take on a framework like LangGraph is not that it is fancier; it is that it gives you four things that are real work to build yourself:
- Checkpointing and persistence. The framework serializes state at each step, so a run can crash and resume, or pause for human input and pick up days later.
- Parallel branches. Fan out to several nodes at once and join their results, without writing the concurrency by hand.
- Human-in-the-loop. Interrupt before a sensitive step, wait for approval, resume. This is hard to do well on a bare loop.
- Time travel and inspection. Load a past checkpoint, change the state, fork execution from there. Priceless when debugging.
from langgraph.graph import StateGraph, START, END
def build():
g = StateGraph(AgentState)
g.add_node("plan", plan_node)
g.add_node("retrieve", retrieve_node)
g.add_node("answer", answer_node)
g.add_node("validate", validate_node)
g.add_edge(START, "plan")
g.add_edge("plan", "retrieve")
g.add_edge("retrieve", "answer")
g.add_edge("answer", "validate")
# conditional edge: loop back or finish, decided by a function of state
g.add_conditional_edges("validate", lambda s: "retrieve" if not s["grounded"] else END)
return g.compile(checkpointer=checkpointer)Notice the structure is the same as the hand-written state machine. That is the point: a graph is the explicit state machine, plus an engine that handles the hard operational parts. You pay for it with a steeper learning curve and a dependency that owns your control flow. If you are not using the checkpointing, parallelism, or human-in-the-loop features, you are paying that cost for nothing, and a plain state machine would have been simpler.
I went deeper on the persistence side specifically in LangGraph state, checkpointing, and persistence, and the human-approval pattern in Human-in-the-loop in DeepAgents. Those two features are the most common honest reasons to be on a graph.
How to actually choose#
| Plain loop | State machine | Graph | |
|---|---|---|---|
| Control over flow | Model decides | You decide shape | You decide shape |
| Parallel steps | Manual | Manual | Built in |
| Pause and resume | No | Hard | Built in |
| Survives a crash | No | No | Built in (checkpoints) |
| Enforced gates | No | Yes | Yes |
| Debuggability | Read the loop | Read the stages | Inspect checkpoints |
| Cost to build | Lowest | Medium | Highest (plus a dependency) |
| Best for | Single-purpose assistants, tool-using chatbots | Workflows with ordered steps and gates | Long-running, parallel, resumable, human-gated agents |
My rule: start with the loop. Move to an explicit state machine when the flow grows rules the model should not be free to break. Move to a graph when you specifically need checkpointing, parallelism, or human-in-the-loop, not before. The framework is a tool for those features, not a badge of seriousness.
A good signal that you have outgrown the loop: you find yourself adding flags to the message list to remember "have I validated yet" or "did the user approve". That bookkeeping is a state machine trying to be born. Make the state explicit and the code gets clearer.
The worst outcome is not picking the "wrong" one. It is picking the heaviest one by default, then fighting its abstractions for a problem a loop would have solved in an afternoon. Match the structure to what the agent actually needs, and let it grow into more only when the need is real. If you are weighing this for a system with several agents, the routing question on top of it is in Multi-agent handoffs vs the supervisor pattern.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.