Building Agents That Actually Work
The practical habits that separate a reliable agent from a flaky demo: simpler workflows, better information flow, and a debugging ladder
There is a real gap between an agent that works and one that does not, and it is usually not the exciting stuff that decides which side you land on. The flaky agent is rarely flaky because you picked the wrong architecture. It is flaky because it makes too many decisions, its tools do not explain themselves, and its task was worded vaguely. The good news is that these are boring, fixable problems. This post is about the handful of practices that reliably move an agent from "works in the demo" to "works." The examples use smolagents, but the lessons carry to any framework.
If you want the fundamentals first, read what agents are and how agents use tools. This picks up where those leave off.
The best agentic systems are the simplest#
Every time you hand a decision to an LLM, you add a chance for it to get that decision wrong. Good systems have error logging and retries so the model can self-correct, but the surest way to cut error is to give the model fewer decisions to make in the first place. So the single most useful guideline is this: reduce the number of LLM calls as much as you can.
Take a bot for a surf trip company. When someone asks about a spot, a naive design lets the agent make two separate calls, one to a travel-distance API and one to a weather API, deciding each time. Instead, collapse them into one tool, return_spot_information, that calls both APIs and returns their combined output. That is one decision instead of two or three, which means lower cost, lower latency, and less room for the model to trip.
Two takeaways fall out of this:
Group tools where you can. If two tools are almost always used together, make them one tool. Every tool you merge is a decision you take off the model's plate.
Prefer deterministic code over agentic decisions. If a step can be a plain function with a predictable output, make it a function. Save the agent's judgment for the parts that genuinely need judgment. An if statement never hallucinates.
Before adding agency anywhere, ask whether that step actually needs it. A hardcoded workflow is 100% reliable. Reach for the LLM only where the flexibility is worth the risk, and keep everything else deterministic.
Improve the information flow to the model#
Here is the mental model that fixes the most bugs: your LLM is an intelligent person locked in a room, and the only way it learns anything about the outside world is notes passed under the door. If you did not put something in the prompt or a tool result, the model does not know it. Full stop.
That starts with the task itself. Because an agent is driven by an LLM, small changes in how you word the task can produce completely different behaviour, so make the task unambiguous. Then fix the other channel the model hears through: its tools. Tools should explain themselves, both in how they are described and in what they return.
Compare two versions of the same weather tool. First, the poor one:
import datetime
from smolagents import tool
@tool
def get_weather_api(location: str, date_time: str) -> str:
"""
Returns the weather report.
Args:
location: the name of the place that you want the weather for.
date_time: the date and time for which you want the report.
"""
lon, lat = convert_location_to_coordinates(location)
date_time = datetime.strptime(date_time)
return str(get_weather_report_at_coordinates((lon, lat), date_time))Why is this bad? It never says what format date_time should be in. It never says how to specify a location. It has no handling that turns a bad input into a message the model can learn from, so a formatting mistake surfaces as a raw stack trace. And the output is a bare list that is hard to read. When this fails, the model has to reverse-engineer the tool from the error to fix its call. Why make it do that work?
Now the good version:
@tool
def get_weather_api(location: str, date_time: str) -> str:
"""
Returns the weather report.
Args:
location: the name of the place that you want the weather for. Should be a place name, followed by possibly a city name, then a country, like "Anchor Point, Taghazout, Morocco".
date_time: the date and time for which you want the report, formatted as '%m/%d/%y %H:%M:%S'.
"""
lon, lat = convert_location_to_coordinates(location)
try:
date_time = datetime.strptime(date_time)
except Exception as e:
raise ValueError(
"Conversion of `date_time` to datetime format failed, make sure to "
"provide a string in format '%m/%d/%y %H:%M:%S'. Full trace: " + str(e)
)
temperature_celsius, risk_of_rain, wave_height = get_weather_report_at_coordinates((lon, lat), date_time)
return (
f"Weather report for {location}, {date_time}: "
f"Temperature will be {temperature_celsius}°C, "
f"risk of rain is {risk_of_rain*100:.0f}%, wave height is {wave_height}m."
)Same logic, completely different reliability. The description spells out the exact format with an example. The error message tells the model precisely what went wrong and how to fix it, so a bad call becomes a self-correcting one. And the output is a sentence a human could read, which means the model can act on it directly.
The question to keep asking is simple: if I were not very smart and using this tool for the first time, how easily could I call it correctly and fix my own mistakes? Design for that reader. And inside your tools, use print liberally, because in a code agent those prints become observations the model sees on the next step. Logging what happened, especially on errors, is one of the cheapest reliability wins there is.
Pass more than a string#
An agent's task does not have to be just text. When the model needs an object to work with, a file, an image, some structured data, pass it alongside the task with additional_args:
from smolagents import CodeAgent, InferenceClientModel
agent = CodeAgent(tools=[], model=InferenceClientModel(model_id="meta-llama/Llama-3.3-70B-Instruct"), add_base_tools=True)
agent.run(
"Why does Mike not know many people in New York?",
additional_args={"mp3_sound_file_url": "https://.../recording.mp3"},
)The agent can then reach those objects directly instead of you trying to cram them into a string. It is a small feature that saves a lot of awkward prompt engineering.
A debugging ladder#
When an agent misbehaves, resist the urge to rewrite everything. Climb this ladder in order; most problems are solved on the first two rungs.
Rung 1: use a stronger model. A lot of "bugs" are not bugs in your system, they are the model reasoning poorly. A real example: an agent asked to make a car picture generates the image, but returns the file path instead of the image, because the model forgot to save the output into a variable it could return. That looks like a framework bug, but it is the model's mistake, and a more capable model would not have made it. Before you touch your code, try a better model.
Rung 2: give more information or clearer instructions. Weaker models can work well if you guide them harder. Put yourself in the model's shoes: with only the system prompt, the task, and the tool descriptions, could you solve this? If not, add what is missing, and put it in the right place. Standing rules that should always apply go in the agent's instructions (they are appended to the system prompt, not replacing it). Details specific to one task go in the task. Guidance about a particular tool goes in that tool's description.
agent = CodeAgent(
tools=[],
model=InferenceClientModel(model_id="meta-llama/Llama-3.3-70B-Instruct"),
instructions="Always cite the source URL next to any fact you report.",
)Rung 3: change the prompt templates (usually skip this). You can overwrite the whole system prompt template, and it is occasionally necessary, but it is rarely the right first move. Passing instructions gets you most of the benefit with far less risk of breaking the carefully tuned default. Reach for template surgery only when clearer instructions genuinely are not enough.
Rung 4: add a planning step. For longer, multi-step tasks, let the agent periodically stop and think without acting. smolagents supports this with planning_interval: every few steps the model pauses, updates the facts it knows, and reflects on what to do next.
agent = CodeAgent(
tools=[search_tool, image_generation_tool],
model=InferenceClientModel(model_id="Qwen/Qwen2.5-72B-Instruct"),
planning_interval=3, # run a planning step every 3 actions
)That periodic re-planning keeps a long run from drifting off course, which is one of the most common failure modes on hard tasks.
Make it observable, then measure it#
The practices above make an agent more likely to work. To know whether it actually does, and to keep it working as you change it, you need two more things that the tutorial hints at and I want to underline.
Observability first. You cannot fix what you cannot see, so trace every run: the plan, each tool call and its arguments, each result, and the final answer. smolagents integrates with OpenTelemetry-based tracing for exactly this. A readable trace turns "the agent did something weird" into "the agent called the wrong tool on step three," which is a fixable statement.
Then measure. Once you can see runs, turn the ones that went wrong into test cases and score changes against them, so a prompt tweak that helps one case does not quietly break five others. This is the whole point of evaluating agents, and it is what separates an agent you tinker with from one you can actually trust in front of users.
Wrapping up#
Reliable agents are not a matter of clever architecture. They come from a few unglamorous habits: cut the number of decisions the model makes, keep deterministic work deterministic, word the task clearly, and build tools that describe themselves and return readable, self-correcting output. When something breaks, climb the ladder, try a stronger model, then better instructions, then planning, and only then reach for the prompt templates. Wrap the whole thing in tracing and evaluation so you can see what it does and prove that your changes help. Do that, and your agent lands on the right side of the gap.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.