How I Learn to Build Production AI Systems by Dissecting Open Source
There are no good production-grade courses, so I turn the codebases real teams run into my curriculum
I gave up waiting for a good course on production AI infrastructure. Not because none exist, but because the field moves faster than any curriculum can keep up. By the time a course on "building AI agents" is recorded, edited, and published, the patterns it teaches are a version or two behind what teams actually run. So I stopped looking for the course and started building my own out of the only material that stays current: the codebases production teams ship.
The method is simple to say and hard to do. Take the open-source project that owns a layer of the stack, gut it, and rebuild it from git init until I understand why every important decision was made. I call it Dissect, then Rebuild, then Ship.
Why reading code beats watching courses#
A course teaches you the happy path. Production code teaches you the edge cases, because it was written by people who got paged at 3am when the edge case fired. The retry logic, the budget guard, the deterministic replay, the audit log: none of that shows up in a tutorial, but all of it is in the codebase, and all of it is the actual job.
There is a catch. Reading code is not the same as understanding it. I can read a clever agent loop, nod, and retain nothing. The understanding only sticks when I rebuild it myself and hit the same walls the original authors hit. That is the difference between recognizing a solution and being able to produce one.
The method: Dissect, Rebuild, Ship#
Three phases per project.
Dissect. Read the codebase to find the spine: the five or six decisions that define how it works. Not every file, just the load-bearing ones. Write down what each one does and why.
Rebuild. Start from an empty repo. Do not clone and tweak. Rebuild the core from scratch, in my own structure, until it runs. This is where the understanding actually happens, because you cannot rebuild what you only half-read.
Ship. Get it working end to end, write up what I learned, and put it on GitHub under MIT. If it does not ship, I did not finish learning it.
A few rules of engagement keep me honest:
- No cloning and renaming. Every rebuild starts at
git init. - Every rebuild ships a 20-case eval suite. If I cannot test it, I do not understand its contract.
- Each repo gets a
CLAUDE.md(how the code is structured) and aLEARNING.md(what surprised me). - Each project ships end to end within about four weeks. Timeboxing forces me to find the spine instead of polishing forever.
- Everything is MIT-licensed and public.
The four-week box is the most important rule. Without it I would spend three months on the first project and never reach the rest. Done and shipped beats perfect and abandoned.
The path: one project per layer#
I picked eight projects, each owning one layer of the production agent stack. The chain runs in order: loop, gateway, memory, sandbox, observability, durability, orchestration, tools. Together they go from "how does an agent even loop" up to "how does an agent touch the real world." The point is not the individual tools. It is that each one is the cleanest real-world answer to one hard infrastructure question, and the same layers that proprietary AI tooling companies charge for.
| Layer | Project | The question it answers |
|---|---|---|
| The agent loop | smolagents | What is the minimal loop that makes an agent an agent? |
| Model gateway | LiteLLM | How do you route across models with fallbacks, budgets, and cost tracking? |
| Memory | Letta / MemGPT | How do you give an agent memory tiers and persistent state? |
| Sandbox | OpenHands | How do you run agent-generated code safely in a Docker sandbox? |
| Observability | Langfuse | How do you trace, evaluate, and gate regressions in CI? |
| Durability | Inngest | How do you survive crashes with durable execution and replay? |
| Orchestration | Paperclip | How do you coordinate many agents with budgets and audit logs? |
| Tools | browser-use | How does an agent act on the world through the DOM? |
1. smolagents, the agent loop#
The starting point, because everything else is built on it. Strip away the framework and an agent is a loop: call the model, parse what it wants to do, do it, feed the result back, repeat until done. smolagents is the cleanest expression of that I have found. Rebuilding it means I never again treat "the agent" as a black box. I covered the concept side of this in What Are AI Agents, and What Is a Multi-Agent System?.
2. LiteLLM, the model gateway#
In production you do not call one model, you call whichever model is up, cheapest, and fast enough, and you fall back when one fails. LiteLLM is the gateway that makes a hundred providers look like one API, with routing, fallbacks, budgets, and cost tracking. Rebuilding it teaches you the unglamorous reality that model calls are a distributed-systems problem, not a function call.
3. Letta (MemGPT), memory#
An agent with no memory restarts from zero every session. Letta pioneered memory tiers: a small in-context working memory plus larger stores it pages in and out, like virtual memory for an LLM. Rebuilding it forces you to confront what "remembering" actually means when the context window is finite. It connects directly to the retrieval ideas in Context Engineering for Agents.
4. OpenHands, the sandbox#
The moment an agent writes and runs code, you have a security problem. OpenHands runs that code in a Docker sandbox so a bad command cannot wreck the host. Rebuilding the sandbox layer is where you learn that "let the agent run code" and "let the agent run code safely" are completely different engineering tasks.
5. Langfuse, observability#
You cannot improve what you cannot see. Langfuse traces every step an agent takes, runs evals including LLM-as-judge, and gates regressions in CI. This is the layer that turns "it feels better" into "p95 dropped and the eval suite passed." It pairs with what I wrote in Evaluating Agents with LangSmith; same problem, different tool.
6. Inngest, durability#
Real agent tasks are long, and long tasks crash. Inngest brings durable execution: a workflow that survives process death, resumes where it stopped, and can replay deterministically. Rebuilding it is how you learn that an agent is really a long-lived workflow wearing a chat interface, and workflows need checkpoints.
7. Paperclip, orchestration#
One agent is a loop. Many agents is a coordination problem: who does what, who pays for it, and who is accountable. Paperclip handles multi-agent orchestration with budgets and audit logs. Rebuilding it makes the supervisor-versus-handoff tradeoffs concrete instead of theoretical.
8. browser-use, tools that touch the world#
The last layer is action. browser-use lets an agent observe a real web page through the DOM and act on it: click, type, navigate. Rebuilding it grounds the whole stack in something physical, because a tool that touches the real world fails in ways a pure-text tool never does. The tool-calling foundations are in Giving Agents Tools: Function Calling and MCP.
The capstone: compose all eight into one thing#
The eight projects are not the finish line, they are the parts. The final build is to wire Projects 1 through 7 into one application a real user would pay for: my agent loop, behind my gateway, with my memory layer, running in my sandbox, traced by my observability stack, made crash-safe by my durable executor, governed by my control plane. Pick the most annoying recurring workflow I actually have, automate it end to end, and run it 24/7 with budgets and audit logs in front of at least five real users.
That capstone is the whole point. Anyone can follow a tutorial for one layer. Stacking all of them into a single system that survives contact with real traffic is the thing that turns "I read the code" into "I can build the platform." The profile at the end is someone who builds the platforms other people build on, not someone who only builds on top of them.
What I am actually getting out of this#
By the end I will have rebuilt, from scratch, one honest answer to each hard question in the production agent stack, with eval suites to prove each one works and write-ups of what surprised me. That is a better artifact than any certificate, and it stays current because the source material is whatever teams are shipping right now, not whatever was true when a course was filmed.
If you want to try this yourself, do not start with all eight. Pick the one layer you understand least, find the cleanest open-source project that owns it, and rebuild its spine in two weeks. The first rebuild is the one that teaches you the method. The rest is just repetition. I will be writing up each project as I finish it, starting with the agent loop, so this is the first post in what is going to be a long series.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.