5 Projects to Master AI Engineering (for Experienced Engineers), Folarin Akinloye

If you already know how to build software, most "learn AI engineering" content will waste your time. You do not need another notebook that calls an LLM and prints the result. You need projects that force you into the parts that are actually hard: grounding answers in real data, letting a model take actions safely, proving quality with numbers, controlling cost, and keeping the thing alive in production.

I went through this transition myself, and the projects that taught me the most were the ones that broke in ways my normal engineering instincts did not predict. So here are five. They build on each other, they are scoped to someone who can already ship, and each one targets a specific skill that separates people who can demo from people who can operate.

One framing note before the list. The market has moved past "can you call an API." The 2026 State of Agent Engineering survey found 57% of teams already running agents in production, but only about 52% have real evals, even though 89% have some observability. Read that gap honestly: most people can get something running and watch it, but far fewer can prove it is good. That gap is exactly where these projects aim.

1. Production-grade RAG over your own messy data#

Retrieval is still the most in-demand skill in AI engineering, and almost everyone builds the easy version. Chunk some clean docs, embed them, do cosine similarity, ship. That version demos beautifully and falls apart on real questions.

Build the hard version. Point it at a corpus that is genuinely messy: your company wiki, a pile of PDFs with tables, a Slack export, anything with inconsistent structure. Then make it actually work.

What that forces you to learn:

Real chunking decisions, not the default 1,000 characters. See where naive splitting destroys meaning.
Hybrid retrieval. The default architecture in 2026 is keyword (BM25) plus vector search, often with a structured or graph layer for entities, and a reranker on top. Build all of it and measure what each piece adds.
Query rewriting and metadata filters, because users do not phrase questions the way your documents are written.
Knowing when retrieval failed versus when generation failed, which are completely different bugs.

You will know you have succeeded when you can point at a class of questions, explain why the pipeline gets them wrong, and fix that specific failure without breaking the rest. That diagnostic loop is the actual skill. I have written the building blocks up already: chunking strategies, reranking, embeddings for engineers, and choosing a vector database.

Stretch goal: add an evaluation pass (project 3) so every change to the pipeline is backed by a number instead of a vibe.

2. An agent that uses tools and can take real actions#

Once retrieval feels solid, build something that acts. Not a chatbot, an agent: a system that plans, calls tools, reads the results, and decides what to do next. Pick a task with real steps. A coding assistant that can read a repo and open a pull request. A research agent that searches, reads, and writes a sourced brief. An ops agent that can query your systems and propose a fix.

This is where you hit the problems that define agentic AI:

Tool design. The model only uses tools as well as their descriptions and schemas let it. You will rewrite these more than you expect.
Context management. A long-running agent fills its own context with junk and gets dumber. You will learn compression, offloading to a filesystem, and isolating heavy work in subagents. I wrote a whole series on how one framework does this in my DeepAgents posts.
Multi-agent shape. When one agent is not enough, do you use handoffs or a supervisor? Build it and feel the tradeoff.
Safety. The moment an agent can delete, send, or deploy, you need human-in-the-loop approval and guardrails. Build the gate before you need it.

You have succeeded when the agent completes a multi-step task end to end, recovers from at least one tool failure on its own, and you trust it enough to let it act without watching every step. If you want a worked example to learn from, I did a full build in a research agent with Vercel's Eve.

3. An evaluation harness that catches regressions#

This is the project almost nobody builds, and it is the one that will set you apart. Take project 1 or 2 and build the thing that tells you, automatically, whether a change made it better or worse.

For RAG, that means scoring the dimensions that matter: faithfulness (is the answer grounded in the retrieved context), context relevance (did you retrieve the right stuff), and answer quality. Industry targets people quote in 2026 sit around faithfulness above 0.9 and context precision above 0.8, so you have real numbers to aim at. For agents, it means trace-based evaluation: judging not just the final answer but the path the agent took.

What you will actually build and learn:

A dataset of real inputs with known-good outputs. Curating this honestly is half the work.
LLM-as-judge scoring for breadth, paired with human review for the high-stakes cases. The survey data backs this mix: human review is still essential, with roughly 60% of teams keeping a human in the loop for nuanced calls.
A regression suite you run on every prompt or model change, so you stop shipping on gut feel.
Tooling fluency. Try the open options (RAGAS, DeepEval, Promptfoo, Langfuse, Phoenix) and form an opinion.

You have succeeded when you change a prompt, run the harness, and it tells you that you fixed two cases and broke one, before any user sees it. I went deep on the RAG side in evaluating RAG: faithfulness, context, and answer quality.

Tip

If you only build one project from this list, build this one. Evals are the rarest skill on the market and the one that makes every other project measurably better.

4. A cost and latency rework of something that works#

Take a system from project 1 or 2 that already produces good output, and make it cheap and fast without wrecking quality. This is pure engineering, which is exactly why experienced engineers tend to be good at it and tend to skip it.

The work:

Profile honestly. Find where the tokens and the milliseconds actually go. It is rarely where you guess.
Apply prompt caching for the stable parts of your context.
Route by difficulty. Send easy requests to a small cheap model and reserve the expensive one for hard cases.
Stream responses so perceived latency drops even when total latency does not. I wrote up the full path in streaming LLM responses end to end.
Decide what to precompute, what to cache, and what to offload.

You have succeeded when you can state the before and after as numbers: cut cost per request by some real percentage and dropped p95 latency by some real amount, with eval scores (project 3) holding steady. "It feels faster" does not count. The playbook is in cutting LLM cost and latency without wrecking quality.

5. One system, actually deployed, that you keep running#

The last project is not a new build. It is taking one of the previous four all the way to production and operating it for a while. This is where you learn the things tutorials cannot teach, because they only show up under real traffic.

What production forces on you:

Observability and tracing, so when a user gets a bad answer you can reconstruct exactly what happened. Most teams have this; make sure yours is good enough to debug a single bad trace.
Handling the long tail: weird inputs, prompt injection attempts, rate limits, provider outages, partial failures.
Versioning prompts and models like the dependencies they are, with a rollback path.
A feedback loop from real usage back into your eval dataset, so the system gets better at the failures that actually occur.
Cost and reliability monitoring with alerts, because surprises here are expensive.

You have succeeded when the system has survived real users doing things you did not anticipate, and you fixed the failures with data instead of guesses. Hiring managers in 2026 explicitly scan for these production signals: how you handle failure, structure data, connect systems, and ship. A deployed, operated system is the single most convincing thing in a portfolio.

How to actually use this#

Do them in order, and let each one feed the next. RAG gives you a system worth evaluating. The agent gives you a harder system worth evaluating. The eval harness makes the cost rework safe. The deployment ties it together and generates the data that improves everything upstream.

Resist two temptations. Do not start a new project the moment one gets hard, because the hard part is the entire point. And do not skip evals because they are unglamorous, because they are the difference between "I built a thing" and "I can prove it works and make it better." That difference is the job.

If you want a sharper version of any one of these scoped to a stack you are using, that is a good next conversation. Pick the project that targets your weakest skill and start there.

1. Production-grade RAG over your own messy data#

Build the hard version. Point it at a corpus that is genuinely messy: your company wiki, a pile of PDFs with tables, a Slack export, anything with inconsistent structure. Then make it actually work.

What that forces you to learn:

Real chunking decisions, not the default 1,000 characters. See where naive splitting destroys meaning.
Hybrid retrieval. The default architecture in 2026 is keyword (BM25) plus vector search, often with a structured or graph layer for entities, and a reranker on top. Build all of it and measure what each piece adds.
Query rewriting and metadata filters, because users do not phrase questions the way your documents are written.
Knowing when retrieval failed versus when generation failed, which are completely different bugs.

Stretch goal: add an evaluation pass (project 3) so every change to the pipeline is backed by a number instead of a vibe.

2. An agent that uses tools and can take real actions#

This is where you hit the problems that define agentic AI:

Tool design. The model only uses tools as well as their descriptions and schemas let it. You will rewrite these more than you expect.
Context management. A long-running agent fills its own context with junk and gets dumber. You will learn compression, offloading to a filesystem, and isolating heavy work in subagents. I wrote a whole series on how one framework does this in my DeepAgents posts.
Multi-agent shape. When one agent is not enough, do you use handoffs or a supervisor? Build it and feel the tradeoff.
Safety. The moment an agent can delete, send, or deploy, you need human-in-the-loop approval and guardrails. Build the gate before you need it.

3. An evaluation harness that catches regressions#

This is the project almost nobody builds, and it is the one that will set you apart. Take project 1 or 2 and build the thing that tells you, automatically, whether a change made it better or worse.

What you will actually build and learn:

A dataset of real inputs with known-good outputs. Curating this honestly is half the work.
LLM-as-judge scoring for breadth, paired with human review for the high-stakes cases. The survey data backs this mix: human review is still essential, with roughly 60% of teams keeping a human in the loop for nuanced calls.
A regression suite you run on every prompt or model change, so you stop shipping on gut feel.
Tooling fluency. Try the open options (RAGAS, DeepEval, Promptfoo, Langfuse, Phoenix) and form an opinion.

Tip

If you only build one project from this list, build this one. Evals are the rarest skill on the market and the one that makes every other project measurably better.

4. A cost and latency rework of something that works#

The work:

Profile honestly. Find where the tokens and the milliseconds actually go. It is rarely where you guess.
Apply prompt caching for the stable parts of your context.
Route by difficulty. Send easy requests to a small cheap model and reserve the expensive one for hard cases.
Stream responses so perceived latency drops even when total latency does not. I wrote up the full path in streaming LLM responses end to end.
Decide what to precompute, what to cache, and what to offload.

5. One system, actually deployed, that you keep running#

What production forces on you:

Observability and tracing, so when a user gets a bad answer you can reconstruct exactly what happened. Most teams have this; make sure yours is good enough to debug a single bad trace.
Handling the long tail: weird inputs, prompt injection attempts, rate limits, provider outages, partial failures.
Versioning prompts and models like the dependencies they are, with a rollback path.
A feedback loop from real usage back into your eval dataset, so the system gets better at the failures that actually occur.
Cost and reliability monitoring with alerts, because surprises here are expensive.

How to actually use this#

If you want a sharper version of any one of these scoped to a stack you are using, that is a good next conversation. Pick the project that targets your weakest skill and start there.

Five Projects to Actually Master AI Engineering (for Experienced Engineers)

1. Production-grade RAG over your own messy data#

2. An agent that uses tools and can take real actions#

3. An evaluation harness that catches regressions#

4. A cost and latency rework of something that works#

5. One system, actually deployed, that you keep running#

How to actually use this#

Related articles

How I Learn to Build Production AI Systems by Dissecting Open Source

Choosing a Vector Database in 2026: pgvector vs Pinecone vs Qdrant vs Weaviate

Prompt Caching for LLM Apps: What It Is and When It Pays Off

Five Projects to Actually Master AI Engineering (for Experienced Engineers)

1. Production-grade RAG over your own messy data#

2. An agent that uses tools and can take real actions#

3. An evaluation harness that catches regressions#

4. A cost and latency rework of something that works#

5. One system, actually deployed, that you keep running#

How to actually use this#

Related articles

How I Learn to Build Production AI Systems by Dissecting Open Source

Choosing a Vector Database in 2026: pgvector vs Pinecone vs Qdrant vs Weaviate

Prompt Caching for LLM Apps: What It Is and When It Pays Off