Agentic RAG, and How It Differs from Traditional RAG
When retrieval stops being a fixed step and becomes something the agent decides
Traditional RAG has one move: take the question, fetch some documents, stuff them into the prompt, and generate an answer. It works until the question is even slightly awkward, and then it falls flat because it only ever gets one shot at retrieval. Agentic RAG fixes that by handing the retrieval decision to the model itself. This post is about what actually changes when you make that switch, and how to build it.
If you read the first post in this series on agents and multi-agent systems, you already have the key idea: an agent is a model in a loop that decides its own next move. Agentic RAG is what you get when retrieval becomes one of those moves.
How traditional RAG works#
The classic RAG pipeline is a straight line. A user asks something, you embed the question, search a vector store for the closest chunks, paste those chunks into the prompt, and the model writes an answer from them. Query, retrieve, generate. Done.
# Traditional RAG: one fixed path, every time
chunks = vectorstore.similarity_search(question, k=4)
context = "\n\n".join(c.page_content for c in chunks)
answer = model.invoke(f"Answer using this context:\n{context}\n\nQ: {question}")This is fast, cheap, and easy to reason about. For a lot of use cases it is genuinely all you need, and I would not talk you out of it. But look at what it assumes: that retrieval should always happen, that one search is enough, and that whatever comes back is worth using. Those three assumptions are exactly where it breaks.
Where traditional RAG falls down#
Three failure modes show up constantly.
It retrieves when it should not. Ask "hello" or "summarise our last chat" and a fixed pipeline still runs a vector search, pulling in random chunks that pollute the answer.
It retrieves once and gives up. If the first search misses, there is no second attempt. A vaguely worded question returns vaguely relevant chunks, and the model dutifully answers from junk.
It cannot tell good context from bad. The retrieved chunks go straight into the prompt whether they are relevant or not. Garbage in, confident garbage out.
The root problem is that retrieval is hardcoded. The pipeline cannot make a decision, because there is no decision-maker in it.
What agentic RAG changes#
Agentic RAG puts a model in charge of retrieval instead of wiring it in. Retrieval becomes a tool the agent can choose to call, rather than a step that always runs. That one change unlocks a lot:
The agent can skip retrieval entirely for questions that do not need it. It can decide what to search for, rewriting a messy question into a better query. It can look at what came back, judge whether it is actually relevant, and search again if it is not. And it can keep looping until it has enough to answer well.
The mental shift is from a pipeline to a control loop. Traditional RAG is a conveyor belt. Agentic RAG is someone standing at the belt deciding what to do next at each step.
This is the same loop from the agents post, just pointed at a knowledge base. The model thinks, optionally acts by retrieving, observes the result, and loops. Retrieval is no longer special. It is just another tool.
The patterns worth knowing#
A few named patterns come up again and again, and they stack.
Adaptive RAG routes the question first. A simple factual question might skip retrieval and use the model's own knowledge. A complex one triggers a vector search. A time-sensitive one routes to web search instead of your stale index.
Corrective RAG grades the retrieved documents before trusting them. If they score poorly for relevance, the system rewrites the query and tries again, or falls back to web search rather than answering from weak context.
Self-reflective RAG checks the generated answer, not just the documents. It asks whether the answer is actually grounded in the sources, and if it spots a hallucination, it regenerates with tighter constraints.
You do not need all three at once. Most real systems start with corrective RAG, because grading documents and retrying is where the biggest quality jump lives.
Building agentic RAG with LangGraph#
Let us build the corrective version. The agent decides whether to retrieve, grades what comes back, and rewrites the question and retries if the documents are weak. LangGraph is a good fit here because it lets you express RAG as a graph of nodes and edges, so retrying a step does not mean rewriting the whole thing.
Install and set up:
pip install -U langgraph "langchain[openai]" langchain-community langchain-text-splittersFirst, the retriever as a tool. This is the heart of it: retrieval is now something the model calls, not something that always fires.
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain.tools import tool
vectorstore = InMemoryVectorStore.from_documents(doc_splits, OpenAIEmbeddings())
retriever = vectorstore.as_retriever()
@tool
def retrieve_docs(query: str) -> str:
"""Search the knowledge base and return relevant passages."""
docs = retriever.invoke(query)
return "\n\n".join(d.page_content for d in docs)Next, the node that decides whether to retrieve at all. We bind the tool to the model and let it choose. If it calls the tool, we retrieve. If not, it answers directly.
from langgraph.graph import MessagesState
from langchain.chat_models import init_chat_model
model = init_chat_model("openai:gpt-4o", temperature=0)
def decide(state: MessagesState):
"""Let the model choose: retrieve, or answer directly."""
response = model.bind_tools([retrieve_docs]).invoke(state["messages"])
return {"messages": [response]}Now the corrective part: grade the retrieved documents. We use structured output to force a clean yes or no, then route based on it.
from pydantic import BaseModel, Field
from typing import Literal
class Grade(BaseModel):
relevant: str = Field(description="'yes' if the docs answer the question, else 'no'")
def grade_documents(state: MessagesState) -> Literal["answer", "rewrite"]:
"""Check whether retrieved docs are actually relevant."""
question = state["messages"][0].content
docs = state["messages"][-1].content
prompt = f"Question: {question}\n\nDocs: {docs}\n\nAre these relevant?"
score = model.with_structured_output(Grade).invoke(prompt).relevant
return "answer" if score == "yes" else "rewrite"If the grade comes back "no", a rewrite node reformulates the question and sends it back through the loop. If "yes", an answer node generates the final response from the context. Wiring those into a graph with conditional edges gives you the full corrective flow:
from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import ToolNode, tools_condition
workflow = StateGraph(MessagesState)
workflow.add_node(decide)
workflow.add_node("retrieve", ToolNode([retrieve_docs]))
workflow.add_node(rewrite) # reformulates the question
workflow.add_node(answer) # generates the final answer
workflow.add_edge(START, "decide")
workflow.add_conditional_edges("decide", tools_condition, {"tools": "retrieve", END: END})
workflow.add_conditional_edges("retrieve", grade_documents) # -> "answer" or "rewrite"
workflow.add_edge("rewrite", "decide")
workflow.add_edge("answer", END)
graph = workflow.compile()Run a question through it and watch the difference. On a clean hit, it retrieves once and answers. On a bad hit, it grades the docs as irrelevant, rewrites the question, and tries again before it ever speaks. That retry loop is the whole point, and it is the thing traditional RAG simply cannot do.
Start the grader as a cheap binary yes/no with a small model. You do not need a fancy relevance score to get most of the benefit, and a tight structured output keeps it reliable and fast.
So which should you use?#
Use traditional RAG when your questions are predictable, your knowledge base is clean, and latency and cost matter more than handling edge cases. It is simpler, and simpler usually wins.
Reach for agentic RAG when questions are messy or multi-step, when one search is often not enough, when you need to pull from several sources, or when wrong answers are expensive. You pay for it in extra model calls and more moving parts, so make sure the quality jump is worth the latency.
The honest framing: agentic RAG is not "better RAG", it is RAG with a brain bolted on. That brain costs money and time. Add it when the questions are hard enough to need one.
Wrapping up#
Traditional RAG retrieves once on a fixed path. Agentic RAG turns retrieval into a decision the model makes, which lets it skip pointless searches, retry bad ones, grade what it finds, and check its own answers. Corrective RAG, where you grade documents and retry, is the pattern that pays off first, and the LangGraph example above is a working starting point.
Next in this series, I get into giving agents tools: how an agent actually does things, how to design tools it uses well, and how MCP lets you plug in tools you did not write.
Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.