Production RAG, Part 1: Chunking That Actually Works
Why naive chunking quietly wrecks retrieval, and how to fix it
Everyone blames the model. The retrieval returned junk, the answer was wrong, so the LLM must be the problem. Nine times out of ten, the problem started earlier, at chunking. If the right information never makes it into a retrievable chunk, no model can save you.
This is Part 1 of a two-part series on production RAG. Here we fix the foundation: how you split documents. Part 2 covers how you measure whether your changes actually helped.
"Chunking" is how you break source documents into the units you embed and retrieve. Get it wrong and every downstream component inherits the damage.
Why Naive Chunking Fails#
The default everyone starts with is fixed-size character splitting:
def naive_chunks(text: str, size: int = 1000) -> list[str]:
return [text[i : i + size] for i in range(0, len(text), size)]It's fast, it's simple, and it quietly destroys meaning. A 1,000-character window cuts mid-sentence, splits a table from its header, and severs a code block from the paragraph explaining it. The embedding for "...the timeout defaults to 30 seconds. However" carries half a thought.
Three failure modes show up again and again:
- Context fragmentation: a single idea is spread across two chunks, so neither ranks well for the query.
- Topic blending: one chunk spans two unrelated sections, diluting its embedding.
- Orphaned references: "as shown above" or "this function" with no antecedent in the chunk.
Structure-Aware Splitting#
The fix is to split along the document's natural boundaries (headings, paragraphs, list items, code fences) and only fall back to size limits within those.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=120,
separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " "],
)
chunks = splitter.split_text(document)The separators list is the important part: the splitter tries to break on headings first, then blank lines, then sentences, and only resorts to mid-word splits as a last resort. Order matters, most semantic first.
Overlap: a small insurance policy#
A modest overlap (10–15% of chunk size) keeps ideas that straddle a boundary recoverable. Too much overlap inflates your index and retrieves near-duplicates; too little reintroduces fragmentation. Start at ~120 tokens for an 800-token chunk and tune from there.
Metadata Is Half the Battle#
A chunk is not just text. Attach where it came from, and retrieval becomes filterable and citable.
chunk = {
"text": body,
"metadata": {
"doc_id": "rag-guide",
"title": "Production RAG",
"section": "Structure-Aware Splitting",
"heading_path": ["Production RAG", "Structure-Aware Splitting"],
"source_url": "https://folarin.dev/blog/production-rag-chunking-that-works",
},
}That heading_path lets you prepend section context to the embedded text, a trick that measurably improves retrieval:
def contextualize(chunk: dict) -> str:
path = " > ".join(chunk["metadata"]["heading_path"])
return f"{path}\n\n{chunk['text']}"Now the embedding for a chunk about timeouts also carries "Production RAG > Core Architecture," so a query about "RAG architecture timeouts" lands closer.
A Practical Recipe#
| Content type | Strategy | Chunk size |
|---|---|---|
| Prose / docs | Recursive, heading-aware | 600–900 tokens |
| Code-heavy | Split on function/class, keep fences intact | 400–700 tokens |
| Tables / structured | One row-group per chunk + header in metadata | varies |
| Transcripts | Split on speaker turns or timestamps | 500–800 tokens |
Embed at the chunk level, but store a pointer to the parent document. At answer time you can optionally expand a retrieved chunk back to its surrounding section for richer context, "small-to-big" retrieval.
What's Next#
Good chunking is necessary but not sufficient. You still need to know whether a change helped or just felt better. In Part 2 we build a lightweight evaluation harness (hit rate, MRR, and faithfulness) so every tweak to your pipeline is backed by a number, not a vibe.
Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.