Production RAG, Part 1: Chunking That Actually Works, Folarin Akinloye

Everyone blames the model. The retrieval returned junk, the answer was wrong, so the LLM must be the problem. Nine times out of ten, the problem started earlier, at chunking. If the right information never makes it into a retrievable chunk, no model can save you.

This is Part 1 of a two-part series on production RAG. Here we fix the foundation: how you split documents. Part 2 covers how you measure whether your changes actually helped.

Note

"Chunking" is how you break source documents into the units you embed and retrieve. Get it wrong and every downstream component inherits the damage.

Why Naive Chunking Fails#

The default everyone starts with is fixed-size character splitting:

def naive_chunks(text: str, size: int = 1000) -> list[str]:
    return [text[i : i + size] for i in range(0, len(text), size)]

It's fast, it's simple, and it quietly destroys meaning. A 1,000-character window cuts mid-sentence, splits a table from its header, and severs a code block from the paragraph explaining it. The embedding for "...the timeout defaults to 30 seconds. However" carries half a thought.

Three failure modes show up again and again:

Context fragmentation: a single idea is spread across two chunks, so neither ranks well for the query.
Topic blending: one chunk spans two unrelated sections, diluting its embedding.
Orphaned references: "as shown above" or "this function" with no antecedent in the chunk.

Structure-Aware Splitting#

The fix is to split along the document's natural boundaries (headings, paragraphs, list items, code fences) and only fall back to size limits within those.

from langchain_text_splitters import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=120,
    separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " "],
)
chunks = splitter.split_text(document)

The separators list is the important part: the splitter tries to break on headings first, then blank lines, then sentences, and only resorts to mid-word splits as a last resort. Order matters, most semantic first.

Overlap: a small insurance policy#

A modest overlap (10–15% of chunk size) keeps ideas that straddle a boundary recoverable. Too much overlap inflates your index and retrieves near-duplicates; too little reintroduces fragmentation. Start at ~120 tokens for an 800-token chunk and tune from there.

Metadata Is Half the Battle#

A chunk is not just text. Attach where it came from, and retrieval becomes filterable and citable.

chunk = {
    "text": body,
    "metadata": {
        "doc_id": "rag-guide",
        "title": "Production RAG",
        "section": "Structure-Aware Splitting",
        "heading_path": ["Production RAG", "Structure-Aware Splitting"],
        "source_url": "https://folarin.dev/blog/production-rag-chunking-that-works",
    },
}

That heading_path lets you prepend section context to the embedded text, a trick that measurably improves retrieval:

def contextualize(chunk: dict) -> str:
    path = " > ".join(chunk["metadata"]["heading_path"])
    return f"{path}\n\n{chunk['text']}"

Now the embedding for a chunk about timeouts also carries "Production RAG > Core Architecture," so a query about "RAG architecture timeouts" lands closer.

A Practical Recipe#

Content type	Strategy	Chunk size
Prose / docs	Recursive, heading-aware	600–900 tokens
Code-heavy	Split on function/class, keep fences intact	400–700 tokens
Tables / structured	One row-group per chunk + header in metadata	varies
Transcripts	Split on speaker turns or timestamps	500–800 tokens

Tip

Embed at the chunk level, but store a pointer to the parent document. At answer time you can optionally expand a retrieved chunk back to its surrounding section for richer context, "small-to-big" retrieval.

What's Next#

Good chunking is necessary but not sufficient. You still need to know whether a change helped or just felt better. In Part 2 we build a lightweight evaluation harness (hit rate, MRR, and faithfulness) so every tweak to your pipeline is backed by a number, not a vibe.

This is Part 1 of a two-part series on production RAG. Here we fix the foundation: how you split documents. Part 2 covers how you measure whether your changes actually helped.

Note

"Chunking" is how you break source documents into the units you embed and retrieve. Get it wrong and every downstream component inherits the damage.

Why Naive Chunking Fails#

The default everyone starts with is fixed-size character splitting:

def naive_chunks(text: str, size: int = 1000) -> list[str]:
    return [text[i : i + size] for i in range(0, len(text), size)]

Three failure modes show up again and again:

Context fragmentation: a single idea is spread across two chunks, so neither ranks well for the query.
Topic blending: one chunk spans two unrelated sections, diluting its embedding.
Orphaned references: "as shown above" or "this function" with no antecedent in the chunk.

Structure-Aware Splitting#

The fix is to split along the document's natural boundaries (headings, paragraphs, list items, code fences) and only fall back to size limits within those.

from langchain_text_splitters import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=120,
    separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " "],
)
chunks = splitter.split_text(document)

Overlap: a small insurance policy#

Metadata Is Half the Battle#

A chunk is not just text. Attach where it came from, and retrieval becomes filterable and citable.

chunk = {
    "text": body,
    "metadata": {
        "doc_id": "rag-guide",
        "title": "Production RAG",
        "section": "Structure-Aware Splitting",
        "heading_path": ["Production RAG", "Structure-Aware Splitting"],
        "source_url": "https://folarin.dev/blog/production-rag-chunking-that-works",
    },
}

That heading_path lets you prepend section context to the embedded text, a trick that measurably improves retrieval:

def contextualize(chunk: dict) -> str:
    path = " > ".join(chunk["metadata"]["heading_path"])
    return f"{path}\n\n{chunk['text']}"

Now the embedding for a chunk about timeouts also carries "Production RAG > Core Architecture," so a query about "RAG architecture timeouts" lands closer.

A Practical Recipe#

Content type	Strategy	Chunk size
Prose / docs	Recursive, heading-aware	600–900 tokens
Code-heavy	Split on function/class, keep fences intact	400–700 tokens
Tables / structured	One row-group per chunk + header in metadata	varies
Transcripts	Split on speaker turns or timestamps	500–800 tokens

Tip

Production RAG, Part 1: Chunking That Actually Works

Why Naive Chunking Fails#

Structure-Aware Splitting#

Overlap: a small insurance policy#

Metadata Is Half the Battle#

A Practical Recipe#

What's Next#

Related articles

Production RAG, Part 2: Measuring Retrieval Quality

Choosing a Vector Database in 2026 (Without the Hype)

A FastAPI Backend for LLM Apps: Streaming, Tools, and Sanity

Production RAG, Part 1: Chunking That Actually Works

Why Naive Chunking Fails#

Structure-Aware Splitting#

Overlap: a small insurance policy#

Metadata Is Half the Battle#

A Practical Recipe#

What's Next#

Related articles

Production RAG, Part 2: Measuring Retrieval Quality

Choosing a Vector Database in 2026 (Without the Hype)

A FastAPI Backend for LLM Apps: Streaming, Tools, and Sanity