RAG Chunking Strategies: Fixed vs Recursive vs Semantic (2026), Folarin Akinloye

Chunking is the least glamorous part of a RAG pipeline and the part that quietly caps how good the whole thing can get. You can swap embedding models and add a reranker all day, but if your chunks are bad, retrieval was doomed before the query arrived. This is how the main strategies actually compare, and a simple way to pick one without reading ten papers.

If you want the background on where chunks fit in the bigger picture, I covered that in what a vector database is and how RAG uses it. This post zooms in on the step before storage: how you cut the document up in the first place.

Why chunking matters more than people think#

You embed chunks, not documents. So the chunk is the smallest unit your retriever can return. If a chunk is too big, its embedding becomes an average of several topics and matches nothing well. If it is too small, you retrieve a sentence with no surrounding context and the model has nothing to work with. Both failures look the same from the outside: the answer is wrong and you have no idea why.

The other thing people miss is that chunking interacts with your reranker and your context window. Tiny chunks mean you need to retrieve more of them to cover an answer, which costs reranking latency and context tokens. Big chunks waste context on irrelevant text. The size you pick ripples through the whole system.

The strategies, fastest to fanciest#

Fixed-size chunking#

Split every N tokens, optionally with some overlap. That is the whole idea. It is fast, predictable, and dumb. It will happily cut a sentence in half or split a code block down the middle.

Fixed-size is fine for one thing: uniform short-paragraph prose where every chunk looks roughly the same. Think product FAQs, marketing copy, a blog corpus. For anything with structure, mixed content, or long arguments that span paragraphs, it leaves quality on the table.

def fixed_chunks(text, size=512, overlap=64):
    tokens = tokenize(text)
    step = size - overlap
    return [tokens[i:i + size] for i in range(0, len(tokens), step)]

Recursive character splitting#

This is the one you should reach for first. It tries to split on the biggest natural boundary available (paragraphs), and if a piece is still too large, it recurses down to sentences, then words. So it respects structure when it can and only falls back to brute force when it has to.

from langchain_text_splitters import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document)

In recent benchmarks recursive splitting keeps coming out on top for end-to-end accuracy. One February 2026 benchmark across 50 academic papers put recursive 512-token splitting first at 69% accuracy, ahead of semantic approaches. It is fast, cheap, has no extra model in the loop, and handles mixed document types. That combination is hard to beat.

Tip

If you only change one thing today, use recursive splitting at 512 tokens with a small overlap (about 10 to 15%). It is the best default for most corpora and you can tune from there.

Semantic chunking#

Instead of splitting on character boundaries, you split on meaning. You embed sentences, walk through the document, and start a new chunk when the similarity between consecutive sentences drops below a threshold. The promise is chunks that line up with topic boundaries.

The reality is mixed. Semantic chunking can win on long-form prose where topics drift and markup is unreliable: research papers, transcripts, books. But it is slower (you are embedding every sentence before you even store anything), it adds a threshold you have to tune, and it can misfire badly. In that same 2026 benchmark, a naive semantic approach landed at 54% and produced fragments averaging just 43 tokens, which is too small to be useful. Done carefully it helps; done carelessly it shreds your documents.

# sketch: split when adjacent-sentence similarity drops
def semantic_chunks(sentences, embed, threshold=0.7):
    vecs = embed(sentences)
    chunks, current = [], [sentences[0]]
    for i in range(1, len(sentences)):
        if cosine(vecs[i - 1], vecs[i]) < threshold:
            chunks.append(" ".join(current))
            current = []
        current.append(sentences[i])
    chunks.append(" ".join(current))
    return chunks

Late chunking#

This one flips the order. Normally you chunk first, then embed each chunk in isolation, so a chunk loses any context from the rest of the document. Late chunking embeds the whole document (or a long window) first with a long-context embedding model, then pools the token embeddings into chunks afterward. Each chunk's vector still carries information from its neighbours, which helps when documents are full of pronouns and cross-references that only make sense in context.

Jina's work on late chunking showed gains on retrieval benchmarks that grow with document length, which fits the intuition: the longer the document, the more context a naively isolated chunk was throwing away. It needs a long-context embedding model to work, so it is not a drop-in for every stack, but it is worth knowing about when cross-references are breaking your retrieval.

A quick decision guide#

Your situation	Use this
Uniform short paragraphs (FAQs, marketing)	Fixed-size at 512
Almost everything else, as a default	Recursive at 512 with overlap
Long-form prose where topic boundaries matter	Semantic, tuned carefully
Long documents where cross-references break retrieval	Late chunking with a long-context model
The LLM keeps lacking surrounding context	Bigger chunks, or hierarchical (small to retrieve, large to feed)

Start at recursive 512. Only move off it when you have a measured reason to.

Don't guess, measure#

Every "best chunking strategy" claim depends on the corpus. The only way to know what works for yours is to score it. Build a small set of real questions with known correct passages, then measure recall: of the chunks you retrieve, how often is the right passage in there? Change one variable at a time (size, overlap, strategy) and watch the number move.

def recall_at_k(eval_set, retrieve, k=5):
    hits = 0
    for question, gold_chunk_id in eval_set:
        retrieved = [c.id for c in retrieve(question, k=k)]
        hits += gold_chunk_id in retrieved
    return hits / len(eval_set)

This is the same discipline I lean on everywhere in RAG and agents. If you are not measuring retrieval, you are tuning blind. I go deeper on the evaluation side in evaluating agents with LangSmith.

Wrapping up#

Use recursive splitting at 512 tokens with light overlap as your default. Reach for semantic chunking on long-form prose where it earns its cost, and late chunking when documents lean hard on context that isolated chunks lose. Above all, build a tiny recall eval before you tune, because chunking choices that feel obvious are often wrong on your specific data.

Next in this series I look at the step after retrieval: reranking, and when a cross-encoder is worth the latency.

Why chunking matters more than people think#

The strategies, fastest to fanciest#

Fixed-size chunking#

Split every N tokens, optionally with some overlap. That is the whole idea. It is fast, predictable, and dumb. It will happily cut a sentence in half or split a code block down the middle.

def fixed_chunks(text, size=512, overlap=64):
    tokens = tokenize(text)
    step = size - overlap
    return [tokens[i:i + size] for i in range(0, len(tokens), step)]

Recursive character splitting#

from langchain_text_splitters import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document)

Tip

If you only change one thing today, use recursive splitting at 512 tokens with a small overlap (about 10 to 15%). It is the best default for most corpora and you can tune from there.

Semantic chunking#

# sketch: split when adjacent-sentence similarity drops
def semantic_chunks(sentences, embed, threshold=0.7):
    vecs = embed(sentences)
    chunks, current = [], [sentences[0]]
    for i in range(1, len(sentences)):
        if cosine(vecs[i - 1], vecs[i]) < threshold:
            chunks.append(" ".join(current))
            current = []
        current.append(sentences[i])
    chunks.append(" ".join(current))
    return chunks

Late chunking#

A quick decision guide#

Your situation	Use this
Uniform short paragraphs (FAQs, marketing)	Fixed-size at 512
Almost everything else, as a default	Recursive at 512 with overlap
Long-form prose where topic boundaries matter	Semantic, tuned carefully
Long documents where cross-references break retrieval	Late chunking with a long-context model
The LLM keeps lacking surrounding context	Bigger chunks, or hierarchical (small to retrieve, large to feed)

Start at recursive 512. Only move off it when you have a measured reason to.

Don't guess, measure#

def recall_at_k(eval_set, retrieve, k=5):
    hits = 0
    for question, gold_chunk_id in eval_set:
        retrieved = [c.id for c in retrieve(question, k=k)]
        hits += gold_chunk_id in retrieved
    return hits / len(eval_set)

This is the same discipline I lean on everywhere in RAG and agents. If you are not measuring retrieval, you are tuning blind. I go deeper on the evaluation side in evaluating agents with LangSmith.

Wrapping up#

Next in this series I look at the step after retrieval: reranking, and when a cross-encoder is worth the latency.

Chunking Strategies for RAG: Fixed, Recursive, Semantic, and How to Choose

Why chunking matters more than people think#

The strategies, fastest to fanciest#

Fixed-size chunking#

Recursive character splitting#

Semantic chunking#

Late chunking#

A quick decision guide#

Don't guess, measure#

Wrapping up#

Related articles

Embeddings Explained for Engineers

Reranking in RAG: Cross-Encoders and When They Are Worth the Latency

Choosing a Vector Database in 2026: pgvector vs Pinecone vs Qdrant vs Weaviate

Chunking Strategies for RAG: Fixed, Recursive, Semantic, and How to Choose

Why chunking matters more than people think#

The strategies, fastest to fanciest#

Fixed-size chunking#

Recursive character splitting#

Semantic chunking#

Late chunking#

A quick decision guide#

Don't guess, measure#

Wrapping up#

Related articles

Embeddings Explained for Engineers

Reranking in RAG: Cross-Encoders and When They Are Worth the Latency

Choosing a Vector Database in 2026: pgvector vs Pinecone vs Qdrant vs Weaviate