Chunking Strategies for RAG: Fixed, Recursive, Semantic, and How to Choose
Why recursive splitting is the right default, and the few cases where you should reach for something fancier
Chunking is the least glamorous part of a RAG pipeline and the part that quietly caps how good the whole thing can get. You can swap embedding models and add a reranker all day, but if your chunks are bad, retrieval was doomed before the query arrived. This is how the main strategies actually compare, and a simple way to pick one without reading ten papers.
If you want the background on where chunks fit in the bigger picture, I covered that in what a vector database is and how RAG uses it. This post zooms in on the step before storage: how you cut the document up in the first place.
Why chunking matters more than people think#
You embed chunks, not documents. So the chunk is the smallest unit your retriever can return. If a chunk is too big, its embedding becomes an average of several topics and matches nothing well. If it is too small, you retrieve a sentence with no surrounding context and the model has nothing to work with. Both failures look the same from the outside: the answer is wrong and you have no idea why.
The other thing people miss is that chunking interacts with your reranker and your context window. Tiny chunks mean you need to retrieve more of them to cover an answer, which costs reranking latency and context tokens. Big chunks waste context on irrelevant text. The size you pick ripples through the whole system.
The strategies, fastest to fanciest#
Fixed-size chunking#
Split every N tokens, optionally with some overlap. That is the whole idea. It is fast, predictable, and dumb. It will happily cut a sentence in half or split a code block down the middle.
Fixed-size is fine for one thing: uniform short-paragraph prose where every chunk looks roughly the same. Think product FAQs, marketing copy, a blog corpus. For anything with structure, mixed content, or long arguments that span paragraphs, it leaves quality on the table.
def fixed_chunks(text, size=512, overlap=64):
tokens = tokenize(text)
step = size - overlap
return [tokens[i:i + size] for i in range(0, len(tokens), step)]Recursive character splitting#
This is the one you should reach for first. It tries to split on the biggest natural boundary available (paragraphs), and if a piece is still too large, it recurses down to sentences, then words. So it respects structure when it can and only falls back to brute force when it has to.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document)In recent benchmarks recursive splitting keeps coming out on top for end-to-end accuracy. One February 2026 benchmark across 50 academic papers put recursive 512-token splitting first at 69% accuracy, ahead of semantic approaches. It is fast, cheap, has no extra model in the loop, and handles mixed document types. That combination is hard to beat.
If you only change one thing today, use recursive splitting at 512 tokens with a small overlap (about 10 to 15%). It is the best default for most corpora and you can tune from there.
Semantic chunking#
Instead of splitting on character boundaries, you split on meaning. You embed sentences, walk through the document, and start a new chunk when the similarity between consecutive sentences drops below a threshold. The promise is chunks that line up with topic boundaries.
The reality is mixed. Semantic chunking can win on long-form prose where topics drift and markup is unreliable: research papers, transcripts, books. But it is slower (you are embedding every sentence before you even store anything), it adds a threshold you have to tune, and it can misfire badly. In that same 2026 benchmark, a naive semantic approach landed at 54% and produced fragments averaging just 43 tokens, which is too small to be useful. Done carefully it helps; done carelessly it shreds your documents.
# sketch: split when adjacent-sentence similarity drops
def semantic_chunks(sentences, embed, threshold=0.7):
vecs = embed(sentences)
chunks, current = [], [sentences[0]]
for i in range(1, len(sentences)):
if cosine(vecs[i - 1], vecs[i]) < threshold:
chunks.append(" ".join(current))
current = []
current.append(sentences[i])
chunks.append(" ".join(current))
return chunksLate chunking#
This one flips the order. Normally you chunk first, then embed each chunk in isolation, so a chunk loses any context from the rest of the document. Late chunking embeds the whole document (or a long window) first with a long-context embedding model, then pools the token embeddings into chunks afterward. Each chunk's vector still carries information from its neighbours, which helps when documents are full of pronouns and cross-references that only make sense in context.
Jina's work on late chunking showed gains on retrieval benchmarks that grow with document length, which fits the intuition: the longer the document, the more context a naively isolated chunk was throwing away. It needs a long-context embedding model to work, so it is not a drop-in for every stack, but it is worth knowing about when cross-references are breaking your retrieval.
A quick decision guide#
| Your situation | Use this |
|---|---|
| Uniform short paragraphs (FAQs, marketing) | Fixed-size at 512 |
| Almost everything else, as a default | Recursive at 512 with overlap |
| Long-form prose where topic boundaries matter | Semantic, tuned carefully |
| Long documents where cross-references break retrieval | Late chunking with a long-context model |
| The LLM keeps lacking surrounding context | Bigger chunks, or hierarchical (small to retrieve, large to feed) |
Start at recursive 512. Only move off it when you have a measured reason to.
Don't guess, measure#
Every "best chunking strategy" claim depends on the corpus. The only way to know what works for yours is to score it. Build a small set of real questions with known correct passages, then measure recall: of the chunks you retrieve, how often is the right passage in there? Change one variable at a time (size, overlap, strategy) and watch the number move.
def recall_at_k(eval_set, retrieve, k=5):
hits = 0
for question, gold_chunk_id in eval_set:
retrieved = [c.id for c in retrieve(question, k=k)]
hits += gold_chunk_id in retrieved
return hits / len(eval_set)This is the same discipline I lean on everywhere in RAG and agents. If you are not measuring retrieval, you are tuning blind. I go deeper on the evaluation side in evaluating agents with LangSmith.
Wrapping up#
Use recursive splitting at 512 tokens with light overlap as your default. Reach for semantic chunking on long-form prose where it earns its cost, and late chunking when documents lean hard on context that isolated chunks lose. Above all, build a tiny recall eval before you tune, because chunking choices that feel obvious are often wrong on your specific data.
Next in this series I look at the step after retrieval: reranking, and when a cross-encoder is worth the latency.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.