Building a Multi-Tenant RAG System: Isolation, Per-Tenant Indexes, and the Leaks Nobody Plans For
The day one customer retrieves another customer's documents, it is over. Here is how to make sure that never happens.
There is one bug in a multi-tenant RAG system that ends the conversation: customer A asks a question and gets back a chunk of customer B's document. Everything else is a performance tuning problem. That one is a breach. So before any of the fun retrieval work, you have to decide how tenants are kept apart, and you have to make that decision hold across every part of the pipeline, not just the database.
This post is about building RAG for a SaaS product where many customers share the same deployment. I will cover the three isolation models, when each is right, the namespace-versus-metadata question that decides your cost and your blast radius, and the places isolation quietly leaks even when the database is configured correctly.
Three ways to keep tenants apart#
There are really only three shapes, and the industry names them silo, pool, and bridge.
Silo: one index (or database) per tenant. Strongest isolation. Tenant A's data physically lives somewhere tenant B's queries cannot reach. A bug in your filter logic cannot leak data, because there is no shared store to leak from. The cost is operational: thousands of tenants means thousands of indexes to provision, migrate, and pay for, and a lot of them sit nearly empty.
Pool: one shared index, partitioned by a tenant key. Every tenant's vectors live in the same index, tagged with a tenant ID, and every query filters on that ID. Cheap and simple to operate: one index to manage, onboarding a tenant is a single write, offboarding is a single delete. The risk is that isolation now depends entirely on your code applying the filter correctly, every single time. Forget it once and you have a leak.
Bridge: a mix. Pool the small tenants to keep costs down, silo the big or high-compliance ones who need (or contractually demand) hard separation. This is what most mature SaaS products land on, because the tenant distribution is usually a long tail of small accounts and a handful of large ones.
| Model | Isolation strength | Cost at scale | Best for |
|---|---|---|---|
| Silo | Strongest (physical) | High | Enterprise, regulated, few large tenants |
| Pool | Logical, code-dependent | Low | Many small tenants, SMB SaaS |
| Bridge | Per-tenant choice | Medium | Mixed customer base (most products) |
Do not start by siloing everyone "to be safe". A silo-per-tenant model with 5,000 tenants is an operations nightmare that will hurt you long before a pool model's filter bug would. Start pooled with disciplined enforcement, and promote individual tenants to silos when their size or compliance needs justify it.
Namespaces beat metadata filters for the tenant boundary#
Inside the pool model, you have two ways to partition: a namespace per tenant, or a single namespace with a tenant_id metadata filter on every query. They sound equivalent. They are not.
A namespace is a hard partition inside the index. A query against namespace tenant_42 only ever touches tenant_42's vectors. Most vector databases (Pinecone, for example) scope a query to one namespace, and query cost scales with the namespace size, not the whole index. So one namespace per tenant is both safer and usually cheaper, because you are searching one tenant's vectors instead of scanning everyone's and discarding the rest.
# Pinecone-style: the tenant boundary is the namespace.
# A query physically cannot return another tenant's vectors.
index.query(
namespace=f"tenant_{tenant_id}",
vector=query_embedding,
top_k=8,
)Metadata filtering still has a job, just not the tenant boundary. Use it for concerns inside a tenant: document-level access control, departments, date ranges, which collection a chunk came from.
# Metadata filtering for sub-tenant concerns: within tenant_42,
# only documents this user's role can see.
index.query(
namespace=f"tenant_{tenant_id}",
vector=query_embedding,
top_k=8,
filter={"visibility": {"$in": user_role_groups}},
)The rule I follow: namespaces (or separate indexes) for the tenant boundary, metadata filters for everything finer-grained. Putting the tenant boundary in a metadata filter means a single missing filter= argument leaks every tenant at once. Putting it in the namespace means a missing namespace argument fails loudly instead of leaking quietly.
Isolation is an end-to-end property, not a database setting#
This is the part that bites teams who think they are done once the namespace is wired up. The retrieval query is one link in a chain, and isolation has to hold across all of them. Every place that touches tenant data is a place it can leak.
The tenant ID must be trustworthy. If the client sends tenant_id in the request body and you trust it, any user can read any tenant by changing a number. Derive the tenant from the authenticated session, ideally a signed token (a JWT claim) that the client cannot forge. The retrieval layer reads the tenant from the verified token, never from user-supplied input.
def retrieve(question: str, claims: dict) -> list[str]:
tenant_id = claims["tenant_id"] # from the verified JWT, not the request body
emb = embed(question)
res = index.query(namespace=f"tenant_{tenant_id}", vector=emb, top_k=8)
return [m["metadata"]["text"] for m in res["matches"]]Ingestion must tag correctly. Retrieval isolation is worthless if a document was written into the wrong namespace at upload time. The write path needs the same tenant check as the read path. A document uploaded by tenant A must land in tenant A's namespace, full stop.
The cache is a tenant boundary too. If you cache embeddings, retrieval results, or full answers (and you should, for cost), the cache key must include the tenant ID. A cache keyed only on the question hash will happily serve tenant A's cached answer to tenant B. I have seen this exact bug. It is easy to make and embarrassing to explain.
# The tenant MUST be in the cache key, or you build a cross-tenant leak.
cache_key = f"{tenant_id}:{hashlib.sha256(question.encode()).hexdigest()}"The prompt and the logs leak too. If you build context by concatenating retrieved chunks, make sure nothing global sneaks in. And scrub tenant content out of any logs or traces that span tenants, or your observability stack becomes the leak.
The most common multi-tenant RAG breach is not a hacked database. It is a shared cache, a trusted client-supplied tenant ID, or a reused embedding store, all of which look fine in a demo and leak the first week in production. Treat every component that sees tenant data as part of the isolation boundary.
A reference shape#
Here is the pool-with-bridge architecture I would reach for in a typical SaaS product:
request ──> auth (verify JWT, extract tenant_id)
──> rate limit (per tenant)
──> retrieval
├─ small tenant? query namespace tenant_<id> in the shared index
└─ enterprise? query the tenant's dedicated index
──> build context (only this tenant's chunks)
──> LLM
──> cache answer under key tenant_id:question_hash
──> responseThe tenant ID is extracted once, at the edge, from a verified token, and then threaded through every downstream call. Retrieval picks the namespace or the dedicated index based on the tenant's tier. Caching, logging, and rate limiting are all keyed by tenant. No component anywhere reads the tenant from user-controlled input.
Testing that it actually holds#
You cannot eyeball isolation. Write a test that tries to break it. Seed two tenants with distinct documents, then assert that a query authenticated as tenant A never returns any of tenant B's chunks, including when the client tries to spoof the tenant ID in the request body.
def test_no_cross_tenant_leak():
seed(tenant="acme", docs=["acme secret roadmap"])
seed(tenant="globex", docs=["globex secret roadmap"])
# authenticated as acme, but the body lies and claims to be globex
results = retrieve(
question="what is the secret roadmap?",
claims={"tenant_id": "acme"}, # from the verified token
body={"tenant_id": "globex"}, # spoof attempt, must be ignored
)
assert all("globex" not in r for r in results)Run that on every deploy. Isolation bugs do not announce themselves; they wait for the worst possible moment.
Where this connects#
The mechanics of the index itself (namespaces, filters, hybrid search) build on What is a vector database, and how does RAG use it?, and the choice of which database to run under all this is in Choosing a vector database in 2026. Once isolation is solid, the usual quality work applies: chunking, reranking, and evaluation all sit on top, scoped per tenant.
Get the boundary right first. A slow multi-tenant RAG system is a backlog ticket. A leaky one is an incident report and a churned customer. Decide silo, pool, or bridge deliberately, put the tenant boundary in the namespace rather than a filter, treat every component as part of the boundary, and write the test that tries to break it.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.