Embeddings Explained for Engineers (2026 Guide), Folarin Akinloye

An embedding is just a list of numbers that represents a piece of text, arranged so that similar meanings land near each other in space. That is the whole idea. Everything else (RAG, semantic search, clustering, recommendations) is plumbing on top of that one trick. This post is the mental model I wish I had when I started, plus how to actually pick a model in 2026.

This sits in the middle of my RAG series, after chunking and reranking. Your embedding model is the engine of first-stage retrieval, so it is worth understanding properly.

The mental model#

Take a sentence, run it through an embedding model, and you get back a fixed-length vector, say 1,536 floating-point numbers. Do the same for another sentence. If the two sentences mean similar things, their vectors point in roughly the same direction. If they mean different things, the vectors point apart.

"How do I reset my password?" and "I forgot my login credentials" share almost no words, but a good embedding model places them close together because they mean nearly the same thing. That is the magic: keyword search would miss the connection, embeddings catch it.

The model learned this by training on huge amounts of text with the objective that related text should produce nearby vectors. You do not need to know the architecture to use it. You need to know that meaning became geometry.

Why cosine similarity#

Once text is vectors, "how similar are these two things?" becomes "how close are these two vectors?" The usual measure is cosine similarity: the cosine of the angle between them. It ignores magnitude and only cares about direction, which is what you want, because direction encodes meaning and magnitude often just encodes length or frequency.

import numpy as np
 
def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 
# 1.0 means identical direction, 0 means unrelated, -1 means opposite

Note

Many modern embedding models return normalized vectors (length 1). When that is true, cosine similarity and dot product give the same ranking, and dot product is cheaper. Check your model's docs; if vectors are pre-normalized, your vector database can use inner product distance and skip the normalization step.

A vector database stores millions of these vectors and answers "find the closest ones to this query vector" fast, using an index like HNSW. I covered that side in what a vector database is and how RAG uses it.

Dimensions, and the Matryoshka trick#

More dimensions can capture more nuance but cost more to store and search. A model that outputs 3,072 dimensions uses twice the storage and roughly twice the search compute of a 1,536-dimension one. At a few thousand vectors nobody cares. At a hundred million, dimension count is a real bill.

The clever development here is Matryoshka representation learning. Models trained this way pack the most important information into the early dimensions, so you can truncate the vector and keep most of the quality. OpenAI's text-embedding-3-large outputs 3,072 dimensions but lets you ask for fewer; you trade a little accuracy for big savings on storage and speed. It means you can start at full size and shrink later if cost bites, without retraining anything.

from openai import OpenAI
client = OpenAI()
 
resp = client.embeddings.create(
    model="text-embedding-3-large",
    input="reset my password",
    dimensions=1024,  # truncate from the native 3072
)
vector = resp.data[0].embedding

Picking a model in 2026#

The leaderboard everyone quotes is MTEB, which scores models across many retrieval and classification tasks. It is a useful starting filter and a terrible final answer. A model that tops MTEB on average can be mediocre on your specific domain. Use it to build a shortlist, then test on your own data.

A few things worth knowing about the current landscape. The top of MTEB moves constantly, with Gemini, Cohere's embed-v4, OpenAI's text-embedding-3-large, Voyage's voyage-4 family, and strong open models like BGE-M3 and the Qwen3 embedding series all in the conversation. Note that MTEB v2 scores are not directly comparable to v1, so do not compare numbers across benchmark versions.

The decision usually comes down to three questions:

Hosted or self-hosted? If you cannot send data to a third party, you are choosing among open models like BGE-M3 or Qwen3-Embedding, which you run yourself. If you can use an API, OpenAI, Cohere, Voyage, and Gemini are all easy to integrate.
What languages and modalities? If you need strong multilingual retrieval, check the model actually covers your languages well (Cohere's embed-v4 and BGE-M3 are built for breadth here). If you need images or code, pick a model trained for that, not a general text model.
What does it cost at your scale? Multiply your token volume by the price and your vector count by the dimension count. The cheapest model that clears your quality bar wins. OpenAI's small model, for instance, is very cheap per million tokens and is plenty for many use cases.

Tip

Whatever you pick, embed your queries and your documents with the same model and the same settings. Mixing models, or mixing dimension counts, makes the vectors live in different spaces and similarity becomes meaningless. This is a surprisingly common bug.

The gotchas that bite in production#

A few things that are not obvious until they break something:

Embeddings have a token limit. Feed a document longer than the model's context window and it silently truncates, so the tail of your chunk never gets embedded. This is one more reason chunking and your embedding model have to be chosen together.

Some models want an instruction prefix. A few retrieval models expect you to prepend something like "query:" to queries and "passage:" to documents. Skip it and quality drops for no obvious reason. Read the model card.

And re-embedding is expensive. If you switch models later, you have to re-embed your entire corpus, because old and new vectors are not comparable. Pick deliberately, because migrating a hundred million vectors is a project, not an afternoon.

Wrapping up#

Embeddings turn text into vectors so that meaning becomes distance, cosine similarity measures that distance, and a vector database makes the search fast. To pick a model, shortlist from MTEB, then test on your own data, and decide based on hosting constraints, language and modality needs, and cost at your scale. Use one model consistently across queries and documents, and remember that switching later means re-embedding everything.

Next in the series: choosing a vector database in 2026, where these vectors actually live.

This sits in the middle of my RAG series, after chunking and reranking. Your embedding model is the engine of first-stage retrieval, so it is worth understanding properly.

The mental model#

Why cosine similarity#

import numpy as np
 
def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 
# 1.0 means identical direction, 0 means unrelated, -1 means opposite

Note

Dimensions, and the Matryoshka trick#

from openai import OpenAI
client = OpenAI()
 
resp = client.embeddings.create(
    model="text-embedding-3-large",
    input="reset my password",
    dimensions=1024,  # truncate from the native 3072
)
vector = resp.data[0].embedding

Picking a model in 2026#

The decision usually comes down to three questions:

Hosted or self-hosted? If you cannot send data to a third party, you are choosing among open models like BGE-M3 or Qwen3-Embedding, which you run yourself. If you can use an API, OpenAI, Cohere, Voyage, and Gemini are all easy to integrate.
What languages and modalities? If you need strong multilingual retrieval, check the model actually covers your languages well (Cohere's embed-v4 and BGE-M3 are built for breadth here). If you need images or code, pick a model trained for that, not a general text model.
What does it cost at your scale? Multiply your token volume by the price and your vector count by the dimension count. The cheapest model that clears your quality bar wins. OpenAI's small model, for instance, is very cheap per million tokens and is plenty for many use cases.

Tip

The gotchas that bite in production#

A few things that are not obvious until they break something:

Wrapping up#

Next in the series: choosing a vector database in 2026, where these vectors actually live.

Embeddings Explained for Engineers

The mental model#

Why cosine similarity#

Dimensions, and the Matryoshka trick#

Picking a model in 2026#

The gotchas that bite in production#

Wrapping up#

Related articles

Chunking Strategies for RAG: Fixed, Recursive, Semantic, and How to Choose

Reranking in RAG: Cross-Encoders and When They Are Worth the Latency

Choosing a Vector Database in 2026: pgvector vs Pinecone vs Qdrant vs Weaviate

Embeddings Explained for Engineers

The mental model#

Why cosine similarity#

Dimensions, and the Matryoshka trick#

Picking a model in 2026#

The gotchas that bite in production#

Wrapping up#

Related articles

Chunking Strategies for RAG: Fixed, Recursive, Semantic, and How to Choose

Reranking in RAG: Cross-Encoders and When They Are Worth the Latency

Choosing a Vector Database in 2026: pgvector vs Pinecone vs Qdrant vs Weaviate