How Embedding Models Are Trained
Deep-dive into embedding models — from one-hot encoding to contrastive learning, why "king − man + woman = queen" works geometrically, Matryoshka embeddings, math foundations, and production applications in RAG and semantic search with interactive demos.
Open your phone and type "best pizza near me" into Google. In under 300 milliseconds you get results that include "top-rated Italian restaurants nearby" — a page that shares zerokeywords with your query. Spotify somehow knows that because you liked one lo-fi hip-hop track, you'll probably enjoy an ambient jazz album you've never heard of. ChatGPT can answer a follow-up question about a paragraph it read three messages ago.
All of this is powered by the same invisible machinery: embeddings — a way of converting messy, ambiguous human language into precise lists of numbers that machines can compare, search, and reason over. Every word, sentence, or document gets its own set of coordinates in a vast, high-dimensional space — and the distance between those coordinates is the distance between their meanings.
The idea is disarmingly simple. Instead of asking "do these two texts share the same words?", embeddings let us ask "do these two texts meanthe same thing?" That single shift — from lexical matching to semantic matching — is what makes modern search, recommendations, and RAG pipelines work. And to understand why, we need to see how these numbers are built.
The Embedding Space
Every word lives at a specific coordinate. Hover any cluster — words with related meaning are packed together. Words with unrelated meaning are far apart.
Real embedding models use 768–3,072 dimensions. This 2D projection (via t-SNE) preserves relative distances: words close in meaning stay close on the plot. Notice how "king" and "queen" are almost touching, but miles from "python" or "sad".
The Representation Problem
Before embeddings existed, engineers represented words the only way they could: one-hot encoding. Give each word in your vocabulary a unique index. "cat" becomes [1, 0, 0, 0, 0]. "dog" becomes [0, 1, 0, 0, 0]. Simple — but catastrophically limited.
The problem is geometric. In a one-hot space, every pair of words is exactly the same distance apart. "cat" is just as far from "kitten" as it is from "refrigerator." A vocabulary of 50,000 words creates 50,000-dimensional vectors that are all mutually orthogonal — they carry identity information and literally zero meaning. No machine learning model can learn that "cat" and "kitten" are related from vectors that are indistinguishable in distance from "cat" and "tax audit."
Dense embeddings flip this completely. Instead of one giant sparse vector per word, you learn a compact vector (say, 5 dimensions) where every single number encodes a fragment of meaning. And when you compare two of these compact vectors, the math actually reflects the semantic relationship between the words.
One-Hot vs. Dense — See the Difference
Toggle between encodings, then tap any word to reveal similarity scores.
Cosine similarity from "cat"
Notice: "cat" → "kitten" scores 0.998, almost identical. "cat" → "car" drops to −0.168, practically opposite. Dense vectors capture meaning; one-hot cannot.
How Similarity Is Measured — Visually
The central operation in all of embedding-land is cosine similarity. It answers one question: how much do two vectors point in the same direction? Forget the magnitude (length) of each vector — we only care about the angle between them.
THE FORMULA
cos(A, B) = A · B / (‖A‖ × ‖B‖)
A · B
Dot product — multiply matching dimensions and sum
‖A‖
Magnitude — the length of vector A
Result
Ranges from −1 (opposite) through 0 (unrelated) to +1 (identical)
There is a critical distinction worth internalizing early: embeddings are not compression. Compression preserves information — you can decompress and recover the original. Embeddings are a lossy semantic projection. You cannot reconstruct the original text from its embedding vector. What you can do is compare two embedding vectors and instantly know how semantically similar the source texts are. That property — fast, numeric semantic comparison — is the entire value proposition.
One more thing about dimensionality. Each dimension in a dense embedding captures a tiny sliver of meaning — think of it as one axis of the semantic coordinate system. 256 dimensions capture word-level relationships well. 1,536 dimensions handle paragraph-level nuance. 3,072 dimensions (used by OpenAI's largest model) can distinguish subtle differences in tone, intent, and domain. More dimensions means more expressive power, but also more storage, more compute, and slower search. That trade-off sits at the core of every production embedding decision.
So now we know what embeddings are and why they matter. The natural next question: how do we actually learn these vectors? That story begins in 2013, with a Google researcher named Tomas Mikolov and a model called Word2Vec.
In 2013, a researcher at Google named Tomas Mikolov published a paper that quietly reshaped the entire field of natural language processing. The model was called Word2Vec, and its central insight was almost embarrassingly simple: you can learn what a word means by looking at the words that surround it.
This is called the distributional hypothesis — the idea that words appearing in similar contexts tend to have similar meanings. "Dog" and "puppy" both show up near words like "bark," "leash," and "walk." So even without anyone ever telling the model that dogs and puppies are related, it figures it out purely from co-occurrence patterns in text.
Word2Vec works by training a tiny neural network on a deceptively simple task: given one word, predict the words around it. The trick is that after training on billions of word pairs, the weights inside that network become the embedding vectors. The network is just scaffolding — what survives is the learned representation.
The Sliding Context Window
Tap any word to make it the center. Drag the slider to expand or shrink the context window. Every (center, context) pair becomes one training example.
Training pairs from this window (4 pairs)
Slide the window across 1 billion words at window=5 and you generate roughly 10 billion training pairs — all for free, from raw text.
Two Architectures, One Idea
Word2Vec comes in two flavors. Skip-gram takes the center word and tries to predict every context word around it. CBOW (Continuous Bag of Words) does the reverse — it takes all the context words and predicts the center. Skip-gram learns better representations for rare words because each word gets multiple training signals. CBOW trains faster because it averages context into a single prediction.
Skip-gram
One center word in → predict each context word independently. More training signal per word, better for rare words.
The Hidden Layer = The Embedding
After training, we throw away the output layer entirely. The weight matrix W (vocabulary × 300) is the embedding table. Row i of W is the embedding vector for word i.
Negative Sampling — Making It Scale
There is a practical problem with the architecture above. The output layer computes a softmax over the entire vocabulary — 50,000 to 500,000 words. For every single training pair, the model must compute a score for every word in the vocabulary and normalize them into a probability distribution. That is brutally expensive.
The solution is negative sampling: instead of predicting the right word out of 500,000 options, we reframe the task as a series of binary yes/no questions. For the real pair ("fox", "brown"), we ask: "did these words actually appear together?" (answer: yes). Then we randomly sample a few words that did notappear near "fox" — say "pizza", "laptop", "ocean" — and ask the same question (answer: no). Now the model just needs to separate real pairs from fake ones, which reduces computation per step from O(500K) to O(k), where k is typically 5–15.
Negative Sampling — Before & After
Step through the process and watch similarity scores change as the model learns to separate real from fake pairs.
The Loss Function
L = log σ(v_w · v_c) + Σᵢ₌₁ᵏ log σ(−v_w · v_nᵢ)
v_w
Target word vector
The word we're training
v_c
Context vector (+)
A real neighboring word
v_nᵢ
Negative vector (−)
A random fake neighbor
σ
Sigmoid function
Squashes to [0, 1]
Read it like this: maximize the dot product between the target word and its real context (first term), while minimizing the dot product with randomly sampled fake contexts (second term). The sigmoid σ turns dot products into probabilities. After enough iterations, real pairs score near 1.0 and fake pairs score near 0.0 — meaning the vectors have learned to encode which words actually co-occur.
How negatives are sampled
P(w) = freq(w)^0.75 / Σ freq(w)^0.75
Negatives are drawn from a smoothed word-frequency distribution. The 0.75 exponent gives rare words more representation than their raw frequency would suggest — ensuring the model doesn't just learn to recognize common words like "the" and "is."
Word2Vec has a fundamental limitation that every practitioner should understand: it assigns one vector per word, regardless of context. The word "bank" gets the exact same embedding whether the sentence is "I sat on the river bank" or "I deposited money at the bank." This is the polysemy problem, and it's the reason production systems today don't use vanilla Word2Vec.
But Word2Vec's legacy is not its specific architecture — it's the proof that self-supervised prediction tasks on raw text produce useful representations. Every modern embedding model, from BERT to OpenAI's text-embedding-3-large, still uses this core principle. The prediction task evolved (masked language modeling, contrastive learning). The architecture evolved (transformers replaced shallow networks). But the foundational idea — that predicting context teaches meaning — remains unchanged.
And those learned vectors had one more surprise in store — a property so striking that it became the most famous result in all of embedding research: you could do arithmetic with meaning.
When Word2Vec was first published, the embedding vectors it produced had a property so striking that it became the single most cited result in all of NLP: you could do arithmetic with meaning.
Take the vector for "king." Subtract the vector for "man." Add the vector for "woman." The result lands almost exactly on the vector for "queen." Written as an equation: v(king) − v(man) + v(woman) ≈ v(queen). The operation works because the directionfrom "man" to "king" — the geometric representation of the concept "royalty applied to male" — is nearly identical to the direction from "woman" to "queen." Subtracting "man" removes the male component, adding "woman" inserts the female component, and what remains is royalty. The four words form a parallelogram in vector space.
This is not a hand-tuned trick. It emerges spontaneously from the training process. Because "king" and "queen" appear in nearly identical grammatical contexts ("ruled," "decreed," "inherited the throne"), their vectors share most of their dimensions. The few dimensions that differ are exactly the ones that encode gender — the same dimensions that separate "man" from "woman." Linear algebra does the rest.
Vector Analogy Explorer
Choose an analogy, then step through the arithmetic. Watch the parallelogram form — the hallmark of linear relationships in embedding space.
Result
king−man+woman=queen✓
1.000
cosine sim
0.0000
euclidean dist
Why It Works — The Parallelogram Law
v(king) − v(man) ≈ v(queen) − v(woman)
Step 1: Isolate the relationship
Subtracting v(man) from v(king) cancels out everything shared between them — "male," "human," "adult." What remains is the pure concept of royalty: the direction in vector space that transforms a commoner into a ruler.
Step 2: Apply to new context
Adding this royalty-direction to v(woman)moves through the same semantic offset — from commoner to ruler — but starting from a different gender. The resulting point is both "female" and "royal."
Step 3: Find nearest neighbor
The computed point doesn't exactly match any word, but v(queen) is the nearest real word in the vocabulary — confirming that the embedding space encodes semantic relationships as consistent linear offsets.
The formal task
d* = argmax_d cos(v_d, v_c − v_b + v_a)
For every word d in the vocabulary (excluding a, b, c), compute the cosine similarity between d's vector and the arithmetic result. The word with the highest similarity wins. Levy & Goldberg (2014) showed that this works because Skip-gram with negative sampling implicitly factorizes a shifted PMI (Pointwise Mutual Information) matrix — meaning the linear structure isn't a coincidence, it's a mathematical inevitability of the training objective.
When Arithmetic Goes Wrong
The analogy trick works beautifully for clean semantic relationships — gender, verb tense, country-capital. But it reveals something troubling when you push it into social territory. Compute "doctor − man + woman" and you shouldget "doctor" — it's a gender-neutral profession. But older embeddings trained on web text returned "nurse." The model had learned the statistical bias of its training corpus: in the texts it read, doctors were disproportionately described as male and nurses as female.
This is not a bug in the algorithm. It is a faithful reflection of the data. Embeddings do not encode truth — they encode the patterns present in whatever text they were trained on. If the corpus is biased (and all internet-scale corpora are), the embeddings will be biased. Production systems must audit for these patterns, apply debiasing techniques, or choose models specifically trained to mitigate them.
One final note: the analogy benchmark is a useful sanity check, but a terrible evaluation metric for production systems. In practice, nobody cares whether king−man+woman=queen. What matters is whether your embedding model returns the right documents for a query, classifies text correctly, or clusters topics coherently. Always evaluate on your actual downstream task, not on toy analogies. Models that ace the analogy test can fail spectacularly on domain-specific retrieval if your domain was underrepresented in the training data.
But Word2Vec had one more fundamental limitation beyond bias — one that motivated the entire next generation of embedding models. It assigns one vector per word, regardless of context. The word "bank" gets the same vector in "river bank" and "investment bank." Modern contrastive learning fixes this completely.
Word2Vec proved that self-supervised prediction on raw text produces useful representations. But it had a fundamental ceiling: it assigns one vector per word, and it only captures word-level relationships. What if we want to compare entire sentences? What if the query "How do I reset my password?" needs to match the document "Steps to change your account password" — a passage that shares almost no keywords?
This is where contrastive learning enters the picture. It is the training paradigm behind every modern sentence embedding model — from sentence-transformers and OpenAI's text-embedding-3-large to Cohere's embed-v3 and Google's Gecko. Instead of predicting neighboring words, contrastive learning directly optimizes the property we actually need: semantically similar texts should have nearby embeddings, and dissimilar texts should be far apart.
The setup is elegant. You start with an anchor text. You pair it with one positive (a text that means the same thing) and several negatives(texts that don't). The training objective: push the anchor's embedding toward the positive and away from the negatives. After millions of such updates, the model has learned a geometric space where proximity equals semantic similarity.
Contrastive Training — Forces in Action
Select an example, then step through training epochs. Watch the positive get pulled in and negatives get pushed out.
Anchor: How do I reset my password?
Positive: Steps to change your account password
Neg 1: Best pasta recipes for dinner
Neg 2: History of the Roman Empire
Neg 3: Python list comprehension tutorial
The InfoNCE Loss Function
The training objective that makes all of this work is called InfoNCE(Information Noise-Contrastive Estimation). It is essentially a softmax cross-entropy loss where the "correct class" is the positive pair. The model computes a similarity score between the anchor and every candidate (the positive + all negatives), divides by a temperature parameter τ, applies softmax, and tries to maximize the probability assigned to the positive.
InfoNCE Loss
L = −log ( exp(sim(q, k⁺) / τ) / Σ exp(sim(q, kᵢ) / τ) )
q
Anchor
k⁺
Positive
kᵢ
All candidates
τ
Temperature
sim
Cosine sim
0.000
InfoNCE Loss (lower = better)
100.0%
P(positive) — goal: near 100%
Hard Negatives — The Secret Weapon
Here is one of the most important practical insights in all of embedding engineering: the quality of your negatives matters more than the architecture of your model. A random negative like "Best pizza in New York" is trivially easy for the model to distinguish from "How to deploy to AWS." It learns nothing useful from that pair. The gradient is tiny, the update is meaningless.
A hard negative, on the other hand, is a text that is topically related but semantically different — like "How to deploy to Google Cloud" when the anchor is "How to deploy to AWS." These force the model to learn fine-grained distinctions: same topic, different platform. The gradient is large, the update is meaningful, and the resulting embeddings are dramatically more discriminative.
Hard Negative Explorer
Toggle between easy and hard negatives to see the difference in training signal.
"How to deploy to AWS"
Easy Negative
"Best pizza in New York"
0.05
sim — trivial
"Python async/await"
Easy Negative
"Dog training tips"
0.05
sim — trivial
Mining Strategies Used in Production
In-batch negatives
Use other positives in the same batch as negatives. Free, surprisingly effective. Used by SimCLR and MoCo.
BM25 negatives
Retrieve lexically similar but semantically different documents via keyword search. High-quality, moderate cost.
Cross-encoder mining
Use an expensive cross-encoder to score candidates and select the hardest. Best quality, highest cost.
Contrastive learning solved the context problem that plagued Word2Vec. Because the model processes entire sentencesthrough a transformer encoder before comparing them, the same word gets different embeddings depending on its surrounding context. "Bank" in a financial document and "bank" in a nature article produce different vectors. This is the power of contextual embeddings — and it's why modern sentence-transformers dominate every retrieval benchmark.
But there is still one more practical challenge that production teams face every day: the dimensionality trade-off. Higher-dimensional embeddings capture more nuance, but they cost more to store, index, and search. What if you could train one model whose embeddings work at multiple dimensions — 256 for fast search, 1024 for high-precision tasks — without retraining? That is exactly what Matryoshka embeddings achieve.
Every embedding model produces a fixed-size vector — 768 dimensions for most open-source models, 1,536 or 3,072 for OpenAI's largest. More dimensions capture more nuance, but they cost more to store, index, and search. At 100 million documents, the difference between 256-D and 1,536-D is tens of gigabytes of vector storage and an order of magnitude in search latency.
In practice, teams face an uncomfortable choice: use the full-dimension embedding and pay the compute bill, or retrain a smaller model and accept lower quality. Matryoshka Representation Learning (MRL) eliminates this trade-off entirely.
Named after Russian nesting dolls, a Matryoshka embedding is trained so that the first d dimensions of the full vector form a valid, high-quality embedding on their own. The first 64 dimensions capture the broad strokes — topic, domain, general sentiment. The first 256 dimensions capture most of the semantic nuance. The full 1,536 dimensions capture everything. And because every prefix is valid, you can choose your quality/cost trade-off at query time — without retraining anything.
The Nesting Structure
Drag the slider to truncate. Inner rings are always contained within outer rings — more dimensions add detail but never change the core.
Storage & Latency Calculator
See the real-world impact of dimension reduction at production scale.
6.0 GB
Full 1536-D
1.0 GB
Truncated 256-D
83%
Storage saved
79%
Faster search
Search Latency per Query
Quality Retained
How MRL Training Works
The training trick is surprisingly simple. Instead of computing the contrastive loss only on the full 1,536-D embedding, you compute it simultaneously at multiple truncation levels: the first 64, 128, 256, 512, and 1,536 dimensions. The total loss is a weighted sum of all these partial losses. Because every level shares the same model parameters, the gradient from the 64-D loss forces the first 64 dimensions to be maximally informative on their own, while the full-dimension loss ensures the complete vector misses nothing.
MRL Training Pipeline
L = Σm∈M w_m × InfoNCE(f(x)[:m])
Encode
Pass text through transformer encoder → full 1,536-D embedding
Truncate at each level
Slice to [:64], [:128], [:256], [:512], [:1536] — five views of the same vector
L2-normalize each slice
Each truncated prefix is independently unit-normalized
Compute InfoNCE at each level
Anchor vs positive+negatives — measured at every truncation
Sum weighted losses → backprop
L_total = Σ w_m × L_m — single backward pass updates all dimensions
The key insight: the gradient from the 64-D loss teaches the first 64 dimensions to be self-sufficient. The 256-D loss refines dimensions 65–256. The full 1,536-D loss polishes the tail. The result: information importance is ordered by dimension index. Dimension 1 is the most important; dimension 1,536 is the least.
There is one honest trade-off to acknowledge: Matryoshka embeddings sacrifice roughly 2–5% quality at the full dimension compared to a model trained exclusively at that size. If your system always uses the full 1,536-D vector and never truncates, a standard contrastive model will be marginally better. But the moment you need flexibility — and production systems almost always do — MRL pays for itself many times over.
The production pattern is a two-stage search. First stage: use 256-D truncated vectors to search 100 million documents in ~20ms. This returns the top 100 candidates at 92% quality. Second stage: re-score those 100 candidates using the full 1,536-D vectors in ~1ms. Total latency: 21ms, with the precision of full-dimension search. This two-stage pattern — fast coarse retrieval followed by precise re-ranking — is how every serious vector search system works at scale, from Google to Spotify to Notion.
Now we have all the building blocks: embeddings, training objectives, and efficiency techniques. The final question is the most practical one: how do you actually use these embeddings in a production application?
Embeddings are not an end product. They are infrastructure — the invisible layer that powers semantic search, recommendation engines, clustering pipelines, classification systems, anomaly detection, and most critically, Retrieval-Augmented Generation (RAG). If you build anything that needs to understand the meaning of text rather than just its keywords, you are using embeddings, whether you realize it or not.
Understanding how embeddings are trained — everything we have covered so far — gives you a diagnostic superpower. When your RAG pipeline returns irrelevant documents, the root cause is almost always traceable to the embedding layer: wrong model for your domain, wrong chunking strategy, or a distributional mismatch between your data and the model's training corpus. Knowing what went wrong inside the model tells you exactly how to fix it.
The RAG Pipeline — Step by Step
Click each stage to see how embeddings power Retrieval-Augmented Generation from query to answer.
Step 1: Embed Query
The user's natural-language query is passed through the same embedding model used during ingestion. The output is a dense vector — the query's "address" in semantic space.
Semantic Search in Action
The query "How do I change my password?" contains none of the keywords in the document titled "Password Reset Guide." A traditional keyword search (BM25) would miss it entirely. But the embedding model maps both to nearby vectors — because they mean the same thing — and the cosine similarity is 0.94. This is the difference between searching for words and searching for meaning.
Semantic Search Demo
Select a query. See which documents the embedding model retrieves, ranked by cosine similarity.
Password Reset Guide
To reset your password, go to Settings > Security > Change Password. You will need your current password or a recovery email.
0.94
cosine sim
Account Deletion
To permanently delete your account, navigate to Settings > Account > Delete Account. This action is irreversible and removes all data after 30 days.
0.31
cosine sim
Team Management
Invite team members from the Organization settings. Assign roles: Admin (full access), Editor (read/write), Viewer (read-only).
0.22
cosine sim
Choosing an Embedding Model
The model you choose sets the quality ceiling for your entire retrieval system. No amount of re-ranking, prompt engineering, or chunking optimization can compensate for a fundamentally wrong embedding model. The key dimensions: retrieval quality (MTEB benchmark), cost per million tokens, dimensionality (storage + latency), context length, and whether it supports Matryoshka truncation for flexible dimensionality.
Production Embedding Models
Sort by the dimension that matters most for your use case.
| Model | Provider | Dims | MTEB | Cost | MRL | Context | Open |
|---|---|---|---|---|---|---|---|
| E5-mistral-7b | Microsoft | 4,096 | 66.6 | Free | — | 32,768 | ✓ |
| text-embedding-3-large | OpenAI | 3,072 | 64.6 | $0.13/1M | ✓ | 8,191 | — |
| embed-v3.0 | Cohere | 1,024 | 64.5 | $0.10/1M | — | 512 | — |
| BGE-large-en-v1.5 | BAAI | 1,024 | 63.5 | Free | — | 512 | ✓ |
| text-embedding-3-small | OpenAI | 1,536 | 62.3 | $0.02/1M | ✓ | 8,191 | — |
| nomic-embed-text-v1.5 | Nomic | 768 | 62.3 | Free | ✓ | 8,192 | ✓ |
MTEB (Massive Text Embedding Benchmark) averages performance across 56 tasks. Scores above 64 are state-of-the-art. But always validate on your domain — a model that tops MTEB on web data may underperform on medical or legal text.
Production RAG Architecture
Two pipelines — offline ingestion and online query — connected by a shared vector database.
Where Production Systems Break
Three failure modes account for the vast majority of RAG quality issues in production:
Domain mismatch
A general-purpose model trained on web text will misinterpret domain-specific vocabulary. "Discharge" means leaving a hospital in medicine, firing an employee in HR, and releasing energy in physics. If your domain wasn't well-represented in training, fine-tune or choose a domain-specific model.
Wrong chunk size
Embedding a 50-page document into a single vector loses all granularity. Embedding individual sentences loses context. The sweet spot is usually 256–512 token chunks with 50-token overlap — but the ideal size depends on your query distribution. Profile your actual queries first.
Missing re-ranking
Bi-encoder search is fast but approximate. Adding a cross-encoder re-ranker on the top 100 results typically improves precision by 15–30% for ~10ms additional latency. This is the single highest-ROI improvement most teams skip.
And one final, practical note: always build an evaluation set. Before you tune any parameter — model, chunk size, overlap, re-ranker — create 50–100 query/answer pairs that represent your actual workload. Measure retrieval precision and recall objectively. Without this, you are tuning blindly, and every change is a coin flip. With it, every experiment is a data point, and your system improves monotonically.
You now have the complete picture: from one-hot encodings to Word2Vec, from contrastive learning to Matryoshka embeddings, from vector arithmetic to production RAG. The next time you build a retrieval system, you will understand not just what to do — but why it works.