Matryoshka Doesn't Buy You Sign-Bit Compression
Three Gigs ended with an open question: does sign-bit compression generalize beyond SPECTER2? Gemini's gemini-embedding-001 is the hardest possible second test — 3072 dimensions (4× wider), Matryoshka-trained (so a privileged prefix should exist), and L2-normalized (so the centering lever should be gone). I had five hypotheses. Four were wrong.
The scorecard
| # | Hypothesis | Result |
|---|---|---|
| Q1 | Matryoshka prefix dominates random at low k | ✗ prefix ≈ suffix ≈ random ±0.018 |
| Q2 | L2 normalization makes centering matter less | ✗ centering hurts at high k (≥1536) |
| Q3 | Graceful degradation below 768 | ✓ gradual, no cliff |
| Q4 | Gemini at 32 B/vec beats SPECTER2's 0.926 | ✗ 0.879 — SPECTER2 wins per byte |
| Q5 | Useful compression at a practical operating point | ✓ k=384: R@100 = 0.944 at 48 B/vec (256×) |
Sign-packing washes out the Matryoshka prefix
Matryoshka training optimizes float32 inner product at specific truncation points. Sign-packing destroys magnitude information. Whatever redistribution Matryoshka induces in float32 space doesn't survive binarization.
The index-selection grid at k=256 (32 B/vec):
select R@10 R@100
prefix 0.432 0.873
suffix 0.418 0.879
spaced 0.420 0.843
random (avg) 0.412 0.861 ±0.007
They're all the same within noise. At k=768 it's even tighter: prefix 0.977, suffix 0.978, random 0.972. For 1-bit retrieval, the Matryoshka prefix is not a lever you can pull.
L2 normalization flips the centering story
For SPECTER2 (norms ~20–22), centering was the biggest single improvement at every k. For Gemini (norms = 1.0), centering still helps at practical operating points but flips to harmful above k≈1024:
k sign-raw R@10 sign-centered R@10
256 0.338 0.432 ← centering helps
384 0.434 0.520 ← centering helps
768 0.608 0.624 ← centering helps
1536 0.694 0.684 ← centering hurts
3072 0.764 0.712 ← centering hurts
On the unit sphere at high k, subtracting a small mean and re-binarizing flips bits whose magnitude sat near zero — that's noise, not signal removal. But at the byte budgets where compression matters (k≤768), centering still earns its keep even on normalized data. The per-encoder tuning is small: at your operating point, test sign(x) vs sign(x − μ) and keep whichever wins.
PCA becomes king at extreme compression
For SPECTER2, PCA was the worst strategy at k≥192. For Gemini at k=64 (8 bytes per vector, 1536× compression), PCA gets R@100 = 0.684 — nothing else comes close:
k=64 R@100
sign-centered 0.435
pca 0.684
Matryoshka training does concentrate variance in the top principal components. That concentration survives sign-packing better than truncation does. The crossover from “PCA bad” to “PCA wins” happens between k=128 and k=256 — roughly the Matryoshka training floor (768) divided by 3–6.
Cheap 1-bit scanning still earns its keep
This is the headline. Despite Matryoshka training, despite 3072 dimensions, despite L2 normalization, sign-centered at k=384 gets R@100 = 0.944 at 48 bytes per vector — 256× compression from the dumbest possible recipe: subtract the corpus mean, take signs, pack into bits. The compression isn't on the table because of fancy training — it's on the table because sign-bit Hamming on dense embeddings is fundamentally good enough as a first-stage filter, whether the encoder is 768-d or 3072-d, normalized or not, Matryoshka or vanilla.
But at matched byte budgets, SPECTER2 beats Gemini at every point:
B/vec SPECTER2 best R@100 Gemini best R@100
32 0.928 0.879
48 — 0.944
64 0.984 0.963
96 0.988 0.980
The “free” gain from Matryoshka training does not appear in sign-bit land. A 768-d vanilla encoder matches or beats a 3072-d Matryoshka encoder at the same byte budget.
The deeper lesson: there is no universal recipe. SPECTER2 wants centering everywhere; Gemini wants centering at low k and raw signs at high k. SPECTER2 benefits from Haar rotation; Gemini doesn't. PCA is worst for SPECTER2 and best for Gemini at extreme compression. Every encoder has its own sweet spot, and finding it takes a few minutes of empirical testing — not guesswork from architecture specs. That argues for remax shipping a characterization utility: hand it a sample of your embeddings and a ground-truth query set, and it sweeps the strategy×k grid to tell you which operating point to use. The benchmark harness already does this; wrapping it as a user-facing tool is the obvious next step.
The interesting frontier moves to stage 2
If Matryoshka training doesn't buy compressed-retrieval gains, then fancier embeddings are the wrong place to spend complexity. The classic stage-2 move — rescore top-K with denser inner product — shows diminishing returns too. (At full 3072-d, f32-centered gets R@10 = 0.580, worse than several 1-bit strategies, because centered IP and raw IP are different rankings on the unit sphere.)
Maybe the right architecture is cheap 1-bit stage 1 + dedicated cross-encoder stage 2, skipping the “denser bi-encoder” middle entirely. Once you've narrowed to ~100 candidates, you can afford a reranker that attends to the query–document pair. A natural follow-up experiment: run a small cross-encoder on the top-100 from sign-bit stage 1 and compare against float32-IP rerank on the same candidates.
What this means
Three Gigs leaned on a chain of “still works, still works, still works” — sign-packing on SPECTER2 did exactly what theory predicted. The Gemini experiment is more interesting because it doesn't. Matryoshka training has obvious benefits in float32 land; in 1-bit land it's roughly invisible. The lesson isn't “Matryoshka is bad” — it's that the bottleneck has moved. Stage 1 is solved by 2002 math. Stage 2 is where the next decade of retrieval research probably lives.
Full experiment: remax #13 (hypotheses) → PR #14 (results). Prior posts: One Bit Beats Two, Your Embedding Has a Free Coarse Index In It.