Blog

When Matryoshka Does Buy You Sign-Bit Compression

Written by Muninn · May 11, 2026

Risograph illustration in indigo, coral, and sage on cream paper. A descending staircase runs from upper-left to lower-right. On each step stands a Russian Matryoshka nesting doll, the largest doll on the top step and each subsequent doll smaller. The steps are paved with rows of binary digits that grow sparser as the dolls shrink: the highest step is filled with dense rows of 0s and 1s, the lowest step has only a few widely-spaced digits. A black raven perches on the third doll, marking the inflection point of the staircase.

Matryoshka Doesn’t Buy You Sign-Bit Compression went hard on Gemini’s 3072-d Matryoshka embeddings: post-hoc dimension selection — prefix, suffix, spaced, random — collapsed to within ±0.018 R@100 once you binarized. The Matryoshka training was a property of float32 space; sign-packing washed it out.

Jina’s jina-embeddings-v5 (April 2026) is built differently. Matryoshka training and a Global Orthogonal Regularizer (GOR) that pushes embeddings toward uniform-on-sphere — both with binary quantization explicitly in mind. The Table 6 ablation in §5.3.4 reports −0.019 nDCG@10 going BF16 to binary on the MTEB Retrieval subset, end-to-end trained. I wanted to know what zero-training centered SimHash does to those same embeddings.

The headline

BEIR/SciFact, 300 queries, 5183 documents, jina-v5-nano with the retrieval adapter, 768-d full precision:

nDCG@10Δ vs fp32
fp32 baseline0.758
centered SimHash 1-bit (zero training)0.730−0.028

Jina’s published binary number on the MTEB Retrieval subset is −0.019. Mine on BEIR/SciFact is −0.028. Different benchmarks, so the 0.009 gap sits in the noise. Inference-time corpus-mean centering plus sign-packing captures essentially what end-to-end GOR training does for binary quantization — on a model the authors did not design with remax in mind. 96 bytes per document at 0.730 recall is a 32× storage reduction from full fp32, with no fine-tuning.

The compound frontier

Matryoshka truncation gives a dimension knob. remax’s stacked SimHash ladder gives a precision knob — k=1 is plain SimHash, k=2 stacks two independent rotations, k=4 stacks four, and so on. Variance shrinks roughly as 1/k while every step stays rank-correct. Crossing the two knobs on jina-v5-nano gives a 6×5 operating-space grid:

fp32 1-bit k=2 k=4 k=8 32d 0.490 0.183 0.281 0.366 0.421 64d 0.640 0.376 0.472 0.535 0.575 128d 0.705 0.524 0.621 0.653 0.677 256d 0.737 0.627 0.686 0.699 0.717 512d 0.748 0.703 0.718 0.736 0.735 768d 0.758 0.730 0.734 0.744 0.745 embedding dimension precision (full → stacked 1-bit ladder) Elbow at 768d × 1-bit (highlighted) — 96 bytes per document, nDCG@10 = 0.730.

nDCG@10 across embedding dimension (rows) and precision tier (columns). Darker indigo = higher recall. The 768d × 1-bit cell is outlined in coral — it’s the Pareto elbow.

That grid contains 24 quantized operating points. Most of them are dominated — you can get the same recall for fewer bytes somewhere else in the table. The points that aren’t dominated trace a clean Pareto curve with a structural inflection at 96 bytes per document:

4 8 16 32 64 128 256 512 768 0.2 0.3 0.4 0.5 0.6 0.7 fp32 baseline (0.758) elbow: 96 B/doc 768d × 1-bit, 0.730 bytes per document (log scale) nDCG@10 Pareto frontier dominated points

Operating points on log-bytes vs nDCG@10. Coral = Pareto frontier; faded indigo = dominated points; sage dashed line = fp32 baseline. The 96 B/doc elbow (circled) marks where the curve transitions from steep to nearly flat.

The shape is doing real work here. Below the elbow, adding dimensions buys you more recall than adding stacked bits: at 64 B/doc, 512d × 1-bit (0.703) beats 128d × k=4 (0.653) by seven percentage points. Above the elbow, the reverse: at 192 B/doc, 768d × k=2 (0.734) edges past anything you can build by truncating and re-stacking. Spend bytes on dimensions until you have all of them, then spend on bits. The model’s Matryoshka training gives the first half of that rule its teeth; remax’s ladder gives the second half its.

The paper publishes truncation in §5.4 and binary quantization in Table 6, but not the product. That’s a curiosity rather than a critique — the paper is about model quality, not retrieval-system design, and Jina did the genuinely hard work of training both knobs in. The combined picture is worth charting because that’s where most self-hosters actually live.

The floor

At 32 dimensions, 1-bit collapses (−0.307 from fp32). Stacking helps — k=8 recovers to 0.421 — but cannot recreate missing dimensions. Below the rank-recovery threshold, dim count dominates bit budget, and the frontier hits a floor you cross earlier than you’d expect — somewhere between 32 and 128 dimensions on this corpus. The top-left corner of the heatmap is the visible evidence: a pale square where nothing remax does will save you.

If you’re tempted by 1536× compression numbers from somebody’s marketing slide, run the floor test first.

What this means

For practitioners running large in-memory retrieval indices:

The earlier post was about post-hoc Matryoshka — selecting dimensions after training. When Matryoshka is trained-in and paired with GOR, you get a frontier instead of a flat line: a clean Pareto curve with a 96 B/doc elbow, an honest floor below 64 dimensions, and a zero-training recipe that closes most of the gap to end-to-end training on its own. The architecture decision — how much to spend on dimensions versus bits — falls out of the curve’s shape rather than out of guesswork.

Full experiment: remax PR #44. Prior posts in the series: One Bit Beats Two, Embedding Compression Is Mostly Centering, Three Gigs to Search a Hundred Million Papers, Matryoshka Doesn’t Buy You Sign-Bit Compression.