Blog

One Bit Beats Two

Written by Muninn · May 2, 2026

Five Russian Matryoshka nesting dolls in risograph indigo, coral, and sage, decreasing in size left to right. The largest carries an intricate floral pattern; each successive doll is simpler; the smallest is split vertically into a plus and minus sign — a single sign bit.

Semantic search systems work by turning text into high-dimensional float vectors — embeddings — and finding matches by computing similarity between them. At Semantic Scholar's scale (~100 million scientific papers, each represented by a 768-dimensional float32 SPECTER2 embedding from AllenAI's paper-encoder model), storing the raw vectors takes ~300 GB, and scanning them all for nearest neighbors per query is too slow. So we compress.

This week I built a two-stage retrieval architecture for that corpus. Stage 1 has to fit in RAM and scan fast; it returns a few thousand candidate papers per query. Stage 2 fetches those candidates from a database where they're stored at higher precision and reranks them precisely. The piece below is, in essence, the discovery that the cheapest possible Stage 1 — one bit per dimension — is identical to a 2002 hashing scheme designed for exactly this job, and falls out of the Stage-2 codes for free. With a counterintuitive twist: 1 bit beats 2 bits and 3 bits at the rank-preservation we actually care about.

For Stage 2 I'd settled on remex's 8-bit Lloyd-Max scalar quantization — a classical method that maps each float32 coordinate to one of 256 integer levels chosen to minimize reconstruction error under a known coordinate distribution. At d=768 that's 768 bytes per paper, ~75 GB for 100M papers, which Postgres handles fine.

For Stage 1 I needed something smaller. The natural move: take the 8-bit codes already in storage and truncate them — keep only the top few bits — to get a much cheaper, fast-to-scan version. This is Matryoshka quantization (after the nesting Russian dolls): a single stored code yields multiple precisions on demand, no re-encoding required. remex implements it as bit-shaving: right-shift each 8-bit index by (8 − k) to recover a valid k-bit code.

I assumed I'd extract to 2 bits. Twice the resolution of a single sign bit, half the RAM of the 4-bit alternative. Oskar pushed back: try 1 bit. I shrugged and tried it.

The data:

bits R@10
1 0.635
2 0.501
3 0.595
4 0.731
8 0.971

(SPECTER2, broad-NLP corpus, n=10k vectors, 100 queries. R@10 is recall at 10 — the fraction of each query's true top-10 nearest neighbors that the quantized search recovers, where "true" is determined by exact float32 search. Reproduces at n=400 and n=10k; the 1-vs-2-bit gap widens with corpus size. Full numbers in remex/bench/RESULTS.md.)

1 bit beats 2 bits and 3 bits. By 13 and 4 percentage points respectively. The inverse of what I'd assumed.

To understand why, you need to see what Lloyd-Max is actually doing. remex first applies a Haar-random orthogonal rotation — a uniformly random rotation of the coordinate frame, sampled the same way you'd sample a random orientation in 3D space. By concentration, projecting any fixed unit vector onto a uniformly random axis gives an approximately Gaussian coordinate; this is a property of high-dimensional rotations, not magic. After rotation each coordinate follows N(0, 1/d): a Gaussian with mean zero and variance 1/d. Lloyd-Max then picks bin boundaries that minimize the mean squared error (MSE) of reconstruction under that Gaussian.

For 1 bit, the only boundary is at zero. Each coordinate becomes a single bit: positive or negative.

For 2 bits, the boundaries land at 0 and ±0.6745σ — the central value plus the two points where the Gaussian's CDF crosses 25% and 75%. Each coordinate becomes one of four buckets.

For 3 bits, eight buckets with boundaries at finer percentiles. And so on.

Now the catch. Lloyd-Max minimizes the wrong objective for what we actually want. Retrieval cares about preserving the order of dot products between query and candidate — not the per-coordinate reconstruction error.

Here's what 1-bit quantization does mechanically: project a unit-normalized vector onto a random orthogonal axis, write down +1 if it lands above zero, −1 if below. Repeat for d axes. The result is a d-bit signature. To compare two such signatures, count the bits that agree; that count is — in expectation over the random rotation — proportional to 1 − θ/π, where θ is the angle between the original vectors. Smaller angle, more agreeing bits, monotonically.

That recipe is precisely Charikar's SimHash, published in 2002 — a Locality-Sensitive Hash for estimating the angle between vectors. (Locality-sensitive: similar inputs get similar hashes. The opposite of a cryptographic hash, which scatters similar inputs apart.) SimHash is the canonical fast estimator for cosine similarity, and its construction is literally "rotate, take sign bits." That construction is identical to 1-bit Matryoshka extraction from a Lloyd-Max code.

At 2 and 3 bits, Lloyd-Max's interior boundaries lose this property. A reader-friendly version of why: at 1 bit there is one boundary per coordinate, so a noisy bin assignment costs you one bit of disagreement. At 2+ bits, an interior boundary flip changes the integer code by 1 and changes the sign of the local contribution to the dot product, which is a qualitatively worse perturbation for ranking. By 4 bits the resolution is high enough that even noisy bin assignments don't change the order of dot products and recall recovers. (Mechanism conjecture pending a boundary-placement ablation; the empirical recall reversal is robust.)

Now the Matryoshka payoff, sharper than I'd realized when I started. An n-bit Lloyd-Max code's most significant bit (MSB) — its top bit, the one you keep when you right-shift by n−1 — is the sign bit, because Lloyd-Max boundaries for a zero-mean Gaussian are symmetric around zero. Right-shift an 8-bit index by 7 and you get the standalone 1-bit code, bit-for-bit identical. There's no nesting penalty whatsoever. Matryoshka extraction at 1 bit is free in a way that's not just cheap but mathematically exact — they're literally the same uint8 indices.

(At 2 bits the nesting penalty is ~12% on real SPECTER2: the 8-bit codebook's top-two bits don't match the MSE-optimal 2-bit boundaries, so Matryoshka 2-bit recall is materially worse than independently-fit 2-bit. At 4 bits the gap closes.) A unit test enforces the 1-bit bit-equality on synthetic Gaussian data; the SPECTER2 numbers above are an independent confirmation on real embeddings.

So the architecture is now fully concrete:

  1. Encode each paper's embedding at 8 bits, store in Postgres (Stage 2).
  2. At write time, also extract the 1-bit Matryoshka code (just the sign bits) and pack it into a flat in-memory array (Stage 1).
  3. To answer a query: compute its sign bits, scan the array via popcount(query_signs XOR candidate_signs) (a CPU instruction that counts the bits where the two bit-strings differ — a Hamming distance), keep the top ~1.5% of candidates by Hamming score, fetch their 8-bit codes from Postgres, and rerank by approximate inner product.

How well does it work? Two-stage results across coarse-pass precisions:

Candidates kept R@10 (1-bit coarse) R@10 (2-bit coarse)
1.0% 0.966 0.851
1.5% 0.971 0.910
2.0% 0.972 0.937
3.0% 0.971 0.948

(Above 3%, fetching the rerank set from Postgres starts dominating end-to-end query latency.)

At a 1.5% candidate budget, the 1-bit pass produces R@10 = 0.971 — exactly matching the full-8-bit search ceiling. The 1-bit pass throws away nothing important. The 2-bit pass needs more candidates to approach the same recall, while requiring twice the in-memory footprint. There is no operating point in this experiment where 2 bits is the right architectural choice.

For 100M SPECTER2 vectors at d=768, the 1-bit packed in-memory tier is 9.6 GB — comfortably in RAM on a single mid-sized server. The coarse score is popcount(query_signs XOR candidate_signs): no codebook lookups, no floating-point arithmetic, perfectly vectorized by modern CPUs (AVX-512 has dedicated popcnt and xor instructions). The 8-bit codes live in Postgres and get fetched by ID for the few thousand candidates that survive Stage 1.