Three Gigs to Search a Hundred Million Papers
The first two posts in this series — One Bit Beats Two and Your Embedding Has a Free Coarse Index In It — established that you can extract a 768-bit binary signature from any dense embedding by centering and taking sign bits. On SPECTER2 — AllenAI's embedding model for scientific papers — that signature recovers 98.8% of the true top-10 at 1% scan depth. No library, no training.
768 bits is 96 bytes per vector. At Semantic Scholar's scale of ~100 million papers with SPECTER2 embeddings — the subset of its 225M paper records that have abstracts — that's 9.6 GB. Comfortably in RAM, but can we do better?
The answer: cut dimensions, not just bit depth. Truncate the centered embedding to its first 256 dimensions before taking signs, and you get 32 bytes per vector. 100 million papers × 32 bytes = 3.2 GB. R@100 stays at 0.926.
A caveat: SPECTER2 is not trained with Matryoshka Representation Learning, where the encoder is optimized so any prefix is a valid shorter embedding. We're throwing away 2/3 of the dimensions of a standard encoder. The experiment measures what that costs.
The experiment
Can you throw away dimensions post-hoc — without retraining the encoder — and still get a useful coarse index? I tested five strategies for producing k-bit binary signatures from 10,000 SPECTER2 embeddings, at bit budgets from 64 to 768, against float32 inner-product ground truth. 256 is the sweet spot: below that, recall drops too fast; above it, diminishing returns.
The table at the operating point (k=256, 32 bytes/vec, 96× compression):
strategy R@10 R@100
f32-raw 0.420 0.882 ← truncation alone
f32-centered 0.464 0.943 ← centering recovers +0.061
sign-raw 0.336 0.798 ← truly free, no centering
sign-centered 0.468 0.926 ← the recipe
haar-trunc 0.487 0.928 ← Haar rotation adds +0.002
sign-centered is the free coarse index recipe applied to the first 256 dimensions: compute mu = corpus.mean(axis=0) once over all corpus vectors, then for each vector (corpus or query) subtract mu, take signs, pack into bits. The corpus mean is a single 768-d vector — the only thing you need to store beyond the codes themselves. haar-trunc applies the Haar random rotation from remex before truncating — the same rotation that One Bit Beats Two showed makes every coordinate equally informative. It helps, but not by much — the encoder's native basis is already close to uniform.
R@100 = 0.926 means scanning 1% of the corpus via Hamming distance recovers 93% of the true top-10. Stage 2 reranks those candidates with full-precision vectors.
A caveat on scale: these numbers come from 10,000 embeddings — 0.01% of the full Semantic Scholar corpus. That's enough for directional signal, but recall behavior can shift at 100M scale where the nearest-neighbor distribution is denser and the corpus mean is computed over a much larger sample. Treat the specific numbers as indicative, not final.
A caveat on generalizability: all results here are on SPECTER2. The uniform-information-across-dimensions property that makes this work may not hold for all encoders — different architectures, training objectives, and embedding dimensions could produce different variance profiles. We plan to test against other popular encoders (e.g. Google's Gecko) in a follow-up.
Three surprises
Centering is the single biggest lever. At k=64, centering buys +0.324 R@100 — more than rotation, random projection, or any other technique. SPECTER2 has dimensions whose mean is far from zero; without centering, sign(x) is constant on those dimensions and contributes nothing to the Hamming distance.
PCA is the worst basis for sign-bit extraction. PCA concentrates variance in the top components, but Hamming distance weights all bits equally — at k ≥ 192, the low-variance components are noise that drowns the signal.
1-bit centered sign beats float32 centered inner product at k ≥ 256. The ground truth uses raw (uncentered) inner product, but f32-centered computes centered inner product — a different ranking. Hamming on centered signs approximates cosine similarity, which turns out to be a better proxy for raw inner product than centered inner product is.
The architecture
100 million papers × 32 bytes = a 3.2 GB file. mmap it. XOR + popcount is a CPU instruction; single-threaded brute force returns top-100 candidates in well under a second. Stage 2 reranks those candidates against full embeddings in S3, Postgres, or S3 Vectors.
The binary codes are the index. The corpus mean is one vector. The entire system is two files and a subtraction.
Reproducer
The experiment is in remax PR #11. Clone, fetch the SPECTER2 cache, run python bench/sketch_matryoshka.py. Reproduces in ~10 seconds on a laptop. The architecture direction is tracked in remax #12.
One final test: since SPECTER2 is a SciBERT-based model where the 768 dimensions are just the hidden states of the last transformer layer, there's no reason the first k should be more informative than any other k. We verified this directly — random selection of 256 dimensions, suffix (last 256), and evenly-spaced all perform within noise of prefix truncation:
index selection R@10 R@100 (k=256, n=10k)
prefix (0..255) 0.468 0.926
suffix (512..767) 0.463 0.937
random (mean, n=5) 0.470 0.935
evenly-spaced 0.470 0.918
The prefix convention is just a convention. Any 256 dimensions will do.