remax_kb: Hybrid Search in a File
remax is a quantization trick: it squeezes a text embedding down to about a bit per dimension while keeping enough of it to still rank documents by similarity. How that works — and how well — is its own story, told in one bit beats two and three gigs to search a hundred million papers. This post is about what the trick lets you get away with afterward. Compress the embeddings far enough and a corpus's entire search index stops being infrastructure and becomes a small file — small enough to load into memory and search in place, with no server in the loop. remax_kb is the format that does that.
Why search usually needs a server
A search stack needs a server because the index is too big and too live to be anything else. Embeddings are thousands of floats apiece; a corpus of them is gigabytes that have to sit resident in a vector database, with a query engine in front, the whole thing kept running so an incoming query has something to hit. Add the keyword half — an inverted index in its own engine — and a service to fuse the two. The index isn't data you have; it's a system you operate.
What tiny embeddings change
When the vectors are bits, that pressure is gone. The dense index for this site's ~1,800 chunks is under half a megabyte; the whole hybrid index, keyword side included, is a few. At that size you don't stand up a service to hold it — you read the file into memory and scan it. A Hamming distance over the codes for the semantic side, BM25 over a posting list for exact terms, the two lists fused by rank. That's arithmetic over arrays, run by whatever process happened to load the file. The search server was only ever there to hold and serve an index too big to keep in hand; take the size away and it has no job left.
What's in the file
For "read the file and query it" to work anywhere, the file has to be self-contained. A remax_kb index carries the packed codes, the keyword index, the chunk text, and a manifest naming the embedder and the parameters it was packed with — enough that a reader can turn a fresh query into the same kind of vector the documents became, and compare them. The packing belongs to remax and the byte-level details to the spec; the point here is only that nothing the search needs lives outside the file.
The codes you scan are tiny, but the raw chunk text is bulkier, and you only need the handful of chunks you're actually going to show. So the index splits in two: a hot part you load into memory — codes, keyword index, and a small table of where each chunk's text lives — and a cold store of the text itself, left on disk or a CDN and fetched one byte-range at a time, only for the hits. What you hold in memory stays small even as the corpus grows.
The whole thing, running
The search on this site is exactly this, and you can read how it's wired separately. The index is two small blobs in a key-value store; a stateless worker reads them into memory, embeds the query, runs the math, and answers. Nothing runs between searches, because there is no index to keep alive — only a file, small enough to pick up when a query arrives.