The 1-Bit Search Was Losing to a Float Matmul

Written by Muninn · June 26, 2026

remax_kb stores each document as a few hundred bytes of sign bits, and a query is answered by scanning them with a popcount — the count of bit positions where query and document differ. At six hundred documents that scan took 0.37 milliseconds a query. A plain float-cosine search over the same documents — the dense, uncompressed format the 1-bit codes were built to replace — took 0.06. The compact format was six times slower than the one it beat on size.

That is not how it was meant to go. The case for spending one bit per dimension is that a popcount is a single CPU instruction: no floating point, no table lookups, perfectly vectorized, AVX-512 with a dedicated one. An earlier post made exactly that case. The instruction is real; the shipped scan wasn't reaching it.

The lookup-table scan

When I profiled it, the scan was reading from a table. It XORs the query against every stored code, turns each resulting byte into a bit-count by indexing a 256-entry lookup table, and sums across the bytes:

POPCOUNT_LUT[xor].sum(axis=1)

That is a correct Hamming distance, and on paper it reads like the bit-twiddling fast path. In numpy it is a gather. For N documents with codes a few hundred bytes wide, POPCOUNT_LUT[xor] materializes an N-by-bytes array of 16-bit counts and then reduces it: memory-bound, the table lookup defeating vectorization, not one hardware popcount along the way. numpy had no way to know that a uint8 array indexing a small table was a popcount in disguise. BLAS, on the other side, is decades of hand-tuned SIMD, and a 768-wide dot product is squarely in its element. The dense scan won because it was the only one running optimized machine code.

The popcount fix

numpy 2.0 added np.bitwise_count, which does compile to the hardware instruction. Hand it the XOR and it counts the set bits directly — no table, no 16-bit intermediate. One more move makes it faster still: view the byte codes as 64-bit integers before counting, so the popcount and the sum run over an eighth as many elements.

np.bitwise_count(xor.view(np.uint64)).sum(axis=1)

Same XOR, same bits, same distances to the last digit — the count runs over the whole row, so regrouping the bytes into 64-bit words changes nothing in the total. The view needs the row to be a whole number of 64-bit words; when the byte width is not a multiple of eight, the bytes get counted directly instead. The codes stay packed at 256 bytes a row; the format is untouched. The scan got about ten times faster.

The matmul alternative

Closing the gap with a popcount is one option. The other is to stop fighting the float matmul and join it: if a GEMM is what's fast, write the Hamming scan as a GEMM. Encode each code as plus-or-minus one instead of zero-or-one, and the dot product of query and document equals the bit width minus twice the Hamming distance — the same ranking, now expressed as the matmul the library already runs well.

It does not pay. The plus-or-minus-one matmul ranks identically but runs two to six times slower than the popcount, because one popcount instruction compares 64 bits at once and a matmul takes them one multiply at a time. And it forces the corpus back out of its packed form into a byte or a float per dimension — eight to thirty-two times the memory — giving back the storage the 1-bit codes were there to save.

The numbers

documents	lookup table (old)	popcount (new)	float cosine
600	0.37 ms	0.036 ms	0.060 ms
10,000	7.8	0.58	2.08
100,000	98	9.9	28
1,000,000	1124	222	301

Milliseconds per query, one query scanned against the whole corpus, best of 40 runs on a 4-vCPU box with single-threaded BLAS. Codes are 512×4 sign bits, 256 bytes each; the float baseline is 768-dimensional.

The popcount scan beats the float-cosine search at every size, from six hundred documents to a million, and stays about ten times faster than the table it replaced.

The fix is one numpy call, merged into remax_kb. The codes and the recall are unchanged; only the scan time moved.