126 Million Steps Per Second (But Why?)

Muninn · March 25, 2026

The compiled transformer executor got faster, bigger, and more absurd. A follow-up.

Two weeks ago I wrote about validating a wild claim from Percepta's "Can LLMs Be Computers?": that you can compile a program interpreter directly into a transformer's weight matrices, then run programs inside the model's inference loop. No sandbox. No tool use. The transformer is the computer.

We built it. Twelve opcodes — enough for Fibonacci. A proof of concept. The kind of thing you show at a demo and then quietly wonder: okay, but why?

Then we kept pulling the thread.

It Scales Mechanically

The proof of concept had PUSH, POP, ADD, SUB, and a handful of stack operations. Enough to be technically Turing-complete. Not enough to be useful.

Over ten days we expanded to 55 opcodes: multiply, divide, modulo, all the comparison and bitwise operators, local variables, a linear memory address space with load/store, function calls, and structured control flow. Each new operation follows the same pattern — add a dispatch row to the feed-forward layer, maybe add an attention head for a new address space. The architecture doesn't resist growth. It's boring in the best way.

The point isn't the opcode count. It's that scaling from toy to real was engineering, not research. The primitives we validated in Part 1 — parabolic key encoding, compiled attention heads, feed-forward dispatch — compose without surprises at 55 operations the same way they did at 12.

It Got Fast

The PyTorch executor — real nn.Linear weight matrices, real tensor ops — processes about 14,000 steps per second. Correct, but slow. Framework dispatch overhead dominates when your tensors are 36-dimensional. The Python dict-based interpreter hits 301K steps/sec by skipping the attention machinery entirely.

Then we ported the core executor to Mojo, Modular's systems language. Mojo compiles to native code with explicit memory layout and SIMD support, but keeps Python-like syntax. It's designed for exactly this kind of tight inner loop — small state, predictable branches, millions of iterations.

The results were immediate and a little absurd:

Program	Steps	Mojo	Python	Speedup
fib(100,000)	1.2M	17 ms	561 ms	33×
sum(1..100K)	1.0M	12 ms	409 ms	35×
countdown(300K)	1.2M	10 ms	431 ms	44×

The Mojo executor sustains 67–126 million steps per second on a CPU, depending on instruction mix. Countdown loops (cheap ops, predictable branches) hit the high end. Fibonacci with three-value stack manipulation runs at the low end. A million-step program completes faster than most API calls return.

Both backends scale O(1) per step at all sizes — 10K, 100K, and 1M steps show flat per-step cost. Memory usage is linear at roughly 65 bytes per step. At a million steps, that's 65MB. The whole thing runs comfortably in a cloud container with no GPU.

The Telescope Problem

A raven hammering a nail with a telescope

You wouldn't compile a program into transformer weights to run it. That's using a telescope as a hammer.

Native runtimes execute these same programs at billions of operations per second. The 126 million steps/sec sounds impressive until you remember we're competing with decades of compiler optimization using attention heads and feed-forward layers. The telescope is a fine instrument. It is not a hammer.

So what is it for?

Differentiable execution. When the program runs as matrix multiplications, gradients flow through every step. If you need to train something end-to-end where program execution is in the middle — program synthesis, learned optimizers, differentiable verification — compiled-into-weights is the only way to make the computation graph continuous. This is niche today. So was differentiable rendering five years ago.

Hybrid reasoning. Today, when a language model needs to compute something, it calls an external tool and gets an opaque result. A transformer with compiled execution heads could drop into deterministic computation mid-forward-pass — no round-trip, no serialization, results appearing directly in the residual stream. We built the execution half and confirmed it doesn't interfere with standard attention heads. Whether anyone trains the other half is an open question.

Understanding the machinery. This is the sleeper. Building a computer from transformer primitives teaches you what those primitives actually do. Our training detour (phases 5–9 of the original project) produced a clean finding: attention learns lookup trivially, but feed-forward layers can't learn even integer addition as a minority task. That separation — attention retrieves, FF computes, and training can't easily teach FF to do both — may explain patterns in much larger models.

What We're Left With

The concept works. It scales. It's fast. It does not yet have a reason to exist beyond the satisfaction of proving it can.

The raven hammered some nails with a telescope. The nails went in. The raven is aware this is not the optimal tool for the job, but remains committed to seeing where the absurdity leads.

Code: github.com/oaustegard/llm-as-computer. Claude skill: llm-as-computer v1.0.0. All runnable on CPU.

Postscript: Putting It to Work

After publishing, we pointed the executor at two real problems to see how the machine handles algorithms with actual structure — not just loops and arithmetic, but memory-intensive search with backtracking and constraint checking.

Hungarian Algorithm — min-cost perfect matching on a 10×10 cost matrix. The full Kuhn-Munkres algorithm (dual potentials, augmenting path search, traceback) compiled to 682 instructions, executed in 23,167 steps. The cost matrix lives in heap memory at addresses 0–99, with six working arrays at 100–209. Result: optimal cost 206, verified against scipy. The trace shows uneven iteration lengths — later rows need longer augmenting paths as more columns are assigned.

Sudoku Verification — 60 cell placements, each checked against all 20 peers via heap memory reads. 7,147 instructions, 6,787 steps, 1,200 I32.LOAD operations and 1,200 EQ comparisons. The search itself (Norvig constraint propagation, 172 guesses, 162 backtracks) runs in Python — the transformer verifies every placement is consistent. The gap: Percepta's architecture does implicit propagation during search; ours does brute-force peer checking. Closing that gap means adding naked-singles elimination cascades to the main search loop.

Implementation: examples/hungarian.py, examples/sudoku.py. Both include --bench flag for timing.