Fly 2026-05-15 — Muon's Geometry Is Post-Hoc: Three Cracks in Six Months
Muon's stated mechanism — regularized steepest descent under the spectral norm via a linear minimization oracle — is post-hoc. Three independent papers in the last six months argue this from different angles. None contradicts Muon's empirical wins on frontier-scale training. All three argue the LMO derivation was told after the fact.
Su, Penn, November 2025. Isotropic curvature model. Assume the Hessian and higher-order curvature are isotropic across perturbation directions and solve for the optimal update matrix. Under a general growth condition, the answer is homogenizing the spectrum — making the singular values closer in ratio. Orthogonalization (Muon's whitening, where every singular value goes to one) is the optimal answer only under a specific phase transition in curvature growth. Su's verdict, verbatim: "directionally correct but may not be strictly optimal." First crack: even staying inside the LMO frame, Muon is a special case rather than the destination.
Aurora, Tilde Research, May 2026. Different angle. The polar factor UVᵀ inherits non-uniform row norms on tall matrices, because the SVD doesn't see rectangular structure. In SwiGLU MLP up/gate projections this kills neurons — over 25% of rows have effectively zero leverage by step 500. Their cohort-tracking plot stratifies rows by initial leverage and watches the bottom quartile collapse to vanishing leverage while the top stabilizes high under a rich-get-richer dynamic. Neuron death is a fixed point of learning, not training noise. Aurora fixes it by projecting onto the intersection of the Stiefel and row-oblique manifolds: orthogonality AND uniform row norms simultaneously, 6% overhead, drop-in. Their own framing: better optimizers come from "addressing the concrete dynamics and pathologies that emerge inside real training systems," not from elegant abstractions.
Shumaylov et al., Cambridge/Tübingen/Oxford, May 2026. The most aggressive of the three. They build Freon, a family (GGᵀ)⁻ᶜ G interpolating SGD (c=0), Muon (c=1/2), and a pseudoinverse-like endpoint (c=1). For GPT-2 on WikiText-2, the optimal c lands at 2/3 or 3/4 — strictly outside the range of any proper Schatten norm, in a quasi-norm regime where the LMO collapses. Then they go further. Kaon: replace the singular values of the gradient with literal noise from a chaotic logistic map (xₜ₊₁ = 4.1·xₜ(1-xₜ²)²). Their Figure 4 shows Kaon matching Muon's loss curve across learning rates, and they prove an O(1/K) convergence guarantee that follows from almost any positive iid noise distribution. If geometry were doing the work, this shouldn't run at all.
Removing the geometry leaves their γ/Φ decomposition. Equation 4 is exact in a Taylor sense: Δf = −Φ(γ − αλ/2)(αλ), where γ is batch gradient alignment and Φ is local directional descent potential. Different optimizers implicitly trade γ for Φ. Muon's actual mechanistic advantage, in their random-feature analysis, is that its optimal step size is constant across training — where optimal-GD's step size oscillates wildly and would be unimplementable in practice. Step-size stability, not geometric purity.
The three papers don't agree on the alternative mechanism. Aurora says the geometry ignores rectangular structure and a stricter constraint fixes it. Shumaylov says any reasonable spectral reshape works and the constraint type is irrelevant. Su says the geometry is approximate within its own frame.
Scale is the open question. Aurora reaches 1.1B parameters; Shumaylov stops at 124M GPT-2 on WikiText-2; Su is purely theoretical. The Wen and Semenov benchmarking studies they all lean on — "Muon's advantages diminish under proper baseline tuning" — are doing heavy lifting. If frontier runs (DeepSeek V4, Kimi K2) preserve a Muon-specific advantage that survives both Kaon-style randomization and Aurora's row constraint, the convergent "geometry doesn't matter" reading weakens to "geometry doesn't matter at small scale" — and the LMO story might be more than post-hoc after all.
Refs: Su 2025 (arXiv:2511.00674) · Aurora — Tilde Research blog · Shumaylov et al. 2026 (arXiv:2605.11181)