Between the Spokes — Calibration and Cascade Data
Empirical reference for the embedding-bridge-discovery MVP. Phase-0 calibration, pipeline trace, three surviving pairs, density math.
Empirical reference material for the embedding-bridge-discovery post. Methods, the phase-0 calibration data, the asymmetric MVP run trace, and the three surviving cascade pairs in full. Skip to a specific section via the contents below; nothing here is prose-paced.
Phase-0 calibration: does the Sawin bridge embed between geometry and algebra?
The bridge-discovery thesis assumes that a paper combining vocabulary and concepts from two communities — say algebraic number theory and discrete geometry — embeds somewhere that reflects both. The cleanest test is calibration against a known bridge: the Erdős unit-distance disproof, which lifts a planar geometry problem into algebraic number theory using a Hajir–Maire–Ramakrishna tower-cutting technique.
Phase-0 (run 2026-05-23) compared SPECTER2 embeddings of Will Sawin's same-day independent disproof against a set of papers spanning the source-technique field (algebraic number theory) and the target-problem field (unit-distance combinatorial geometry).
Calibration set
SPECTER2 [CLS] vectors fetched from the Semantic Scholar API (precomputed where available; otherwise computed locally from arXiv abstracts). The Lenstra entry is a proxy abstract — the 1986 paper predates arXiv ingestion — and is flagged accordingly.
| Label | Field | Paper |
|---|---|---|
Sawin | bridge (test) | arXiv:2605.20579 — Sawin, "An explicit lower bound for the unit distance problem" |
OpenAI | bridge (digest) | arXiv:2605.20695 — Alon et al., "Remarks on the disproof of the unit distance conjecture" |
AMP | geometry | arXiv:2412.11914 — Alexeev, Mixon, Parshall, "The Erdős Unit Distance Problem for Small Point Sets" |
Pach–Raz | geometry | arXiv:2507.15679 — Pach, Raz, unit-distance + rigidity |
Erdős-2D | geometry | arXiv:2002.00502 — Erdős distance compression |
HMR | algebraic NT | arXiv:2103.09508 — Hajir, Maire, Ramakrishna, "Deficiency of p-Class Tower Groups and Minkowski Units" |
Lenstra* | algebraic NT | Lenstra 1986 (proxy abstract — original predates arXiv) |
Unrel | ML control | arXiv:2103.00020 — CLIP (out-of-domain baseline) |
Sawin's nearest neighbors in SPECTER2
Cosine distance from Sawin to each comparison paper. Lower is closer.
| Neighbor | Field | d(Sawin, ·) |
|---|---|---|
OpenAI (Alon et al. digest) | twin (same theorem, same authors) | nearest (top-1) |
Lenstra* (proxy abstract) | algebraic NT | top-2 (caveat — see below) |
AMP | geometry | 0.103 |
Pach–Raz | geometry | 0.120 |
Erdős-2D | geometry | 0.126 |
HMR | algebraic NT | 0.159 |
HMR-v2 | algebraic NT | 0.159 |
Unrel (CLIP) | ML control | 0.21–0.26 |
Reading this honestly: two of Sawin's top-2 nearest neighbors don't count as evidence about embedding geometry. The OpenAI digest is the same theorem by overlapping authors with an explicit Hajir–Maire name-check in the abstract; it's a twin paper. Lenstra (1986) is a proxy abstract — the original isn't on arXiv — and is itself a bridge paper that Lenstra framed using both algebraic-NT and coding-theory vocabulary, so its closeness in part reflects shared bridge framing rather than the kind of "papers about the same problem" structure that should dominate a problem-domain embedder.
After excluding the twin and the proxy, Sawin's three nearest substantive neighbors are all geometry papers (d = 0.103, 0.120, 0.126), and the algebraic-NT anchors that supply the actual proof technique sit a clear band further out (d = 0.159). Δ = +0.039 toward the problem side. The bridge embeds firmly with its target-problem field, not between source and target.
Mechanism: what SPECTER2 is trained on
SPECTER2 is trained on a citation-prediction objective: papers cited together end up nearby in embedding space. Bridge papers cite their problem-side predecessors heavily and use method-side ancestors more sparingly — a paper proving an Erdős-unit-distance result cites unit-distance papers densely and cites the algebraic-number-theory papers only as needed for the technique. The embedder picks up the dominant citation signal, which is audience, not methods. Once realized, bridges dissolve into their problem-side cluster.
This is partly SPECTER2-specific. Gemini's embedder, trained differently, shows a softer version of the same pattern: HMR moves from rank 7–8 in SPECTER2 to rank 5 in Gemini for Sawin's neighborhood, and Gemini correctly places Lenstra between its algebra and coding-theory endpoints (where SPECTER2 clusters Lenstra with other bridge papers). Gemini also widens the cross-field bands — d(math, ML) ≈ 0.14 in SPECTER2 but 0.21–0.25 in Gemini. The directional finding ("Sawin's neighborhood is geometry-dominant") survives both embedders; the magnitude of the algebra-side suppression is partly an artifact of citation-trained models specifically.
Asymmetric MVP pipeline architecture
Even given the calibration result, the asymmetric scan ran on 2026-05-24 to measure what the cascade does surface when the embedding can't see the target. Architecture:
- Corpus build. Draw a fixed sample from arXiv: 800 papers from
cs.LG,cs.CV,cs.CL,cs.NE,cs.CR,cs.NI,cs.DC,stat.ML(the "empirical" pool); 1377 papers frommath.*,cs.IT,cs.DS(the "theory" pool). - Embed. Fetch SPECTER2 vectors from Semantic Scholar (97.75% / 99.6% coverage). Save as
.npy. - Cross-axis scan. For each empirical paper, find its top-K theory neighbors by SPECTER2 cosine. Apply sequential arXiv dedup. Top-2000 pairs survive.
- Slot extraction. Per paper, an LLM (
gemini-2.5-flash) extracts asymmetric slot records: empirical papers get(phenomenon, regime, mechanism_unknown); theory papers get(theorem_claim, regime, mechanism_provided). - Slot re-rank. Re-embed the structured slots (not the raw abstracts), recompute cosine. Top-200 pair budget; 95 pairs survive because of slot-extraction failures.
- Cheap judge. A second LLM call asks: does the theory paper resolve, partially resolve, or not address the empirical paper's mechanism gap? Verdicts:
resolves,partially_resolves,unrelated. - Agent-tool translation. For the surviving
partially_resolvespairs, an Opus subagent writes a detailed translation: "what would it mean for the theorem to actually resolve the empirical gap, and what's the testable prediction?" - Independent Opus second opinion. A fresh Opus call assesses each translation: does the theorem actually say what the cascade thinks it says (consistency)? Is this finding novel or folklore? Worth a domain expert's time?
The pipeline is resumable at each stage; a CF gateway 429 burst during stage 4 (slot extraction) cost 131 papers but the run resumed cleanly.
Run trace, cost, acceptance criteria
Pipeline trace
| Stage | Input | Output | Survivors |
|---|---|---|---|
| 1 — corpus | arXiv categories | 800 emp + 1377 th | 2177 |
| 2 — embed | 2177 papers | 782 + 1372 SPECTER2 vectors | 2154 (97.75% / 99.6%) |
| 3 — scan | cross-axis NN | candidate pairs (sequential dedup) | 2000 |
| 4 — slot extract | 731 unique papers | 600 successful, 131 CF-429 failures | 82% |
| 5 — slot re-rank | 2000 pairs | both ends extracted OK | 95 |
| 6 — cheap judge | 95 pairs | 92 unrelated / 3 partially_resolves / 0 resolves / 0 errors | 3 |
| 7 — translation | 3 pairs | Opus subagent translations | 3 |
| 8 — indep-Opus | 3 translations | 3 × FOLKLORE verdicts | 0 novel |
Cost
| Component | Calls | Actual cost |
|---|---|---|
| S2 SPECTER2 batch fetch | 4 batches × 500 | $0 (heavy 429 throttle, retry absorbed) |
| arXiv body fetch | 731 unique | $0 (~15 min @ 1.2s polite interval) |
Slot extract (gemini-2.5-flash) | 731 × 1 (some retried) | ~$0.10 |
Slot embed (batchEmbedContents) | 6 batches × 100 | ~$0.05 |
Cheap judge (gemini-2.5-flash) | 95 × 1 | ~$0.05 |
| Opus translations + indep-Opus | 6 × 1 | orchestrator-side (in-session) |
| Total external API spend | ~$0.20 |
Wall clock: ~50 min for the pipeline + ~5 min for orchestrator stages.
Acceptance criteria
| # | Criterion | Verdict |
|---|---|---|
| 1 | Pipeline runs end-to-end with hard dedup tiers, resumable, under ~$5 | PASS Resumed cleanly after CF 429 throttle bursts. |
| 2 | ≥20 candidate pairs survive cheap-judge as resolves or partially_resolves | FAIL Only 3 surviving. 92/95 judged unrelated. |
| 3 | Top 3 candidates pass indep-Opus internal-consistency check (does the theorem say what the cascade thinks it says?) | PASS All 3 internally consistent. |
| 4 | At least 1 candidate rated by indep-Opus as worth a domain-expert second opinion | PASS Pair 3 (worth-expert: yes). |
3 of 4. The miss is the load-bearing one — yield density at this scale is structurally too low to produce non-folklore output, even with the cascade behaving correctly.
The three surviving pairs
Pair 1 — Compositional sparsity × DNN regression rates
- E: arXiv:2605.14764 — "Compositional Sparsity as an Inductive Bias for Neural Architecture Design"
- T: arXiv:2504.03405 — "On the Rate of Convergence of an Over-Parametrized Deep Neural Network Regression Estimate Learned by Gradient Descent"
- Slot cosine: 0.801. Cheap judge:
partially_resolves.
Cascade translation (summary): The empirical paper's "approximation/estimation error" maps to the theorem's L² risk; "ambient dimension d" maps to input dimension; the empirical "structure" maps to a hierarchical composition model with effective dimension d* ≪ d. Concrete untested prediction: log-risk vs. log-n slope should be governed by d* not d, and changing d with d* fixed should not change the slope.
Indep-Opus verdict: FOLKLORE. Worth-expert: no. Lineage: Mhaskar–Poggio 2016 (Annals); Bauer–Kohler 2019 (Annals of Statistics); Schmidt-Hieber 2020 (Annals); Kohler–Krzyżak / Kohler–Langer follow-ups extending rates to GD-trained over-parametrized nets. The bridge is real and internally consistent — the theorem says what the cascade says it says — but the result is the explicit thesis of a decade-old well-known line of work.
Pair 2 — Nuclear surrogate ML benchmark × differentiable Kalman filter
- E: arXiv:2605.15549 — "CTF4Nuclear: Common Task Framework for Nuclear Fission and Fusion Models"
- T: arXiv:2509.07474 — "DKFNet: Differentiable Kalman Filter for Field Inversion and Machine Learning"
- Slot cosine: 0.686. Cheap judge:
partially_resolves.
Cascade translation (summary): The surrogate ML output ŷ(t; θ) maps to the differentiable Kalman recursion's posterior mean; the twelve benchmark metrics map onto components of the posterior covariance and innovation statistics; the sparse-measurement / noisy-sensor / low-data regimes map onto DKFNet's measurement model, observation noise, and linear-dynamics hypothesis. The gap: nuclear systems are strongly nonlinear and rarely Gaussian, so the correspondence requires EKF/UKF-style linearisation.
Indep-Opus verdict: FOLKLORE. Worth-expert: no. Over-claims in a specific way: CTF4Nuclear is a benchmark (defines tasks and metrics), not a paper claiming an empirical phenomenon with an unexplained mechanistic gap. "The 12 metrics don't measure calibrated intervals" is benchmark scope, not a flagged gap. Differentiable Kalman filters trace to Haarnoja et al. 2016; Kalman-based UQ for nuclear/plasma state has decades of data-assimilation literature.
Pair 3 — Generative-model ambiguity × no-prior Bayes IMs
- E: arXiv:2605.15050 — "Separating Intrinsic Ambiguity from Estimation Uncertainty in Deep Generative Models for Linear Inverse Problems"
- T: arXiv:2503.19748 — "No-prior Bayes reIMagined: probabilistic approximations of inferential models"
- Slot cosine: 0.676. Cheap judge:
partially_resolves.
Cascade translation (summary): "Intrinsic ambiguity" — null-space directions of the forward operator A — maps to parameter coordinates where the possibilistic IM's contour function π_y(x) remains near 1. "Estimation uncertainty" maps to coordinates where π_y contracts sharply. The theorem's inner approximation P_y ⪯ π_y plays the role of the generative model's posterior surrogate q_φ(x | y). Testable prediction: credible sets read off from the inner approximation should achieve exact frequentist coverage, even on null-space coordinates where standard variational posteriors are miscalibrated.
Indep-Opus verdict: FOLKLORE leaning WEAK-VALID if "constructive IM for deep priors" is honestly downgraded to "calibration target". Worth-expert: yes — specifically a Martin-school statistician fluent in Bayesian inverse problems, because the constructive-feasibility question is load-bearing. IMs quantify uncertainty about a finite-dimensional parameter; applying to a deep generative prior over x requires either treating the latent code as the parameter (re-imports the prior, loses no-prior guarantee) or a valid predictive/nonparametric IM, which exists in theory but has no constructive scalable implementation for deep priors. The general area (frequentist coverage / calibrated UQ for ill-posed linear inverse problems) is well-trodden — Nickl, Szabó, van der Vaart — but the specific IM-to-deep-generative bridge at theorem-matching level is probably not published.
Pair 3 is the borderline case. The cascade surfaced a real candidate; the indep-Opus assessment says the construction question is genuine; one borderline pair in 2000 candidate pairs is statistically consistent with chance against the population density estimate below.
Density math
arXiv has on the order of 5×10⁵ ML/applied-CS papers and 5×10⁵+ math/theory papers in the categories drawn from. Suppose ~10³ real bridge pairs (theorem T explains observation O) exist in the joint corpus — a generous upper bound, given that the issue's exemplar table lists only a handful of well-articulated bridges from decades of work.
Per-paper probability of being a bridge endpoint: ~2×10⁻³ on each side.
Drawing 800 empirical + 1377 theory papers uniformly:
E[bridges with both endpoints in sample]
≈ |bridges| × (800 / 5×10⁵) × (1377 / 5×10⁵)
≈ 10³ × 1.6×10⁻³ × 2.8×10⁻³
≈ 0.004
The expected count of known bridge pairs with both endpoints in this sample is effectively zero. Three folklore-grade survivors is consistent with the cascade catching chance structural matches — which is exactly what the indep-Opus assessments confirmed.
Nearest-neighbor selection across the cross-axis would lift this expected yield only if the embedding preserved bridge geometry. The phase-0 calibration above shows it doesn't — bridges embed with their target-problem field, not between source and target. So the uniform-sample expectation is the operative lower bound on cascade yield at this corpus size.
Path A / B / C decision tree
Three forward options:
Path A — Keep iterating at sub-thousand scale
Tighten the empirical pool (drop stat.ML, cs.NI, cs.DC), two-pass slot extraction with a gap-detection first pass, provision S2_API_KEY for author-Jaccard + citation-overlap dedup, drop concurrency to 2 to avoid CF 429s, bias sampling toward 2024–2026. Cheap. Won't produce non-folklore output — the density math forbids it. Useful only for methodological refinement, not for hunting publication-grade bridges.
Path B — Resume the 1.9M-paper production build
At 1.9M papers the expected-bridges-in-sample math becomes ~5–50, depending on per-paper bridge-density assumptions. The cascade as currently structured would then be filtering real signal, not sampling noise. Cost estimated at ~$435. Defensible only if (a) the density estimate holds and (b) the geometry objection below does not apply.
Path C — Anchor-seeded sampling
Pick five to ten known or strongly-suspected non-folklore bridges (the original BTS post's table is a starting set — neural scaling laws / random matrix theory, lottery ticket / compressed sensing, etc.). Take their SPECTER2 neighborhoods. Run the asymmetric cascade against those neighborhoods.
Two failure-modes are diagnostic:
- If non-anchor cousins surface at non-trivial rate, the SPECTER2 geometry holds enough signal at small scale that Path B becomes defensible.
- If the cascade still returns only folklore neighbors of the anchors, the Singular Learning Theory objection — that novel bridges manifest as new directions in representation space, not as points between existing clusters — is doing real work, and the mechanism itself needs to change before the corpus does.
Cost: ~$1. Decides the $435 question one way or the other.
Recommended sequence: Path C first, then Path B if Path C surfaces non-anchor cousins.
Companion to Between the Spokes: What the Embeddings Can't See (2026-05-24).