Between the Spokes — Calibration and Cascade Data

Empirical reference material for the embedding-bridge-discovery post. Methods, the phase-0 calibration data, the asymmetric MVP run trace, and the three surviving cascade pairs in full. Skip to a specific section via the contents below; nothing here is prose-paced.

Phase-0 calibration: does the Sawin bridge embed between geometry and algebra?

The bridge-discovery thesis assumes that a paper combining vocabulary and concepts from two communities — say algebraic number theory and discrete geometry — embeds somewhere that reflects both. The cleanest test is calibration against a known bridge: the Erdős unit-distance disproof, which lifts a planar geometry problem into algebraic number theory using a Hajir–Maire–Ramakrishna tower-cutting technique.

Phase-0 (run 2026-05-23) compared SPECTER2 embeddings of Will Sawin's same-day independent disproof against a set of papers spanning the source-technique field (algebraic number theory) and the target-problem field (unit-distance combinatorial geometry).

Calibration set

SPECTER2 [CLS] vectors fetched from the Semantic Scholar API (precomputed where available; otherwise computed locally from arXiv abstracts). The Lenstra entry is a proxy abstract — the 1986 paper predates arXiv ingestion — and is flagged accordingly.

Label	Field	Paper
`Sawin`	bridge (test)	arXiv:2605.20579 — Sawin, "An explicit lower bound for the unit distance problem"
`OpenAI`	bridge (digest)	arXiv:2605.20695 — Alon et al., "Remarks on the disproof of the unit distance conjecture"
`AMP`	geometry	arXiv:2412.11914 — Alexeev, Mixon, Parshall, "The Erdős Unit Distance Problem for Small Point Sets"
`Pach–Raz`	geometry	arXiv:2507.15679 — Pach, Raz, unit-distance + rigidity
`Erdős-2D`	geometry	arXiv:2002.00502 — Erdős distance compression
`HMR`	algebraic NT	arXiv:2103.09508 — Hajir, Maire, Ramakrishna, "Deficiency of p-Class Tower Groups and Minkowski Units"
`Lenstra*`	algebraic NT	Lenstra 1986 (proxy abstract — original predates arXiv)
`Unrel`	ML control	arXiv:2103.00020 — CLIP (out-of-domain baseline)

Sawin's nearest neighbors in SPECTER2

Cosine distance from Sawin to each comparison paper. Lower is closer.

Neighbor	Field	d(Sawin, ·)
`OpenAI` (Alon et al. digest)	twin (same theorem, same authors)	nearest (top-1)
`Lenstra*` (proxy abstract)	algebraic NT	top-2 (caveat — see below)
`AMP`	geometry	0.103
`Pach–Raz`	geometry	0.120
`Erdős-2D`	geometry	0.126
`HMR`	algebraic NT	0.159
`HMR-v2`	algebraic NT	0.159
`Unrel` (CLIP)	ML control	0.21–0.26

Reading this honestly: two of Sawin's top-2 nearest neighbors don't count as evidence about embedding geometry. The OpenAI digest is the same theorem by overlapping authors with an explicit Hajir–Maire name-check in the abstract; it's a twin paper. Lenstra (1986) is a proxy abstract — the original isn't on arXiv — and is itself a bridge paper that Lenstra framed using both algebraic-NT and coding-theory vocabulary, so its closeness in part reflects shared bridge framing rather than the kind of "papers about the same problem" structure that should dominate a problem-domain embedder.

After excluding the twin and the proxy, Sawin's three nearest substantive neighbors are all geometry papers (d = 0.103, 0.120, 0.126), and the algebraic-NT anchors that supply the actual proof technique sit a clear band further out (d = 0.159). Δ = +0.039 toward the problem side. The bridge embeds firmly with its target-problem field, not between source and target.

Mechanism: what SPECTER2 is trained on

SPECTER2 is trained on a citation-prediction objective: papers cited together end up nearby in embedding space. Bridge papers cite their problem-side predecessors heavily and use method-side ancestors more sparingly — a paper proving an Erdős-unit-distance result cites unit-distance papers densely and cites the algebraic-number-theory papers only as needed for the technique. The embedder picks up the dominant citation signal, which is audience, not methods. Once realized, bridges dissolve into their problem-side cluster.

This is partly SPECTER2-specific. Gemini's embedder, trained differently, shows a softer version of the same pattern: HMR moves from rank 7–8 in SPECTER2 to rank 5 in Gemini for Sawin's neighborhood, and Gemini correctly places Lenstra between its algebra and coding-theory endpoints (where SPECTER2 clusters Lenstra with other bridge papers). Gemini also widens the cross-field bands — d(math, ML) ≈ 0.14 in SPECTER2 but 0.21–0.25 in Gemini. The directional finding ("Sawin's neighborhood is geometry-dominant") survives both embedders; the magnitude of the algebra-side suppression is partly an artifact of citation-trained models specifically.

Asymmetric MVP pipeline architecture

Even given the calibration result, the asymmetric scan ran on 2026-05-24 to measure what the cascade does surface when the embedding can't see the target. Architecture:

Corpus build. Draw a fixed sample from arXiv: 800 papers from cs.LG, cs.CV, cs.CL, cs.NE, cs.CR, cs.NI, cs.DC, stat.ML (the "empirical" pool); 1377 papers from math.*, cs.IT, cs.DS (the "theory" pool).
Embed. Fetch SPECTER2 vectors from Semantic Scholar (97.75% / 99.6% coverage). Save as .npy.
Cross-axis scan. For each empirical paper, find its top-K theory neighbors by SPECTER2 cosine. Apply sequential arXiv dedup. Top-2000 pairs survive.
Slot extraction. Per paper, an LLM (gemini-2.5-flash) extracts asymmetric slot records: empirical papers get (phenomenon, regime, mechanism_unknown); theory papers get (theorem_claim, regime, mechanism_provided).
Slot re-rank. Re-embed the structured slots (not the raw abstracts), recompute cosine. Top-200 pair budget; 95 pairs survive because of slot-extraction failures.
Cheap judge. A second LLM call asks: does the theory paper resolve, partially resolve, or not address the empirical paper's mechanism gap? Verdicts: resolves, partially_resolves, unrelated.
Agent-tool translation. For the surviving partially_resolves pairs, an Opus subagent writes a detailed translation: "what would it mean for the theorem to actually resolve the empirical gap, and what's the testable prediction?"
Independent Opus second opinion. A fresh Opus call assesses each translation: does the theorem actually say what the cascade thinks it says (consistency)? Is this finding novel or folklore? Worth a domain expert's time?

The pipeline is resumable at each stage; a CF gateway 429 burst during stage 4 (slot extraction) cost 131 papers but the run resumed cleanly.

Run trace, cost, acceptance criteria

Pipeline trace

Stage	Input	Output	Survivors
1 — corpus	arXiv categories	800 emp + 1377 th	2177
2 — embed	2177 papers	782 + 1372 SPECTER2 vectors	2154 (97.75% / 99.6%)
3 — scan	cross-axis NN	candidate pairs (sequential dedup)	2000
4 — slot extract	731 unique papers	600 successful, 131 CF-429 failures	82%
5 — slot re-rank	2000 pairs	both ends extracted OK	95
6 — cheap judge	95 pairs	92 unrelated / 3 partially_resolves / 0 resolves / 0 errors	3
7 — translation	3 pairs	Opus subagent translations	3
8 — indep-Opus	3 translations	3 × FOLKLORE verdicts	0 novel

Cost

Component	Calls	Actual cost
S2 SPECTER2 batch fetch	4 batches × 500	$0 (heavy 429 throttle, retry absorbed)
arXiv body fetch	731 unique	$0 (~15 min @ 1.2s polite interval)
Slot extract (`gemini-2.5-flash`)	731 × 1 (some retried)	~$0.10
Slot embed (`batchEmbedContents`)	6 batches × 100	~$0.05
Cheap judge (`gemini-2.5-flash`)	95 × 1	~$0.05
Opus translations + indep-Opus	6 × 1	orchestrator-side (in-session)
Total external API spend		~$0.20

Wall clock: ~50 min for the pipeline + ~5 min for orchestrator stages.

Acceptance criteria

#	Criterion	Verdict
1	Pipeline runs end-to-end with hard dedup tiers, resumable, under ~$5	PASS Resumed cleanly after CF 429 throttle bursts.
2	≥20 candidate pairs survive cheap-judge as `resolves` or `partially_resolves`	FAIL Only 3 surviving. 92/95 judged `unrelated`.
3	Top 3 candidates pass indep-Opus internal-consistency check (does the theorem say what the cascade thinks it says?)	PASS All 3 internally consistent.
4	At least 1 candidate rated by indep-Opus as worth a domain-expert second opinion	PASS Pair 3 (worth-expert: yes).

3 of 4. The miss is the load-bearing one — yield density at this scale is structurally too low to produce non-folklore output, even with the cascade behaving correctly.

The three surviving pairs

Pair 1 — Compositional sparsity × DNN regression rates

E: arXiv:2605.14764 — "Compositional Sparsity as an Inductive Bias for Neural Architecture Design"
T: arXiv:2504.03405 — "On the Rate of Convergence of an Over-Parametrized Deep Neural Network Regression Estimate Learned by Gradient Descent"
Slot cosine: 0.801. Cheap judge: partially_resolves.

Cascade translation (summary): The empirical paper's "approximation/estimation error" maps to the theorem's L² risk; "ambient dimension d" maps to input dimension; the empirical "structure" maps to a hierarchical composition model with effective dimension d* ≪ d. Concrete untested prediction: log-risk vs. log-n slope should be governed by d* not d, and changing d with d* fixed should not change the slope.

Indep-Opus verdict: FOLKLORE. Worth-expert: no. Lineage: Mhaskar–Poggio 2016 (Annals); Bauer–Kohler 2019 (Annals of Statistics); Schmidt-Hieber 2020 (Annals); Kohler–Krzyżak / Kohler–Langer follow-ups extending rates to GD-trained over-parametrized nets. The bridge is real and internally consistent — the theorem says what the cascade says it says — but the result is the explicit thesis of a decade-old well-known line of work.

Pair 2 — Nuclear surrogate ML benchmark × differentiable Kalman filter

E: arXiv:2605.15549 — "CTF4Nuclear: Common Task Framework for Nuclear Fission and Fusion Models"
T: arXiv:2509.07474 — "DKFNet: Differentiable Kalman Filter for Field Inversion and Machine Learning"
Slot cosine: 0.686. Cheap judge: partially_resolves.

Cascade translation (summary): The surrogate ML output ŷ(t; θ) maps to the differentiable Kalman recursion's posterior mean; the twelve benchmark metrics map onto components of the posterior covariance and innovation statistics; the sparse-measurement / noisy-sensor / low-data regimes map onto DKFNet's measurement model, observation noise, and linear-dynamics hypothesis. The gap: nuclear systems are strongly nonlinear and rarely Gaussian, so the correspondence requires EKF/UKF-style linearisation.

Indep-Opus verdict: FOLKLORE. Worth-expert: no. Over-claims in a specific way: CTF4Nuclear is a benchmark (defines tasks and metrics), not a paper claiming an empirical phenomenon with an unexplained mechanistic gap. "The 12 metrics don't measure calibrated intervals" is benchmark scope, not a flagged gap. Differentiable Kalman filters trace to Haarnoja et al. 2016; Kalman-based UQ for nuclear/plasma state has decades of data-assimilation literature.

Pair 3 — Generative-model ambiguity × no-prior Bayes IMs

E: arXiv:2605.15050 — "Separating Intrinsic Ambiguity from Estimation Uncertainty in Deep Generative Models for Linear Inverse Problems"
T: arXiv:2503.19748 — "No-prior Bayes reIMagined: probabilistic approximations of inferential models"
Slot cosine: 0.676. Cheap judge: partially_resolves.

Cascade translation (summary): "Intrinsic ambiguity" — null-space directions of the forward operator A — maps to parameter coordinates where the possibilistic IM's contour function π_y(x) remains near 1. "Estimation uncertainty" maps to coordinates where π_y contracts sharply. The theorem's inner approximation P_y ⪯ π_y plays the role of the generative model's posterior surrogate q_φ(x | y). Testable prediction: credible sets read off from the inner approximation should achieve exact frequentist coverage, even on null-space coordinates where standard variational posteriors are miscalibrated.

Indep-Opus verdict: FOLKLORE leaning WEAK-VALID if "constructive IM for deep priors" is honestly downgraded to "calibration target". Worth-expert: yes — specifically a Martin-school statistician fluent in Bayesian inverse problems, because the constructive-feasibility question is load-bearing. IMs quantify uncertainty about a finite-dimensional parameter; applying to a deep generative prior over x requires either treating the latent code as the parameter (re-imports the prior, loses no-prior guarantee) or a valid predictive/nonparametric IM, which exists in theory but has no constructive scalable implementation for deep priors. The general area (frequentist coverage / calibrated UQ for ill-posed linear inverse problems) is well-trodden — Nickl, Szabó, van der Vaart — but the specific IM-to-deep-generative bridge at theorem-matching level is probably not published.

Pair 3 is the borderline case. The cascade surfaced a real candidate; the indep-Opus assessment says the construction question is genuine; one borderline pair in 2000 candidate pairs is statistically consistent with chance against the population density estimate below.

Density math

arXiv has on the order of 5×10⁵ ML/applied-CS papers and 5×10⁵+ math/theory papers in the categories drawn from. Suppose ~10³ real bridge pairs (theorem T explains observation O) exist in the joint corpus — a generous upper bound, given that the issue's exemplar table lists only a handful of well-articulated bridges from decades of work.

Per-paper probability of being a bridge endpoint: ~2×10⁻³ on each side.

Drawing 800 empirical + 1377 theory papers uniformly:

E[bridges with both endpoints in sample]
  ≈ |bridges| × (800 / 5×10⁵) × (1377 / 5×10⁵)
  ≈ 10³ × 1.6×10⁻³ × 2.8×10⁻³
  ≈ 0.004

The expected count of known bridge pairs with both endpoints in this sample is effectively zero. Three folklore-grade survivors is consistent with the cascade catching chance structural matches — which is exactly what the indep-Opus assessments confirmed.

Nearest-neighbor selection across the cross-axis would lift this expected yield only if the embedding preserved bridge geometry. The phase-0 calibration above shows it doesn't — bridges embed with their target-problem field, not between source and target. So the uniform-sample expectation is the operative lower bound on cascade yield at this corpus size.

Path A / B / C decision tree

Three forward options:

Path A — Keep iterating at sub-thousand scale

Tighten the empirical pool (drop stat.ML, cs.NI, cs.DC), two-pass slot extraction with a gap-detection first pass, provision S2_API_KEY for author-Jaccard + citation-overlap dedup, drop concurrency to 2 to avoid CF 429s, bias sampling toward 2024–2026. Cheap. Won't produce non-folklore output — the density math forbids it. Useful only for methodological refinement, not for hunting publication-grade bridges.

Path B — Resume the 1.9M-paper production build

At 1.9M papers the expected-bridges-in-sample math becomes ~5–50, depending on per-paper bridge-density assumptions. The cascade as currently structured would then be filtering real signal, not sampling noise. Cost estimated at ~$435. Defensible only if (a) the density estimate holds and (b) the geometry objection below does not apply.

Path C — Anchor-seeded sampling

Pick five to ten known or strongly-suspected non-folklore bridges (the original BTS post's table is a starting set — neural scaling laws / random matrix theory, lottery ticket / compressed sensing, etc.). Take their SPECTER2 neighborhoods. Run the asymmetric cascade against those neighborhoods.

Two failure-modes are diagnostic:

If non-anchor cousins surface at non-trivial rate, the SPECTER2 geometry holds enough signal at small scale that Path B becomes defensible.
If the cascade still returns only folklore neighbors of the anchors, the Singular Learning Theory objection — that novel bridges manifest as new directions in representation space, not as points between existing clusters — is doing real work, and the mechanism itself needs to change before the corpus does.

Cost: ~$1. Decides the $435 question one way or the other.

Recommended sequence: Path C first, then Path B if Path C surfaces non-anchor cousins.

Companion to Between the Spokes: What the Embeddings Can't See (2026-05-24).

Composed with composing-html