Scratch · 2026-05-24

Between the Spokes — Calibration and Cascade Data

Empirical reference for the embedding-bridge-discovery MVP. Phase-0 calibration, pipeline trace, three surviving pairs, density math.

Empirical reference material for the embedding-bridge-discovery post. Methods, the phase-0 calibration data, the asymmetric MVP run trace, and the three surviving cascade pairs in full. Skip to a specific section via the contents below; nothing here is prose-paced.

Phase-0 calibration: does the Sawin bridge embed between geometry and algebra?

The bridge-discovery thesis assumes that a paper combining vocabulary and concepts from two communities — say algebraic number theory and discrete geometry — embeds somewhere that reflects both. The cleanest test is calibration against a known bridge: the Erdős unit-distance disproof, which lifts a planar geometry problem into algebraic number theory using a Hajir–Maire–Ramakrishna tower-cutting technique.

Phase-0 (run 2026-05-23) compared SPECTER2 embeddings of Will Sawin's same-day independent disproof against a set of papers spanning the source-technique field (algebraic number theory) and the target-problem field (unit-distance combinatorial geometry).

Calibration set

SPECTER2 [CLS] vectors fetched from the Semantic Scholar API (precomputed where available; otherwise computed locally from arXiv abstracts). The Lenstra entry is a proxy abstract — the 1986 paper predates arXiv ingestion — and is flagged accordingly.

LabelFieldPaper
Sawinbridge (test)arXiv:2605.20579 — Sawin, "An explicit lower bound for the unit distance problem"
OpenAIbridge (digest)arXiv:2605.20695 — Alon et al., "Remarks on the disproof of the unit distance conjecture"
AMPgeometryarXiv:2412.11914 — Alexeev, Mixon, Parshall, "The Erdős Unit Distance Problem for Small Point Sets"
Pach–RazgeometryarXiv:2507.15679 — Pach, Raz, unit-distance + rigidity
Erdős-2DgeometryarXiv:2002.00502 — Erdős distance compression
HMRalgebraic NTarXiv:2103.09508 — Hajir, Maire, Ramakrishna, "Deficiency of p-Class Tower Groups and Minkowski Units"
Lenstra*algebraic NTLenstra 1986 (proxy abstract — original predates arXiv)
UnrelML controlarXiv:2103.00020 — CLIP (out-of-domain baseline)

Sawin's nearest neighbors in SPECTER2

Cosine distance from Sawin to each comparison paper. Lower is closer.

NeighborFieldd(Sawin, ·)
OpenAI (Alon et al. digest)twin (same theorem, same authors)nearest (top-1)
Lenstra* (proxy abstract)algebraic NTtop-2 (caveat — see below)
AMPgeometry0.103
Pach–Razgeometry0.120
Erdős-2Dgeometry0.126
HMRalgebraic NT0.159
HMR-v2algebraic NT0.159
Unrel (CLIP)ML control0.21–0.26

Reading this honestly: two of Sawin's top-2 nearest neighbors don't count as evidence about embedding geometry. The OpenAI digest is the same theorem by overlapping authors with an explicit Hajir–Maire name-check in the abstract; it's a twin paper. Lenstra (1986) is a proxy abstract — the original isn't on arXiv — and is itself a bridge paper that Lenstra framed using both algebraic-NT and coding-theory vocabulary, so its closeness in part reflects shared bridge framing rather than the kind of "papers about the same problem" structure that should dominate a problem-domain embedder.

After excluding the twin and the proxy, Sawin's three nearest substantive neighbors are all geometry papers (d = 0.103, 0.120, 0.126), and the algebraic-NT anchors that supply the actual proof technique sit a clear band further out (d = 0.159). Δ = +0.039 toward the problem side. The bridge embeds firmly with its target-problem field, not between source and target.

Mechanism: what SPECTER2 is trained on

SPECTER2 is trained on a citation-prediction objective: papers cited together end up nearby in embedding space. Bridge papers cite their problem-side predecessors heavily and use method-side ancestors more sparingly — a paper proving an Erdős-unit-distance result cites unit-distance papers densely and cites the algebraic-number-theory papers only as needed for the technique. The embedder picks up the dominant citation signal, which is audience, not methods. Once realized, bridges dissolve into their problem-side cluster.

This is partly SPECTER2-specific. Gemini's embedder, trained differently, shows a softer version of the same pattern: HMR moves from rank 7–8 in SPECTER2 to rank 5 in Gemini for Sawin's neighborhood, and Gemini correctly places Lenstra between its algebra and coding-theory endpoints (where SPECTER2 clusters Lenstra with other bridge papers). Gemini also widens the cross-field bands — d(math, ML) ≈ 0.14 in SPECTER2 but 0.21–0.25 in Gemini. The directional finding ("Sawin's neighborhood is geometry-dominant") survives both embedders; the magnitude of the algebra-side suppression is partly an artifact of citation-trained models specifically.

Asymmetric MVP pipeline architecture

Even given the calibration result, the asymmetric scan ran on 2026-05-24 to measure what the cascade does surface when the embedding can't see the target. Architecture:

  1. Corpus build. Draw a fixed sample from arXiv: 800 papers from cs.LG, cs.CV, cs.CL, cs.NE, cs.CR, cs.NI, cs.DC, stat.ML (the "empirical" pool); 1377 papers from math.*, cs.IT, cs.DS (the "theory" pool).
  2. Embed. Fetch SPECTER2 vectors from Semantic Scholar (97.75% / 99.6% coverage). Save as .npy.
  3. Cross-axis scan. For each empirical paper, find its top-K theory neighbors by SPECTER2 cosine. Apply sequential arXiv dedup. Top-2000 pairs survive.
  4. Slot extraction. Per paper, an LLM (gemini-2.5-flash) extracts asymmetric slot records: empirical papers get (phenomenon, regime, mechanism_unknown); theory papers get (theorem_claim, regime, mechanism_provided).
  5. Slot re-rank. Re-embed the structured slots (not the raw abstracts), recompute cosine. Top-200 pair budget; 95 pairs survive because of slot-extraction failures.
  6. Cheap judge. A second LLM call asks: does the theory paper resolve, partially resolve, or not address the empirical paper's mechanism gap? Verdicts: resolves, partially_resolves, unrelated.
  7. Agent-tool translation. For the surviving partially_resolves pairs, an Opus subagent writes a detailed translation: "what would it mean for the theorem to actually resolve the empirical gap, and what's the testable prediction?"
  8. Independent Opus second opinion. A fresh Opus call assesses each translation: does the theorem actually say what the cascade thinks it says (consistency)? Is this finding novel or folklore? Worth a domain expert's time?

The pipeline is resumable at each stage; a CF gateway 429 burst during stage 4 (slot extraction) cost 131 papers but the run resumed cleanly.

Run trace, cost, acceptance criteria

Pipeline trace

StageInputOutputSurvivors
1 — corpusarXiv categories800 emp + 1377 th2177
2 — embed2177 papers782 + 1372 SPECTER2 vectors2154 (97.75% / 99.6%)
3 — scancross-axis NNcandidate pairs (sequential dedup)2000
4 — slot extract731 unique papers600 successful, 131 CF-429 failures82%
5 — slot re-rank2000 pairsboth ends extracted OK95
6 — cheap judge95 pairs92 unrelated / 3 partially_resolves / 0 resolves / 0 errors3
7 — translation3 pairsOpus subagent translations3
8 — indep-Opus3 translations3 × FOLKLORE verdicts0 novel

Cost

ComponentCallsActual cost
S2 SPECTER2 batch fetch4 batches × 500$0 (heavy 429 throttle, retry absorbed)
arXiv body fetch731 unique$0 (~15 min @ 1.2s polite interval)
Slot extract (gemini-2.5-flash)731 × 1 (some retried)~$0.10
Slot embed (batchEmbedContents)6 batches × 100~$0.05
Cheap judge (gemini-2.5-flash)95 × 1~$0.05
Opus translations + indep-Opus6 × 1orchestrator-side (in-session)
Total external API spend~$0.20

Wall clock: ~50 min for the pipeline + ~5 min for orchestrator stages.

Acceptance criteria

#CriterionVerdict
1Pipeline runs end-to-end with hard dedup tiers, resumable, under ~$5PASS Resumed cleanly after CF 429 throttle bursts.
2≥20 candidate pairs survive cheap-judge as resolves or partially_resolvesFAIL Only 3 surviving. 92/95 judged unrelated.
3Top 3 candidates pass indep-Opus internal-consistency check (does the theorem say what the cascade thinks it says?)PASS All 3 internally consistent.
4At least 1 candidate rated by indep-Opus as worth a domain-expert second opinionPASS Pair 3 (worth-expert: yes).

3 of 4. The miss is the load-bearing one — yield density at this scale is structurally too low to produce non-folklore output, even with the cascade behaving correctly.

The three surviving pairs

Pair 1 — Compositional sparsity × DNN regression rates

Cascade translation (summary): The empirical paper's "approximation/estimation error" maps to the theorem's L² risk; "ambient dimension d" maps to input dimension; the empirical "structure" maps to a hierarchical composition model with effective dimension d*d. Concrete untested prediction: log-risk vs. log-n slope should be governed by d* not d, and changing d with d* fixed should not change the slope.

Indep-Opus verdict: FOLKLORE. Worth-expert: no. Lineage: Mhaskar–Poggio 2016 (Annals); Bauer–Kohler 2019 (Annals of Statistics); Schmidt-Hieber 2020 (Annals); Kohler–Krzyżak / Kohler–Langer follow-ups extending rates to GD-trained over-parametrized nets. The bridge is real and internally consistent — the theorem says what the cascade says it says — but the result is the explicit thesis of a decade-old well-known line of work.

Pair 2 — Nuclear surrogate ML benchmark × differentiable Kalman filter

Cascade translation (summary): The surrogate ML output ŷ(t; θ) maps to the differentiable Kalman recursion's posterior mean; the twelve benchmark metrics map onto components of the posterior covariance and innovation statistics; the sparse-measurement / noisy-sensor / low-data regimes map onto DKFNet's measurement model, observation noise, and linear-dynamics hypothesis. The gap: nuclear systems are strongly nonlinear and rarely Gaussian, so the correspondence requires EKF/UKF-style linearisation.

Indep-Opus verdict: FOLKLORE. Worth-expert: no. Over-claims in a specific way: CTF4Nuclear is a benchmark (defines tasks and metrics), not a paper claiming an empirical phenomenon with an unexplained mechanistic gap. "The 12 metrics don't measure calibrated intervals" is benchmark scope, not a flagged gap. Differentiable Kalman filters trace to Haarnoja et al. 2016; Kalman-based UQ for nuclear/plasma state has decades of data-assimilation literature.

Pair 3 — Generative-model ambiguity × no-prior Bayes IMs

Cascade translation (summary): "Intrinsic ambiguity" — null-space directions of the forward operator A — maps to parameter coordinates where the possibilistic IM's contour function π_y(x) remains near 1. "Estimation uncertainty" maps to coordinates where π_y contracts sharply. The theorem's inner approximation P_y ⪯ π_y plays the role of the generative model's posterior surrogate q_φ(x | y). Testable prediction: credible sets read off from the inner approximation should achieve exact frequentist coverage, even on null-space coordinates where standard variational posteriors are miscalibrated.

Indep-Opus verdict: FOLKLORE leaning WEAK-VALID if "constructive IM for deep priors" is honestly downgraded to "calibration target". Worth-expert: yes — specifically a Martin-school statistician fluent in Bayesian inverse problems, because the constructive-feasibility question is load-bearing. IMs quantify uncertainty about a finite-dimensional parameter; applying to a deep generative prior over x requires either treating the latent code as the parameter (re-imports the prior, loses no-prior guarantee) or a valid predictive/nonparametric IM, which exists in theory but has no constructive scalable implementation for deep priors. The general area (frequentist coverage / calibrated UQ for ill-posed linear inverse problems) is well-trodden — Nickl, Szabó, van der Vaart — but the specific IM-to-deep-generative bridge at theorem-matching level is probably not published.

Pair 3 is the borderline case. The cascade surfaced a real candidate; the indep-Opus assessment says the construction question is genuine; one borderline pair in 2000 candidate pairs is statistically consistent with chance against the population density estimate below.

Density math

arXiv has on the order of 5×10⁵ ML/applied-CS papers and 5×10⁵+ math/theory papers in the categories drawn from. Suppose ~10³ real bridge pairs (theorem T explains observation O) exist in the joint corpus — a generous upper bound, given that the issue's exemplar table lists only a handful of well-articulated bridges from decades of work.

Per-paper probability of being a bridge endpoint: ~2×10⁻³ on each side.

Drawing 800 empirical + 1377 theory papers uniformly:

E[bridges with both endpoints in sample]
  ≈ |bridges| × (800 / 5×10⁵) × (1377 / 5×10⁵)
  ≈ 10³ × 1.6×10⁻³ × 2.8×10⁻³
  ≈ 0.004

The expected count of known bridge pairs with both endpoints in this sample is effectively zero. Three folklore-grade survivors is consistent with the cascade catching chance structural matches — which is exactly what the indep-Opus assessments confirmed.

Nearest-neighbor selection across the cross-axis would lift this expected yield only if the embedding preserved bridge geometry. The phase-0 calibration above shows it doesn't — bridges embed with their target-problem field, not between source and target. So the uniform-sample expectation is the operative lower bound on cascade yield at this corpus size.

Path A / B / C decision tree

Three forward options:

Path A — Keep iterating at sub-thousand scale

Tighten the empirical pool (drop stat.ML, cs.NI, cs.DC), two-pass slot extraction with a gap-detection first pass, provision S2_API_KEY for author-Jaccard + citation-overlap dedup, drop concurrency to 2 to avoid CF 429s, bias sampling toward 2024–2026. Cheap. Won't produce non-folklore output — the density math forbids it. Useful only for methodological refinement, not for hunting publication-grade bridges.

Path B — Resume the 1.9M-paper production build

At 1.9M papers the expected-bridges-in-sample math becomes ~5–50, depending on per-paper bridge-density assumptions. The cascade as currently structured would then be filtering real signal, not sampling noise. Cost estimated at ~$435. Defensible only if (a) the density estimate holds and (b) the geometry objection below does not apply.

Path C — Anchor-seeded sampling

Pick five to ten known or strongly-suspected non-folklore bridges (the original BTS post's table is a starting set — neural scaling laws / random matrix theory, lottery ticket / compressed sensing, etc.). Take their SPECTER2 neighborhoods. Run the asymmetric cascade against those neighborhoods.

Two failure-modes are diagnostic:

Cost: ~$1. Decides the $435 question one way or the other.

Recommended sequence: Path C first, then Path B if Path C surfaces non-anchor cousins.


Companion to Between the Spokes: What the Embeddings Can't See (2026-05-24).

Composed with composing-html