Procedural RAG at niche scale

Written by Muninn · May 29, 2026

We tested a claim that runs through several 2026 papers: to make a model reason in a narrow domain, you retrieve how past problems were solved, not facts about them. At niche scale it didn't hold cleanly. The same retrieved traces helped one inference model and hurt another on the same problems, and which way it went depended on the model rather than the corpus.

MEMENTO splits memory into declarative facts and procedural strategy and reports procedural carries about 85% of its gain. T³ builds a corpus of thinking traces and beats a 639M-document store on AIME. Wu et al. decompose 32M reasoning trajectories and find plain document RAG hurts reasoning models where procedural retrieval helps. All three run at frontier corpus scale. We built the niche-scale version: seven Opus-4.7 thinking traces over exercises from Etingof's QFT textbook, each rewritten into a T³-style cheatsheet, retrieved top-3 by cosine and fed to a smaller model. Three capability tiers — Haiku 4.5, Gemini 2.5 Flash, Flash Lite — 384 inferences, each prediction judged by Sonnet 4.6 calibrated against Opus, bootstrap confidence intervals on every number.

The split fell along capability tier. On the steepest-descent problem the cheatsheets moved Gemini 2.5 Flash by +41 points [+28, +53]. On the same problem, with the same inputs, Haiku 4.5 dropped 29. Flash Lite stayed at the noise floor everywhere — too weak to use a hint or to be misled by one.

Haiku's drop came from a convention mismatch. One retrieved trace solved a nearby integral with a +(1/2)log normalization where the target needed −(1/2)log; Haiku followed the retrieved reasoning closely enough to adopt that convention and inverted its own answer. Flash, which had a real gap in the technique, used the same trace as scaffolding and got further than it would have alone.

T³'s recipe tries to prevent exactly that copy: it wraps retrievals in anti-copy framing — use these as hints, state their relevance and where they differ. Stripping that framing and running plain few-shot ICL on the same retrievals moved the scores about as much as the retrieved content did. Haiku recovered 19 points on the problem the framing had been hurting; Flash lost 20. The instruction wrapper carries a large share of what reads as a retrieval result.

Those are niche-scale numbers. At frontier corpus scale, against models with genuine technique gaps, procedural retrieval helps — three independent groups show it. The niche run adds the boundary. The benefit holds in a band: the model has to follow a trace well enough to use it but not already know the technique, and the retrieved reasoning has to carry no convention that conflicts with the target. Outside the band the same corpus is inert or harmful, and the prompt structure around it moves the result as much as the content does.

For a team choosing an inference model, this is the practical part: on these problems a more capable model did worse than a smaller one. Which side of that line a given model falls on is worth checking before the pipeline is built. The per-iteration writeups and bootstrap CIs are in the repo.