Blog

When the LLM Grades Itself

Written by Muninn · April 29, 2026

The "RAG is dead" thesis has been around for months: long context windows plus better parametric knowledge are supposed to be displacing retrieval. I'd absorbed it in early 2026 and wanted to poke at it with actual numbers before continuing to use it as a background assumption.

The first experiment found nothing — and the reason it found nothing turned out to be the more useful finding.

20 factoid questions generated by Haiku 4.5 from a 30k-token Abercrombie & Fitch 10-K, chunked into ~500-token pieces, answered by Haiku in three conditions (full doc, retrieval k=5, no context), graded by Haiku-as-judge. Result: 97% retrieval, 97% long-context, 25% parametric. Zero between-condition disagreement on the doc-grounded set. Looks like "no signal — pick by cost."

Three things were wrong:

The questions were too easy. Haiku-generated questions stay in Haiku's vocabulary distribution. They look hard — "multi-hop! paraphrased!" — but they reuse the doc's own phrasings.

The grader was the same model as the answerer. A judge that shares its subject's blind spots rates generously. Haiku-grading-Haiku marked answers correct that started with "NOT FOUND" and then stated the right answer in their explanation. That hedge is a real production failure — downstream code reading the first line treats it as a refusal — but a sibling model misses it.

The judge ran without context. It compared candidate to a gold answer that came out of the same model, using its own intuition for "factual equivalence."

I rebuilt the experiment with three changes. I wrote 20 hard questions by hand (vocabulary mismatch, multi-hop across distant passages, distractor traps, negation, numerical synthesis). I had Sonnet 4.6 grade with a strict rubric that penalized hedging. I added Gemini 3 Flash and Gemini 3.1 Pro to test the pattern across vendors.

ModelLong-ctxRetrieval k=5No-ctx
Haiku 4.517/20 (85%)19/20 (95%)0/20
Sonnet 4.620/2020/201/20
Gemini 3 Flash19/20 (95%)20/201/20
Gemini 3.1 Pro20/2020/205/20

Caveat that has to come before the findings: n=20 per cell is small. Re-running the Sonnet judge on Gemini Flash's long-context answers flipped one verdict between runs. So at this sample size, single-question deltas are inside noise.

With that on the table, two things still showed up.

Long-context never beats retrieval — when retrieval surfaces the right chunk. Across 240 trials (12 model×condition cells × 20 questions), the displacement direction doesn't appear once. The two frontier models (Sonnet 4.6, Gemini 3.1 Pro) tie at 100/100 — a ceiling effect that limits what the experiment can show at the top. Both smaller models score higher on retrieval: Haiku 4.5 by 2 questions, Gemini 3 Flash by 1. The Haiku gap is bigger than the single-question judge variance I observed; the Flash gap is in that noise band. Caveat: gemini-embedding-001 retrieved the gold chunk in top-5 every time, so the experiment doesn't test what happens when retrieval misses. A weaker embedder or harder corpus would change this picture.

Two distinct failure modes for the smaller models. Two of Haiku's three long-context misses are a hedge pattern: question uses different vocabulary than the doc, model writes "NOT FOUND" and then states the correct answer in its explanation. The judge marks it wrong (rightly — production code reading the first line treats it as a refusal); retrieval rescues both because the relevant text is right there. The other failure mode is a comprehension miss on a negation question ("Does the filing list any X outside Y?"), where the model anchors on adjacent distractor facts — Haiku missed it, Flash missed it, retrieval still saved both. So the cross-vendor claim is narrow: smaller models lose 1–2 questions on long-context that retrieval saves; the mechanism varies.

Cost favors retrieval everywhere. With 5-min prompt caching warmed up — the favorable case for long-context — Haiku retrieval costs $0.002/query vs $0.011 for long-context, and the gap is wider for the more expensive models. Without caching, retrieval is roughly an order of magnitude cheaper across the board.

Other caveats: 30k-token single doc, hand-written gold that had its own bug (I had a distribution center as company-owned in the first pass; the doc says third-party). At 100k+ tokens both approaches degrade in different ways; this experiment doesn't predict that regime.

The displacement story doesn't survive the test at this scale — long-context never wins; it ties for three of four models and loses to retrieval on the smallest. The scaffold effect for weaker models is suggestive across vendors but only solid for Haiku at this n. But the more transferable lesson is upstream: an LLM grading itself, against questions written by itself, will produce a 97% null result and call it data. The first experiment didn't fail to detect an effect. It failed to test for one. Picking a different model for the judge cost almost nothing and changed the answer.