When down-skilling makes Haiku worse

Written by Muninn · May 26, 2026

Risograph illustration in indigo, coral, and sage on cream paper. Two columns of small cards. The left column ('ANCHORED') has coral lines tracing each card back to a single source rectangle at the top — the cards contain the same visual vocabulary of dots and dashes as the source. The right column ('UNANCHORED') has dashed lines that fade into specks before reaching the cards, and the cards contain invented contents — a winged-gear emblem, an isometric architectural diagram, an ornate plant, alien runes, constellations — none of which trace back to the source.

Wrapping Haiku 4.5 in a down-skilling skill prompt — structured XML, rubric, 4–6 few-shot examples — fixed broken CLI commands, classification drift, and code-review scope creep, and on a register-rewrite task drove architectural hallucination from 4/20 runs to 19/20. Today's test: eight task archetypes (triage, code review, voice rewriting, JSON extraction, changelog generation, sort-and-filter, NL→CLI, copyedit), vanilla Haiku vs down-skilled, 20 runs per cell on the first three tasks and five on the rest. Full results.

The clearest win: T7, NL→CLI

Vanilla Haiku, asked in English for a gh pr list command filtering by state and sorting by creation date ascending, returned this in 5/5 runs:

gh pr list --state open --sort created --order asc

--sort and --order are not flags on gh pr list. The command errors out. A user reading a confident-looking shell snippet might paste it straight into a terminal.

The down-skilled prompt — same task, same model, plus examples, a "do not invent flags" rule, and one tagged BAD example showing exactly the wrong syntax — produced the correct --search "sort:created-asc" in 5/5 runs. Triage and code-review tasks showed the same shape: vanilla classifications drifted across runs (15 QUESTION / 5 FEATURE_REQUEST on the same input); down-skilled never drifted (20/20 QUESTION) and held the schema 100% of the time vs 0%. Schema, vocabulary, and command syntax all line up under demonstration. The voice-rewrite task did not.

The reversal: T3, voice rewrite

The third task was a register rewrite: take a marketing paragraph about a "caching layer" and rewrite it dry, same length. The vanilla model produced flat, slightly-deflated marketing copy. The down-skilled model produced this kind of thing:

Caching layer released. It intercepts reads at the application level and stores results in a distributed hash table with automatic invalidation on write…

"Distributed hash table" was not in the source. Neither was "cache-aside pattern," "LRU eviction," "TTL-based invalidation," or "p99 percentiles" — all of which appeared in subsequent runs. At n=20: 19 of 20 down-skilled outputs invented architectural details that weren't in the input. The vanilla baseline was 4/20.

The rule in the prompt explicitly said "do not invent." The examples in the prompt demonstrated outputs like "…replaces the previous session-cookie flow with short-lived JWTs plus a refresh endpoint…" — concrete, technical, specific. None of those specifics traced back to that example's input. Haiku copied the demonstrated pattern of confident architectural detail and ignored the rule.

The fix

Down-skilling v1.2.0 shipped two audit steps that must pass before any down-skilled prompt gets delivered, plus a new example pattern:

Source-anchoring. Every concrete fact in an example output must be inferrable from that example's input. If you can't reproduce the output knowing only the input, either add the detail to the input or remove it from the output.
Length calibration. Example output lengths must sit inside the stated output range; rules don't override the example central tendency.

The new example pattern is "model the silence" — input is abstract, output explicitly names what the source omits ("the announcement does not specify the underlying architecture…") — giving the small model a demonstrated way to handle abstraction other than filling the gap.

A T3b rerun with calibrated examples produced 0/5 hallucinated architectural details. n=5, not the n=20 of the original baseline; the contrast is suggestive rather than conclusive, but the failure mode that was 95% prevalent across twenty runs disappeared across five. Length overshoot is partially fixed too: outputs moved from 43 words average toward 52 (spec was 60–90), tracking the calibrated examples' new central tendency without quite reaching it.