When down-skilling makes Haiku worse
Wrapping Haiku 4.5 in a down-skilling skill prompt — structured XML, rubric, 4–6 few-shot examples — fixed broken CLI commands, classification drift, and code-review scope creep, and on a register-rewrite task drove architectural hallucination from 4/20 runs to 19/20. Today's test: eight task archetypes (triage, code review, voice rewriting, JSON extraction, changelog generation, sort-and-filter, NL→CLI, copyedit), vanilla Haiku vs down-skilled, 20 runs per cell on the first three tasks and five on the rest. Full results.
The clearest win: T7, NL→CLI
Vanilla Haiku, asked in English for a gh pr list command filtering by state and sorting by creation date ascending, returned this in 5/5 runs:
gh pr list --state open --sort created --order asc
--sort and --order are not flags on gh pr list. The command errors out. A user reading a confident-looking shell snippet might paste it straight into a terminal.
The down-skilled prompt — same task, same model, plus examples, a "do not invent flags" rule, and one tagged BAD example showing exactly the wrong syntax — produced the correct --search "sort:created-asc" in 5/5 runs. Triage and code-review tasks showed the same shape: vanilla classifications drifted across runs (15 QUESTION / 5 FEATURE_REQUEST on the same input); down-skilled never drifted (20/20 QUESTION) and held the schema 100% of the time vs 0%. Schema, vocabulary, and command syntax all line up under demonstration. The voice-rewrite task did not.
The reversal: T3, voice rewrite
The third task was a register rewrite: take a marketing paragraph about a "caching layer" and rewrite it dry, same length. The vanilla model produced flat, slightly-deflated marketing copy. The down-skilled model produced this kind of thing:
Caching layer released. It intercepts reads at the application level and stores results in a distributed hash table with automatic invalidation on write…
"Distributed hash table" was not in the source. Neither was "cache-aside pattern," "LRU eviction," "TTL-based invalidation," or "p99 percentiles" — all of which appeared in subsequent runs. At n=20: 19 of 20 down-skilled outputs invented architectural details that weren't in the input. The vanilla baseline was 4/20.
The rule in the prompt explicitly said "do not invent." The examples in the prompt demonstrated outputs like "…replaces the previous session-cookie flow with short-lived JWTs plus a refresh endpoint…" — concrete, technical, specific. None of those specifics traced back to that example's input. Haiku copied the demonstrated pattern of confident architectural detail and ignored the rule.
The fix
Down-skilling v1.2.0 shipped two audit steps that must pass before any down-skilled prompt gets delivered, plus a new example pattern:
- Source-anchoring. Every concrete fact in an example output must be inferrable from that example's input. If you can't reproduce the output knowing only the input, either add the detail to the input or remove it from the output.
- Length calibration. Example output lengths must sit inside the stated output range; rules don't override the example central tendency.
The new example pattern is "model the silence" — input is abstract, output explicitly names what the source omits ("the announcement does not specify the underlying architecture…") — giving the small model a demonstrated way to handle abstraction other than filling the gap.
A T3b rerun with calibrated examples produced 0/5 hallucinated architectural details. n=5, not the n=20 of the original baseline; the contrast is suggestive rather than conclusive, but the failure mode that was 95% prevalent across twenty runs disappeared across five. Length overshoot is partially fixed too: outputs moved from 43 words average toward 52 (spec was 60–90), tracking the calibrated examples' new central tendency without quite reaching it.