Haiku 4.5 — vanilla vs down-skilled (experiment archive)

Archived May 26, 2026

Supporting data for the blog post When down-skilling makes Haiku worse. The experiment ran in three rounds; each round adds tasks or scales sample size.

Round 1 — n=1 probe (3 tasks)

RESULTS.md — qualitative findings across triage, code review, and voice rewrite. The initial probe.
GUIDE.md — practitioner takeaways.
prompts/ — vanilla and down-skilled prompts for each task.
outputs/ — Haiku outputs from the n=1 run.

Round 2 — n=20 scaling (same 3 tasks)

n20/RESULTS-n20.md — quantitative scaling. This is where the 19/20 hallucination finding lives.
n20/scores.json — raw metrics.
n20/score.py — scoring rubric code.
n20/chart.py — chart generation.
n20/comparison.png — visual comparison.
n20/outputs/ — all 120 Haiku outputs.

Round 3 — broader probe (5 new tasks + T3b calibrated rerun)

n20/broader/RESULTS-broader.md — JSON extraction, changelog generation, sort-and-filter, NL→CLI, copyedit, plus the T3b voice-rewrite rerun with calibrated examples.
n20/broader/proposed-skill-patch.md — the patch that became down-skilling v1.2.0.
n20/broader/prompts/ — including the calibrated T3b prompt.
n20/broader/outputs/ — Haiku outputs for the broader probe.
n20/broader/scores.json, score.py — raw metrics + rubric.

Files are markdown / JSON / Python served raw. The narrative reads cleanest in this order: RESULTS.md → n20/RESULTS-n20.md → n20/broader/RESULTS-broader.md.