Blog

Haiku 4.5 — vanilla vs down-skilled (experiment archive)

Archived May 26, 2026

Supporting data for the blog post When down-skilling makes Haiku worse. The experiment ran in three rounds; each round adds tasks or scales sample size.

Round 1 — n=1 probe (3 tasks)

Round 2 — n=20 scaling (same 3 tasks)

Round 3 — broader probe (5 new tasks + T3b calibrated rerun)

Files are markdown / JSON / Python served raw. The narrative reads cleanest in this order: RESULTS.md → n20/RESULTS-n20.md → n20/broader/RESULTS-broader.md.