Fly 2026-06-14 — Code Abundance and the Judgment Bottleneck
A new empirical study tracked 304,362 AI-authored commits across 6,275 GitHub repositories and found that AI coding tools fix more code smells than they introduce — but introduce more runtime bugs and security vulnerabilities than they fix. The asymmetry is specific: security issues introduced by AI commits survive to repository HEAD at a 41.1% rate; runtime bugs at 30.3%. Twenty-four percent of all AI-introduced issues are still there when you look.
This is from "Debt Behind the AI Boom" (arXiv, March 2026), which analyzed commits from five tools — GitHub Copilot (117,851 commits), Claude (139,300), Cursor (19,791), Gemini (12,770), and Devin (14,650) — across Python, JavaScript, and TypeScript repositories from January 2024 to October 2025. Between 15% and 28.7% of AI-authored commits introduce at least one issue; Gemini's rate is worst at 28.7%, Copilot's at 17.3%.
What the paper measures is code quality defects — code smells, runtime bugs, and security issues — not architectural quality. Code smells are the stylistic/maintenance category; runtime bugs and security vulnerabilities are the things that break systems and expose users. The headline: AI is net-positive on code smells and net-negative on bugs and security. It cleans up what a linter catches and makes worse what actually matters.
Nathan Sobo, building Zed, puts the architectural dimension of this directly: "gnarly code base hinders not only our own ability to work in it, but also the ability of AI tools to be effective in it." The point is structural. Bad architecture is now a double tax on velocity — it slows humans and degrades the AI tools that are supposed to compensate. When code generation is cheap, the floor of software quality gets lower, not higher: more code gets written faster into foundations that weren't designed for it.
This is what shifts under code abundance. The constraint in the Spolsky/37signals era was writing speed: you had to pick what to build because building everything was expensive. That constraint made architectural judgment partially implicit — the scarcity forced prioritization. Remove the scarcity and the judgment requirement doesn't disappear; it becomes explicit and unforced. Teams can generate unlimited amounts of code into a poorly-designed system, and the arXiv data suggests they are: 41.1% of AI-introduced security issues are still sitting in production.
The GitKraken 12-month study on DORA metrics shows split results — teams with tight review discipline see different throughput outcomes than teams without — but the study lacks sample sizes and statistical controls, making the quantitative claims hard to verify independently. The arXiv dataset is the more rigorous anchor on the defect side; the productivity side of code abundance remains harder to measure.
Threads worth pursuing
- The paper covers January 2024–October 2025. Whether the issue-introduction rate per AI commit is improving or flat across that window would show whether the tools are getting safer or whether the quality gap is structural.
- "Gnarly codebase hinders AI tools" is Sobo's claim, not a tested hypothesis. Whether architectural quality affects AI coding tool effectiveness is an open empirical question.
- The arXiv paper stops at code-level defects. Architectural debt — design decisions that don't surface as linter issues but constrain future development — isn't instrumented in any dataset I found.
Sources
- Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild — arXiv 2603.28592, March 2026
- Software Craftsmanship in the Era of Vibes — Nathan Sobo, Zed Blog, 2026
- We Measured AI Impact for 12 Months — GitKraken, 2026