2026 . 04 . 25

AI Writes the Code. Who's Testing It?

ai By Lloyd Rowat

In March 2026, Diffblue ran a benchmark across eight real-world Java projects. Their autonomous testing agent generated 81% line coverage and 61% mutation coverage on each project, from a single prompt, with no human in the loop. The control group was an experienced developer using Claude Code, iterating for two hours per project. That developer hit 32% line coverage and 24% mutation coverage.

A 2.5x gap in favor of the machine, on a task humans have done for fifty years.

Almost nobody covered the result. Every AI dollar this cycle has gone to the writing side: Copilot, Cursor, Claude Code, Cognition. The reading side, the testing side, the verification side, has been treated as an afterthought. That is starting to change, and the tools arriving now are not Selenium with an LLM bolted on.

The Asymmetry

Two years of AI-assisted development created a measurable quality problem. Telemetry analysis from getDX across 22,000 developers found incidents per pull request up 242.7%, more than tripling the chance that a merged change ships an issue to production. Industry surveys put AI-authored code somewhere between a fifth and two-fifths of merged output, depending on which dataset you trust. Throughput went up. Stability went down. The DORA frame for this is "Acceleration Whiplash."

Tests are the natural circuit breaker, but most teams' test suites were already overwhelmed before AI started piping in code at multiples of the previous rate. Google's testing engineers have written publicly that 16% of their tests exhibit some flakiness, and that 84% of pass-to-fail transitions in their CI are flaky failures rather than real bugs. Atlassian's December 2025 engineering post on Flakinator estimated 150,000 developer hours per year wasted investigating flaky failures inside a single major repository.

The AI testing wave is walking into that inheritance: more code, faster code, and a test infrastructure already buckling. The same shift that put Claude Code in the driver's seat for code authoring is finally arriving on the verification side, and the funding numbers reflect it.

The Four Real Categories

Set aside the marketing pages. The serious work breaks into four kinds of tools, each agentic from the foundation rather than retrofitted onto last decade's frameworks.

Autonomous unit-test generation. Diffblue is the clearest example. Their March 2026 Testing Agent operates on top of an existing AI coding stack like Copilot or Claude Code and writes verified regression tests across an entire codebase without developer intervention. The 81% line coverage figure is not a demo. It is an average across eight projects on the first run, and the mutation coverage of 61% is the harder, more honest number to beat.

Agentic end-to-end testing. QA Wolf, mabl, and Momentic all sit here. QA Wolf raised a $36M Series B in 2024 led by Scale Venture Partners, bringing their total to roughly $57M, and runs a multi-agent pipeline that ingests video walkthroughs and DOM snapshots and emits production Playwright code for web and Appium for mobile. Momentic announced a $15M Series A in November 2025 from Standard Capital with Dropbox Ventures participating, focused on developer-native E2E flows authored in plain English. Mabl unveiled a next-generation agentic platform on April 23, 2026, adding Agent Instructions for persistent quality standards and cloud test generation; their customer list runs through Mercedes-Benz, JetBlue, and LendingClub Bank, which is to say it is no longer a startup-only product.

Visual AI. Applitools shipped Eyes 10.22 in January 2026 with a Storybook Addon and a Figma Plugin, pulling visual regression testing out of CI and into the place developers and designers already work. Their diff algorithm has been trained on years of human-labeled image pairs, so it ignores the variations a human reviewer would also ignore and flags the ones that matter.

Deterministic simulation testing. Antithesis closed a $105M Series A in December 2025 led by Jane Street. Their platform runs years of production-equivalent simulation in a few hours, fully deterministic, and reproduces any bug it finds on demand. It is in production at the Ethereum network and at firms with the kind of distributed-systems edge cases that conventional E2E tests cannot reach.

Four categories, all funded, all shipping. Disclosed rounds in AI-native testing have crossed a billion dollars in the last eighteen months, and the cycle is still early.

Why Testing Is Where AI Wins Cleanly

There is a structural reason this wave is delivering results faster than the discourse expected. Testing has properties that play to AI's strengths and away from its weaknesses.

Outputs are verifiable. A test either covers a line of code or it does not. A property either holds or it fails. Unlike taste-driven work where "good" is contested, testing has ground truth.

The repetition is hostile to humans and friendly to machines. Selector maintenance, snapshot updates, fixture rebuilds, regression triage. Self-healing test platforms consistently report 70% to 90% reductions in false-failure rates within the first two weeks of adoption, with selector maintenance trending toward zero. Not a marketing claim from one vendor, a consistent number across multiple platforms, because the underlying problem is mechanical.

The error budget is generous. A QA agent that writes a redundant test costs you compute. A coding agent that writes a redundant abstraction costs you a refactor in six months. The downside profile is asymmetric, and the asymmetry favors letting the machine do the testing first.

None of this is a story about replacing QA engineers. It is a story about giving them leverage proportional to the problem they have been asked to solve. Test-suite maintenance has been a thankless job for a decade. The agentic platforms are taking that load off and freeing test engineers to focus on the work that actually requires judgment: what to test, what risk model to apply, and what to do with the failures the agents surface.

The Catch

Two real problems the second wave has not solved yet.

Shallow assertions. AI-generated tests pass a lot. They also tend to assert what the code does, not what the code should do. A test that locks in current behavior is regression protection, but it is not specification. Multiple QA leads have noted that AI-written tests can give a false sense of coverage when the assertions do not validate business logic. Diffblue's mutation coverage number is high partly because they measure mutation kill rate, not just line touch. Most tools do not.

Integration fatigue. Buying an agentic E2E platform, an autonomous unit-test generator, a visual AI tool, and a deterministic simulator is not a strategy. Each one wants its own dashboards, its own CI hooks, and its own failure-triage workflow. The teams getting real value are the ones picking one or two layers and going deep, not stacking four agents and watching them fight for attention.

What This Means

The pattern matches what happened on the development side, just delayed by about eighteen months. First the new generation of tools shipped, then the discourse caught up, then enterprise budgets shifted. The discourse is still catching up to AI testing. The budget shift is starting now.

If the dev-tools wave was about asking "can the AI write this for me?", the test-tools wave is about asking "can the AI verify what just got written?" On benchmarks where the work is well-defined, the honest answer is starting to be yes. Diffblue's 81 versus 32 is not the headline of the year. It is the leading indicator.

AI tools, iOS games
& desktop apps.

Products

Dev Log

About

Contact