All field notes
The Autopsy · Apr 2026 · 7 min

Autopsy: the AI copilot we killed three weeks before launch

The hardest thing a studio does isn't shipping. It's killing something that already works in the demo, after the money's been spent and the launch date is on the calendar. Here's one of those decisions, and why we'd make it again.

A note on this one. This is a composite — a pattern we've watched play out more than once, assembled into a single account with identifying details changed. We won't name a real client's failure to make a point. The mechanics, the numbers we cite, and the decision are real to how this actually goes.
The setup

An AI copilot inside a vertical SaaS product — the kind of feature that's on every roadmap in 2026. Type a request in plain English, the copilot reads your data, takes the action, explains what it did. The demo was genuinely impressive. It closed deals. The launch date was set and announced.

The demo that lied

The demo always uses the happy path: clean inputs, a request the copilot has effectively seen, a reviewer who knows what the right answer looks like. None of those hold in production. The copilot that nails the staged request confidently mangles the messy real one — and does it in the same calm, authoritative tone, which is exactly what makes it dangerous. The demo proved the ceiling. It said nothing about the floor.

The eval that failed

Three weeks out, we did the thing that should happen before a launch date is ever announced: we built a real eval set — a few hundred actual user requests with known-correct outcomes — and ran the copilot against it. It passed comfortably on the simple slice and fell apart on the realistic one, taking confident wrong actions on a share of tasks no amount of polish would have fixed in three weeks. That tracks with the wider picture: agents that work in a controlled demo fail in real workflows at rates reported as high as ~88% (Fiddler AI).

For a copilot that takes actions on a customer's data, "confidently wrong" isn't a quality issue. It's a liability.

The decision to kill

The pressure to ship anyway is enormous. The date is public, the budget is spent, the demo still works in the room. Every one of those is a sunk cost, and none of them changes what the eval said: this copilot would erode trust faster than it created value, and we'd be back in six months doing the autopsy from the other side.

Shipping it would have felt like progress for three weeks and like a mistake for a year.

So we killed the autonomous copilot — and shipped the 80% of the work that was real: the same AI, reframed as a suggest-and-confirm assistant that drafted the action and made a human approve it. Less magical in the demo. Trusted in production. Still in use.

The lesson

Two, actually. First: build the eval set before you set the launch date, because the eval is what tells you whether the date is real. Second: "kill the feature" is almost never "kill the work" — the reliable version is usually hiding inside the ambitious one, with the autonomy turned down to where trust can survive.

The feature that demos best and the feature that survives production are rarely the same feature. The eval is how you find out which one you're holding — while you still have time to choose.

Sources
  1. Fiddler AI — AI Agent Failure Rate (demo-to-production gap)
  2. CallSphere — The 10 Biggest Agentic AI Disasters of Early 2026
More The Autopsy