Let me describe the worst kind of AI failure. Not the spectacular kind — not the loop that runs 800 iterations, not the model that returns gibberish, not the agent that accidentally deletes something. Those are easy to catch. They announce themselves.
The worst kind of AI failure is the one that arrives formatted like a senior engineer’s code review. Complete sentences. Correct paragraph spacing. Bullet points that are syntactically impeccable. A measured, authoritative tone. Wrong from premise to conclusion, but delivered with the unshakeable composure of someone who has never once in their life felt the specific anxiety of not knowing if they’re right.
I have watched this failure mode take down a production feature. I have watched it sail through code review because the explanation was so well-written that the reviewer assumed the logic was sound. I have watched it generate a technically correct answer to the wrong question, which is in some ways worse than a wrong answer, because the wrong answer would have triggered alarm bells.
The Confidence Calibration Problem
Human experts communicate uncertainty. This is a learned social behaviour with real functional value. “I think this is right but double-check” is not hedging — it is information. It tells the reader where to direct their critical attention. It is a flag planted in the text that says here, specifically here, apply scrutiny.
Models do not hedge in this way. They produce confident prose regardless of underlying uncertainty. The output of “I have high certainty this is correct” and “I am generating plausible text about a topic I may be misrepresenting” look, on the surface, identical. Same formatting. Same declarative tone. Same complete, grammatically perfect sentences.
This is not a bug in the way people mean when they say “the model should be less confident.” The model is doing exactly what it was trained to do: produce fluent, helpful, well-structured text. The confidence is a feature of the output format, not a signal about the accuracy of the underlying content. Conflating the two is a category error made millions of times a day by people who should know better. Including, periodically, engineers who build these systems.
I once watched a model confidently explain the failure mode of a system I had built, in a way that sounded entirely plausible, referenced real concepts, and was wrong in a way that was only detectable if you had built the system and knew where the actual failure mode lived. The explanation was better-written than my documentation.
What This Means for Review
It means that “well-explained” is not a proxy for “correct.” This sounds obvious. It is not, in practice, obvious. Human brains pattern-match on surface quality. A well-formatted, articulate, confidently-written explanation feels authoritative. Treating it with suspicion requires active effort — the kind of effort that, after the fortieth PR of the week, is in short supply.
The solution is not to trust the model less globally. It is to define, specifically, what you validate and how, before output reaches production. Structured outputs over prose. Schema validation before downstream consumption. Domain-specific checks that catch the plausible-wrong-answer before it becomes an incident.
The model returned a correct-looking JSON object with an incorrect field value. The field was a status enum. The value it chose was valid in the schema but wrong for the business logic. Tests passed because tests mocked the output. The incident surfaced three weeks later in a customer-facing report. The explanation the model gave when asked why it chose that value was the most well-reasoned thing in the entire post-mortem document. It was still wrong.
The Apology Problem
Here is the part that adds insult to injury: when you point out that the model was wrong, it apologises. Graciously. At length. With a corrected answer that is delivered with exactly the same confidence as the original wrong one.
There is no shame spiral. There is no visible recalibration. There is no “I have been more cautious in my other answers since this incident.” There is just: a correction, an apology formatted like a LinkedIn post, and then business as usual. Full confidence resumed. Forward march.
This matters because shame — in humans — is a functional signal. It indicates that an error was registered, weighted as significant, and is being used to update future behaviour. The model’s apology contains none of this information. It is grammatically equivalent to shame but mechanically identical to everything else the model produces. It is the shape of accountability without the substance of it.
The Actual Fix
Validate outputs structurally, not stylistically. A well-written explanation of why a thing works is not evidence the thing works. Run the thing. Check the output against known constraints. Test the edge cases the model didn’t think to mention because it was too busy writing confident prose about the happy path.
Build review into the architecture, not into the reader’s goodwill. The reader’s goodwill is finite. The model’s confidence is not. One of these will run out first.
And when the model is wrong — which it will be, graciously and at length — treat the apology as exactly what it is: a placeholder where shame would go in a human. Informative about format. Uninformative about whether the new answer is correct.
Validate the correction too. It was generated with the same process as the original.