AI CODE AUDIT // 1,000 LINES // FINDINGSCORRECT ✓Happy pathsFormatting / namingComments / docsCONCERNING ⚠Copied error messagesShallow null checksMissing validationWRONG ✗Async race conditionsAuth logic gapsHallucinated SDK methodsSUBTLE DISASTER ✗✗Works. Misleads.✗✗Passes tests. Wrong.✗✗Right logic, wrong domain✗✗* Real production codebase. Anonymity requested. I would have too.
Peer Review But Make It Existential 1,000 Lines. No Mercy. The Tests Passed. They Were Wrong. Comments Were Excellent. Code Was Not. Author Still Recovering Peer Review But Make It Existential 1,000 Lines. No Mercy. The Tests Passed. They Were Wrong. Comments Were Excellent. Code Was Not. Author Still Recovering

A client asked me to review their codebase before a significant infrastructure migration. Standard stuff. Then they mentioned, almost as an aside, that roughly 40% of the recent additions had been written primarily by AI — Copilot completions, Claude-assisted implementation, GPT-4 for the “quick stuff.” They said this like it was fine. Like it was even slightly normal.

It is increasingly normal. That’s either exciting or alarming depending on how carefully you’re reading the pull requests.

I read those pull requests. All of them. Line by line, function by function, over the course of two days that I will not be getting back. What I found was not what I expected — in both directions.

What AI-Generated Code Is Genuinely Good At

Let’s start here because the discourse typically skips it. AI-generated code is, on average, better formatted, better commented, and better named than equivalent code written by a human in a hurry. This is not a small thing. A significant portion of real-world production code is written by humans in a hurry. The AI is not in a hurry. The AI has infinite patience for consistent casing, clear variable names, and docstrings that actually describe what the function does.

The happy-path logic was also solid. Standard CRUD operations, data transformations, mapping between DTOs, basic validation flows — all of it was correct, readable, and wouldn’t have raised a flag in any reasonable code review. If you showed this code to a mid-level engineer without context, they’d approve it.

This is important. The floor is high. The AI is not producing code that’s obviously broken. It’s producing code that passes a visual scan, passes lint checks, and in most cases passes tests.

That last sentence is where things start getting unsettling.

The Tests Were Wrong

Not all of them. But enough. The AI wrote tests that verified its own assumptions — which is fine if the assumptions are correct, and a very confident lie if they aren’t. Tests that mocked the database, tested the mock, and called it integration coverage. Tests that used expected values that were… technically correct for the code as written, but wrong for the domain the code was supposed to model.

The code was correct. The test was correct. The behaviour they were jointly validating was incorrect. The business stakeholder who defined the original requirement would not have recognised the output as what they asked for.

This is the subtle disaster category, and it’s the one that keeps me up at night. Not the obviously broken code — that gets caught in review. It’s the code that works perfectly and does precisely the wrong thing, validated by tests that confirm it works precisely as written.

To catch this, you need someone who understands the domain, not just the syntax. The AI does not understand your domain. It has extensive knowledge of general patterns and can reason about code structure fluently. It has no idea what your company’s definition of a “completed transaction” is, or why that specific edge case matters, or what the business would look like if that edge case were silently mishandled for six months.

The Patterns That Appeared Repeatedly

Across the thousand lines, certain failure signatures appeared often enough to be diagnostic.

Shallow error handling. AI-generated code catches exceptions and logs them. Consistently. Reliably. And in several cases, entirely silently — the exception was caught, logged to a logger no one monitored, and the function returned a default value that looked like a success to the caller. The happy path continued. The failure was invisible.

Hallucinated library methods. This one is well-documented elsewhere but still surprised me in practice. Three separate files called methods on third-party libraries that did not exist. Not deprecated — never existed. The code was syntactically plausible. It looked like the kind of method that should exist on that class. The tests had mocked the calls so the method was never actually invoked. The production error would have been spectacular.

Confidence without context. Authorization checks that validated the wrong property. Pagination logic that was off by one in a direction that only matters past 100 records. Date handling that was correct for UTC and wrong for the customer’s actual timezone. All of these required domain knowledge to spot. None of them were visible from the code alone.

// The Uncomfortable Realisation

The AI doesn’t know what it doesn’t know. It generates plausible code for the context it has. When the context is missing something important — a business rule, a constraint, a “this has to work differently because of how we handle X” — the AI fills the gap with a reasonable general assumption. Reasonable general assumptions in domain-specific code are how subtle production incidents are born.

What This Actually Means for Code Review

It means code review has to change. Not disappear — change.

The current practice of using AI to write code and then doing a surface-level read of the PR is not a process. It is theatre. The review needs to shift from “is this code syntactically correct and reasonably structured” — the AI handles that — to “does this code correctly model the business requirement, handle real-world edge cases, and fail gracefully when things go wrong.”

That’s a harder review. It requires understanding the requirement, not just the code. It requires asking whether the tests test anything meaningful, not just whether the tests pass. It requires domain knowledge the AI doesn’t have and, frankly, that a lot of code reviewers haven’t been exercising because the old form of review consumed all the available time.

~65%
of AI-generated code: correct and acceptable
~22%
had issues reviewable without domain knowledge
~13%
required domain context to catch — and almost didn’t

That 13% is the number that matters. In a codebase processed primarily through cursory review, that 13% silently accumulates. Correct-looking code, plausible tests, domain-wrong behaviour. Patiently waiting.

The answer isn’t to stop using AI for code generation. The output is genuinely useful and the productivity gains are real. The answer is to understand what AI-generated code is actually good at and where its blind spots are — and to build your review process around the blind spots, not the strengths.

The code review isn’t for the AI’s sake. It’s for yours. Act accordingly.