THE DEMO vs. PRODUCTION // A FAMILIAR STORYTHE DEMO 🏆PRODUCTION 💀✓ Task understood✓ Plan generated✓ Code written✓ Tests pass✓ PR opened🏆 INVESTOR APPLAUSEproduction✗ Hallucinated API✗ Loop: 47 iterations✗ Confident. Wrong.✗ Tests: invented✗ PR: 847 files changed💀 SILENT INCIDENT AT 3AMThe model isn’t dumb. You gave it infinite loops and no guardrails.It did exactly what you built it to do. That’s the problem.

There is a specific flavour of humiliation that belongs exclusively to AI engineers. It happens like this: you spend two weeks building a multi-agent system. It works flawlessly in every demo. You present it to stakeholders. You ship it. And then you check the logs at 3am to discover your agent has been confidently doing the wrong thing for six hours, looping through the same three actions, billing tokens like a caffeinated accountant on overtime.

Welcome to production. Population: your regrets.

The Demo Problem

The uncomfortable truth about AI agent demos is that they are, by definition, optimised for success. You use the happy path. You use pre-validated inputs. You run it three times and record the take where it works. This is not dishonesty — it’s how every technology demo in history has worked. But with AI agents, the gap between demo performance and production reality is wide enough to park a data centre in.

// Field Observation

The model isn’t dumb. It’s operating on incomplete context, ambiguous instructions, and zero guardrails in an environment it has never seen. You gave it a goal, no constraints, and access to your production database. The model did exactly what you told it to. That’s the problem.

The failure modes in production are not random. They are almost always traceable to one of three causes:

1. The context is wrong. The agent was designed with your happy-path example as the mental model. Real inputs are messier, shorter, more ambiguous, and frequently nothing like your training set. Your agent confidently misclassifies the edge case because you never showed it an edge case.

2. The loop has no exit. Multi-agent systems need hard limits. Max iterations. Max tokens per step. Explicit failure states. Without these, an agent that encounters an unexpected situation will do what humans do when they’re lost: keep trying variations of the same wrong approach until someone stops them or they run out of money.

3. The tools are trusted too much. Agents hallucinate tool outputs. They call APIs that don’t exist, pass parameters that aren’t valid, and then double down when the error comes back by trying to interpret the error message as a success response. Trust nothing. Validate everything.

What Actually Works

I’ve shipped agentic systems in production — including Axon, which runs autonomously against real Jira tickets in a real codebase. The things that made it reliable weren’t clever prompt engineering or a bigger model. They were boring engineering fundamentals:

Hard
iteration limits on every loop
Structured
outputs, never free-form
Human
checkpoint before irreversible actions

Structured outputs over free text. Every. Single. Time. The moment you let an agent return prose and then parse that prose to decide what to do next, you’ve introduced a failure surface that will bite you at the worst possible moment. Make the model return JSON. Validate the JSON. Fail loudly if it doesn’t match the schema. Do not give the model the benefit of the doubt.

Observability isn’t optional. You need to know what your agent decided, why it decided it, what tool it called, what the tool returned, and how the agent interpreted that response. Not for debugging — for the moment your agent does something unexpected at 3am and you need to reconstruct the crime scene.

The Hard Truth About “Just Use GPT-4o”

The model is not the bottleneck. I have seen teams obsess over model selection while completely ignoring orchestration logic, context management, and tool validation. The best model in the world will fail spectacularly if you give it bad context, infinite loops, and access to production systems with no guardrails.

Conversely, GPT-3.5 with good orchestration will outperform GPT-4o with no orchestration on most real tasks. The model is the engine. The orchestration is the car. Stop modifying the engine and build the car first.

Your agent isn’t stupid. It’s doing exactly what you built it to do. The question is whether what you built it to do is what you actually wanted.

Spoiler: usually not. But that’s what iteration is for. Ship carefully, observe ruthlessly, and treat every production incident as the free architecture review it is.