A few months ago I gave an agent what I thought was a clean task:
“Pick up this ticket, implement the obvious fix, open a PR.”
It did exactly that. It opened a PR. It changed code. It even wrote tests.
The PR was beautifully formatted and strategically wrong.
Not broken. Not random. Wrong in a way that looked reasonable for fifteen seconds, then triggered a twenty-minute code review intervention and one deeply disappointed architect.
The Model Was Not the Problem
When people see this, they blame the model first.
Wrong suspect.
The model only had the instructions I gave it, the tools I exposed, and the guardrails I did not define. That is not intelligence failure. That is interface failure.
If a human engineer got a ticket that vague, they would ask clarifying questions. Agents do not ask enough clarifying questions by default. They infer. And inference under ambiguity creates confident nonsense with clean formatting.
Prompt Quality Is Overrated Without Task Contracts
A lot of teams spend too much time tuning prompts and too little time defining task contracts.
A useful contract for agentic work includes:
- clear success criteria
- allowed files and folders
- explicit output schema
- stop conditions
- fallback behavior
- reviewer expectation
That looks boring. It is also the reason one workflow feels reliable and another feels haunted.
Most agent failures I see in production are specification failures wearing a model-shaped costume.
What We Changed in Practice
We replaced vague ticket execution with a small structured contract template.
Each ticket now includes:
- objective in one sentence
- non-goals in one sentence
- acceptance test bullets
- risk note for impacted modules
- reviewer lens (what to inspect hardest)
That template did more for output quality than all the prompt adjective upgrades combined.
And yes, I still use good prompts. I just stopped pretending prompts are architecture.
The Review Experience Changed Immediately
Once contract clarity improved, the PRs changed in a very visible way:
- fewer broad rewrites
- fewer speculative abstractions
- tighter test intent
- faster reviewer decisions
The surprising part was not speed. It was reviewer energy. People were less irritated because they were reviewing work, not untangling assumptions.
If You Want Better Agent Output, Start Here
Before changing models, do this for one workflow this week:
- define output schema
- define file boundaries
- define stop conditions
- define what “done” means
Then let the agent run.
You will still get mistakes. But they will be smaller, easier to review, and less theatrical.
The model can be smart and still fail under vague instructions.
That is not a paradox. That is engineering.