WHERE TEAMS SPEND TIME vs. WHERE PROBLEMS LIVETEAM ATTENTIONPrompt wording68%Model selection24%Orchestration5%Error handling3%* secondhand embarrassment estimatereality gapACTUAL FAILURESBad prompt5%No error handling48%Infinite loops30%Bad context17%* from post-mortems I sat in on
Prompt Archaeology in Progress The Architecture Was Fine Actually String Interpolation Is Not Orchestration Skill Issue. Specifically: Architecture. The Model Didn’t Fail. You Did. Prompt Archaeology in Progress The Architecture Was Fine Actually String Interpolation Is Not Orchestration Skill Issue. Specifically: Architecture. The Model Didn’t Fail. You Did.

Here is a real thing that happens. A team spends two weeks refining their system prompt. They A/B test phrasings. They add capitalization for emphasis — “YOU MUST return valid JSON” — as if the model is a misbehaving intern who just needs a firmer tone. They consult three blog posts and a YouTube video by someone who has never shipped anything to production. They add more adjectives. They switch from GPT-4o to Claude. They switch back. They try a different temperature.

The system still fails. At which point they add more adjectives.

I’ve been brought in as the third set of eyes on more of these situations than I’d like to count. And the answer is almost never the prompt.

The Prompt Is the Last 5%

Here’s the structural truth that makes prompt-obsession so seductive: prompts are visible. You can read them. Modify them. Show them to stakeholders as evidence of work. A prompt is text. Text is legible. Legibility feels like control.

Orchestration logic, context management, retry strategies, structured output validation — these are invisible until they break. And when they break, they break in production, at volume, in ways that look like the model is being stupid when really the model is being precisely what you built it to be: a very fast text predictor operating on garbage input with no error recovery path.

The model isn’t confused. The model received your poorly assembled context, your ambiguous instruction, your 4,000-token preamble of irrelevant information, and your complete lack of output schema — and it did its best. It always does its best. That’s the terrifying part.

I have reviewed pipelines where the “orchestration” was an f-string. A literal Python f-string. prompt = f"Here is the data: {raw_json_blob}. Do the thing." Shipped to production. Processing real customer data. And when it broke — which it did, spectacularly — the post-mortem discussion was about the word “thing.” Perhaps “task” would perform better?

No. No, the word was not the problem.

The Three Layers Nobody Builds Properly

Good agentic systems have three layers that have nothing to do with what’s in the system prompt.

Context architecture. What information does the agent actually need, in what format, and in what order? Context position matters. Token limits are real. Injecting 6,000 tokens of background documentation before the actual task is not “giving the model full context” — it’s burying the instruction under a haystack. The model will find the needle sometimes. The other times it will confidently summarise the haystack and call it done.

Output contracts. If the model can return anything, it will return everything and nothing interchangeably. Define a schema. Enforce the schema. Return a validation error when the schema breaks and either retry with a corrective prompt or fail loudly. Free-form text output that gets parsed downstream is not a feature. It is a time bomb with a randomised fuse.

Recovery logic. What happens when the agent gets stuck? What happens when the tool call fails? What happens when the model returns output that fails validation twice in a row? If the answer is “it loops until the token budget runs out,” then you have not built a system. You have built an optimistic vibes machine with a billing account attached.

// Observation from Consulting

The teams that obsess over prompt engineering are usually the teams that haven’t shipped a production agent yet. The teams that have shipped — and survived the first real incident — stop talking about prompt wording entirely. They talk about observability, retry strategies, and circuit breakers. The vocabulary shift happens fast.

What Prompt Engineering Is Actually For

None of this means prompts don’t matter. They do. But they matter in the same way that a well-written error message matters — they make the difference between a good experience and a frustrating one, but only after the underlying system is solid enough to consistently reach the point where the message is shown.

Good prompt engineering shapes tone, output format, reasoning approach, and edge case handling at the margins. It is the last 5% of a production agent’s reliability, not the first. You reach it after the architecture is solid, the output contracts are enforced, and the observability is in place.

Most teams are spending their time on the last 5% while the other 95% is on fire.

The Uncomfortable Diagnostic

If your AI integration is failing and your first instinct is to rewrite the prompt — sit with that for a moment. Ask: does the model receive complete, correctly ordered context? Is the output schema enforced? Is there a retry limit? Is there a circuit breaker? Is there logging at every decision point?

If the answer to any of those is “no” or “kind of” or “we haven’t looked at that yet” — then you know where to spend the next two weeks.

It is not on the prompt. The prompt is fine. The prompt was never the problem.

Build the car. Stop tuning the engine that isn’t in a car yet.