June 2026Applied AI8 minPublished

Prompt architecture for real workflows, not toy prompts

A production prompt is a small system: context assembly, task decomposition, output contracts, evals, and guardrails. How to build prompts that survive messy real inputs instead of demo inputs.

The prompt that wins the demo is almost never the prompt that survives Tuesday. You paste in a clean example, the model returns something beautiful, everyone nods, and then you wire it into a real workflow where the inputs are half-empty PDFs, a finance person's notes in two languages, and a field someone renamed in 2021. The clever one-liner falls apart on contact with reality.

I learned this the hard way building LLM steps into the Nordbrief grant pipeline, and again in the recurring ops workflows I ran at Nuorten Kotkat. The lesson is the same every time: a production prompt is not a sentence. It is a small system. And like any system, it has parts that each do one job, and it fails in predictable places if you skip one.

Here is the mental model I use now.

A production prompt has five parts

Think of a real prompt the way you'd think of a small service. Five components, each with a clear responsibility:

Context assembly. What the model gets to see, and in what shape.
Task decomposition. Breaking the job into steps the model can actually do.
Output contract. The exact structure the answer must take.
Evals. How you know it's working before you trust it.
Guardrails. What happens when the input is garbage or the output is wrong.

A toy prompt collapses all five into one hopeful sentence: "Summarise this grant application and extract the budget." That works on the demo PDF. It does not work on the forty real ones, where three have no budget table, one has the budget as a scanned image, and two put the budget in the appendix under a heading the applicant invented.

Let me walk each part with the grant pipeline as the running example.

Context assembly: the model can only reason about what you put in front of it

Most prompt failures are actually context failures. The model didn't get the thing it needed, or it got it buried under three pages of boilerplate.

In Nordbrief, the job is to help an NGO turn a messy project description into a fundable application for a specific funder. The naive version stuffs everything into one prompt: the funder's guidelines, the org's old applications, the new project notes, and a request to "write the application." The model drowns. It weights the boilerplate equally with the one paragraph that matters, and the output reads like every other grant: safe verbs, generic outcomes, zero edge.

Context assembly means deciding, deliberately, what goes in and how it's labelled. For each LLM step I assemble:

The funder's actual evaluation criteria, pulled out as a short list, not the full PDF
Two or three relevant past paragraphs, retrieved by topic, not the whole archive
The new project facts, structured as fields (problem, who, where, budget line items)
A short note on what's missing, so the model doesn't invent it

That last one matters more than it sounds. If you don't tell the model what you don't have, it will fill the gap with plausible fiction. Telling it "budget is not yet available" up front is the difference between an honest draft and a confident lie about a number that doesn't exist.

Task decomposition: ask for one thing the model can actually do

The single biggest upgrade to my prompts was to stop asking for the whole deliverable in one shot.

"Write the grant application" is four jobs wearing a trenchcoat: understand the funder, match the project to their criteria, structure the narrative, and produce compliant prose. Bundled together, the model does all four badly. Split apart, it does each one well.

So the Nordbrief pipeline runs as steps. One step maps the project to the funder's criteria and flags weak matches. A separate step drafts the theory of change. Another generates the budget narrative from the line items. Another produces the compliance annexes. Each step is a focused prompt with its own context and its own output contract.

The bonus: when something goes wrong, you know exactly which step did it. A one-shot mega-prompt that produces a bad application gives you nothing to debug. A pipeline that produces a bad budget narrative tells you precisely where to look.

There's a real tradeoff here. More steps means more model calls, more latency, more cost, and more places to maintain. I don't decompose for its own sake. The rule I use: split when the sub-tasks need different context, or when one sub-task failing shouldn't poison the others. If two jobs always share the same input and always succeed or fail together, keep them in one prompt.

Output contracts: stop parsing prose

This is the part people skip and then regret. If a downstream system consumes the model's output, the output needs a contract, not a paragraph.

Early on, one of my ops workflows asked the model to "categorise this incoming request and suggest next steps." It returned lovely prose. Then I had to write a parser to figure out which category it picked, and the parser broke every time the model phrased it slightly differently. I was writing regex against vibes.

The fix is to specify the exact shape and make the model fill it in:

{
  "category": "funding | partnership | media | volunteer | other",
  "urgency": "high | normal | low",
  "missing_info": ["string"],
  "suggested_reply_language": "fi | en",
  "confidence": "high | medium | low"
}

Constrained fields, an enum where there's a fixed set of answers, and a place for the model to say what it couldn't determine. Now the downstream code reads a field instead of interpreting a mood. And confidence gives me a cheap routing signal: anything below high goes to a human.

A good contract also disciplines the model's thinking. When you force it to commit to one category from a fixed list, it stops hedging. The structure is half the prompt.

Evals: you cannot improve what you only spot-check

Here's the uncomfortable bit. Most people "test" a prompt by trying it a few times, liking the result, and shipping. Then they tweak the wording later, eyeball one example, decide it's better, and ship again. They have no idea whether the change helped or quietly broke the other thirty cases.

You don't need a fancy eval framework to fix this. You need a folder of real, messy examples and a way to check the output against what you actually wanted.

For the grant categoriser, my eval set was twenty real past requests, hand-labelled with the correct category and urgency. Every time I changed the prompt, I ran all twenty and counted how many it got right. Boring. Decisive. The first time I did this, I discovered that a wording change I was sure improved things had dropped accuracy on bilingual inputs, because I'd over-indexed the prompt on English phrasing. Without the eval set, I'd have shipped that and wondered later why Finnish requests kept getting misrouted.

Build the eval set from the inputs that scare you, not the ones that flatter you. The half-empty PDF. The request that's polite small talk for three paragraphs before the actual ask. The one in Finnish with an English subject line. If your prompt holds on those, the easy cases take care of themselves.

Guardrails: assume the input is hostile and the output is wrong

The last part is what you do when things go sideways, because they will.

Two failure modes dominate. The first is garbage input: an empty document, a scan with no extractable text, a field the model can't find. The guardrail is to make "I can't do this" a valid, structured output. My grant steps can return status: "insufficient_input" with a list of what's missing, instead of hallucinating a budget. A model that's allowed to refuse is far safer than one forced to always produce.

The second is confident-but-wrong output. LLM text is fluent, which is exactly the problem. It will invent a partnership history, smooth over a real risk with optimistic language, or assert a metric that doesn't exist anywhere in the source. The guardrail is a verification pass: a separate, cheaper check that asks "is every factual claim in this draft grounded in the provided source material?" Anything ungrounded gets flagged for a human, never auto-shipped.

The cheapest guardrail is the human checkpoint at the right place. Not reviewing everything, which doesn't scale. Reviewing the things the system itself marked as low-confidence or ungrounded.

That's the thread running through all of this, and it's the same belief I bring to everything I build: scale runs on systems, not goodwill. You don't get reliable AI output by hoping the model behaves. You get it by designing the points where it's allowed to fail safely.

Where this leaves you

The shift is small to describe and large in practice. Stop writing prompts as clever sentences. Start writing them as small systems with five jobs: assemble the right context, decompose the task, contract the output, evaluate against real cases, and guard the edges.

It's more work up front than a one-liner, and that's the honest tradeoff. But the one-liner was never cheaper. It just moved the cost downstream, to the Tuesday when the messy inputs arrived and nobody knew which part had broken. Build the system, and Tuesday gets boring. Boring, in production, is the whole point.

Want to discuss this? Write directly.

jami@impactnode.fi