April 2026Operations9 minPublished

Operations as code: turning recurring chaos into runbooks

Treat recurring operational chaos (events, reporting, onboarding) like code: write it down, version it, automate where it pays. The runbook is the unit of work.

There is a particular kind of stress that has nothing to do with difficulty. The work is not hard. You have done it twenty times. It is the annual report, the spring event, the new hire's first week. You know how it goes. And yet every single time it arrives, it arrives as a small emergency: a scramble of half-remembered steps, a frantic search through last year's emails for the template, a colleague asking you a question you have answered four times before and will answer again next year.

That stress is a tax, and it is entirely self-imposed. You are paying, every cycle, for the privilege of not having written anything down.

I spent years in roles built almost entirely out of recurring chaos. At Nuoret Kotkat, the youth organisation, I ran more than twenty events a year for hundreds of young people, plus a multi-year grant reporting pipeline with a funder who did not accept "we forgot a step" as an answer. At IKEA I sat in resolution, where the whole job is high-volume repeated judgement: the same categories of escalated customer claim, over and over, all day. In both places I learned the same thing the same painful way. The chaos was never in the work. It was in the fact that the work lived in people's heads instead of on a page.

So here is the frame that changed how I operate: treat operations like code.

What "operations as code" actually means

Developers figured this out a long time ago. You do not solve the same problem twice from memory. You write it down as code, put it in version control, and automate the parts that are mechanical. The cleverness gets captured once and then it just runs. Nobody re-derives how to deploy the app every Friday from memory.

Operations is full of problems you solve over and over, and we treat almost none of them this way. We treat each recurrence as if it were the first time, powered by the heroics of whoever happens to remember. That is the bug.

Operations as code means three moves, borrowed directly from how good engineering teams work:

Write it down so the knowledge lives in a file, not a skull. Version it so you can see what changed and why, and improve it deliberately instead of by accident. Automate where it pays so the boring, mechanical parts stop costing human attention.

The unit of all this is the runbook. Not a policy document, not a fat wiki nobody reads. A runbook is the operational equivalent of a function: a named, repeatable procedure that takes some inputs and reliably produces an output, written so that a competent person who is not you can run it and get the same result.

How to find the runbooks hiding in your week

You do not start by writing a hundred runbooks. That is documentation theatre, and it rots faster than you can produce it. You start by finding the few procedures that are actually costing you.

The test I use is simple and a little blunt: frequency times pain times bus-factor.

Frequency. How often does this happen? A thing you do weekly is a far better candidate than a thing you do once.
Pain. How much does each occurrence hurt? Stress, rework, errors, the dread you feel when you see it on the calendar.
Bus-factor. How many people can do it without you? If the answer is one, and that one is you, the risk is concentrated in a single point of failure with a pulse.

Multiply those, roughly, in your head. The procedures that score high are your first runbooks. Everything else can wait, possibly forever.

At Nuoret Kotkat the highest-scoring thing was obvious once I looked: the event cycle. Twenty-plus times a year (high frequency), the same recurring panic around logistics, registration, and safety paperwork (high pain), far too much of it depending on me specifically (terrible bus-factor). The grant reporting scored almost as high, with the twist that getting it wrong had real consequences with the funder. Those two got written down first. The one-off "organise the board's anniversary dinner" did not, because it would never happen the same way twice and a runbook would have been a museum piece.

How to write a runbook that someone will actually use

Most internal documentation fails because it is written as prose, by someone proving they understand the process, for an imaginary reader who already knows it. A runbook is the opposite. It is written as a checklist, by someone trying to make themselves unnecessary, for a stressed colleague doing this at 4pm with a deadline.

A runbook that works has five parts.

1. A trigger

When does this run? "Every quarter, two weeks before the grant reporting deadline." A procedure with no trigger is a document. A procedure with a trigger is a habit waiting to happen.

2. Inputs

What do you need in hand before you start? The registration list, last quarter's numbers, the venue contact, the budget line. Listing inputs up front kills the most common failure: getting three steps in and discovering you are missing something that takes two days to obtain.

3. Numbered steps, in the real order

The actual sequence, boring and explicit. Not "handle registrations" but "export the registration list, check for under-18s without a guardian signature, flag those to the local coordinator." Each step should be small enough that there is no ambiguity about whether you have done it.

4. Verification

This is the step everyone skips and the one that earns its keep. How do you know it worked? At IKEA, resolution decisions could go wrong in quiet, expensive ways: a refund authorised against the wrong policy, a claim closed without the customer actually being made whole. The verification step is where you write down the checks that catch those: does the amount match the policy, has the customer confirmed, is the case actually closed and not just marked closed. A runbook without verification is just a faster way to ship the same mistakes with more confidence.

5. An owner

A name, not a department. The person responsible for keeping this runbook true when the template changes, the team changes, or the tool changes. Without an owner, every runbook is accurate the day it is written and slowly becomes a trap, because people trust it after it has stopped being right.

Here is what one looks like, stripped down:

# event-runbook.md
# Trigger: 3 weeks before any local event
# Owner: operations coordinator
#
# Inputs:
#   - venue confirmed + contact
#   - registration form live
#   - safety + guardian-consent template
#
# Steps:
#   1. Open registrations, set cap, set close date.
#   2. T-7: export list. Check every under-18 has consent.
#      -> missing consent is a STOP. Flag to local coordinator.
#   3. T-3: confirm catering headcount against list.
#   4. T-1: print sign-in sheet + emergency contacts.
#
# Verify before "done":
#   - every attendee has consent on file (no exceptions)
#   - emergency contact sheet physically at the venue
#   - someone other than you knows where the first-aid kit is

The comment that the missing-consent line is a hard STOP matters more than the prettiness of the format. That is the scar tissue from a real near-miss, and it tells the next person which steps are negotiable and which are not.

Automate where it pays, not everywhere

Once a procedure is written down as explicit steps, the mechanical ones become obvious, and some are begging to be automated. The order matters: write the runbook first, then automate from it. People who try to automate a process they have not yet written down end up automating their own confusion.

But automation is not free, and the seductive failure here is automating things that should have stayed manual. My rule: automate the step if it is mechanical, high-frequency, and low-judgement. Leave it alone if it needs a human to decide something.

In the event runbook, generating the sign-in sheet from the registration list is pure mechanics. Automate it. Deciding whether to chase a parent about a missing consent form is judgement and a relationship. Leave it human. In grant reporting, pulling the raw numbers into the right structure is mechanical and now AI-assisted in everything I build. Deciding what story those numbers tell the funder is the actual job, and no script should touch it.

The honest version of "automate where it pays" includes the cost of the automation breaking. A script that fails silently is worse than a manual step a human would have noticed going wrong. So I only automate where a failure is loud, and I keep the human verification step regardless. The automation drafts; the person still signs off.

The failure modes, named

I have hit all of these, so I will name them plainly.

Runbook rot. A runbook that quietly stops matching reality is more dangerous than none, because people follow it off a cliff. This is the entire reason the owner exists. A document with no owner is a future incident with a delay timer.

Process for its own sake. The over-eager version documents everything and turns a useful tool into bureaucracy people route around. The test is always frequency times pain times bus-factor. If a procedure does not score, it does not get a runbook, full stop.

Writing for the wrong reader. Documentation written to show off the author's mastery instead of to enable a stranger's success. The reader is not your boss. It is a tired colleague who has never done this before and needs to not screw it up. Write for them.

Automating the unwritten. Reaching for a script before the steps are clear. You cannot automate a process you cannot describe. The runbook is the spec; the automation is the implementation. Spec first, always.

The compounding part

The thing nobody tells you about treating operations like code is that the value compounds, quietly, the same way a good codebase does.

The first time you write the event runbook, it costs you an afternoon and saves you almost nothing, because you were going to run that event from memory anyway. The second time, it saves you the scramble. By the fifth event, a new volunteer runs most of it without you in the room, and you have your attention back for the things that genuinely need a human brain. By the time you leave, you hand over a folder of runbooks instead of a crisis. That last part is not a nice-to-have. It is the whole point.

This is the same belief that sits underneath everything I build at Impact Node: scale runs on systems, not goodwill. Goodwill is the colleague who always remembers the steps and saves the day. It is wonderful, and it is a liability, because it does not scale and it does not show up the week they are sick. A runbook is what is left when the heroics are gone. It is goodwill, written down, so it does not have to be re-earned every single time.

So the next time a recurring task lands on your desk as a small emergency, do not just power through it on memory and adrenaline. Notice that you are paying the tax again. Write the function. Run it next time. The chaos was never in the work.

Want to discuss this? Write directly.

jami@impactnode.fi