June 2026Fintech · AI8 minPublished

What “agentic bookkeeping” actually requires

Most “AI does your books” demos are theater. Real agentic bookkeeping needs deterministic primitives, confidence thresholds, human-in-the-loop on irreversible steps, and full auditability.

Every fintech demo right now shows the same magic trick. Upload a stack of receipts, the agent reads them, posts the entries, reconciles the bank feed, and a clean P&L appears. The room nods. Someone says the word "autonomous." Nobody asks what happens when the model decides a 50,000 euro supplier invoice is a 5,000 euro one, posts it confidently, and the VAT return goes out wrong three weeks later.

I build accounting software. Vantnod is AI-first by design, and I still think most "AI does your books" claims are demo-ware. Not because the models are bad. They are genuinely good at the part of bookkeeping that is reading and guessing. The problem is that bookkeeping is mostly not reading and guessing. It is arithmetic that has to be exactly right, rules that cannot be approximately followed, and a trail that an auditor or a tax authority can walk back through years later. A language model is the wrong tool for all three of those, and pretending otherwise is how you ship something that looks great on stage and quietly corrupts a client's ledger in production.

So here is what actually has to be true underneath, if you want an agent you can trust with real money.

The model reads. It does not compute.

The first mental model I hold onto: the LLM is an interpreter sitting on top of a deterministic ledger, never the ledger itself. It translates messy human reality into structured proposals. It never holds the truth.

Think of it as two layers that must never blur:

The deterministic core. Transactions, accounts, balances, VAT calculations, double-entry rules. This is plain code. It is testable. Given the same inputs it produces the same outputs, every time, forever. A debit equals a credit because an assert says so, not because a model felt confident.
The interpretive layer. The LLM. It looks at a PDF invoice and proposes: "this is a 1,240 euro hosting expense from Hetzner, account 7560, 24% VAT, dated 3 June." That is a proposal, structured and typed, handed down to the core for validation.

The core never asks the model to add anything up. The moment a VAT total comes out of a language model instead of out of amount * rate, you have built a calculator that hallucinates, and you deserve what happens next. Models are non-deterministic by nature. Ask twice, get two answers. That is fine for "what is this document about" and disqualifying for "what is the balance."

In Vantnod the boundary is a hard one. The model fills fields. The engine does math, enforces double-entry, applies the jurisdiction's VAT rules through a deterministic adapter, and rejects anything that does not balance. If the model proposes an entry where debits and credits do not match, the entry does not get a second chance from a smarter prompt. It gets rejected by code.

Confidence thresholds, and a real escalation path

A proposal is not a binary "trust it or do not." Every extracted field carries a confidence signal, and that signal decides who touches it next.

The pattern I use is three lanes:

Auto-post. High confidence, low stakes, and the entry passes every deterministic check. A recurring 12 euro SaaS charge that matches last month's vendor, account, and VAT treatment, and reconciles cleanly against the bank line. The agent posts it. No human looks. This is where the time savings live.
Queue for review. The model is unsure, or the amount is material, or something does not match the usual pattern. The entry is drafted, not posted, and lands in a human queue with the model's reasoning attached. "I think this is a software expense but the vendor is new and the amount is 8x your typical." A human clicks yes or fixes it.
Block and escalate. The deterministic checks fail outright, or the action is irreversible, or it touches something legally loaded. Nothing happens automatically. Full stop.

The trap everyone falls into is treating confidence as a single global dial. It is not. Confidence has to be weighted by consequence. A 90% confident guess on a 5 euro parking receipt and a 90% confident guess on a 50,000 euro intercompany transfer are not the same risk, and they should not share a threshold. The right design multiplies model confidence against blast radius. Cheap and reversible can clear a low bar. Expensive or irreversible needs near-certainty and a human, no matter how confident the model claims to be.

I learned the shape of this long before I wrote any of it in code. At IKEA I had financial authority over escalated customer claims, which meant a chunk of my day was exactly this judgment: this one is routine, approve it and move on; this one is unusual, look closer; this one is outside what I can sign off, send it up. The model is just doing the first triage now. The escalation ladder is the same ladder. You are not removing human judgment. You are making sure it gets spent on the cases that actually deserve it, instead of being burned on 200 identical SaaS receipts.

Human-in-the-loop belongs on the irreversible steps

Here is the line that matters most, and the one demos love to cross: autonomy is fine for the reversible, never for the irreversible.

Posting a draft entry is reversible. You can edit it, delete it, repost it, and the audit trail records all of it. Let the agent run free there. But some steps in accounting are doors that only open one way:

Filing a VAT return with the tax authority
Executing or approving a payment
Closing a financial period
Submitting statutory reports

These are not "high confidence, go ahead" situations. They are "a human with authority clicks the button" situations, every single time, regardless of how clean the underlying data looks. Not because the agent cannot prepare them. It can, and it should: assemble the entire VAT return, flag the three entries it was unsure about, show its work. But the commit is human. The cost of being wrong is not a fixed bug. It is a regulator, a penalty, and a client who no longer trusts you.

The useful framing is to ask, for any agent action: if this is wrong, can I quietly undo it before it leaves the system? If yes, the agent can own it. If no, a human owns the final click. Drawing that one line removes most of the catastrophic failure modes while keeping almost all of the speed, because the irreversible steps are a tiny fraction of the daily work.

If you can't audit it, you didn't build bookkeeping

The part that gets skipped in every prototype: an agent that posts entries you cannot reconstruct is not bookkeeping software. It is a black box pointed at your finances.

Every action the agent takes has to leave a trail you can replay cold, months later, when an auditor asks "why is this here?":

The source document it read, stored and linked
The proposal it made, with the confidence on each field
The deterministic checks that passed or failed
Who or what committed it: auto-posted by the agent, or approved by a named human, and when
The model version and prompt behind the decision, because "the AI did it" is not an answer a tax authority accepts

This is non-negotiable in a way that goes beyond good engineering. In most EU jurisdictions you are legally required to keep records that let someone trace a number in the financial statements back to the original transaction. An agent that cannot explain itself does not just fail a code review. It fails compliance. So the audit log is not a feature bolted on at the end. It is the spine. In Vantnod every posted entry knows its own provenance, and "the model was confident" is recorded as exactly what it is: an input to a human decision, not a substitute for one.

Where the model genuinely earns its keep

I do not want this to read as "AI cannot do accounting." It can do a lot, and the lift is real. The model is excellent at the parts that are genuinely hard for humans and genuinely tedious:

Reading unstructured documents. Pulling vendor, amount, date, and VAT off a photographed receipt in a language nobody on the team speaks.
Pattern matching against history. "You always book Hetzner to 7560, so I will propose that." This is where most of the auto-post volume comes from.
Drafting the boring narrative. Period summaries, variance explanations, the prose around the numbers.
Catching anomalies. "This vendor usually invoices 1,200 and this one is 12,000. Look."

What it must never be trusted to do is the arithmetic, the rule application, and the irreversible commit. Those stay deterministic, stay validated, stay human-gated. Get that division right and you have something genuinely useful: a tool that eats the data entry and hands the judgment back to a person, instead of a tool that does everything fast and some of it wrong.

The honest version of "AI does your books" is less impressive on a demo stage and far more valuable in practice. The agent reads, proposes, and drafts at machine speed. The deterministic core keeps it honest. The human owns the doors that only open one way. Scale runs on systems, not goodwill, and an accounting agent is just that principle pointed at the one place where being approximately right is the same as being wrong.

Want to discuss this? Write directly.

jami@impactnode.fi