How We Built a Cashflow Engine for AI Financial Planning

Our first financial model was a unique piece of LLM-generated Python for every user. It worked - until it didn't. We analysed 1,682 production models, found common patterns and built a deterministic engine to replace them.

Dima Tarasenko

Dima is the Founder and CEO of Meet Warren. Passionate about making top-class financial management accessible to the 99%.

15 min read

2026-05-21

The North Star

Our number one guiding principle for building Warren is "Traditional fintech apps all deliver the same experience to each user. A personal financial companion, on the other hand, should be exactly that - personal". So when it came to cashflow modelling we wanted to innovate there as well - no rigid Income, Expenses, Assets, Liabilities. 1:1 financial model, built specifically around your finances.

You come to Warren to understand your retirement options? The UI should say "John's Early Withdrawal Pot", not "Pensions". You're buying a house? Your model doesn't need a "Net Worth", just a growing "Deposit Fund" and then a decreasing "Mortgage Balance".

This is not at all trivial to achieve and our first version was quite interesting (probably deserving of a separate post!). Of course, asking an LLM to "just do the maths" is out of question, so what we did instead is asked something like "Write a piece of Python code that will do the maths for this user".

The Problem

It worked. Sometimes impressively well. Every single user's model was unique and represented their finances 1:1 without relying on LLM maths... in theory. But four problems became clear as we scaled:

Projections were opaque. "Your model" was just a string of text that could be compiled into Python together with the observed output values when that Python was once executed. It didn't have components or a set structure beyond an output signature. It could tell you "your pension could reach £850,000 by age 67 when you retire" but it was almost nonsensical to ask Warren "what if I retire at 60?" - there is no toggle for that.

Maths leaked into LLM layer too often. For example, aggregating (potentially incorrectly) income streams into a single income variable in the code and then performing deterministic calculations on that.

All edits required full regeneration. Making changes and creating scenarios was possible but there was no concept of small vs big changes, just a text edit and a full re-execution with the associated overhead of compilation error fixes and so on.

There was no ability to record progression. The model was a static, point-in-time artefact, that's it. If we want customers to treat Warren as a financial companion that helps you stay on track as your life evolves, at the very least it should be able to update the graphs that customer sees.

And yet after 2,000 beta users we validated that the key insight was correct - when it worked, it really worked, users have never seen charts that truly represent them in an app before! Going back to full determinism and AI plugging in just a couple of inputs was not an option. So, we got to work.

The Analysis

Before building anything, we needed to know whether 1:1 personalisation actually required unique code per user - or whether our 2,000 models were structurally more similar than they appeared.

We pulled 1,682 production models from the database, parsed their Python ASTs, and measured every structural dimension: what series they tracked, what events they fired, what assumptions they used, and what computational patterns they implemented. We wanted to know, given the full flexibility of code, what the LLM was actually producing.

The finding was striking: 8 pattern combinations covered 70% of all models. 11 covered 80%. Despite 839 unique series IDs and 4,598 unique event labels, the naming variance masked a much smaller set of underlying financial concepts. "Net Worth", "Total Net Worth", and "Cumulative Net Worth" were all the same thing. "Emergency Fund Complete" appeared in 37% of models - always implemented the same way.

We built taxonomies - collapsing the naming chaos into closed vocabularies. 73 series archetypes. 101 event archetypes. 6 assumption categories. These taxonomies became the foundation of the schema we would build: if 1,682 unique Python scripts were really implementing ~8-11 computational templates with different parameters, then a declarative schema covering those templates could replace code generation entirely.

We then validated this by transcribing 25 representative models from Python into our proposed schema and running them through a deterministic engine. The first pass found 121 gaps - places where the schema couldn't express what the Python did. After four iterations of surgical schema fixes, that dropped to 47. Most of what remained were edge cases (intra-period ordering, conditional waterfalls) that affected a small fraction of models, not fundamental limitations.

The conclusion: per-user code generation was structurally unnecessary. A well-designed declarative schema could capture the full range of personal finance situations our users presented, while being deterministic, editable, and auditable by design.

What We Built

Meet Warren's Financial Engine is the result of that analysis - a deterministic simulation system. It takes a structured Model Definition (a declarative representation of the user's financial life) and produces a Simulation Output: period-by-period projections for every financial instrument, with taxes applied, events triggered, and assertions checked.

The LLM's job changes fundamentally: instead of generating Python code, it generates structure. It populates a schema that defines the series (salary, pension, ISA, mortgage), the events (retirement, house purchase, salary increase), the taxes, and the relationships between them. The engine then computes every number deterministically.

The core contract is simple: the LLM generates a Model Definition (structure, relationships, initial parameters), and the engine computes a Simulation Output (period-by-period projections, fired events, tax debits, assertion results). Same inputs, same outputs - every time.

The Model Definition captures a user's entire financial life as a structured document. At a high level, it contains:

Horizon - when the model starts, how far it projects, the user's age
Assumptions - named parameters (inflation rate, salary growth, retirement age)
Series - financial instruments, each with a type: income (salary, rental income), accumulator (pension, ISA), flow (pension contributions), debt (mortgage, student loan), or derived (net worth, computed from other series)
Events - things that happen (retirement, house purchase, salary increase), each with a trigger condition and effects on series
Taxes - tax computations applied to series (income tax, NI, CGT)
Assertions - rule-based sanity checks ensuring model cohesion (eg "salary never negative")

Every element is declarative. A salary series defines its growth by referencing assumptions. A pension accumulates contributions from a flow. A mortgage amortises according to financial mathematics. Events trigger on conditions - when age reaches the retirement assumption, salary and contributions deactivate. Taxes compute from the affected series and debit them. Assertions verify that the plan is on track.

The engine walks this structure period by period, evaluating expressions, firing events, applying taxes, computing growth, and recording values. No LLM involved. No randomness. Pure arithmetic.

Construction: Key Decisions

Decision 1: LLM Defines Structure, Engine Computes Numbers

This is the foundational split - and the direct result of the corpus analysis.

The LLM is excellent at understanding a user's financial situation and translating it into a structured model - deciding that a 32-year-old with a workplace pension and a mortgage needs specific series, events, and relationships. It was never reliable at writing correct computational code for 35-year compound growth with tax implications and event-driven parameter changes.

We give the LLM the Model Definition schema and ask it to populate the structure. The schema - shaped by the taxonomies we built during the analysis - is expressive enough to capture the semantics of UK personal finance: income gating (salary stops at retirement), event-driven parameter changes (mortgage rate increases on remortgage), tax distribution across multiple series, compound and linear growth mechanisms, and assertions that serve as plan health checks.

The engine then takes this structure and does what computers are good at: arithmetic. Period by period, it evaluates every expression, applies every tax, fires every event whose trigger condition is met, grows every accumulator, amortises every debt, and records every value - in both nominal and real (inflation-adjusted) terms.

The trade-off was deliberate and costly: designing a schema that could capture the breadth of UK personal finance took far longer than building the simulator. The Model Definition had to be wide enough to handle everything from a straightforward salary-and-pension model to complex multi-property, multi-pension, event-heavy scenarios - while remaining structured enough for deterministic evaluation. The corpus analysis gave us confidence that this was achievable: the patterns were finite, even if the naming was not.

Decision 2: An Expression Language, Not Hardcoded Formulas

Financial models are full of relationships. Pension contributions are a percentage of salary. Salary grows at an assumed rate. Tax is computed from income. Net worth is a derived sum.

We designed a small expression language that the LLM uses to define these relationships:

Layer	Operations	Example
Values	constants, assumption refs, series refs, age, years elapsed	`ref: "inflation_rate"`
Arithmetic	`add`, `multiply`, `pow`, `min`	`min(multiply(salary, rate), cap)`
Predicates	`gte`, `lt`, `eq`, `and`, `or`, `not`	`gte(age, retirement_age)`
Composition	`let` bindings for local variables	`let(tax_free = ..., body)`

These compose freely - a pension contribution can be expressed as min(multiply(salary, contribution_rate), annual_allowance). An event triggers when its predicate evaluates to true. An assertion passes when its predicate holds. An income series can be gated on a predicate - salary is active only while age is below the retirement assumption.

This gives the LLM the vocabulary to express a wide range of financial logic without us having to anticipate every possible relationship. When a user says "I want to increase my pension contributions by 1% every year until I hit the annual allowance," the LLM can express that as a min() over a growing contribution and a cap - no new engine feature required.

Decision 3: The Validate-Repair Loop

An LLM generating a Model Definition from a conversation will make mistakes. Series IDs might not match between a flow and its target accumulator. An event might reference an assumption that doesn't exist. A derived series might accidentally reference itself, creating a cycle.

We built a validation layer that catches these errors - not just structural schema violations, but semantic issues:

ID uniqueness across series, events, taxes, assumptions, and assertions
Cross-reference resolution - every referenced series, event, assumption, and property actually exists
Cycle detection - derived series can't (directly or transitively) reference themselves
Variable binding - every var reference in a let expression resolves to an enclosing binding
Progress point coverage - every trackable series has a corresponding entry in the progress point

Validation is pure and fast - O(N) over the plan's structure, trivial for any realistic model. It produces structured errors and warnings that the LLM can read and act on.

In the plan generation pipeline, this creates a validate-repair loop: the LLM generates a model, the engine validates it, and if validation fails, the errors are fed back to the LLM with instructions to fix them. This usually converges within a small number of iterations. The LLM doesn't need to perfectly memorise our schema - it just needs to get close enough for the repair loop to guide it to a valid model.

Another advantage of this approach is that it allows us to progressively inject context about the subtle edge behaviours of the engine where relevant without polluting the initial large system instructions.

Beyond hard errors, the validator also produces quality hints - suggestions that the model isn't wrong, but could be better. For example, an assumption that's defined but never referenced is probably left over from an earlier version. These hints are surfaced to the generation pipeline and optionally to the user's model editor.

Decision 4: Re-anchoring - Plans That Move Through Time

A financial model created in January with current balances and assumptions is stale by June. Balances have changed. Events may have occurred. New contributions have been made. The model's "as of" date is in the past.

There is a trade off here between simply moving forward in the background by making assumptions (i.e., on every login, the user sees today as the starting point of the model) vs having a stale UI after a couple of weeks of no new data. We made a product decision to never advance without user confirmation of their current balances (e.g., did the salary increase the model predict actually happen?)

The way we progress is interesting. We created a reconciliation process internally called "Re-anchoring". When a user captures a new progress point - current balances for all their financial instruments - the engine:

Moves the horizon forward. The asOfDate advances to the new progress point's date. periodCount is clamped so the calendar end-date of the simulation stays the same - the plan still projects to the same future date, just from a new starting point. The engine deterministically handles partial periods for tax and inflation purposes.
Processes event decisions. Events that fell between the old and new asOfDate might have fired (the user got a raise), been rescheduled (the house purchase moved to next year), or been removed (the user decided not to remortgage). The user reviews these events and marks each one. Fired events have their effects baked into the model - parameter changes become declared values, lifecycle changes are recorded.
Seeds the forward simulation. The new progress point provides period-0 values for every trackable series. The engine simulates forward from here with no memory of how the previous simulation ran - only the structural effects of past events (which have been baked into the model) carry forward.
Surfacing progress. We store both historic initial values (eg last time you provided a salary figure) as well as future predicted values at historic simulations (eg now that you got an unexpected raise, your 2050 Net Worth is X% higher than previously predicted).

This means a financial plan is never frozen at its creation date. It evolves as the user's life evolves, staying grounded in real numbers rather than stale projections.

What We've Learned

Separate structure from computation. The single biggest improvement in plan quality came from taking code generation away from the LLM. Let it reason about financial situations and populate a structured schema - that's what it's good at. Let a deterministic engine handle the arithmetic - that's what engines are good at. This gives you the best of both worlds: flexible, nuanced financial reasoning with precise, reproducible projections that can be edited, debugged and built upon with a simple schema edit.

Measure before you build. The corpus analysis was the most important step in the entire project. Without it, we would have been guessing at what the schema needed to express. With it, we knew exactly which computational templates to support and could design with confidence. The finding that 8-11 patterns covered 80% of models turned a risky architectural bet into an evidence-based decision.

Schemas are expensive to design, cheap to use. Designing a Model Definition schema that could capture the breadth of UK personal finance took far longer than building the simulator. But once the schema existed, every new feature - a new type of debt, a new tax mechanism, a new event effect - was a schema extension and a simulator step, not a rewrite.

Validation is the bridge between LLM and engine. The validate-repair loop is what makes it practical to have an LLM generate structured models. Without it, the LLM would need to perfectly satisfy every cross-reference and dependency constraint - an unreasonable expectation. With it, the LLM needs to get the structure approximately right, and the validator guides it the rest of the way.

Explainability comes from structure, not explanation. We never had to build a separate "explain this projection" feature. Because every number is the result of a transparent computation - salary × growth^years, compounded annually, minus tax, plus contributions - the explanation is inherent in the model. Users can trace any number back to its inputs. Warren can use intuitive jq tooling to quickly understand the user's specific model at any point and explain the why behind every single number. No black boxes here.

What's Next

The financial engine is in production and actively expanding:

Coverage. The expression language and event system are general-purpose, but the library of financial instrument types, tax computations, and event patterns continues to grow. More debt types, more nuanced tax modelling, support for complex benefit interactions - each expansion makes the schema capable of representing more financial situations accurately.

Semantic breadth. Real financial lives are messy. Supporting scenarios like "I might take a career break," "my partner's income is variable," or "I'm considering equity release on my property" means expanding the event and series vocabulary while keeping the engine deterministic.

Computation depth. The tax system currently handles income tax, NI, and capital gains at a modelled level. Deepening this to handle reliefs, allowances, tapering, and cross-year interactions - always deterministically, always traceably - is an ongoing investment.

Flexibility. As UK tax rules change (and they change frequently), the model schema needs to accommodate new structures without breaking existing plans. This means careful versioning of model definitions and simulation outputs, so a plan created under 2025-26 tax rules can be re-simulated under 2026-27 rules with the differences clearly surfaced.

Built by the Meet Warren team. For questions, reach us at info@meetwarren.co.uk.