February 20, 2026
7 mins

The Hard Part Isn’t the AI

Agentic workflow architecture for financial reporting: whiteboard showing pipeline stages, data extraction, prompt dependencies, and audit trails
Image
Image

How we built an agentic workflow that turns months of manual analysis into auditable, presentation-ready reports — and why the technology was the easy bit.

The Jargon Problem

“The MP produces a DDA for the RA.”

If that sentence means nothing to you, you’re in good company. But in our client’s world, it’s Tuesday. Every industry has its own dialect — acronyms stacked on acronyms, shortcuts forged over decades of practice, meaning compressed into shorthand that only insiders can parse. This is the reality of complex, regulated businesses. And it’s exactly the kind of environment where the instinct is to throw AI at the problem.

The logic seems sound: we have spreadsheets, financial analyses, charts, tables, and a mountain of reference documents. Modern AI can process text. Why not automate the whole thing?

Because here’s what happens when you ask an AI to process documents it doesn’t understand: it produces confident, plausible nonsense. And when you’re working with financial data — real numbers attached to real decisions — plausible nonsense isn’t just embarrassing. It’s dangerous.

When you’re working with financial data — real numbers attached to real decisions — plausible nonsense isn’t just embarrassing. It’s dangerous.

A year ago, we attempted exactly this. Orchestration tools like n8n and Make.com were available. The workflows could be wired together. But the large language models at the time simply weren’t capable of the data processing we needed — the nuance, the cross-referencing, the domain reasoning. We got something “good enough,” but good enough doesn’t survive contact with a client who needs to trust every number on every slide.

If your organisation spends days — per person, per week — manually compiling complex, jargon-heavy documents into presentation-ready reports, what follows is directly relevant to you. We solved this problem. But the breakthrough wasn’t a smarter model or a better prompt. It started with two consultants, a whiteboard, and a lot of coffee.

What We Built

The deliverable is a compiled financial assessment report — over a hundred slides, presentation-ready, with charts, tables, benchmarks, and written narrative. The source material is a collection of complex spreadsheets and reference documents. The output is something a senior consultant can review and present, not something they need to reconstruct.

The proprietary process behind this sits entirely in Excel. Activity-based accounting methodology, industry benchmarks accumulated over decades, qualitative and quantitative surveys, complex pivot tables — this is the intellectual property of consultants who are leaders in their field. Excel is their tool. Frankly, it’s their only tool. And that’s fine, because their value isn’t in software. It’s in the analysis. We didn’t replace any of it. We built the automation layer that takes the outputs of their expertise and transforms them into the final product.

The pipeline works in stages with managed dependencies. First, data and charts are extracted from the source spreadsheets. That extracted data feeds into hundreds of individually crafted prompts — each one targeting a specific piece of analysis or narrative. The architecture is multi-level: raw data feeds level-one prompts, those outputs feed level-two summaries, which feed level-three executive narratives. Prompts feeding prompts feeding prompts. You can’t generate an executive summary before the section summaries exist. You can’t write narrative before the data is extracted and validated. Every dependency is explicit.

Every piece of generated content carries full provenance. Which source document. Which data point. How it was processed, and by which step in the pipeline. The audit trail is a first-class deliverable, not an afterthought — because when you’re working with financial data, the ability to prove where every number came from is non-negotiable.

And critically: zero tolerance for hallucinations. Financial figures are extracted and placed, never generated. The AI writes narrative around verified data. It does not invent data.

The entire workflow is API-invocable — it can be triggered remotely as a service, run on demand, integrated into broader operational processes. This isn’t a script someone runs from a laptop. It’s infrastructure.

Why It Failed Before

A year ago, we built the first version of this workflow using Bubble.io as a no-code orchestration layer and GPT-3.5 with chat completions for the language processing. The orchestration worked — we could wire stages together, move data between steps, trigger actions in sequence. But the LLM couldn’t do the job. It couldn’t cross-reference across documents. It couldn’t handle the domain jargon. It couldn’t maintain consistency across the hundreds of prompts the workflow required. No-code gave us plumbing, but the engine it was connected to wasn’t powerful enough.

Fast forward to today. The models have improved dramatically. So we ran a deliberate benchmark: could a modern consumer AI tool — in this case, Claude on claude.ai — handle this workflow? We wanted an honest answer. Is the current state of general-purpose AI tooling better or worse than purpose-built infrastructure?

We spent four hours on it before calling it. The model itself was capable — the raw intelligence wasn’t the problem. But the tooling collapsed under the weight of the task. Context limits meant we couldn’t hold enough of the workflow in a single session. There was no filesystem access, so every document had to be manually fed in. No persistent memory between steps. Constant need to do things outside the tool, then bring the results back. For a workflow with this many stages and dependencies, it was like trying to build a house using only a very clever hammer.

Complex, multi-stage, domain-specific agentic workflows need purpose-built infrastructure.

This isn’t a criticism of any specific tool. Claude.ai, ChatGPT, and the rest are impressive for general tasks. The point is narrower than that: complex, multi-stage, domain-specific agentic workflows need purpose-built infrastructure. They need filesystem integration, persistent context, orchestration logic, and the ability to manage hundreds of parallel and sequential operations without losing track of where they are. General-purpose chat interfaces aren’t designed for that. Not yet.

The Real Foundation: Understanding the Business

Here’s the contrarian claim: the hardest part of building an agentic workflow isn’t the AI, the code, or the architecture. It’s understanding the domain.

“The MP produces a DDA for the RA.” We opened with this sentence for a reason. Without knowing what it means — really knowing, not just expanding the acronyms — no amount of prompt engineering will save you. Every synonym, every abbreviation, every implicit assumption baked into decades of client process had to be understood before a single line of code was written. The spreadsheets weren’t just data. They were encoded expertise, and decoding them required expertise of our own.

There are no spring chickens on this team. That’s not a disadvantage — it’s the whole point. Business modelling, operations, financial analysis, process mapping — this knowledge was built over careers, not bootcamps. When you’re staring at a workbook with thirty tabs of pivot tables and activity-based costings, grey hair is an asset. You know what to look for. You know what questions to ask. You know when something doesn’t smell right.

An AI can only be as good as the instructions it receives. If you don’t understand the business, your prompts are garbage — and your output is confident garbage.

An AI can only be as good as the instructions it receives. If you don’t understand the business, your prompts are garbage — and your output is confident garbage. A no-code tool or a junior development team could wire the orchestration together. The stages would execute. The outputs would look plausible. And they’d be subtly, dangerously wrong, because nobody on the team would catch the domain nuances that make the difference between a trustworthy report and an expensive liability.

We didn’t start with code. We started with questions. What does this spreadsheet actually mean? What’s the relationship between these tabs? Why is this benchmark structured this way? What does the client actually need to see on slide forty-seven? Six hours with coffee and a whiteboard — two people who understood both the technology and the business, sketching the process, mapping dependencies, understanding what had to happen before what. That was the foundation for everything that followed.

Designing the Workflow

The whiteboard sessions didn’t produce code. They produced a design — stages, boundaries, data flows, dependencies, all mapped out on paper before anyone opened an editor.

The first key decision was breaking the report into its natural sections. Each section of the final deliverable became an independent agent context with its own data, its own prompts, and its own outputs. This wasn’t arbitrary. It was essential for two reasons.

First, speed. The source documents are large. The workflows are numerous. Running everything sequentially would take far too long to be practical. Parallel execution wasn’t a nice-to-have — it was a requirement.

Second, context isolation. When an AI agent’s working memory bleeds across unrelated tasks, output quality degrades. We call it context poisoning. If the agent processing financial benchmarks is also holding survey data and executive narrative in its context, it starts making connections that aren’t there. Clean boundaries between sections prevent cross-contamination and keep each agent focused on exactly what it needs to know.

When an AI agent’s working memory bleeds across unrelated tasks, output quality degrades. We call it context poisoning.

The design had to account for hundreds of prompts with explicit dependencies between them. Data extraction before narrative generation. Level-one analysis before level-two summaries. Level-two summaries before level-three executive narrative. Every dependency was mapped, every ordering constraint was deliberate.

Then we proved each step in isolation before attempting the whole job. Extract one chart. Run one prompt. Merge one section into the template. Validate the output. Get each component right on its own before wiring them together. Only when every individual step was reliable did we connect the pipeline end-to-end.

We built in automatic feedback loops and self-correction. The system checks its own output — catches formatting errors, detects text overflow, validates that data has been placed correctly. When something fails, it knows, and it reports exactly what went wrong and where.

Designing the workflow orchestration was the non-trivial part. The actual implementation — writing the configuration, wiring the agents — was straightforward; modern LLMs handle that well. But knowing what to build, how to decompose the problem, where to draw the boundaries, what each agent should and shouldn’t see — that only comes from experience. You need to have designed a lot of agentic processes to understand what good looks like.

And underneath all of this is a substantial amount of Python code with real dependencies — libraries for spreadsheet parsing, chart extraction, image rendering, document generation. This is not a weekend project. This is not something a no-code tool can produce. It required genuine software engineering alongside the domain expertise.

The Result

Twenty million tokens to reach production testing. To put that in perspective: 7 specialist agents coordinated across 17 skills process 53 extracted data tables, execute 68 individually crafted prompts that generate 230 distinct pieces of narrative content, render 62 tables and diagrams, place 91 charts and images, and assemble it all into an 82-slide presentation — with every single output traceable to its source.

That’s what the numbers look like. Every chart placement, every narrative paragraph, every financial figure can be traced back to the specific spreadsheet cell it came from and the specific pipeline step that processed it. The audit trail isn’t a feature. For financial reporting, it’s the point.

The workflow runs on Sasha — our agentic workflow platform — which provides the observability and orchestration that a pipeline of this complexity demands. Every stage is logged. Every decision is auditable. When something needs attention, the system surfaces it in structured reports that a consultant can act on, not in stack traces buried in a terminal.

The practical impact is measured in days. The manual version of this process — extracting data, writing narrative, formatting slides, placing charts, cross-checking numbers — consumed days per person per week. Experienced consultants spending their time on compilation and formatting instead of analysis and client work. That time is now returned to them.

What This Means

A year ago, we couldn’t build this. The models weren’t capable enough. The tooling wasn’t mature enough. Today, 7 agents coordinate across 68 prompts to produce an auditable, presentation-ready report from raw spreadsheets — and every number can be traced back to its source.

But the technology was the straightforward part. The months of work were spent understanding a client’s business deeply enough to encode it — sitting with their spreadsheets, learning their jargon, mapping their processes, and asking the questions that only experience teaches you to ask.

The question isn’t whether the AI is smart enough. It is. The question is whether the people building the workflow understand your business well enough to get it right.

If your organisation has complex, domain-heavy document workflows that consume specialist time, this is now solvable. The question isn’t whether the AI is smart enough. It is. The question is whether the people building the workflow understand your business well enough to get it right.