I Kept Removing AI Until It Worked
How I built an invoice automation system for Reno Stars by learning — the hard way — that more AI agents just means more problems.
So I spent 5 days building an invoice automation system for Reno Stars, a renovation company. By the end, I'd rewritten the architecture 4 times. Each rewrite removed more AI. And each time, the system got more reliable.
This is the story of how I went from "let's use multiple AI agents to review each other's work" to "one smart LLM calling dumb, deterministic tools" — and why that turned out to be the answer all along.
The Problem
Reno Stars generates renovation invoices — bathroom remodels, kitchen renovations, flooring, painting. Each invoice follows specific patterns: a tub-to-tiled-shower bathroom always has the same set of steps (demolition, drywall, shower wall, shower base, tile, glass door, vanity, fixtures...). But there are dozens of base configurations, each with optional add-ons and customizations.
A colleague had been writing these by hand, and all the patterns lived in their head. My job was to capture that knowledge and automate the whole thing.
Day 1: Build the Knowledge Base
Before writing any code, I scraped 342 historical invoices from InvoiceSimple and converted them to markdown. Stripped all pricing — this system only handles scope of work, not numbers. Then I organized everything into 13 reference documents by trade: bathroom, kitchen, flooring, painting, electrical, plumbing, and so on.
This part was straightforward. I now had a structured knowledge base. The hard part was figuring out how to actually use it.
Day 2: Composable Templates — Let AI Pick and Compose
My first real architecture: 35 base models (pre-written invoice templates), each with add-ons and replacement rules. The AI's job was simple — read the user's prompt, pick the right model, select the right add-ons, and a compose engine would mechanically stitch them together.
User prompt → AI picks model + add-ons → Compose engine → Invoice
This worked... kind of. The compose engine was deterministic, which was great. But the AI kept drifting from the templates. It'd add steps that didn't exist in any model. It'd rephrase things in subtle ways. Small hallucinations compounded into invoices that looked plausible but weren't quite right.
I thought: "The AI just needs more structure. More guardrails. More review."
That thought sent me down the wrong path for an entire day.
Day 3: The Multi-Agent Rabbit Hole
This was the longest day. I rebuilt the entire system as a multi-agent pipeline:
- Extractor agent — parses the user's prompt into a structured spec
- Reviewer agent — validates the extraction, loops up to 4 times to correct mistakes
- Section agents — one per trade, each doing a 2-turn conversation (review intake, then generate)
- Post-processor — programmatic fixes to catch whatever the agents still got wrong
- Assembler — stitches everything together
The idea was defense in depth. If one agent hallucinates, the next one catches it. More eyes on the problem. More layers of review.
It made everything worse.
The reviewer agent would "correct" things that were already right. The section agents would rephrase template text in ways that sounded fine but broke the format. Each layer of AI added its own hallucination surface. And debugging became a nightmare — when the output was wrong, which of the 4 AI stages caused it? Good luck figuring that out.
So I added a post-processor to programmatically fix common AI mistakes: strip hallucinated remarks, restore missing steps, enforce bold headings, remove duplicate entries. The post-processor kept growing. And at some point I stepped back and realized — I'm writing code to fix AI output... when I could just write code to generate the output correctly in the first place.
That was the turning point.
I disabled AI generation entirely and just used the preprocessed template directly. The section agent went from a 2-turn conversation (review + generate) to review only. Content generation became fully deterministic.
The output immediately got more reliable.
So I kept going. I built typed step objects — instead of AI modifying text, the system parsed templates into typed objects, applied modifications programmatically, and rendered back to markdown. No AI touches the content at all.
By the end of Day 3, most invoices generated with zero AI tool calls. The extraction still needed AI (you genuinely need language understanding to parse "I want a bathroom with a bench and double sink"), but everything after that was pure code.
Day 4: The Final Architecture — MCP-First
Day 4 started with a big cleanup. I built a proper typed object model: step classes with build() methods, factory functions for base models, modifier functions for add-ons. Each step knows how to render itself. No central renderer, no parser, no compose engine.
Factory function → SectionInvoice { steps: InvoiceStep[] }
→ Modifier functions mutate steps
→ buildSection() sorts by order, calls step.build()
→ Markdown output
Then I deleted everything else. The entire multi-agent pipeline — extractor, reviewer, section agents, post-processor, AI client, trace logger — all gone. About 1,900 lines, just wiped.
In its place: an MCP server with 6 deterministic tools. (If you're not familiar with MCP — Model Context Protocol — think of it as a plugin system for LLMs. It lets an AI call external tools in a standardized way.)
Claude Code Opus (the brain)
→ list_catalog — browse available models and modifiers
→ describe_item — inspect a model or modifier in detail
→ build_section — factory + modifiers + preferences → markdown
→ assemble_invoice — header + sections → final file
→ get_invoice — retrieve a saved invoice
→ list_invoices — list all saved invoices
Zero AI inside the server. No API keys. No AI SDK dependencies. The MCP server is purely mechanical — structured input in, deterministic output out. Claude Code Opus is the only brain. It reads the user's prompt, reasons about what models and modifiers to use, and calls the tools.
Here's what I love about this: if a better LLM comes out tomorrow, I swap one thing. The tools don't change. The templates don't change. The step classes don't change. The intelligence is cleanly separated from the machinery.
The rest of Day 4 was pure expansion. I added 6 new bathroom models, 4 new sections (foyer, painting, flooring, rough-in), and a bunch of quality fixes. The typed architecture made this trivial — add a factory function, add some modifier functions, register them, done. No parser changes, no renderer changes, no prompt engineering. Just code.
Day 5: E2E Testing and Polish
13 template fixes from end-to-end testing. Every single fix was localized to a specific step class or modifier function. No cascading failures. No "fixing this broke that."
Wrong output? Find the step class. Fix the build() method. Done.
That's it.
Compare that to Day 3 where a single wording change could ripple through the extractor, reviewer, section agent, and post-processor. Night and day.
What I Learned
More AI layers = more hallucination, not less
This was the big one. My gut told me that adding a reviewer agent would catch mistakes from the generator agent. In practice, each AI layer adds its own failure modes. The reviewer hallucinates corrections. The generator hallucinates content. You end up debugging interactions between AI agents instead of debugging your actual logic.
It's like playing a game of telephone — every retelling introduces distortion. Adding more people to the chain doesn't make the message clearer.
One smart LLM + dumb tools > many AI agents
The final system has exactly one AI decision-maker: Claude Opus. It reads the prompt, picks the right models and modifiers, and calls deterministic tools. That's it.
Don't distribute intelligence across multiple AI agents. Concentrate it in one capable model and give it well-defined, deterministic building blocks.
This is the architecture I'd recommend to anyone building AI-powered automation right now.
Code beats "AI review" every time
When I caught the AI hallucinating vanity types, I had two options: add another AI agent to review vanity selections, or write a 5-line function that maps user input to vanity type deterministically.
The function is faster, cheaper, 100% reliable, and debuggable. Not even close.
Every time I replaced an AI decision with a programmatic rule, reliability went up. Not because AI is bad — but because most of these decisions weren't actually ambiguous. "Double sink means vanity sink quantity = 2" doesn't need AI. It needs an if-statement.
Bugs stay fixed with code. With prompts, they don't.
This one drove me crazy during the multi-agent phase.
With code, you find a bug, you fix it, and it's fixed forever. Done. Move on.
But with prompt engineering, fixing one thing often breaks something else. You tweak a prompt to stop hallucinating vanity types, and now it starts dropping fixture steps. You fix that, and suddenly the reviewer agent over-corrects in a different way. It's like whack-a-mole.
With the deterministic system, Day 5 was 13 bug fixes. Each one was surgical — change a step class, verify the output, done. None of them interfered with each other. That's just not possible when your logic lives in natural language prompts.
Save AI for what's genuinely ambiguous
The one place AI genuinely earns its keep in this system: understanding user intent. "I want a master bathroom, tub to tiled shower, with a bench, double vanity, keep the existing exhaust fan" — parsing that into structured selections requires real language understanding. That's AI's job. Everything downstream is just code.
The Numbers
| Metric | Multi-Agent Pipeline | MCP-First |
|---|---|---|
| AI calls per invoice | 4-8 | 0 (inside the server) |
| Lines of AI orchestration code | ~1,900 | 0 |
| AI dependencies | @ai-sdk/anthropic, @ai-sdk/openai, ai, dotenv | None |
| Debugging time per issue | 30-60 min (which agent?) | 5 min (which step class?) |
| Output reliability | ~85% (needed human review) | ~99% (deterministic) |
Final Thoughts
I started this project thinking I'd build a sophisticated multi-agent system. I ended up building something way simpler — and way better. Reno Stars gets reliable invoices. I get a system I can maintain and extend without prompt engineering.
If you're building AI-powered automation, start by asking: "What parts of this actually need AI?" You might be surprised how small the answer is. Put your best model in charge of that small part, make everything else deterministic, and resist the urge to add more agents when something goes wrong.
The fix is almost never "more AI." It's usually "less AI, better code."