Building Agentic Deep Research Systems: From Hours to Minutes with AI-Powered Document Generation
$ grep -n "^##" 2025-04-building-agentic-deep-research-systems.md>
I typed a topic into a box, walked off to make coffee, and came back to a finished Microsoft Word document. Twelve pages. Section headings, body copy, generated diagrams, inline citations with source links. The kind of report that would have eaten an afternoon of someone's week.
I'd built the thing that produced it — an AI Document Generator — so I knew it was just five specialised agents passing work between each other. But sitting there reading a coherent, cited document that hadn't existed when I put the kettle on, the part that stuck with me wasn't how fast it was. It was that I now had to read every line of it carefully, because I had no idea which of those confident citations were real.
That tension — generation got cheap, verification didn't — is the whole story. The rest of this post is how I built the generation half, and why the verification half is the part that actually matters.
What the numbers say (and where they stop)
I'm wary of productivity stats because they tend to measure the easy half. But the direction is real, and worth putting on the table before I get into the build:
- 40% average productivity boost for employees using AI in document-related tasks
- 66% faster completion times for reports and content generation
- 20-40% efficiency gains in document processing workflows
- 22% reduction in operating costs on average for companies investing in automation
- 80% reduction in loan processing costs achieved by Direct Mortgage Corp. through automated document workflows
Marc Benioff, Salesforce's CEO, frames it as a labour shift rather than a tooling one: "Digital labor is here, transforming productivity without growing the workforce. The age of AI agents is now."
Here's what every one of those figures quietly assumes: that the output is correct. They measure how fast you got to the document, not how long you then spent checking it. In my experience running agents in production, that second number is the one that decides whether the whole thing was worth it.
How the work gets divided
I've spent twenty years building systems that hand work between components — started out racking servers in a London Docklands colo around 2005, and the instinct never left. A multi-agent setup is the same discipline at a higher altitude: one job per worker, clean handoffs, no agent trying to do everything.
The Document Generator runs five of them.
The orchestrator owns the workflow end to end. It turns a topic into a research plan, hands tasks out, watches progress, and assembles the final document. It's the only agent that holds the whole picture; the rest are deliberately narrow.
The web research agent runs queries through Perplexity, scores each source for credibility, and structures what comes back into citable results. It carries its own retry and rate-limiting logic because real research means hundreds of requests, and a third of them will fail or throttle on a bad day.
The structure agent takes that raw research and lays out the document — section hierarchy, ordering, the logical spine from intro to conclusion. It runs on OpenAI's o3-mini, which is cheap and good enough for what is essentially an outlining problem.
The content writer does the heavy lifting: full prose for each section, on GPT-4o with token limits cranked up, holding a consistent voice across the whole piece.
The image agent generates the diagrams and illustrations through DALL-E 3, sized and captioned to slot into the layout.
Five workers, one coordinator. The reason it works isn't any single agent being clever — it's that none of them is asked to be clever about more than one thing.
The flow itself is linear with a parallel middle:
- The orchestrator analyses the topic and writes targeted research questions
- Research queries fire simultaneously, each scored for source credibility
- Findings feed into document structuring
- Sections write in parallel, pulling in images as they go
- Everything assembles into a formatted DOCX
Where I've watched this land
The pattern isn't confined to my weekend repo. Teams shipping this in anger are getting real movement:
In healthcare revenue cycle management, Easterseals Central Illinois put specialised agents on documentation, coding, and claims. They cut accounts-receivable days by 35 days and primary denials by 7%.
In banking, credit-risk systems now draft memos from multiple data sources to support relationship managers — 20-60% productivity increase and 30% faster credit turnaround.
In market research, multi-agent platforms flag data anomalies and synthesise insights, with a projected 60% productivity gain and $3 million in annual savings.
One practitioner described the before-state precisely: "Picture a customer service agent spending hours manually generating documents, switching between systems, and copying information back and forth. Now, imagine transforming this into a simple conversation. That's the intersection where AI meets document automation."
Notice what all three have in common: a human still signs the output. Nobody is letting the agents file the insurance claim or approve the loan unread. The agents draft; a person commits.
The plumbing
None of the interesting decisions here are about the AI. They're about keeping a long-running, failure-prone pipeline stable.
The system runs on FastAPI with Celery task queues backed by Redis. That choice is doing real work: research and content generation are slow and flaky, so each stage is an async task that can retry independently without taking the whole run down. SQLAlchemy persists state so a job that dies at section seven of ten doesn't start over from the topic.
The Perplexity layer is where I spent the most care. Every result carries source attribution, a reliability score, and structured formatting for whatever consumes it downstream. That metadata isn't decoration — it's the only thing standing between a polished document and a confidently-wrong one. A citation the system can't trace to a real source gets flagged, not printed as fact.
The quality gates sit at each handoff:
- source credibility scoring before research is accepted
- content length and depth checks before a section passes
- format and style consistency across the assembled document
- citation accuracy and attribution before anything ships
Scaling is the boring part that matters: configurable concurrency, rate limiting that respects the upstream APIs, progress tracking so you can see where a run is, and enough logging to actually debug a failure at 2am. Redis-backed queues mean you scale horizontally by adding workers, not by rewriting anything.
The bottleneck moved, and it's not where you'd guess
I now routinely run four or five Claude agents across different projects at once, and I've built a few of these orchestrators — Gluon, scraper-mcp, dagentic, a deep-research harness. The lesson is the same every time, and it's the one this whole post circles: the agents are no longer the constraint. Generation is solved-enough. The constraint is me, reading the output, deciding whether I trust it enough to act on.
That's why the source attribution and credibility scoring aren't a footnote in this build — they're the point. A document generator that produces beautiful prose with unverifiable claims hasn't saved you an afternoon. It's moved the afternoon from writing to fact-checking, and fact-checking someone else's confident draft is slower than writing your own.
We're watching a genuine shift in how people interact with information systems — Salesforce calls it going "from clicks to prompts." Instead of driving software through menus, you state intent and the system figures out the steps. Nitro Software puts the upside plainly: "This eliminates the tedious, manual work... allowing you to save significant time and focus on more important tasks."
And it's not a niche. 43.2% of U.S. workers now use generative AI regularly, with daily usage surging 233% in six months. The adoption argument is over.
But adoption isn't the same as trust. Every one of those workers is now downstream of an output they didn't write and have to decide whether to believe. That's a verification problem, and it scales worse than the generation problem we just solved.
If you want to build one
The repository is a complete starting point — it needs OpenAI and Perplexity keys and runs locally or under Docker. A few things I'd tell anyone adapting it.
Keep each agent to one responsibility. The moment an agent is doing two jobs, you can't test it, you can't debug it, and a failure in one half corrupts the other. The narrowness is the design, not a limitation of it.
Pick your use case for how checkable the output is, not how impressive it sounds. Structured, research-heavy documents — reports, proposals, analytical write-ups — work because their claims are traceable to sources. The further you drift from that, the harder verification gets, and verification is the cost that actually bites.
Instrument everything: agent performance, research quality, content metrics. You won't improve what you can't see.
And build the human checkpoint in from the start, not as a polite afterthought. The system runs autonomously right up to the point where someone has to stand behind the result — and that point is non-negotiable for anything that matters.
What's actually changed
Document automation is the visible edge of something larger. These systems are already creeping into analysis, planning, work that used to be unambiguously human. The capability curve is steep and it's not flattening.
But the thing that changed in my workflow isn't that I write fewer documents. It's where my attention goes. The machine took the part I was happy to give up — assembling, formatting, drafting — and handed back, undiluted, the part it can't do: deciding whether the result is true enough to put my name on.
I built a system that turns hours into minutes. The minutes it gives back are the ones I now spend verifying. That's not a complaint — it's the trade, stated honestly. The work didn't disappear. It moved to the one place a person still has to stand.
Want to explore the AI Document Generator yourself? The complete source code, documentation, and examples are on GitHub. The system requires OpenAI and Perplexity API keys and can be deployed locally or via Docker.
The Cutler.sg Newsletter
Weekly notes on AI, engineering leadership, and building in Singapore. No fluff.
Three Topologies: Single Agent, Supervisor, or Swarm
Anthropic's multi-agent Research feature beat single-agent Opus 4 by 90.2% — at 15× the token cost. Every documented production swarm runs on rails. Here's the topology decision framework before you commit.
From Solo Tool to Team Infrastructure: Scaling Gluon for Production
When I first built Gluon on my Mac mini, I was solving a personal problem: monitoring Claude agents without losing my mind to tmux logs. But when teams join the picture, everything changes — security, governance, observability, and the fundamental role of the developer. Here's what production infrastructure for autonomous agents looks like.
The Hidden Arsenal: How My Dotfiles Unlocked 10x Productivity with AI Coding Assistants
After 12 months of systematic optimization, I've documented 50-70% productivity gains with AI coding assistants. The secret isn't just using AI tools—it's teaching them to think like you do through carefully crafted configurations.