Building Storytell Multi-Agent Platform
Recently a user shared a frustration: they’d asked Storytell a deep research question—one that needed both web sources and their internal documents.
The execution started strong. Called multiple tools, fetched multiple sources, consumed close to 80% of the available tokens. Then came the warning: context window exhausted. The response was incomplete — external research but no internal analysis, no synthesis. Half an answer to a question that mattered.
That failure crystallized something we’d been circling around: single-agent architectures hit a wall when questions span multiple sources. The problem isn’t intelligence — it’s architecture.
Over the past year building Storytell, we ran into this limitation repeatedly. What starts as a simple prompt-and-response loop quickly becomes unwieldy when users ask complex questions spanning external research, internal documents, and multi-step reasoning. The context window fills up, the model loses track of its goal, and quality degrades. If you’ve built anything with LLMs, you’ve likely hit the same wall.
At Storytell, we’re solving a specific problem: helping teams work with their unstructured data. About 80% of organizational knowledge is trapped in documents, emails, Slack threads, and meeting recordings — information that’s valuable but effectively unsearchable. Users ask complex questions that require synthesizing web research with internal documents, and they expect answers without needing to understand how those answers are assembled.
If you’ve used Claude Code or other coding agents, you’ve already experienced a version of this problem — and its solution. When you ask it to “implement a complex new feature,” it doesn’t try to do everything in one shot. It spawns specialized agents, each configured with specific capabilities and constraints, that run in isolation before returning results to the orchestrator.
This orchestrator-worker pattern is what we adopted and incorporated into Storytell. This article shares what we learned making it production-ready for non-technical users.
Foundational Concepts
The vocabulary we use:
The parent agent that receives a user request, decides how to decompose it, and delegates work to specialized agents. It maintains the primary context and assembles final results.
A specialized agent spawned to execute a focused task. Agents run in isolated context windows — they don’t inherit the parent’s conversation history and don’t consume its token budget.
The architectural principle that agents operate independently. The parent never sees raw tool outputs from agents; it receives only structured summaries. This prevents context pollution and enables deep task decomposition.
With these concepts in place, let’s look at the core design.
The Core Design: Orchestrator-Worker
The orchestrator-worker pattern is simple: a single orchestrator that spawns specialized workers on demand. Each worker is effectively a new LLM stream with its own context window and set of tools available.
The challenge with single-agent systems is a tradeoff: make them too general and they produce shallow answers; make them too specialized and you need many of them, each requiring coordination. The orchestrator-worker pattern sidesteps this by letting one agent decide what needs doing while delegating how to focused workers.
When a user asks a question, the orchestrator evaluates whether it can answer directly or needs to delegate. Simple queries get handled inline; complex ones spawn specialized workers. If the question requires both web research and internal document analysis, it can spawn multiple workers in parallel — each operating independently before returning results.
The orchestrator never sees the raw web pages or document chunks. It receives pre-digested summaries, keeping its context clean for synthesis and user interaction.
Agent Specialization Through Configuration
At Storytell, our agent specialization is purely configuration-driven. Rather than writing different agent classes, we define agents through declarative specifications that control three things:
- Tool access: Which tools the agent can invoke
- Execution limits: Maximum turns, tool calls, and token budget
- Behavioral instructions: Embedded guidance for how to approach tasks
Our five agent types are specialized as follows:
| Agent | Available Tools | Focus Area |
|---|---|---|
| Explorer | web_search, web_page | External research — finding information from the open web |
| Librarian | knowledge_base_search | Internal knowledge — searching uploaded documents and saved content |
| Researcher | web_search, web_page, knowledge_base_search | Multi-source synthesis — combining external and internal information |
| Plan | All tools | Task decomposition — breaking complex requests into steps |
| General | All tools | Full capability — used when specialization isn’t needed, and only context offload is required |
This approach makes the system highly extensible.
Adding a new agent type requires zero code changes for users. We define a new specification, register it, and the system handles everything else — tool registry, prompt/instructions injection, execution limits.
Specialization is enforced architecturally, not through prompting. When spawning an Explorer agent, the system filters the tool registry down to only what that agent is allowed to use. The agent literally cannot call tools outside its specification — no prompt-engineering required.
This configuration-driven approach enables rapid iteration. New agents are added via declarative specs. New tools are registered in a central registry and become instantly available to any agent configured to use them. We can ship new capabilities quickly and experiment with different agent configurations.
Specialization alone isn’t enough. The real breakthrough is context isolation.
Context Isolation: Why It Matters
Context isolation changed everything. Without it, delegated work consumes the parent’s context budget, and deep task decomposition becomes impossible.
Consider what happens without and with isolation:
Each agent gets its own context window. It doesn’t inherit the parent’s conversation history beyond the task prompt — a description of what to accomplish with relevant context. When finished, it returns a structured summary to the parent, not raw outputs. The parent never sees every file or tool result the agent processed.
We consume more total tokens across all contexts. But the benefit — being able to handle arbitrarily complex multi-step workflows — far outweighs the cost.
Isolation keeps contexts clean, but we still need to handle pressure within a single context.
Context Management
Context isolation solves the problem of agents polluting each other’s context. But what happens when a single agent’s context starts to fill up?
We track tokens at two levels: total consumed (everything the agent processed) and returned to parent (only the summary). This separation is key — a worker might consume significant tokens processing web pages, but return only a concise synthesis to the orchestrator. The parent consumes what it receives. The agent and parent are always aware of the budget they have left.
When a worker approaches its budget limit, the system guides it to wrap up gracefully rather than failing abruptly. Tools stop accepting new requests, and the agent is prompted to summarize its findings. This gives the model a chance to produce useful output instead of hitting a hard wall.
The goal is always graceful degradation: partial results are better than failures.
Beyond token tracking, agents need hard limits to prevent runaway execution.
Flexible Execution Control
Agents need guardrails. Without them, a runaway agent can exhaust resources, timeout, or loop indefinitely. We implement composable stop conditions that are evaluated each turn:
When remaining tokens fall below a percentage of the budget, the agent receives guidance to wrap up. Tools return errors rather than executing, and the agent is prompted to summarize its findings.
Each agent type has a maximum turn count. This prevents infinite loops while allowing enough iterations for complex tasks.
Independent of turns, we also cap total tool invocations. An agent might make multiple tool calls per turn, so this provides additional protection.
Certain tools signal that the agent has completed its work. When the agent calls those tools, it’s declaring “I’m done” — the executor can stop cleanly.
A hard ceiling that prevents any single agent from running indefinitely.
These conditions compose — the first one triggered wins. This lets us tune accordingly without sacrificing capability for complex work. Checking conditions each turn adds overhead, but it’s far better than throwing errors in front of users.
Beyond the Basics: Production Patterns
The patterns above form the foundation, but production systems require additional robustness. Here are a few extra decisions that made the difference:
As context fills, agents receive increasingly urgent signals to wrap up. Tools begin refusing execution, prompting summarization before hitting hard limits.
Users see progress as agents work, not just the final result. Tool invocations and completions are forwarded in real-time, creating responsiveness even for complex multi-step tasks.
When agents fail, we guide the LLM to handle errors gracefully. The orchestrator can retry with narrower scope, delegate to a different agent type, or surface the issue to the user with actionable context.
Tools and agents are defined once and work with any LLM vendor. This enables custom integration with specialized models, graceful fallbacks when services degrade, and future-proofing against lock-in.
Each tool has configurable output limits to prevent any single result from consuming the context window. When exceeded, the system can either reject (prompting the LLM to retry with narrower parameters) or truncate (when partial content is acceptable).
Every tool produces both user-facing artifacts and LLM-optimized output. For example, web search results display as rich cards for users while providing structured data to the model.
Progressive Budget Warnings show how the system guides agents toward completion as context fills up, providing escalating signals rather than abrupt failures.
Real-Time Streaming & Dual Formats demonstrates how events from parallel agents are unified into a single stream, giving users visibility into progress as work happens. The same data is then presented differently to each consumer: users see rich visual cards while the model receives structured data optimized for processing.
Vendor-Agnostic Design shows how tools and agents are defined once and work across multiple providers. This architecture enables graceful fallbacks when services degrade and avoids lock-in to any single vendor.
Error Handling Flow illustrates the decision tree when an agent fails. The orchestrator evaluates the error and chooses between retry, delegation, or surfacing partial results to the user.
Per-Tool Output Limits show the two strategies for handling oversized tool outputs: reject (prompting the agent to refine its request) or truncate (accepting partial data with a marker).
Let’s see how all these patterns work together in a real scenario.
Putting It Together: A Research Workflow
Consider a realistic research workflow showing how these patterns combine.
A user asks: “What are the latest developments in multimodal AI, and how do they relate to the research papers in this project about vision-language models?”
This question requires external research (latest developments) and internal document analysis (saved papers). The sequence:
- External research needed (latest developments)
- Internal research needed (saved papers)
- Synthesis needed (relate external to internal)
1. GPT-5...
2. Gemini...
3. Claude..."
1. CLIP variants...
2. VLM training...
3. Cross-modal alignment..."
- Maps external developments to user's saved research
- Identifies gaps and connections
- Suggests next steps
Parallel execution: Explorer and Librarian run simultaneously, reducing time to completion.
Independent tool usage: Explorer made
4tool calls; Librarian made4. Each tracked independently against their limits.Clean handoffs: The orchestrator received two focused summaries, not raw search results or document chunks.
Natural completion: Both agents called
summaryto signal completion, triggering clean termination.Context efficiency: The orchestrator’s context contains only the user question and two summaries — plenty of room for a thoughtful synthesis.
This architecture powers Storytell’s approach to knowledge work.
Designed for Everyone, Not Just Developers
I mentioned Claude Code as it’s one of the most popular tools these days, but different tools serve different audiences. Claude Code is built for developers writing code — technical depth and visibility are core to the experience. Storytell is built for knowledge work — querying documents, researching topics, synthesizing information, and generating consumable artifacts like reports, briefs, and presentations (just scratching the surface of what it’s capable of).
Storytell’s users span product managers, analysts, sales teams, executives, and AI enthusiasts. When a user asks “What do customers say about our pricing?”, they shouldn’t need to know that the question was routed to a specialized agent. They just see their question and the answer, with citations they can verify.
For developers writing code. Technical visibility is central to the experience — understanding what agents do is part of the workflow.
For knowledge work. Generates consumable outcomes - answers, reports, briefs, presentations. Works out of the box, with full visibility and control for advanced users.
The default experience requires no setup. Users don’t need to select agent types, enable tools, or tune parameters. The system evaluates each question and decides what’s needed. A simple factual query might be answered directly. A complex research question spawns multiple agents in parallel.
But advanced users get the same depth as any developer tool. They can see which agents were spawned, what tools were called, and how the system arrived at its answer. They can also orchestrate how the agents will operate, write plans, and tune behavior just using natural language.
What This Means for Users
Users never see this complexity — and that’s the point. No agent configuration, no context management, no token budgets. Just questions and answers.
Better Answers on Complex Questions: Questions that span web research and internal documents get proper treatment. Each source receives dedicated analysis before synthesis. No more “context window exceeded” failures on deep research tasks. A question like “How does our product compare to competitors based on recent reviews?” can spawn a fleet of agents to gather both public web content and internal sales documents.
Faster Time to Insight: Parallel execution means web search and document analysis happen simultaneously rather than sequentially. Real-time streaming shows progress as work happens. Even when agents take minutes to complete, users would take hours or days to get the same information through manual research.
Adapts to Complexity: Simple questions get quick answers; complex ones spawn deep research. The system decides what’s needed — no configuration, no mode switching, no prompt engineering required.
Building this system taught us that the hard part isn’t the architecture — it’s making complexity invisible. Users shouldn’t need to understand orchestrators and context windows to get answers from their data.
If you’re building something similar, the patterns here — orchestrator-worker, context isolation, error handling — are a solid starting point. AI is evolving at a pace we’ve never seen before, and what works today might need rethinking in a few months. We’re all learning together in this new landscape.
If you found interesting, you can try it at Storytell.