Building Storytell Multi-Agent Platform

Recently a user shared a frustration: they’d asked Storytell a deep research question—one that needed both web sources and their internal documents.

The execution started strong. Called multiple tools, fetched multiple sources, consumed close to 80% of the available tokens. Then came the warning: context window exhausted. The response was incomplete — external research but no internal analysis, no synthesis. Half an answer to a question that mattered.

That failure crystallized something we’d been circling around: single-agent architectures hit a wall when questions span multiple sources. The problem isn’t intelligence — it’s architecture.

Over the past year building Storytell, we ran into this limitation repeatedly. What starts as a simple prompt-and-response loop quickly becomes unwieldy when users ask complex questions spanning external research, internal documents, and multi-step reasoning. The context window fills up, the model loses track of its goal, and quality degrades. If you’ve built anything with LLMs, you’ve likely hit the same wall.

At Storytell, we’re solving a specific problem: helping teams work with their unstructured data. About 80% of organizational knowledge is trapped in documents, emails, Slack threads, and meeting recordings — information that’s valuable but effectively unsearchable. Users ask complex questions that require synthesizing web research with internal documents, and they expect answers without needing to understand how those answers are assembled.

If you’ve used Claude Code or other coding agents, you’ve already experienced a version of this problem — and its solution. When you ask it to “implement a complex new feature,” it doesn’t try to do everything in one shot. It spawns specialized agents, each configured with specific capabilities and constraints, that run in isolation before returning results to the orchestrator.

This orchestrator-worker pattern is what we adopted and incorporated into Storytell. This article shares what we learned making it production-ready for non-technical users.

Foundational Concepts

The vocabulary we use:

Orchestrator

The parent agent that receives a user request, decides how to decompose it, and delegates work to specialized agents. It maintains the primary context and assembles final results.

Agents

A specialized agent spawned to execute a focused task. Agents run in isolated context windows — they don’t inherit the parent’s conversation history and don’t consume its token budget.

Context Isolation

The architectural principle that agents operate independently. The parent never sees raw tool outputs from agents; it receives only structured summaries. This prevents context pollution and enables deep task decomposition.

Orchestrator Agent

maintains primary context, receives user requests

Agent A

isolated context

Structured Summary

Agent B

isolated context

Structured Summary

Final Response

orchestrator assembles result

With these concepts in place, let’s look at the core design.

The Core Design: Orchestrator-Worker

The orchestrator-worker pattern is simple: a single orchestrator that spawns specialized workers on demand. Each worker is effectively a new LLM stream with its own context window and set of tools available.

The challenge with single-agent systems is a tradeoff: make them too general and they produce shallow answers; make them too specialized and you need many of them, each requiring coordination. The orchestrator-worker pattern sidesteps this by letting one agent decide what needs doing while delegating how to focused workers.

When a user asks a question, the orchestrator evaluates whether it can answer directly or needs to delegate. Simple queries get handled inline; complex ones spawn specialized workers. If the question requires both web research and internal document analysis, it can spawn multiple workers in parallel — each operating independently before returning results.

ORCHESTRATOR

User: "What does our internal policy say about X, and how does that compare to industry best practices?"

Decision: This requires both internal docs & external research

LIBRARIAN

Search internal knowledge base for policy on X

Summary:

"Policy states that..."

EXPLORER

Search web for best practices on X

Summary:

"Industry best practices..."

ORCHESTRATOR

Synthesizes both summaries into final response comparing policy to practices

Key Insight

The orchestrator never sees the raw web pages or document chunks. It receives pre-digested summaries, keeping its context clean for synthesis and user interaction.

Agent Specialization Through Configuration

At Storytell, our agent specialization is purely configuration-driven. Rather than writing different agent classes, we define agents through declarative specifications that control three things:

Tool access: Which tools the agent can invoke
Execution limits: Maximum turns, tool calls, and token budget
Behavioral instructions: Embedded guidance for how to approach tasks

Our five agent types are specialized as follows:

Agent	Available Tools	Focus Area
Explorer	`web_search`, `web_page`	External research — finding information from the open web
Librarian	`knowledge_base_search`	Internal knowledge — searching uploaded documents and saved content
Researcher	`web_search`, `web_page`, `knowledge_base_search`	Multi-source synthesis — combining external and internal information
Plan	All tools	Task decomposition — breaking complex requests into steps
General	All tools	Full capability — used when specialization isn’t needed, and only context offload is required

This approach makes the system highly extensible.

Key Insight

Adding a new agent type requires zero code changes for users. We define a new specification, register it, and the system handles everything else — tool registry, prompt/instructions injection, execution limits.

Specialization is enforced architecturally, not through prompting. When spawning an Explorer agent, the system filters the tool registry down to only what that agent is allowed to use. The agent literally cannot call tools outside its specification — no prompt-engineering required.

This configuration-driven approach enables rapid iteration. New agents are added via declarative specs. New tools are registered in a central registry and become instantly available to any agent configured to use them. We can ship new capabilities quickly and experiment with different agent configurations.

Specialization alone isn’t enough. The real breakthrough is context isolation.

Context Isolation: Why It Matters

Context isolation changed everything. Without it, delegated work consumes the parent’s context budget, and deep task decomposition becomes impossible.

Consider what happens without and with isolation:

WITHOUT CONTEXT ISOLATION

Parent Context (128K tokens):

User message (500 tokens)

Direct tool calls in parent context:

web_search results (3,000 tokens)

web_page content (15,000 tokens)

web_page content (12,000 tokens)

web_search results (2,500 tokens)

web_page content (8,000 tokens)

Agent reasoning (4,000 tokens)

Context now 45K tokens consumed

kb_search results (2,000 tokens)

read_doc chunk (10,000 tokens)

read_doc chunk (8,000 tokens)

read_doc chunk (12,000 tokens)

Agent reasoning (3,000 tokens)

Context now 80K tokens consumed

Synthesis attempt...

Only 48K tokens remaining for response

WITH CONTEXT ISOLATION

Parent Context (128K tokens):

User message (500 tokens)

Explorer agent summary (1,500 tokens)

Librarian agent summary (1,200 tokens)

Context only 3,200 tokens consumed

125K tokens available for synthesis + more tool calls + response

Explorer Context (separate):

Task prompt (200 tokens)

web_search results (3,000 tokens)

web_page content (15,000 tokens)

web_page content (12,000 tokens)

web_search results (2,500 tokens)

web_page content (8,000 tokens)

Agent reasoning (4,000 tokens)

Total: 44,700 tokens consumed discarded

→ Returns only 1,500 token summary

Librarian Context (separate):

Task prompt (200 tokens)

kb_search results (2,000 tokens)

read_doc chunk (10,000 tokens)

read_doc chunk (8,000 tokens)

read_doc chunk (12,000 tokens)

Agent reasoning (3,000 tokens)

Total: 35,200 tokens consumed discarded

→ Returns only 1,200 token summary

Each agent gets its own context window. It doesn’t inherit the parent’s conversation history beyond the task prompt — a description of what to accomplish with relevant context. When finished, it returns a structured summary to the parent, not raw outputs. The parent never sees every file or tool result the agent processed.

Tradeoff

We consume more total tokens across all contexts. But the benefit — being able to handle arbitrarily complex multi-step workflows — far outweighs the cost.

Isolation keeps contexts clean, but we still need to handle pressure within a single context.

Context Management

Context isolation solves the problem of agents polluting each other’s context. But what happens when a single agent’s context starts to fill up?

We track tokens at two levels: total consumed (everything the agent processed) and returned to parent (only the summary). This separation is key — a worker might consume significant tokens processing web pages, but return only a concise synthesis to the orchestrator. The parent consumes what it receives. The agent and parent are always aware of the budget they have left.

When a worker approaches its budget limit, the system guides it to wrap up gracefully rather than failing abruptly. Tools stop accepting new requests, and the agent is prompted to summarize its findings. This gives the model a chance to produce useful output instead of hitting a hard wall.

The goal is always graceful degradation: partial results are better than failures.

Beyond token tracking, agents need hard limits to prevent runaway execution.

Flexible Execution Control

Agents need guardrails. Without them, a runaway agent can exhaust resources, timeout, or loop indefinitely. We implement composable stop conditions that are evaluated each turn:

Budget Threshold

When remaining tokens fall below a percentage of the budget, the agent receives guidance to wrap up. Tools return errors rather than executing, and the agent is prompted to summarize its findings.

Turn Limits

Each agent type has a maximum turn count. This prevents infinite loops while allowing enough iterations for complex tasks.

Tool Call Limits

Independent of turns, we also cap total tool invocations. An agent might make multiple tool calls per turn, so this provides additional protection.

Tool-Signaled Completion

Certain tools signal that the agent has completed its work. When the agent calls those tools, it’s declaring “I’m done” — the executor can stop cleanly.

Timeout

A hard ceiling that prevents any single agent from running indefinitely.

These conditions compose — the first one triggered wins. This lets us tune accordingly without sacrificing capability for complex work. Checking conditions each turn adds overhead, but it’s far better than throwing errors in front of users.

Beyond the Basics: Production Patterns

The patterns above form the foundation, but production systems require additional robustness. Here are a few extra decisions that made the difference:

Progressive Budget Warnings

As context fills, agents receive increasingly urgent signals to wrap up. Tools begin refusing execution, prompting summarization before hitting hard limits.

Real-Time Streaming

Users see progress as agents work, not just the final result. Tool invocations and completions are forwarded in real-time, creating responsiveness even for complex multi-step tasks.

Intelligent Error Guidance

When agents fail, we guide the LLM to handle errors gracefully. The orchestrator can retry with narrower scope, delegate to a different agent type, or surface the issue to the user with actionable context.

Vendor-Agnostic Design

Tools and agents are defined once and work with any LLM vendor. This enables custom integration with specialized models, graceful fallbacks when services degrade, and future-proofing against lock-in.

Per-Tool Output Limits

Each tool has configurable output limits to prevent any single result from consuming the context window. When exceeded, the system can either reject (prompting the LLM to retry with narrower parameters) or truncate (when partial content is acceptable).

Dual-Format Outputs

Every tool produces both user-facing artifacts and LLM-optimized output. For example, web search results display as rich cards for users while providing structured data to the model.

Progressive Budget Warnings show how the system guides agents toward completion as context fills up, providing escalating signals rather than abrupt failures.

Progressive Budget Warnings

✓ HEALTHY

Agent operates normally

⚠ APPROACHING LIMIT

System suggests wrapping up

⚠ LOW BUDGET

New tool calls discouraged

TOOLS REFUSED

Agent must summarize findings

Real-Time Streaming & Dual Formats demonstrates how events from parallel agents are unified into a single stream, giving users visibility into progress as work happens. The same data is then presented differently to each consumer: users see rich visual cards while the model receives structured data optimized for processing.

Real-Time Streaming & Dual Formats

AGENT A

Working...

Found results

Complete

AGENT B

Working...

Found results

Complete

UNIFIED STREAM

Agent A Working...

Agent B Working...

Agent A Found results

Agent B Complete

USER VIEW

DOC Research Paper

WEB Web Source

Rich visual cards

LLM VIEW

{ type: "doc" } { type: "web" }

Structured data format

Same data, optimized for each consumer

Vendor-Agnostic Design shows how tools and agents are defined once and work across multiple providers. This architecture enables graceful fallbacks when services degrade and avoids lock-in to any single vendor.

Vendor-Agnostic Architecture

ORCHESTRATOR

Claude

EXPLORER

Gemini

web_search web_page

LIBRARIAN

GPT

kb_search

RESEARCHER

Claude

web_search kb_search

Each agent can use a different model and different tools — all through the same interface

✓ Mix models per task

✓ Graceful fallbacks

✓ No vendor lock-in

Error Handling Flow illustrates the decision tree when an agent fails. The orchestrator evaluates the error and chooses between retry, delegation, or surfacing partial results to the user.

Graceful Error Recovery

AGENT ENCOUNTERS ISSUE

Task cannot be completed as requested

SYSTEM EVALUATES OPTIONS

Considers error type, retry count, and available alternatives

RETRY

Adjusted approach

Succeeds

Task completes with refined parameters

ALTERNATE

Different method

Succeeds

Alternative strategy works

SURFACE

Return partial results

Partial

Available findings shared with user

Per-Tool Output Limits show the two strategies for handling oversized tool outputs: reject (prompting the agent to refine its request) or truncate (accepting partial data with a marker).

Output Size Management

LARGE OUTPUT

Tool returns more data than expected

LIMIT CHECK

Each tool has configurable output limits to prevent context overflow

EXCEEDS CONFIGURED LIMIT

REJECT

Agent refines the request

Retries with narrower scope or specific section

Full control

ACCEPT PARTIAL

kept dropped

Continues with available data

Agent works with truncated content

Faster response

Let’s see how all these patterns work together in a real scenario.

Putting It Together: A Research Workflow

Consider a realistic research workflow showing how these patterns combine.

A user asks: “What are the latest developments in multimodal AI, and how do they relate to the research papers in this project about vision-language models?”

This question requires external research (latest developments) and internal document analysis (saved papers). The sequence:

Multi-Source Research Workflow

User →

"What are the latest developments in multimodal AI, and how do they relate to the research papers in this project about vision-language models?"

ORCHESTRATOR

Analyzes request:

External research needed (latest developments)
Internal research needed (saved papers)
Synthesis needed (relate external to internal)

Decision: Spawn Explorer + Librarian in parallel, then synthesize results

EXPLORER

Gemini

(isolated context)

1. web_search "multimodal AI 2025 2026"

2. web_page Read top results

3. web_search Follow-up query

4. summary Synthesize findings [signals done]

Summary:

"Key developments:
1. GPT-5...
2. Gemini...
3. Claude..."

LIBRARIAN

GPT

(isolated context)

1. kb_search "vision-language models"

2. kb_search "multimodal AI CLIP"

3. kb_search Refine search

4. summary Synthesize findings [signals done]

Summary:

"Your saved papers focus on:
1. CLIP variants...
2. VLM training...
3. Cross-modal alignment..."

ORCHESTRATOR

Receives both summaries (only ~3000 tokens combined)

Synthesizes final response:

Maps external developments to user's saved research
Identifies gaps and connections
Suggests next steps

←

Final response to user

Parallel execution: Explorer and Librarian run simultaneously, reducing time to completion.
Independent tool usage: Explorer made 4 tool calls; Librarian made 4. Each tracked independently against their limits.
Clean handoffs: The orchestrator received two focused summaries, not raw search results or document chunks.
Natural completion: Both agents called summary to signal completion, triggering clean termination.
Context efficiency: The orchestrator’s context contains only the user question and two summaries — plenty of room for a thoughtful synthesis.

This architecture powers Storytell’s approach to knowledge work.

Designed for Everyone, Not Just Developers

I mentioned Claude Code as it’s one of the most popular tools these days, but different tools serve different audiences. Claude Code is built for developers writing code — technical depth and visibility are core to the experience. Storytell is built for knowledge work — querying documents, researching topics, synthesizing information, and generating consumable artifacts like reports, briefs, and presentations (just scratching the surface of what it’s capable of).

Storytell’s users span product managers, analysts, sales teams, executives, and AI enthusiasts. When a user asks “What do customers say about our pricing?”, they shouldn’t need to know that the question was routed to a specialized agent. They just see their question and the answer, with citations they can verify.

Claude Code

For developers writing code. Technical visibility is central to the experience — understanding what agents do is part of the workflow.

Storytell

For knowledge work. Generates consumable outcomes - answers, reports, briefs, presentations. Works out of the box, with full visibility and control for advanced users.

The default experience requires no setup. Users don’t need to select agent types, enable tools, or tune parameters. The system evaluates each question and decides what’s needed. A simple factual query might be answered directly. A complex research question spawns multiple agents in parallel.

But advanced users get the same depth as any developer tool. They can see which agents were spawned, what tools were called, and how the system arrived at its answer. They can also orchestrate how the agents will operate, write plans, and tune behavior just using natural language.

What This Means for Users

Users never see this complexity — and that’s the point. No agent configuration, no context management, no token budgets. Just questions and answers.

Better Answers on Complex Questions: Questions that span web research and internal documents get proper treatment. Each source receives dedicated analysis before synthesis. No more “context window exceeded” failures on deep research tasks. A question like “How does our product compare to competitors based on recent reviews?” can spawn a fleet of agents to gather both public web content and internal sales documents.

Faster Time to Insight: Parallel execution means web search and document analysis happen simultaneously rather than sequentially. Real-time streaming shows progress as work happens. Even when agents take minutes to complete, users would take hours or days to get the same information through manual research.

Adapts to Complexity: Simple questions get quick answers; complex ones spawn deep research. The system decides what’s needed — no configuration, no mode switching, no prompt engineering required.

Building this system taught us that the hard part isn’t the architecture — it’s making complexity invisible. Users shouldn’t need to understand orchestrators and context windows to get answers from their data.

If you’re building something similar, the patterns here — orchestrator-worker, context isolation, error handling — are a solid starting point. AI is evolving at a pace we’ve never seen before, and what works today might need rethinking in a few months. We’re all learning together in this new landscape.

If you found interesting, you can try it at Storytell.