Source URL: https://simonwillison.net/2025/Jun/14/multi-agent-research-system/#atom-everything
Source: Simon Willison’s Weblog
Title: Anthropic: How we built our multi-agent research system
Feedly Summary: Anthropic: How we built our multi-agent research system
OK, I’m sold on multi-agent LLM systems now.
I’ve been pretty skeptical of these until recently: why make your life more complicated by running multiple different prompts in parallel when you can usually get something useful done with a single, carefully-crafted prompt against a frontier model?
This detailed description from Anthropic about how they engineered their “Claude Research" tool has cured me of that skepticism.
Reverse engineering Claude Code had already shown me a mechanism where certain coding research tasks were passed off to a "sub-agent" using a tool call. This new article describes a more sophisticated approach.
They start strong by proving a clear definition of the term "agent":
A multi-agent system consists of multiple agents (LLMs autonomously using tools in a loop) working together. Our Research feature involves an agent that plans a research process based on user queries, and then uses tools to create parallel agents that search for information simultaneously.
Why use multiple agents for a research system?
The essence of search is compression: distilling insights from a vast corpus. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent. […]
Our internal evaluations show that multi-agent research systems excel especially for breadth-first queries that involve pursuing multiple independent directions simultaneously. We found that a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval. For example, when asked to identify all the board members of the companies in the Information Technology S&P 500, the multi-agent system found the correct answers by decomposing this into tasks for subagents, while the single agent system failed to find the answer with slow, sequential searches.
As anyone who has spent time with Claude Code will already have noticed, the downside of this architecture is that it can burn a lot more tokens:
There is a downside: in practice, these architectures burn through tokens fast. In our data, agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats. For economic viability, multi-agent systems require tasks where the value of the task is high enough to pay for the increased performance. […]
We’ve found that multi-agent systems excel at valuable tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools.
The key benefit is all about managing that 200,000 token context limit. Each sub-task has its own separate context, allowing much larger volumes of content to be processed as part of the research task.
Providing a "memory" mechanism is important as well:
The LeadResearcher begins by thinking through the approach and saving its plan to Memory to persist the context, since if the context window exceeds 200,000 tokens it will be truncated and it is important to retain the plan.
The rest of the article provides a detailed description of the prompt engineering process needed to build a truly effective system:
Early agents made errors like spawning 50 subagents for simple queries, scouring the web endlessly for nonexistent sources, and distracting each other with excessive updates. Since each agent is steered by a prompt, prompt engineering was our primary lever for improving these behaviors. […]
In our system, the lead agent decomposes queries into subtasks and describes them to subagents. Each subagent needs an objective, an output format, guidance on the tools and sources to use, and clear task boundaries.
They got good results from having special agents help optimize those crucial tool descriptions:
We even created a tool-testing agent—when given a flawed MCP tool, it attempts to use the tool and then rewrites the tool description to avoid failures. By testing the tool dozens of times, this agent found key nuances and bugs. This process for improving tool ergonomics resulted in a 40% decrease in task completion time for future agents using the new description, because they were able to avoid most mistakes.
Sub-agents can run in parallel which provides significant performance boosts:
For speed, we introduced two kinds of parallelization: (1) the lead agent spins up 3-5 subagents in parallel rather than serially; (2) the subagents use 3+ tools in parallel. These changes cut research time by up to 90% for complex queries, allowing Research to do more work in minutes instead of hours while covering more information than other systems.
There also an extensive section about their approach to evals – they found that LLM-as-a-judge worked well for them, but human evaluation was essential as well:
In our case, human testers noticed that our early agents consistently chose SEO-optimized content farms over authoritative but less highly-ranked sources like academic PDFs or personal blogs. Adding source quality heuristics to our prompts helped resolve this issue.
There’s so much useful, actionable advice in this piece. I haven’t seen anything else about multi-agent system design that’s anywhere near this practical.
They even added some example prompts from their Research system to their open source prompting cookbook. Here’s the bit that encourages parallel tool use:
And an interesting description of the OODA research loop used by the sub-agents:
Research loop: Execute an excellent OODA (observe, orient, decide, act) loop by (a) observing what information has been gathered so far, what still needs to be gathered to accomplish the task, and what tools are available currently; (b) orienting toward what tools and queries would be best to gather the needed information and updating beliefs based on what has been learned so far; (c) making an informed, well-reasoned decision to use a specific tool in a certain way; (d) acting to use this tool. Repeat this loop in an efficient way to research well and learn based on new results.
Tags: ai-assisted-search, anthropic, claude, evals, ai-agents, llm-tool-use, ai, llms, prompt-engineering, generative-ai, paper-review
AI Summary and Description: Yes
**Summary:** The text details Anthropic’s innovative approach in developing a multi-agent research system using LLMs (Large Language Models). It discusses the advantages and intricacies of deploying multiple agents for research tasks, showcasing performance improvements over traditional single-agent systems. This insight is particularly relevant for AI and infrastructure security professionals interested in maximizing efficiency and performance in research and data processing applications.
**Detailed Description:**
– **Multi-Agent System Definition:**
– Anthropic defines a multi-agent system as an array of LLMs that autonomously utilize tools in a loop, which collaborate to carry out research tasks.
– **Purpose of Multi-Agent Approach:**
– The aim is to enhance the efficiency of search processes by dividing tasks among subagents, enabling broader exploration and faster information acquisition.
– **Performance Improvements:**
– Internal evaluations indicated that the multi-agent systems significantly outperformed traditional single-agent systems in various tasks, showcasing a 90.2% improvement in their ability to process broad queries.
– **Challenges with Token Management:**
– The architecture tends to consume significantly more tokens, about four times more than chat interactions, necessitating a careful consideration of the economic feasibility of using multi-agent systems.
– **Enhancing Context Handling:**
– A crucial advantage is that each sub-task retains its context, allowing for processing beyond the 200,000 token limit inherent in single-context systems.
– **Memory Mechanism:**
– A memory feature allows agents to save and persist their contextual understanding to avoid truncation of information when contexts exceed limits.
– **Prompt Engineering as a Lever:**
– The text emphasizes the importance of prompt engineering in refining agent performance, with examples of how improved prompting led to reduced errors and faster task completions.
– **Parallelization Advantages:**
– The introduction of parallel tool use—where lead agents activate multiple subagents simultaneously—greatly accelerates research efforts, cutting completion times significantly.
– **Evaluation Process:**
– Anthropic underscored the necessity of human evaluation alongside automated systems to ensure the quality of sourced information, introducing heuristics to prioritize authoritative sources over lower-quality, SEO-optimized content.
– **Actionable Insights:**
– The article provides practical advice, including examples of prompts that optimize parallel tool usage, which can serve as a guide for practitioners looking to implement similar systems.
– **Research Loop Framework:**
– A structured OODA (Observe, Orient, Decide, Act) loop is introduced, presenting a systematic approach for sub-agents to gather and process information effectively.
This comprehensive exploration of multi-agent systems not only highlights their strengths but also serves as a valuable resource for AI practitioners focused on enhancing system performance, particularly in relation to research and data analysis tasks.