AIPA Day 29

Anthropic orchestrator-worker 精读 — +90.2% 背后的 15× token 与超线性

2026-07-13

orchestrator-workermulti-agenttoken-economics

日期: 2026-07-13 阶段: Phase 2 - AI-native 参考架构标签: #orchestrator-worker #multi-agent #token-economics

核心问题

P1 已经把 AML Copilot 的「单 agent + 评测底座」打磨完。进入 P2 第一刀，要回答一个架构决策问题：到底要不要上多 agent？ 业界最被引用的一手证据是 Anthropic 的研究系统——它声称多 agent 比单 agent 强 90.2%，代价是 15× token。但大多数人只记住了「90.2%」这个数字，没读懂三件事：

这个 +90.2% 是在什么任务上测出来的？换到 AML 的「单案深推」还成立吗？
token 为什么会是 15×、而不是 3× 或 5×？这个倍数从哪来的？
质量提升和 token 投入是线性的吗——如果不是，那意味着什么取舍？

本仓库 src/agent/orchestrator/orchestratorAgent.ts 已经按 Lead-Subagent 写了一版（Claude Agent SDK + Vercel AI SDK）。今天用 Anthropic 原文（2025-06）把这套架构拆到消息流和成本公式级别，校准我们对它的预期——避免在错误的任务类型上付 15× 的钱。

关键内容

A. orchestrator-worker 的控制流：lead 不干活，只分解-派发-整合

Anthropic 原文（2025-06）给的标准流程是一个迭代环，核心点是「LeadResearcher 自己不做检索」：

用户 query
   │
   ▼
[LeadResearcher]  ── 思考策略，把 plan 写入 Memory（防止上下文截断后丢计划）
   │  decompose：为每个子任务指定 4 件事
   │   ① objective（目标）  ② output format（输出格式）
   │   ③ tools/sources 指引  ④ clear task boundaries（边界）
   ├──────────────┬──────────────┐
   ▼              ▼              ▼
[Subagent A]   [Subagent B]   [Subagent C]   ← 并行，各自独立
 web search     web search      web search
 interleaved    interleaved     interleaved   ← 边搜边想（评估工具结果）
 thinking       thinking        thinking
 return 摘要     return 摘要      return 摘要    ← 只回压缩结论，不回原始上下文
   └──────────────┴──────────────┘
                  ▼
[LeadResearcher]  ── synthesize，判断「是否需要再来一轮」
                  │  needs more? ──是──► 再 spawn 一批 subagent（回到 decompose）
                  │  否
                  ▼
[CitationAgent]   ── 单独一趟做引用归属（把结论挂回来源）
                  ▼
              最终答案 → 用户

原文原话：lead agent「develops a strategy, and spawns subagents to explore different aspects simultaneously」；分解时必须给每个 subagent「an objective, an output format, guidance on the tools and sources to use, and clear task boundaries」。

为什么 subagent 只回压缩摘要而非原始上下文——这是整套架构最关键、却最容易被忽略的设计。Anthropic 在《Effective context engineering》（2025-09）里把它点透：每个 subagent「might explore extensively, using tens of thousands of tokens or more, but returns only a condensed, distilled summary（often 1,000-2,000 tokens）」。这是一种显式的上下文隔离：脏的、长的检索过程留在 subagent 的窗口里，lead 的窗口只装蒸馏后的结论。它本质是在对抗 context rot（上下文越长召回越差）——用多个干净的小窗口换一个被污染的大窗口。

这里有个量级感要建立起来：如果不做压缩、让三个 subagent 把各自数万 token 的原始检索全塞回 lead，lead 的窗口会瞬间被几十万 token 撑爆，且其中绝大多数是噪声（搜索返回的无关页面、被否决的中间假设）。压缩到 1-2K 摘要后，lead 综合时面对的是「三段已经过滤过的结论」，而非「三堆原始素材」。换句话说，subagent 不只是「检索器」，更是「信息过滤器」——原文直接用了「intelligent filters」这个词。这也解释了为什么 lead 要给每个 subagent 指定 output format（A 节②）：格式约束的本质是逼 subagent 在返回前先做一次结构化压缩，而不是把原始 dump 甩回来。一旦缺了这道压缩，多 agent 的并行优势会被「汇聚时的上下文爆炸」直接抵消——这是把 orchestrator 写崩的最常见方式之一。

B. +90.2% 与「token 解释 80% 方差」的来源与边界

原文给的两个硬数字必须连起来读，否则会误用：

+90.2%：「A multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval.」配的例子是「找出 S&P 500 信息技术板块所有公司的董事会成员」——多 agent 全找对，单 agent 直接失败。
token 解释 80% 方差：「token usage by itself explains 80% of the variance（in the BrowseComp eval）, with the number of tool calls and the model choice as the two other explanatory factors.」三因素合计解释 95%。

把这两条拼起来，得到一个反直觉的因果链：多 agent 之所以强，主要不是因为「多个脑子更聪明」，而是因为并行让它能在固定的墙钟时间内塞进远超单窗口的 token 和工具调用。 多 agent 是「token 投递机制」，不是「智能放大器」。

这条因果链直接圈定了适用边界。原文：多 agent 系统「excel especially for breadth-first queries that involve pursuing multiple independent directions simultaneously」；反过来「most coding tasks involve fewer truly parallelizable tasks than research」，不适合。

反直觉洞察①（董事会例子是「广度可分解」的极端样本，不能外推到深推）：+90.2% 来自一个能被切成 N 个独立子查询、每个子查询答案互不依赖的任务（每家公司董事会独立可查）。AML 的「这一笔 78 万跨境为什么可疑」恰恰相反——它是深度耦合的：交易对手、资金来源、历史模式必须在同一个推理链里联立，切给三个 subagent 各查一段，反而会因 B 节说的「communication bottleneck」丢掉联立信息。把董事会场景的 +90.2% 当成 AML 单案深推的预期收益，是最典型的误用。

C. 失败模式：协调比智能更难

Anthropic 早期版本暴露的失败模式，全是「协调」而非「能力」问题：

失败模式	原文描述	根因	对应修复
subagent 暴增	「spawning 50 subagents for simple queries」	lead 缺 scaling rule	prompt 里写明「简单查询 spawn 几个」的标度规则
重复劳动	「2 others duplicated work investigating current 2025 supply chains」	任务边界含糊	每个 subagent 给死边界（A 节④）
无效漫游	「scouring the web endlessly for nonexistent sources」	无停止条件	给 subagent 明确 output format + 停止判据
互相干扰	「distracting each other with excessive updates」	过度同步通信	减少 agent 间消息，只回最终摘要

原文给的根因诊断非常精确：当 lead 只下「research the semiconductor shortage」这种「simple, short instructions」时，「one subagent explored the 2021 automotive chip crisis while 2 others duplicated work investigating current 2025 supply chains, without an effective division of labor」。

这四种失败模式有一个共同的本质：它们全都不是「subagent 不够聪明」，而是「lead 的分解不够好」。spawn 50 个、重复劳动、无效漫游、互相干扰——根因都指向同一处：lead 把任务切坏了，或根本没切，直接把模糊指令甩给 subagent。这反过来给了一条极重要的工程启示：调多 agent 系统时，90% 的精力应该花在 lead 的分解 prompt 上，而不是 subagent 的能力上。原文的解法也印证这点——它没有去换更强的 subagent 模型（subagent 反而用更便宜的 Sonnet），而是在 lead 的 prompt 里写死「scaling rules」（简单查询 spawn 几个、复杂查询 spawn 几个）和「clear task boundaries」（每个 subagent 的边界互不重叠）。这与传统分布式系统里「协调者的正确性比工作节点的算力更决定系统吞吐」是同构的——多 agent 系统的瓶颈在编排层，不在执行层。

token 成本拆解公式——为什么是 15× 而不是 3×。设单次 chat 消耗 $T_{chat}$，原文给「agents typically use about 4× more tokens than chat」「multi-agent systems use about 15× more tokens than chats」。把 15× 拆开：

$$T_{multi} \approx \underbrace{T_{lead_plan} + T_{lead_synth}}{\text{lead 自身} \approx 4\times} + \sum{i=1}^{N}\underbrace{T_{sub_i}}{\text{每个 sub} \approx 4\times} + \underbrace{T{cite}}_{\text{引用趟}}$$

每个 subagent 自己就是一个「4× chat」的 agent，N 个并行叠加，再加 lead 的规划+综合两趟和 CitationAgent 一趟——4 + N×4 + overhead，N≈3 时正好落到 ≈15×。

反直觉洞察②（质量与 token 是超线性，不是线性）：直觉以为「花 15× token 换 15% 提升」是亏的。但在「广度可分解 + 单窗口装不下」的任务上，单 agent 不是「慢一点」，而是根本做不到（董事会例子直接失败=0 分）。此时质量曲线在某个 token 阈值上是台阶跃迁：低于阈值近乎 0，越过阈值跳到 90%+。所以收益不是「+15%」而是「从不可能到可能」——这解释了为什么 +90.2% 这种夸张数字出现在「单 agent 失败」的任务上。代价是：一旦任务不需要跨越那个台阶（单窗口就装得下），15× token 就是纯浪费。

设计要点/决策表

决策点	Anthropic 做法	本项目取舍
lead 是否自己检索	否，只分解-派发-整合	沿用，lead 只调 dispatch 工具
subagent 回传内容	仅 1-2K token 蒸馏摘要	沿用，sub-agent 返回 `{ text }` 而非原始 trace
引用归属	独立 CitationAgent 一趟	AML 必须可溯源，单列引用步骤
何时用多 agent	广度可分解 + 超单窗口	仅用于「批量扫多笔/多对手」，单案深推走单 agent
防 subagent 暴增	prompt 写 scaling rule	Budget 硬限 step/toolCall（代码层兜底）
防边界含糊	给死 objective+format+boundary	dispatch 工具的 `subQuery` 必须具体

对本项目的落地

src/agent/orchestrator/orchestratorAgent.ts 已实现的部分对得上原文：runOrchestrator 里 lead 只暴露 invokeKnowledgeAgent / invokeResearchAgent / invokePortfolioAgent / finalAnswer 四个工具，自己不直接检索（对应 A 节「lead 不干活」）；每个 sub-agent execute 返回 { text: r.text }（对应 B 节「只回压缩摘要」）。这套结构是 orchestrator-worker 的忠实实现。
finalAnswer 工具应拆出独立引用逻辑：目前 finalAnswerTool 把 text 和 sources 一起产出，等价于让 lead 兼做 CitationAgent。AML 合规要求每条结论可溯源，建议 W 后续把引用归属做成 orchestrator 的最后一个独立 step（对应原文 CitationAgent 单趟），降低 lead 在综合时「顺手编来源」的风险——这是计划中的改造，当前未实现。
多 agent 的启用条件要写进编排策略，而非默认开：依据 B/C 节，AML 里只有「批量复核 N 笔告警 / 比对多个交易对手」这类广度任务才该走 orchestrator；单笔深度调查应直接走单个 sub-agent（如 runResearchAgent），省掉 15× 成本。这条判据应落到 orchestratorPrompt.ts 的分解策略里。
budget.ts 是 C 节失败模式的代码级兜底：assertCanStep / assertCanToolCall 对应「防 subagent 暴增」「防无效漫游」——prompt 里的 scaling rule 是软约束，Budget 的 maxOrchestratorSteps（当前 stopWhen: steps>=8）是硬约束。Day 32 会把 costCapUsd 接到真实 token 计价，给 15× 成本上一道经济闸。

参考资料

Anthropic Engineering — How we built our multi-agent research system：orchestrator-worker 控制流；+90.2% over single-agent Opus 4；token 解释 80% 方差；4×/15× token；50 subagents/duplicate work 失败模式 (2025-06)
Anthropic Engineering — Effective context engineering for AI agents：subagent 用数万 token 探索、只回 1-2K 蒸馏摘要；context rot；compaction/structured note-taking/multi-agent 三技术 (2025-09)
本仓库 src/agent/orchestrator/orchestratorAgent.ts（Lead-Subagent 实现）、src/agent/orchestrator/budget.ts（step/toolCall/cost 硬限）(2026-06)

SOTA 检查 (2026-06-11)

orchestrator-worker 仍是 2026-06 多 agent 的事实标准模式：Anthropic（2025-06）这套被 OpenAI Agents SDK 的「agents-as-tools」、Microsoft Agent Framework（2026-04）等主流框架沿用为默认编排范式之一，Day 30 会对比它与 handoff 范式的差异。
「token 解释 80% 性能方差」是这篇最被引用、也最被误读的结论：2026 多篇复现工作（如等 token 预算下单/多 agent 对比，arXiv 2604.02460，2026-04）开始反向质疑——一旦把 token 预算拉平，多 agent 优势在多跳推理上消失甚至反转。这说明 +90.2% 高度依赖「并行能塞更多 token」这个机制，而非多 agent 本身更聪明；Day 31 专门核查这个证据冲突。
15× token 倍数在 2026 仍然成立但语义在变：随着 prompt caching、context compaction（2025-09 那篇的技术）普及，subagent 的重复上下文成本被压低，实测倍数可能低于 15×；但「质量提升与 token 超线性」的定性结论未被推翻。
待跟踪：本项目接真实计价后（Day 32），用 AML 批量复核任务实测 orchestrator 的 token 倍数与质量增益，验证是否落在「广度可分解=值得、单案深推=不值得」的预测区间。