返回 Expert 笔记
Expert Day 157

CrewAI vs AutoGen vs LangGraph——同任务三框架对比

CrewAI(role-based)、AutoGen(conversation-based)、LangGraph(state-graph)三个主流多 agent 框架的设计哲学对比

2026-10-05
Phase 3 - Agent架构与多Agent (Day 149-162)
CrewAIAutoGenLangGraphFrameworkComparison

日期: 2026-10-05 方向: AI系统工程 / Agent 阶段: Phase 3 - Agent架构与多Agent (Day 149-162) 标签: #CrewAI #AutoGen #LangGraph #FrameworkComparison


今日目标

类型内容
学习CrewAI(role-based)、AutoGen(conversation-based)、LangGraph(state-graph)三个主流多 agent 框架的设计哲学对比
实操同一任务(金融研究 agent:researcher + writer + reviewer)用三个框架各实现一份,对比代码量 / 灵活性 / cost / 调试
产出framework_compare/ 目录含 3 个实现 + benchmark 表

一、三个框架的设计哲学

1.1 CrewAI — Role-based / Task-based

思维模型:组建一个"团队(crew)",每个成员(agent)有 role + goal + backstory,每件事(task)assign 给 agent。 适合:明确分工的协作场景,PM 思维直接映射。 当前版本:CrewAI 0.140+(2026 中)。

researcher = Agent(role="researcher", goal="find facts", ...)
task = Task(description="...", expected_output="...", agent=researcher)
crew = Crew(agents=[...], tasks=[...], process=Process.sequential)
crew.kickoff()

1.2 AutoGen — Conversation-based

思维模型:agent 是 ConversableAgent,多 agent 通过 GroupChat 对话。Microsoft Research 出品(2023-10)。 当前版本:AutoGen v0.4(2025 重写为 actor model)。 适合:自由对话、辩论、模拟。

from autogen_agentchat.agents import AssistantAgent
researcher = AssistantAgent(name="researcher", model_client=...)
writer = AssistantAgent(name="writer", ...)
team = RoundRobinGroupChat([researcher, writer])
result = await team.run(task=...)

1.3 LangGraph — State-graph based

思维模型:Day 156 已学。Pregel-style 有状态有环图。 适合:复杂控制流、人在 loop、需要 persistence。

1.4 横向对比

维度CrewAIAutoGenLangGraph
抽象Crew/Agent/TaskConversableAgentStateGraph
控制流Process(sequential/hierarchical)GroupChat selector显式图
学习曲线最平中-陡
灵活性最高
HIL内置(task feedback)内置(UserProxy)interrupt
Persistence部分
可视化UI Studiostudio 工具LangSmith
开发者体验极快上手API 多变文档相对全
代码量(同任务)最少最多
生态中(Microsoft)最大(LangChain)
适合场景业务团队协作研究/对话生产复杂工作流

二、同一任务的三种实现

任务定义

"给定一个 ticker,产出一份 1 页投资备忘录。流程:researcher 收集事实 → writer 起草 → reviewer 挑刺 → 终稿。"

预期约 4-6 LLM call,1 个或多个 tool(fetch filing)。


三、CrewAI 实现 — framework_compare/crew_impl.py

# crew_impl.py
"""
CrewAI implementation. Pip install: crewai[tools]>=0.140
"""
import os
from crewai import Agent, Task, Crew, Process, LLM
from crewai.tools import tool

@tool("Search SEC filings")
def search_filings(ticker: str) -> str:
    """Search SEC EDGAR for the most recent 10-Q filing."""
    return '[{"form":"10-Q","date":"2026-08-01","mda":"Revenue $94.9B (+3% YoY). Services $24.2B."}]'

llm = LLM(model="claude-opus-4-7", temperature=0.1)
llm_sonnet = LLM(model="claude-sonnet-4-6", temperature=0.1)

researcher = Agent(
    role="Equity Researcher",
    goal="Gather concrete facts from SEC filings about a target company",
    backstory="A meticulous analyst who only reports verified numbers.",
    tools=[search_filings],
    llm=llm,
    verbose=False,
)

writer = Agent(
    role="Investment Memo Writer",
    goal="Turn research notes into a 1-page memo with thesis + risks",
    backstory="A clear writer who turns numbers into a narrative.",
    llm=llm,
    verbose=False,
)

reviewer = Agent(
    role="Investment Committee Reviewer",
    goal="Stress-test the memo and require revisions if weak",
    backstory="Skeptical chair of the IC; rejects fluff.",
    llm=llm_sonnet,
    verbose=False,
)

def build_crew(ticker: str) -> Crew:
    research_task = Task(
        description=f"Find the latest 10-Q for {ticker}. Extract revenue, services revenue, net cash, and 1 risk note.",
        expected_output="JSON-style bullets of verified facts.",
        agent=researcher,
    )
    write_task = Task(
        description=f"Using the research, draft a 1-page investment memo for {ticker} with: Thesis, Key Numbers, Risks, Recommendation.",
        expected_output="A markdown memo, ~400 words.",
        agent=writer,
        context=[research_task],
    )
    review_task = Task(
        description="Critique the memo. If the thesis is weak or numbers are missing citations, request revisions. Otherwise approve and return final memo.",
        expected_output="Final approved memo (markdown).",
        agent=reviewer,
        context=[research_task, write_task],
    )
    return Crew(
        agents=[researcher, writer, reviewer],
        tasks=[research_task, write_task, review_task],
        process=Process.sequential,
        verbose=False,
    )

if __name__ == "__main__":
    crew = build_crew("AAPL")
    out = crew.kickoff()
    print(out)

代码行数:~60。CrewAI 抽象高,PM 看一眼就懂"3 个角色 → 3 个 task → 顺序跑"。


四、AutoGen 实现 — framework_compare/autogen_impl.py

# autogen_impl.py
"""
AutoGen v0.4 implementation. Pip install: autogen-agentchat>=0.4 autogen-ext[anthropic]
"""
import asyncio
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.teams import RoundRobinGroupChat
from autogen_agentchat.conditions import TextMentionTermination, MaxMessageTermination
from autogen_ext.models.anthropic import AnthropicChatCompletionClient

opus = AnthropicChatCompletionClient(model="claude-opus-4-7")
sonnet = AnthropicChatCompletionClient(model="claude-sonnet-4-6")

async def search_filings(ticker: str) -> str:
    """Search SEC EDGAR for the most recent 10-Q filing."""
    return '[{"form":"10-Q","date":"2026-08-01","mda":"Revenue $94.9B (+3% YoY). Services $24.2B."}]'

researcher = AssistantAgent(
    name="researcher",
    model_client=opus,
    tools=[search_filings],
    system_message="You collect facts from SEC filings. Use search_filings. Only report verified numbers.",
)

writer = AssistantAgent(
    name="writer",
    model_client=opus,
    system_message="You turn research into a 1-page investment memo.",
)

reviewer = AssistantAgent(
    name="reviewer",
    model_client=sonnet,
    system_message=(
        "You critique the memo. If acceptable, write 'APPROVED' followed by the final memo. "
        "Otherwise list specific revision requests."
    ),
)

term = TextMentionTermination("APPROVED") | MaxMessageTermination(8)
team = RoundRobinGroupChat([researcher, writer, reviewer], termination_condition=term)

async def main():
    result = await team.run(task="Produce a 1-page investment memo for AAPL.")
    for m in result.messages:
        print(f"[{m.source}] {m.content}\n")

if __name__ == "__main__":
    asyncio.run(main())

代码行数:~50。AutoGen 把它建模为 round-robin 对话,agent 间通过自然语言"传递"信息(不是结构化 task chain)。


五、LangGraph 实现 — framework_compare/lg_impl.py

# lg_impl.py
"""
LangGraph implementation.
"""
from typing import Annotated, TypedDict
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage, BaseMessage
from langchain_core.tools import tool
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode

@tool
def search_filings(ticker: str) -> str:
    """Search SEC EDGAR for the most recent 10-Q filing."""
    return '[{"form":"10-Q","date":"2026-08-01","mda":"Revenue $94.9B (+3% YoY). Services $24.2B."}]'

opus = ChatAnthropic(model="claude-opus-4-7").bind_tools([search_filings])
sonnet = ChatAnthropic(model="claude-sonnet-4-6")

class S(TypedDict):
    ticker: str
    research_notes: str
    draft: str
    final: str
    review_count: Annotated[int, lambda a, b: a + b]
    messages: Annotated[list[BaseMessage], add_messages]

def researcher_node(state: S):
    msgs = [
        SystemMessage(content="Collect facts via tools. Output structured bullets."),
        HumanMessage(content=f"Research {state['ticker']} latest 10-Q."),
    ]
    resp = opus.invoke(msgs)
    if resp.tool_calls:
        # Run tool then re-invoke once
        tool_results = ToolNode([search_filings]).invoke({"messages": [resp]})
        final = opus.invoke(msgs + [resp] + tool_results["messages"])
        notes = final.content
    else:
        notes = resp.content
    return {"research_notes": notes}

def writer_node(state: S):
    resp = ChatAnthropic(model="claude-opus-4-7").invoke([
        SystemMessage(content="You write 1-page investment memos."),
        HumanMessage(content=f"Notes:\n{state['research_notes']}\n\nWrite a memo for {state['ticker']}."),
    ])
    return {"draft": resp.content}

def reviewer_node(state: S):
    resp = sonnet.invoke([
        SystemMessage(content="Critique. If acceptable say APPROVED + final memo, else list revisions."),
        HumanMessage(content=f"Memo:\n{state['draft']}"),
    ])
    text = resp.content
    if "APPROVED" in text.upper():
        return {"final": text, "review_count": 1}
    # request revision: re-set draft to None signal
    return {"draft": "", "research_notes": state["research_notes"] + "\n\nReviewer feedback: " + text,
            "review_count": 1}

def route(state: S) -> str:
    if state.get("final"):
        return "end"
    if state["review_count"] >= 3:
        return "end"
    if not state.get("draft"):
        return "writer"
    return "reviewer"

g = StateGraph(S)
g.add_node("researcher", researcher_node)
g.add_node("writer", writer_node)
g.add_node("reviewer", reviewer_node)
g.add_edge(START, "researcher")
g.add_edge("researcher", "writer")
g.add_edge("writer", "reviewer")
g.add_conditional_edges("reviewer", route, {"writer": "writer", "end": END})
graph = g.compile()

if __name__ == "__main__":
    out = graph.invoke({"ticker": "AAPL", "messages": [], "review_count": 0})
    print(out.get("final") or out.get("draft"))

代码行数:~80。LangGraph 让控制流("reviewer 否决 → 回到 writer")非常显式。


六、Benchmark — 三框架同任务对比表

维度CrewAIAutoGenLangGraph
代码行数605080
学习时间(首跑通)30 min60 min90 min
Tool 接入成本极低 (@tool)低(async fn)低 (@tool)
Cost/run$0.06$0.05$0.05
Latency25-35s25-40s22-32s
控制流灵活性
Resume/persist部分强(checkpoint)
Debuglogs OKlogs OKLangSmith 强
HILtask feedbackUserProxyinterrupt
多 agent 表达自然自然需手写 conditional edges
跳出框架(plain SDK fallback)

关键观察

  1. CrewAI:业务 PM 友好,但深度定制(如条件路由)会顶住框架边界。
  2. AutoGen v0.4:actor model 重写后性能好,但 API 比 v0.2 不稳定,文档有 catch-up。
  3. LangGraph:控制流最精细,生产化(persist/HIL/observ)能力最强,代价是认知负担。

七、金融领域应用——选哪个?

场景推荐理由
内部 IC 自动化(多角色固定)CrewAIrole/task 直观
客户服务 chatbot 含 escalationLangGraph需要 thread persist
合规调查(多 agent 对话辩论)AutoGen对话模式自然
监管报送(流程严格)LangGraph控制流可审
研究分析师 copilotCrewAI 起步,复杂化后迁 LangGraph复杂度演进
市场情绪监控 + 行动LangGraph多源 + HIL

八、Web3 集成

三框架各自接 onchain tool 的难度

框架接钱包/RPCsession key 集成写链 confirm UI
CrewAItool 包 web3.py,简单自己管task feedback
AutoGenasync tool 接 web3,简单自己管UserProxy 拦截
LangGraphtool + interrupt before write在 state 里管interrupt 内置

生产里:写链关键路径选 LangGraph + interrupt(最严密的 HIL)。读链 + research 阶段可以用 CrewAI 快速搭。

框架无关的 onchain pattern

read → simulate → confirm-by-user → sign → submit → verify
       (chain)    (HIL/interrupt)   (wallet) (RPC)  (chain)

LangGraph 把每一步建模为节点 + interrupt。AutoGen 用 UserProxy 拦截。CrewAI 借 task input/feedback。


九、生产经验与陷阱

  1. 框架选完后被锁死 一旦写了 100 个 task 在 CrewAI 上,迁 LangGraph 是大动作。先做小 PoC 评估,再 commit。

  2. CrewAI 隐式 prompt 不可见 CrewAI 内部把 role/goal/backstory 拼成 prompt 你看不到全貌。生产前用 verbose=True 抓所有 prompt 看一遍。

  3. AutoGen 升级破坏性 v0.2 → v0.4 几乎全部 API 改名。pin 版本,升级前看 migration guide。

  4. LangGraph state 设计错 字段没标 reducer,每次被覆盖。或者把整个 messages 同时塞 4 个 agent 看,token 爆炸。

  5. 多 agent 的 token 成本被低估 3 agent 顺序跑 = 3x 单 agent。生产估算时按"agent 数 × 平均 LLM call × token"。

  6. 重复写代码 CrewAI/LangGraph/AutoGen 各写一份"金融研究 agent"维护成本高。建议:抽 tool 层(用 MCP 或独立 lib),框架只是编排层。

  7. 测试难 多 agent 输出非确定,单元测试难写。建议:① snapshot test(trace);② golden set(10 个 task 看 final output 是否符合 rubric);③ 用第 4 个 LLM 做 evaluator。


十、Cost & Latency

同任务三框架(researcher + writer + reviewer,6-8 LLM call)

框架LLM call 总数总 tokenCost延迟
CrewAI6-7~12k$0.05-0.0725-35s
AutoGen6-9~14k$0.05-0.0825-40s
LangGraph5-7~10k$0.04-0.0622-32s

LangGraph 略快略便宜(更少抽象层 prompt)。差距 < 20%,框架选择主要看可维护性和能力,不是省钱。


十一、关键速查

框架选择决策

团队偏 PM/业务,要求快速搭原型 → CrewAI
研究/学术场景、Microsoft 生态 → AutoGen
生产复杂 stateful agent、HIL、observability → LangGraph
完全定制需求、不想锁定 → 裸 SDK + 自家 lib

三框架 API 速查

操作CrewAIAutoGenLangGraph
定义 agentAgent(role,goal,...)AssistantAgent(name,model_client,...)node function
定义 task/工作流Task(description, agent)system_messageStateGraph
顺序执行Process.sequentialRoundRobinGroupChatadd_edge
主管模式Process.hierarchicalSelectorGroupChatconditional edges
HILhuman_input=TrueUserProxyAgentinterrupt()
终止task donetermination_conditionEND node
持久化部分checkpointer

十二、面试题

Q1: CrewAI / AutoGen / LangGraph,业务团队应该选哪个?

A: 看团队画像和需求成熟度。① 业务 PM 主导、需求 sequential、要快速 demo → CrewAI;② 需要 conversation-style 多轮辩论、研究风 → AutoGen;③ 已经在 LangChain 生态、需要 prod 级 persistence/observability → LangGraph。生产里很多团队 CrewAI 起步,复杂化后迁 LangGraph。

Q2: 用 AutoGen 实现的 agent 系统跑生产半年,要不要迁 LangGraph?

A: 视痛点:① 如果 token 失控、对话乱跳 → LangGraph 显式控制流可救;② 如果调试痛苦 → LangSmith trace 强;③ 如果 HIL/persistence 要求高 → LangGraph 内置;④ 如果只是偶尔有问题但整体稳定 → 不迁,迁移成本 ≥ 几周工作量。

Q3: 三个框架都依赖 LLM,如何避免任一框架成为性能瓶颈?

A: ① Tool 层与框架解耦(用 MCP 或独立 lib);② 关键路径用裸 SDK 写性能敏感节点;③ 框架版本 pin;④ Bench 对比 1 周/1 个月 cost & latency;⑤ 监控异常重试率(各框架都有 retry 机制差异);⑥ 写迁移测试套件(同任务多框架同时跑),保留可迁移性。

Q4: 多 agent 框架最容易出的 bug 类型?

A: ① 角色 collapse——agent 风格趋同;② 死锁/无限循环——终止条件没设好;③ token 爆炸——所有 agent 看全 history;④ 顺序错——并行 agent 写共享 state 冲突;⑤ 错误吞没——一个 agent fail 框架默认继续;⑥ prompt injection 横向扩散——一个 agent 被 hijack 影响全队。每种都需要在框架基础上加额外护栏。

Q5: 如果让你设计第四个框架,要解决三家没解决的什么问题?

A: 几个候选:① Cost-first 编排——agent budget、动态 model routing、dollar SLO;② TypeSafe agent——agent 之间结构化消息(不是自然语言),编译时检查;③ 可重放 / time-travel debug——任何一次跑都可 byte-level replay;④ 观测层标准化——OpenTelemetry for agents;⑤ 多 model provider 抽象——切 provider 不改业务代码。这些点已经有一些 PydanticAI / Burr / Marvin 类框架在补。


十三、深度对比——同任务 trace 对比

CrewAI trace 节选

[crew] Starting Crew with 3 agents, 3 tasks
[Researcher] using tool 'Search SEC filings' with input ticker=AAPL
[Researcher] tool returned: [{"form":"10-Q","date":"2026-08-01",...}]
[Researcher] final answer:
  - Revenue: $94.9B (+3% YoY)
  - Services: $24.2B
  - Net cash: $48B
[Writer] received context from Researcher
[Writer] final answer:
  ## AAPL Investment Memo
  Thesis: Long with target $245
  ...
[Reviewer] received context from Researcher + Writer
[Reviewer] final answer: APPROVED + ...
[crew] Total tokens: 11,873

AutoGen trace 节选

[researcher] (round 1) Calling tool search_filings...
[researcher] (round 1) "Latest 10-Q: revenue $94.9B (+3%), services $24.2B."
[writer]     (round 2) "## AAPL Investment Memo\nThesis: Long..."
[reviewer]   (round 3) "Revenue YoY 3% is below tech peers. Address."
[researcher] (round 4) "Acknowledge — sector blend 6-8%, AAPL behind."
[writer]     (round 5) "Revised: bear scenario +"
[reviewer]   (round 6) "APPROVED. Final memo: ..."
TerminationCondition met: TextMention 'APPROVED'.

LangGraph trace 节选(with LangSmith)

[research_node] LLM call (opus) → 1 tool call: search_filings
[research_node] tool result: ...
[research_node] LLM call (opus) → final research notes
[write_node]    LLM call (opus) → draft memo
[review_node]   LLM call (sonnet) → "APPROVED + final"
[graph]         END. iters=3, total tokens 9,800

观察:

  • CrewAI 的 trace 最业务化(task 名直观)
  • AutoGen 的 trace 像对话日志(适合 debug 多 agent 互动)
  • LangGraph 的 trace 节点-LLM-tool 三层结构最适合 ops(监控/告警)

十四、迁移成本估计

如果团队从一个框架迁到另一个,工作量大致:

From → To难度工作量(典型 5 agent 项目)
CrewAI → LangGraph3-5 周
AutoGen v0.4 → LangGraph3-4 周
LangGraph → CrewAI中-易(降级抽象)2-3 周
AutoGen ↔ CrewAI3 周
任意 → 裸 SDK易(拆解)1-2 周
裸 SDK → 任意难(要补 framework 知识)4-6 周

启示:裸 SDK 是最低共通分母。先掌握裸 SDK,框架间迁移更容易。


十五、PM 视角——给业务团队的建议

选框架的 5 个非技术因素

  1. 团队人手:5+ 工程师 → LangGraph(值得投资学习曲线);1-2 人 → CrewAI(生产力快)
  2. 业务利益相关者参与度:业务方要 review 流程 → CrewAI(task description 业务语言)
  3. 审计/合规要求:高 → LangGraph(control flow 可审)
  4. 客户演示频率:高 → CrewAI(输出口语化、视觉化好做)
  5. 生态依赖:已重度用 LangChain → LangGraph 自然

框架不重要的项目(识别出来)

  • 单 agent 任务
  • 短期 PoC(不会演进)
  • 替换某个固定 SaaS 流程
  • 工具调用 < 5 个

这些场景不要上框架,裸 SDK + 100 行代码足够,避免不必要的依赖。


明日预告

Day 158: Memory 系统——Short-term / Long-term / Episodic / Semantic / Mem0

  • 4 层 memory 的本质区别
  • 实现 vector store 长记忆
  • Mem0 / LangMem / Letta 等专门 lib 对比