Expert Day 133

LLM对比与选型——Claude 4.7、GPT-5、Gemini 2.5、Llama实测

Claude 4.7 / GPT-5 / Gemini 2.5 Pro / Llama 3.x 各自架构、定价、能力差异

2026-09-11

Phase 3 - LLM基础与Prompt工程 (Day 121-134)

ModelSelectionClaudeGPT5GeminiLlamaBenchmark

日期: 2026-09-11 方向: AI系统工程阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #ModelSelection #Claude #GPT5 #Gemini #Llama #Benchmark

今日目标

类型	内容
学习	Claude 4.7 / GPT-5 / Gemini 2.5 Pro / Llama 3.x 各自架构、定价、能力差异
实操	同任务跑4个模型对比：accuracy、latency、cost、structured output reliability
产出	`model_compare.md` — 选型决策表

一、理论基础

1.1 主流模型矩阵 (2026 Q3快照)

模型	Provider	Context	Multimodal	Notable Features
Claude 4.7 Opus	Anthropic	200K (1M beta)	Vision, PDF	Extended thinking, tools, citations, prompt caching
Claude 4.6 Sonnet	Anthropic	200K	Vision, PDF	Best $/perf for most tasks
Claude 4.5 Haiku	Anthropic	200K	Vision	Fast, cheap
GPT-5	OpenAI	256K	Vision, voice	Native tool use, voice mode
GPT-5-mini	OpenAI	256K	Vision	Cheap tier
Gemini 2.5 Pro	Google	2M	Full multi-modal (video, audio)	Massive context
Gemini 2.5 Flash	Google	1M	Multi-modal	Speed优先
Llama 3.3 405B	Meta (open)	128K	Vision (3.2)	Self-host, fine-tune
DeepSeek-R1	DeepSeek (open)	128K	Text only	Math/reasoning特化
Qwen 3 72B	Alibaba (open)	128K	Vision	中文场景强

1.2 定价对比 (per 1M tokens, 2026)

Model	Input	Output	Cached read
Claude 4.7 Opus	$15.00	$75.00	$1.50
Claude 4.6 Sonnet	$3.00	$15.00	$0.30
Claude 4.5 Haiku	$0.80	$4.00	$0.08
GPT-5	$10.00	$50.00	varies
GPT-5-mini	$1.00	$5.00	-
Gemini 2.5 Pro	$5.00	$20.00 (>200K: $10/$30)	-
Gemini 2.5 Flash	$0.30	$2.50	-
Llama 3.3 (Together)	$0.20	$0.60	-
DeepSeek-R1 (DeepInfra)	$0.55	$2.19	-

(都是估算，实际变化快)

1.3 能力定性对比

能力	最强	次强	跟随
Code generation	Claude 4.7, GPT-5	Sonnet, GPT-5-mini	Llama 3.3
Math/Reasoning	GPT-5 (o1接班), Claude Opus thinking	DeepSeek-R1	Gemini
Long context	Gemini 2.5 Pro (2M)	Claude 4.7 (1M beta)	GPT-5 (256K)
Vision/Multimodal	Gemini 2.5 Pro (video)	GPT-5 (voice)	Claude (image+pdf强但无video)
Tool use / Agents	Claude 4.7	GPT-5	Gemini
Speed	Haiku, Flash	mini	Llama (depends on host)
Cost-effective	Haiku, Flash, Llama	Sonnet	Opus
Open weights	Llama, DeepSeek, Qwen	-	Claude/GPT/Gemini都不开
中文	Qwen, GPT-5	Claude	Gemini

1.4 Architecture Insights

各家可推测的架构选择：

Claude: Dense decoder + thinking framework + heavy CAI training
GPT-5: 据信MoE架构，o1的test-time compute thinking integrated
Gemini 2.5: MoE + 长context专门优化 (Pathways)
Llama 3.3: Dense, 405B/70B/8B各开源
DeepSeek-R1: MoE + 大量RL on math/code

1.5 评测Benchmark

Benchmark	What	Top performer (2026)
MMLU	General knowledge	Claude 4.7, GPT-5 (~92%)
HumanEval	Code	GPT-5 (~95%), Claude 4.7 (~94%)
GSM8K	Math	DeepSeek-R1, GPT-5 (~99%)
AIME	Olympiad math	GPT-5, Claude with thinking
SWE-bench	Real bugs	Claude 4.7 (~70%), GPT-5 (~65%)
LongBench	Long context	Gemini 2.5 Pro
HellaSwag	Commonsense	All saturated
MTEB	Embeddings	(different track)

注意：benchmarks容易over-fit。真实production behavior会和benchmark分数显著divergent。

二、直觉解释

为什么不存在"最好"的模型？

不同维度（cost, latency, ceiling, multimodal）有不同winner。模型选型是multi-objective optimization——业务需求决定权重。

为什么需要multi-model fallback？

API outage常见 (OpenAI, Anthropic历史都有)
Rate limit会触发
不同模型在不同case精度不同
监管/合规可能要求vendor diversity
价格竞争快速变化

成熟产品都做model router。

开源vs闭源的trade-off

闭源 (Claude/GPT/Gemini):

最强base model能力
不用部署
但lock-in、贵、隐私担心

开源 (Llama/Qwen/DeepSeek):

自部署，data不出门
可fine-tune
但infrastructure复杂、能力天花板低
Per-token成本可低5-10倍（如果有规模）

三、代码实现

3.1 Multi-model benchmark脚本

# model_compare.py
"""
对4个provider用同一任务做对比
"""
import anthropic
import openai
import google.generativeai as genai
import time
import json

# 配置
anthropic_client = anthropic.Anthropic()
openai_client = openai.OpenAI()
genai.configure(api_key="<GOOGLE_KEY>")

# 测试任务
TEST_QUERIES = [
    {
        "name": "factual_qa",
        "prompt": "What is the capital of France? Answer with one word."
    },
    {
        "name": "math",
        "prompt": "If a train travels 60 km/h for 2.5 hours, how many km did it travel? Answer only with the number."
    },
    {
        "name": "code",
        "prompt": "Write a Python function fibonacci(n) using iteration. Output only code."
    },
    {
        "name": "extraction",
        "prompt": """Extract revenue from this report (number only, in millions USD):

Apple Q3 2026: Reported revenue of $94.5 billion, up 8.5% year over year."""
    },
    {
        "name": "reasoning",
        "prompt": """Janet has 24 candies. She gives 1/3 to Bob and 1/4 of the remainder to Tom. How many does she have left? Show your reasoning briefly."""
    },
]

def call_anthropic(model, prompt, **kwargs):
    t0 = time.time()
    resp = anthropic_client.messages.create(
        model=model, max_tokens=512, temperature=0.0,
        messages=[{"role": "user", "content": prompt}]
    )
    return {
        "text": resp.content[0].text,
        "latency": time.time() - t0,
        "input_tokens": resp.usage.input_tokens,
        "output_tokens": resp.usage.output_tokens,
    }

def call_openai(model, prompt, **kwargs):
    t0 = time.time()
    resp = openai_client.chat.completions.create(
        model=model, max_tokens=512, temperature=0.0,
        messages=[{"role": "user", "content": prompt}]
    )
    return {
        "text": resp.choices[0].message.content,
        "latency": time.time() - t0,
        "input_tokens": resp.usage.prompt_tokens,
        "output_tokens": resp.usage.completion_tokens,
    }

def call_gemini(model, prompt, **kwargs):
    t0 = time.time()
    m = genai.GenerativeModel(model)
    resp = m.generate_content(prompt, generation_config={"temperature": 0, "max_output_tokens": 512})
    usage = resp.usage_metadata
    return {
        "text": resp.text,
        "latency": time.time() - t0,
        "input_tokens": usage.prompt_token_count,
        "output_tokens": usage.candidates_token_count,
    }

# Cost calc
PRICES = {
    "claude-opus-4-7":   {"in": 15.0, "out": 75.0},
    "claude-sonnet-4-6": {"in": 3.0,  "out": 15.0},
    "claude-haiku-4-5":  {"in": 0.8,  "out": 4.0},
    "gpt-5":             {"in": 10.0, "out": 50.0},
    "gpt-5-mini":        {"in": 1.0,  "out": 5.0},
    "gemini-2.5-pro":    {"in": 5.0,  "out": 20.0},
    "gemini-2.5-flash":  {"in": 0.3,  "out": 2.5},
}

def cost(in_tok, out_tok, model):
    p = PRICES[model]
    return in_tok * p["in"] / 1e6 + out_tok * p["out"] / 1e6

# Run
MODELS = [
    ("claude-opus-4-7",   call_anthropic),
    ("claude-sonnet-4-6", call_anthropic),
    ("claude-haiku-4-5",  call_anthropic),
    ("gpt-5",             call_openai),
    ("gpt-5-mini",        call_openai),
    ("gemini-2.5-pro",    call_gemini),
    ("gemini-2.5-flash",  call_gemini),
]

results = {}
for q in TEST_QUERIES:
    results[q["name"]] = {}
    print(f"\n=== Task: {q['name']} ===")
    for model, fn in MODELS:
        try:
            r = fn(model, q["prompt"])
            c = cost(r["input_tokens"], r["output_tokens"], model)
            results[q["name"]][model] = {
                **r, "cost": c
            }
            print(f"  {model:<22} latency={r['latency']:.2f}s  "
                  f"cost=${c:.5f}  out={r['text'][:60]!r}")
        except Exception as e:
            print(f"  {model:<22} ERROR: {e}")

# Save
with open("model_compare_results.json", "w") as f:
    json.dump(results, f, indent=2, default=str)

3.2 用LiteLLM统一接口

# unified_router.py
"""
LiteLLM统一所有provider，写一次换model不重写
"""
# pip install litellm
from litellm import completion
import os

os.environ["ANTHROPIC_API_KEY"] = "..."
os.environ["OPENAI_API_KEY"] = "..."
os.environ["GEMINI_API_KEY"] = "..."

def ask(prompt, model="claude-sonnet-4-6"):
    resp = completion(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=512,
        temperature=0.0,
    )
    return resp.choices[0].message.content

# 同一接口
print(ask("Hello", "claude-haiku-4-5"))
print(ask("Hello", "gpt-5-mini"))
print(ask("Hello", "gemini/gemini-2.5-flash"))
print(ask("Hello", "together_ai/meta-llama/Meta-Llama-3.3-405B-Instruct"))

3.3 智能Router with fallback

# smart_router.py
"""
基于task类型 + cost budget 自动pick model
"""
class ModelRouter:
    def __init__(self, budget_per_query=0.05):
        self.budget = budget_per_query
        self.tier_chains = {
            "factual":   ["claude-haiku-4-5", "gpt-5-mini", "gemini-2.5-flash"],
            "code":      ["claude-sonnet-4-6", "gpt-5", "claude-opus-4-7"],
            "long_doc":  ["gemini-2.5-pro", "claude-sonnet-4-6"],
            "reasoning": ["claude-opus-4-7", "gpt-5"],
        }

    def classify(self, prompt):
        if len(prompt) > 50000:
            return "long_doc"
        if any(kw in prompt.lower() for kw in ["code", "function", "def ", "class "]):
            return "code"
        if any(kw in prompt.lower() for kw in ["solve", "calculate", "prove", "step by step"]):
            return "reasoning"
        return "factual"

    def call(self, prompt, max_retries=3):
        task = self.classify(prompt)
        chain = self.tier_chains[task]

        for model in chain[:max_retries]:
            try:
                return ask(prompt, model)
            except Exception as e:
                print(f"  {model} failed: {e}, trying next...")
                continue
        raise RuntimeError("All models failed")

四、Anthropic API最佳实践

4.1 何时选Claude over others

✅ Claude优势场景：

Long agent loops with tool use (parallel + thinking)
Document/PDF heavy workflow (Files API + citations)
Compliance-critical (Constitutional AI training, safer defaults)
Code generation/review (SWE-bench领先)
Cost-sensitive Sonnet tier平衡好

❌ 别选Claude：

需要video understanding → Gemini
需要voice mode → GPT-5
需要2M+ context → Gemini 2.5 Pro
需要self-host → Llama
极度cost-sensitive simple task → Gemini Flash或Haiku

4.2 用AWS Bedrock / Vertex AI Anthropic

如果你公司已用AWS：

import boto3
client = boto3.client("bedrock-runtime", region_name="us-east-1")
resp = client.invoke_model(
    modelId="anthropic.claude-opus-4-7-20260615-v1:0",
    body=json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 512,
        "messages": [{"role": "user", "content": "Hello"}]
    })
)

或Vertex AI:

from anthropic import AnthropicVertex
client = AnthropicVertex(region="us-central1", project_id="my-proj")
client.messages.create(model="claude-opus-4-7@20260615", ...)

好处：data residency、统一billing、私网。

五、金融领域应用

案例：银行内部AI平台model选型

需求矩阵：

用例	量级	主要要求	推荐model
客服FAQ	10M/day	速度+成本	Haiku 4.5
合规审查	100K/day	准确+审计	Opus 4.7 + thinking
内部research	5K/day	深度reasoning	Opus 4.7
代码review (Copilot-like)	50K/day	code能力	Sonnet 4.6
文档分析 (RFP, contracts)	1K/day	长context	Gemini 2.5 Pro
多语言客服 (中文)	1M/day	中文+成本	Qwen 3 self-host

案例：法务合同分析

def analyze_contract(contract_pdf):
    # 长文档 + 引用要求 + 准确性 = Claude优势
    return anthropic_client.messages.create(
        model="claude-opus-4-7",
        max_tokens=4096,
        thinking={"type": "enabled", "budget_tokens": 16000},
        messages=[{"role": "user", "content": [
            {"type": "document", "source": {...},
             "citations": {"enabled": True}},
            {"type": "text", "text": "Identify all liability clauses, indemnification terms, and termination conditions."}
        ]}]
    )

但long document compare/clause cross-reference可能要用Gemini 2.5 Pro (2M context)。

多模型架构示例

┌──────────────────────────────────────────────┐
│              API Gateway                      │
│         (rate limit, auth, log)               │
└────────┬─────────────────────────────────────┘
         │
    ┌────▼────┐
    │ Router  │ (classify task → pick model)
    └─┬──┬──┬─┘
      │  │  │
      ▼  ▼  ▼
   Anthropic OpenAI Google
   primary  fallback fallback
      │  │  │
      └──┴──┴── Logger / Audit

六、常见陷阱

基于过期benchmark选型：模型迭代太快，3个月前的对比不再准。每月re-eval。
忽视latency尾部：avg 1s不代表p99 1s。生产关心p99。
vendor lock-in：把所有bet压一家。至少有2 vendor active integration。
盲信MMLU score：你的真实任务和MMLU无关。自建domain-specific eval set。
不算cached cost：开了prompt caching后effective cost大幅下降，可能让"贵"模型变便宜。
跨provider behavior不一致：相同prompt在不同model得到不同答案。要在router层加eval保证quality consistency。

七、关键速查

选型决策树

Need open weights? → Llama / DeepSeek / Qwen
Need 1M+ context? → Gemini 2.5 Pro
Need video?       → Gemini
Need voice?       → GPT-5
Code intensive?   → Claude 4.7 / GPT-5
Compliance/audit? → Claude (CAI defaults stronger)
Cost sensitive?   → Haiku / Flash / Llama
General?          → Claude Sonnet 4.6 (best $/perf)

Multi-vendor checklist

至少2 provider integration
LiteLLM/portkey统一接口
Per-task quality eval set (>50 examples)
Monthly re-benchmark
Fallback config (provider down时自动切)
Cost monitoring per model
Latency SLA per route

2026年模型ID速查

claude-opus-4-7
claude-sonnet-4-6
claude-haiku-4-5
gpt-5, gpt-5-mini, gpt-5-nano
gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite
meta-llama/Meta-Llama-3.3-405B-Instruct
deepseek-ai/DeepSeek-R1
Qwen/Qwen3-72B-Instruct

八、面试题

Q1: 给一个金融产品100M/year LLM预算，怎么分配？

分tier：80% volume用Haiku/Flash (基本Q&A)；15% 用Sonnet (深度分析)；5% 用Opus (高stake决策)。开heavy cache。Multi-vendor分散lock-in风险。冷场景考虑self-host Llama。Monitoring track per-route cost+quality。

Q2: 为什么实践中Sonnet比Opus更popular？

Pareto-optimal: Sonnet在80%任务上接近Opus质量，cost 5x低，latency更短。Default选Sonnet, 触发条件升级Opus——是大多数产品的最佳practice。

Q3: 如果Anthropic API down 1小时，你的产品怎么办？

(1) Fallback链：Anthropic API → Bedrock Anthropic → AWS region-2 → OpenAI/Gemini同等tier。(2) Cache hot responses做degraded service。(3) Queue非紧急requests。(4) 透明communication给user (status page, in-app message)。SLA和incident playbook写到合同。

Q4: 怎么确定一个任务该用thinking还是不用？

Eval set上跑两个版本：(a) 测accuracy差距：thinking + cost worth吗？阈值经验5%+ improvement值得。(b) 测latency差距：thinking典型多2-10s。如果UX不能等就不用。(c) Cost：thinking按output收费，可能贵2-5x。建议：先开试100个query评估，再决定default。

九、明日预告

Day 134: Week 20复习 — Prompt工程SOP，整合本周所有方法论。