返回 Expert 笔记
Expert Day 133

LLM对比与选型——Claude 4.7、GPT-5、Gemini 2.5、Llama实测

Claude 4.7 / GPT-5 / Gemini 2.5 Pro / Llama 3.x 各自架构、定价、能力差异

2026-09-11
Phase 3 - LLM基础与Prompt工程 (Day 121-134)
ModelSelectionClaudeGPT5GeminiLlamaBenchmark

日期: 2026-09-11 方向: AI系统工程 阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #ModelSelection #Claude #GPT5 #Gemini #Llama #Benchmark


今日目标

类型内容
学习Claude 4.7 / GPT-5 / Gemini 2.5 Pro / Llama 3.x 各自架构、定价、能力差异
实操同任务跑4个模型对比:accuracy、latency、cost、structured output reliability
产出model_compare.md — 选型决策表

一、理论基础

1.1 主流模型矩阵 (2026 Q3快照)

模型ProviderContextMultimodalNotable Features
Claude 4.7 OpusAnthropic200K (1M beta)Vision, PDFExtended thinking, tools, citations, prompt caching
Claude 4.6 SonnetAnthropic200KVision, PDFBest $/perf for most tasks
Claude 4.5 HaikuAnthropic200KVisionFast, cheap
GPT-5OpenAI256KVision, voiceNative tool use, voice mode
GPT-5-miniOpenAI256KVisionCheap tier
Gemini 2.5 ProGoogle2MFull multi-modal (video, audio)Massive context
Gemini 2.5 FlashGoogle1MMulti-modalSpeed优先
Llama 3.3 405BMeta (open)128KVision (3.2)Self-host, fine-tune
DeepSeek-R1DeepSeek (open)128KText onlyMath/reasoning特化
Qwen 3 72BAlibaba (open)128KVision中文场景强

1.2 定价对比 (per 1M tokens, 2026)

ModelInputOutputCached read
Claude 4.7 Opus$15.00$75.00$1.50
Claude 4.6 Sonnet$3.00$15.00$0.30
Claude 4.5 Haiku$0.80$4.00$0.08
GPT-5$10.00$50.00varies
GPT-5-mini$1.00$5.00-
Gemini 2.5 Pro$5.00$20.00 (>200K: $10/$30)-
Gemini 2.5 Flash$0.30$2.50-
Llama 3.3 (Together)$0.20$0.60-
DeepSeek-R1 (DeepInfra)$0.55$2.19-

(都是估算,实际变化快)

1.3 能力定性对比

能力最强次强跟随
Code generationClaude 4.7, GPT-5Sonnet, GPT-5-miniLlama 3.3
Math/ReasoningGPT-5 (o1接班), Claude Opus thinkingDeepSeek-R1Gemini
Long contextGemini 2.5 Pro (2M)Claude 4.7 (1M beta)GPT-5 (256K)
Vision/MultimodalGemini 2.5 Pro (video)GPT-5 (voice)Claude (image+pdf强但无video)
Tool use / AgentsClaude 4.7GPT-5Gemini
SpeedHaiku, FlashminiLlama (depends on host)
Cost-effectiveHaiku, Flash, LlamaSonnetOpus
Open weightsLlama, DeepSeek, Qwen-Claude/GPT/Gemini都不开
中文Qwen, GPT-5ClaudeGemini

1.4 Architecture Insights

各家可推测的架构选择:

  • Claude: Dense decoder + thinking framework + heavy CAI training
  • GPT-5: 据信MoE架构,o1的test-time compute thinking integrated
  • Gemini 2.5: MoE + 长context专门优化 (Pathways)
  • Llama 3.3: Dense, 405B/70B/8B各开源
  • DeepSeek-R1: MoE + 大量RL on math/code

1.5 评测Benchmark

BenchmarkWhatTop performer (2026)
MMLUGeneral knowledgeClaude 4.7, GPT-5 (~92%)
HumanEvalCodeGPT-5 (~95%), Claude 4.7 (~94%)
GSM8KMathDeepSeek-R1, GPT-5 (~99%)
AIMEOlympiad mathGPT-5, Claude with thinking
SWE-benchReal bugsClaude 4.7 (~70%), GPT-5 (~65%)
LongBenchLong contextGemini 2.5 Pro
HellaSwagCommonsenseAll saturated
MTEBEmbeddings(different track)

注意:benchmarks容易over-fit。真实production behavior会和benchmark分数显著divergent。


二、直觉解释

为什么不存在"最好"的模型?

不同维度(cost, latency, ceiling, multimodal)有不同winner。模型选型是multi-objective optimization——业务需求决定权重。

为什么需要multi-model fallback?

  • API outage常见 (OpenAI, Anthropic历史都有)
  • Rate limit会触发
  • 不同模型在不同case精度不同
  • 监管/合规可能要求vendor diversity
  • 价格竞争快速变化

成熟产品都做model router。

开源vs闭源的trade-off

闭源 (Claude/GPT/Gemini):

  • 最强base model能力
  • 不用部署
  • 但lock-in、贵、隐私担心

开源 (Llama/Qwen/DeepSeek):

  • 自部署,data不出门
  • 可fine-tune
  • 但infrastructure复杂、能力天花板低
  • Per-token成本可低5-10倍(如果有规模)

三、代码实现

3.1 Multi-model benchmark脚本

# model_compare.py
"""
对4个provider用同一任务做对比
"""
import anthropic
import openai
import google.generativeai as genai
import time
import json

# 配置
anthropic_client = anthropic.Anthropic()
openai_client = openai.OpenAI()
genai.configure(api_key="<GOOGLE_KEY>")

# 测试任务
TEST_QUERIES = [
    {
        "name": "factual_qa",
        "prompt": "What is the capital of France? Answer with one word."
    },
    {
        "name": "math",
        "prompt": "If a train travels 60 km/h for 2.5 hours, how many km did it travel? Answer only with the number."
    },
    {
        "name": "code",
        "prompt": "Write a Python function fibonacci(n) using iteration. Output only code."
    },
    {
        "name": "extraction",
        "prompt": """Extract revenue from this report (number only, in millions USD):

Apple Q3 2026: Reported revenue of $94.5 billion, up 8.5% year over year."""
    },
    {
        "name": "reasoning",
        "prompt": """Janet has 24 candies. She gives 1/3 to Bob and 1/4 of the remainder to Tom. How many does she have left? Show your reasoning briefly."""
    },
]

def call_anthropic(model, prompt, **kwargs):
    t0 = time.time()
    resp = anthropic_client.messages.create(
        model=model, max_tokens=512, temperature=0.0,
        messages=[{"role": "user", "content": prompt}]
    )
    return {
        "text": resp.content[0].text,
        "latency": time.time() - t0,
        "input_tokens": resp.usage.input_tokens,
        "output_tokens": resp.usage.output_tokens,
    }

def call_openai(model, prompt, **kwargs):
    t0 = time.time()
    resp = openai_client.chat.completions.create(
        model=model, max_tokens=512, temperature=0.0,
        messages=[{"role": "user", "content": prompt}]
    )
    return {
        "text": resp.choices[0].message.content,
        "latency": time.time() - t0,
        "input_tokens": resp.usage.prompt_tokens,
        "output_tokens": resp.usage.completion_tokens,
    }

def call_gemini(model, prompt, **kwargs):
    t0 = time.time()
    m = genai.GenerativeModel(model)
    resp = m.generate_content(prompt, generation_config={"temperature": 0, "max_output_tokens": 512})
    usage = resp.usage_metadata
    return {
        "text": resp.text,
        "latency": time.time() - t0,
        "input_tokens": usage.prompt_token_count,
        "output_tokens": usage.candidates_token_count,
    }

# Cost calc
PRICES = {
    "claude-opus-4-7":   {"in": 15.0, "out": 75.0},
    "claude-sonnet-4-6": {"in": 3.0,  "out": 15.0},
    "claude-haiku-4-5":  {"in": 0.8,  "out": 4.0},
    "gpt-5":             {"in": 10.0, "out": 50.0},
    "gpt-5-mini":        {"in": 1.0,  "out": 5.0},
    "gemini-2.5-pro":    {"in": 5.0,  "out": 20.0},
    "gemini-2.5-flash":  {"in": 0.3,  "out": 2.5},
}

def cost(in_tok, out_tok, model):
    p = PRICES[model]
    return in_tok * p["in"] / 1e6 + out_tok * p["out"] / 1e6

# Run
MODELS = [
    ("claude-opus-4-7",   call_anthropic),
    ("claude-sonnet-4-6", call_anthropic),
    ("claude-haiku-4-5",  call_anthropic),
    ("gpt-5",             call_openai),
    ("gpt-5-mini",        call_openai),
    ("gemini-2.5-pro",    call_gemini),
    ("gemini-2.5-flash",  call_gemini),
]

results = {}
for q in TEST_QUERIES:
    results[q["name"]] = {}
    print(f"\n=== Task: {q['name']} ===")
    for model, fn in MODELS:
        try:
            r = fn(model, q["prompt"])
            c = cost(r["input_tokens"], r["output_tokens"], model)
            results[q["name"]][model] = {
                **r, "cost": c
            }
            print(f"  {model:<22} latency={r['latency']:.2f}s  "
                  f"cost=${c:.5f}  out={r['text'][:60]!r}")
        except Exception as e:
            print(f"  {model:<22} ERROR: {e}")

# Save
with open("model_compare_results.json", "w") as f:
    json.dump(results, f, indent=2, default=str)

3.2 用LiteLLM统一接口

# unified_router.py
"""
LiteLLM统一所有provider,写一次换model不重写
"""
# pip install litellm
from litellm import completion
import os

os.environ["ANTHROPIC_API_KEY"] = "..."
os.environ["OPENAI_API_KEY"] = "..."
os.environ["GEMINI_API_KEY"] = "..."

def ask(prompt, model="claude-sonnet-4-6"):
    resp = completion(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=512,
        temperature=0.0,
    )
    return resp.choices[0].message.content

# 同一接口
print(ask("Hello", "claude-haiku-4-5"))
print(ask("Hello", "gpt-5-mini"))
print(ask("Hello", "gemini/gemini-2.5-flash"))
print(ask("Hello", "together_ai/meta-llama/Meta-Llama-3.3-405B-Instruct"))

3.3 智能Router with fallback

# smart_router.py
"""
基于task类型 + cost budget 自动pick model
"""
class ModelRouter:
    def __init__(self, budget_per_query=0.05):
        self.budget = budget_per_query
        self.tier_chains = {
            "factual":   ["claude-haiku-4-5", "gpt-5-mini", "gemini-2.5-flash"],
            "code":      ["claude-sonnet-4-6", "gpt-5", "claude-opus-4-7"],
            "long_doc":  ["gemini-2.5-pro", "claude-sonnet-4-6"],
            "reasoning": ["claude-opus-4-7", "gpt-5"],
        }

    def classify(self, prompt):
        if len(prompt) > 50000:
            return "long_doc"
        if any(kw in prompt.lower() for kw in ["code", "function", "def ", "class "]):
            return "code"
        if any(kw in prompt.lower() for kw in ["solve", "calculate", "prove", "step by step"]):
            return "reasoning"
        return "factual"

    def call(self, prompt, max_retries=3):
        task = self.classify(prompt)
        chain = self.tier_chains[task]

        for model in chain[:max_retries]:
            try:
                return ask(prompt, model)
            except Exception as e:
                print(f"  {model} failed: {e}, trying next...")
                continue
        raise RuntimeError("All models failed")

四、Anthropic API最佳实践

4.1 何时选Claude over others

✅ Claude优势场景:

  • Long agent loops with tool use (parallel + thinking)
  • Document/PDF heavy workflow (Files API + citations)
  • Compliance-critical (Constitutional AI training, safer defaults)
  • Code generation/review (SWE-bench领先)
  • Cost-sensitive Sonnet tier平衡好

❌ 别选Claude:

  • 需要video understanding → Gemini
  • 需要voice mode → GPT-5
  • 需要2M+ context → Gemini 2.5 Pro
  • 需要self-host → Llama
  • 极度cost-sensitive simple task → Gemini Flash或Haiku

4.2 用AWS Bedrock / Vertex AI Anthropic

如果你公司已用AWS:

import boto3
client = boto3.client("bedrock-runtime", region_name="us-east-1")
resp = client.invoke_model(
    modelId="anthropic.claude-opus-4-7-20260615-v1:0",
    body=json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 512,
        "messages": [{"role": "user", "content": "Hello"}]
    })
)

或Vertex AI:

from anthropic import AnthropicVertex
client = AnthropicVertex(region="us-central1", project_id="my-proj")
client.messages.create(model="claude-opus-4-7@20260615", ...)

好处:data residency、统一billing、私网。


五、金融领域应用

案例:银行内部AI平台model选型

需求矩阵:

用例量级主要要求推荐model
客服FAQ10M/day速度+成本Haiku 4.5
合规审查100K/day准确+审计Opus 4.7 + thinking
内部research5K/day深度reasoningOpus 4.7
代码review (Copilot-like)50K/daycode能力Sonnet 4.6
文档分析 (RFP, contracts)1K/day长contextGemini 2.5 Pro
多语言客服 (中文)1M/day中文+成本Qwen 3 self-host

案例:法务合同分析

def analyze_contract(contract_pdf):
    # 长文档 + 引用要求 + 准确性 = Claude优势
    return anthropic_client.messages.create(
        model="claude-opus-4-7",
        max_tokens=4096,
        thinking={"type": "enabled", "budget_tokens": 16000},
        messages=[{"role": "user", "content": [
            {"type": "document", "source": {...},
             "citations": {"enabled": True}},
            {"type": "text", "text": "Identify all liability clauses, indemnification terms, and termination conditions."}
        ]}]
    )

但long document compare/clause cross-reference可能要用Gemini 2.5 Pro (2M context)。

多模型架构示例

┌──────────────────────────────────────────────┐
│              API Gateway                      │
│         (rate limit, auth, log)               │
└────────┬─────────────────────────────────────┘
         │
    ┌────▼────┐
    │ Router  │ (classify task → pick model)
    └─┬──┬──┬─┘
      │  │  │
      ▼  ▼  ▼
   Anthropic OpenAI Google
   primary  fallback fallback
      │  │  │
      └──┴──┴── Logger / Audit

六、常见陷阱

  1. 基于过期benchmark选型:模型迭代太快,3个月前的对比不再准。每月re-eval。
  2. 忽视latency尾部:avg 1s不代表p99 1s。生产关心p99。
  3. vendor lock-in:把所有bet压一家。至少有2 vendor active integration
  4. 盲信MMLU score:你的真实任务和MMLU无关。自建domain-specific eval set
  5. 不算cached cost:开了prompt caching后effective cost大幅下降,可能让"贵"模型变便宜。
  6. 跨provider behavior不一致:相同prompt在不同model得到不同答案。要在router层加eval保证quality consistency。

七、关键速查

选型决策树

Need open weights? → Llama / DeepSeek / Qwen
Need 1M+ context? → Gemini 2.5 Pro
Need video?       → Gemini
Need voice?       → GPT-5
Code intensive?   → Claude 4.7 / GPT-5
Compliance/audit? → Claude (CAI defaults stronger)
Cost sensitive?   → Haiku / Flash / Llama
General?          → Claude Sonnet 4.6 (best $/perf)

Multi-vendor checklist

  • 至少2 provider integration
  • LiteLLM/portkey统一接口
  • Per-task quality eval set (>50 examples)
  • Monthly re-benchmark
  • Fallback config (provider down时自动切)
  • Cost monitoring per model
  • Latency SLA per route

2026年模型ID速查

claude-opus-4-7
claude-sonnet-4-6
claude-haiku-4-5
gpt-5, gpt-5-mini, gpt-5-nano
gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite
meta-llama/Meta-Llama-3.3-405B-Instruct
deepseek-ai/DeepSeek-R1
Qwen/Qwen3-72B-Instruct

八、面试题

Q1: 给一个金融产品100M/year LLM预算,怎么分配?

分tier:80% volume用Haiku/Flash (基本Q&A);15% 用Sonnet (深度分析);5% 用Opus (高stake决策)。开heavy cache。Multi-vendor分散lock-in风险。冷场景考虑self-host Llama。Monitoring track per-route cost+quality。

Q2: 为什么实践中Sonnet比Opus更popular?

Pareto-optimal: Sonnet在80%任务上接近Opus质量,cost 5x低,latency更短。Default选Sonnet, 触发条件升级Opus——是大多数产品的最佳practice。

Q3: 如果Anthropic API down 1小时,你的产品怎么办?

(1) Fallback链:Anthropic API → Bedrock Anthropic → AWS region-2 → OpenAI/Gemini同等tier。(2) Cache hot responses做degraded service。(3) Queue非紧急requests。(4) 透明communication给user (status page, in-app message)。SLA和incident playbook写到合同

Q4: 怎么确定一个任务该用thinking还是不用?

Eval set上跑两个版本:(a) 测accuracy差距:thinking + cost worth吗?阈值经验5%+ improvement值得。(b) 测latency差距:thinking典型多2-10s。如果UX不能等就不用。(c) Cost:thinking按output收费,可能贵2-5x。建议:先开试100个query评估,再决定default。


九、明日预告

Day 134: Week 20复习 — Prompt工程SOP,整合本周所有方法论。