LLM对比与选型——Claude 4.7、GPT-5、Gemini 2.5、Llama实测
Claude 4.7 / GPT-5 / Gemini 2.5 Pro / Llama 3.x 各自架构、定价、能力差异
日期: 2026-09-11 方向: AI系统工程 阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #ModelSelection #Claude #GPT5 #Gemini #Llama #Benchmark
今日目标
| 类型 | 内容 |
|---|---|
| 学习 | Claude 4.7 / GPT-5 / Gemini 2.5 Pro / Llama 3.x 各自架构、定价、能力差异 |
| 实操 | 同任务跑4个模型对比:accuracy、latency、cost、structured output reliability |
| 产出 | model_compare.md — 选型决策表 |
一、理论基础
1.1 主流模型矩阵 (2026 Q3快照)
| 模型 | Provider | Context | Multimodal | Notable Features |
|---|---|---|---|---|
| Claude 4.7 Opus | Anthropic | 200K (1M beta) | Vision, PDF | Extended thinking, tools, citations, prompt caching |
| Claude 4.6 Sonnet | Anthropic | 200K | Vision, PDF | Best $/perf for most tasks |
| Claude 4.5 Haiku | Anthropic | 200K | Vision | Fast, cheap |
| GPT-5 | OpenAI | 256K | Vision, voice | Native tool use, voice mode |
| GPT-5-mini | OpenAI | 256K | Vision | Cheap tier |
| Gemini 2.5 Pro | 2M | Full multi-modal (video, audio) | Massive context | |
| Gemini 2.5 Flash | 1M | Multi-modal | Speed优先 | |
| Llama 3.3 405B | Meta (open) | 128K | Vision (3.2) | Self-host, fine-tune |
| DeepSeek-R1 | DeepSeek (open) | 128K | Text only | Math/reasoning特化 |
| Qwen 3 72B | Alibaba (open) | 128K | Vision | 中文场景强 |
1.2 定价对比 (per 1M tokens, 2026)
| Model | Input | Output | Cached read |
|---|---|---|---|
| Claude 4.7 Opus | $15.00 | $75.00 | $1.50 |
| Claude 4.6 Sonnet | $3.00 | $15.00 | $0.30 |
| Claude 4.5 Haiku | $0.80 | $4.00 | $0.08 |
| GPT-5 | $10.00 | $50.00 | varies |
| GPT-5-mini | $1.00 | $5.00 | - |
| Gemini 2.5 Pro | $5.00 | $20.00 (>200K: $10/$30) | - |
| Gemini 2.5 Flash | $0.30 | $2.50 | - |
| Llama 3.3 (Together) | $0.20 | $0.60 | - |
| DeepSeek-R1 (DeepInfra) | $0.55 | $2.19 | - |
(都是估算,实际变化快)
1.3 能力定性对比
| 能力 | 最强 | 次强 | 跟随 |
|---|---|---|---|
| Code generation | Claude 4.7, GPT-5 | Sonnet, GPT-5-mini | Llama 3.3 |
| Math/Reasoning | GPT-5 (o1接班), Claude Opus thinking | DeepSeek-R1 | Gemini |
| Long context | Gemini 2.5 Pro (2M) | Claude 4.7 (1M beta) | GPT-5 (256K) |
| Vision/Multimodal | Gemini 2.5 Pro (video) | GPT-5 (voice) | Claude (image+pdf强但无video) |
| Tool use / Agents | Claude 4.7 | GPT-5 | Gemini |
| Speed | Haiku, Flash | mini | Llama (depends on host) |
| Cost-effective | Haiku, Flash, Llama | Sonnet | Opus |
| Open weights | Llama, DeepSeek, Qwen | - | Claude/GPT/Gemini都不开 |
| 中文 | Qwen, GPT-5 | Claude | Gemini |
1.4 Architecture Insights
各家可推测的架构选择:
- Claude: Dense decoder + thinking framework + heavy CAI training
- GPT-5: 据信MoE架构,o1的test-time compute thinking integrated
- Gemini 2.5: MoE + 长context专门优化 (Pathways)
- Llama 3.3: Dense, 405B/70B/8B各开源
- DeepSeek-R1: MoE + 大量RL on math/code
1.5 评测Benchmark
| Benchmark | What | Top performer (2026) |
|---|---|---|
| MMLU | General knowledge | Claude 4.7, GPT-5 (~92%) |
| HumanEval | Code | GPT-5 (~95%), Claude 4.7 (~94%) |
| GSM8K | Math | DeepSeek-R1, GPT-5 (~99%) |
| AIME | Olympiad math | GPT-5, Claude with thinking |
| SWE-bench | Real bugs | Claude 4.7 (~70%), GPT-5 (~65%) |
| LongBench | Long context | Gemini 2.5 Pro |
| HellaSwag | Commonsense | All saturated |
| MTEB | Embeddings | (different track) |
注意:benchmarks容易over-fit。真实production behavior会和benchmark分数显著divergent。
二、直觉解释
为什么不存在"最好"的模型?
不同维度(cost, latency, ceiling, multimodal)有不同winner。模型选型是multi-objective optimization——业务需求决定权重。
为什么需要multi-model fallback?
- API outage常见 (OpenAI, Anthropic历史都有)
- Rate limit会触发
- 不同模型在不同case精度不同
- 监管/合规可能要求vendor diversity
- 价格竞争快速变化
成熟产品都做model router。
开源vs闭源的trade-off
闭源 (Claude/GPT/Gemini):
- 最强base model能力
- 不用部署
- 但lock-in、贵、隐私担心
开源 (Llama/Qwen/DeepSeek):
- 自部署,data不出门
- 可fine-tune
- 但infrastructure复杂、能力天花板低
- Per-token成本可低5-10倍(如果有规模)
三、代码实现
3.1 Multi-model benchmark脚本
# model_compare.py
"""
对4个provider用同一任务做对比
"""
import anthropic
import openai
import google.generativeai as genai
import time
import json
# 配置
anthropic_client = anthropic.Anthropic()
openai_client = openai.OpenAI()
genai.configure(api_key="<GOOGLE_KEY>")
# 测试任务
TEST_QUERIES = [
{
"name": "factual_qa",
"prompt": "What is the capital of France? Answer with one word."
},
{
"name": "math",
"prompt": "If a train travels 60 km/h for 2.5 hours, how many km did it travel? Answer only with the number."
},
{
"name": "code",
"prompt": "Write a Python function fibonacci(n) using iteration. Output only code."
},
{
"name": "extraction",
"prompt": """Extract revenue from this report (number only, in millions USD):
Apple Q3 2026: Reported revenue of $94.5 billion, up 8.5% year over year."""
},
{
"name": "reasoning",
"prompt": """Janet has 24 candies. She gives 1/3 to Bob and 1/4 of the remainder to Tom. How many does she have left? Show your reasoning briefly."""
},
]
def call_anthropic(model, prompt, **kwargs):
t0 = time.time()
resp = anthropic_client.messages.create(
model=model, max_tokens=512, temperature=0.0,
messages=[{"role": "user", "content": prompt}]
)
return {
"text": resp.content[0].text,
"latency": time.time() - t0,
"input_tokens": resp.usage.input_tokens,
"output_tokens": resp.usage.output_tokens,
}
def call_openai(model, prompt, **kwargs):
t0 = time.time()
resp = openai_client.chat.completions.create(
model=model, max_tokens=512, temperature=0.0,
messages=[{"role": "user", "content": prompt}]
)
return {
"text": resp.choices[0].message.content,
"latency": time.time() - t0,
"input_tokens": resp.usage.prompt_tokens,
"output_tokens": resp.usage.completion_tokens,
}
def call_gemini(model, prompt, **kwargs):
t0 = time.time()
m = genai.GenerativeModel(model)
resp = m.generate_content(prompt, generation_config={"temperature": 0, "max_output_tokens": 512})
usage = resp.usage_metadata
return {
"text": resp.text,
"latency": time.time() - t0,
"input_tokens": usage.prompt_token_count,
"output_tokens": usage.candidates_token_count,
}
# Cost calc
PRICES = {
"claude-opus-4-7": {"in": 15.0, "out": 75.0},
"claude-sonnet-4-6": {"in": 3.0, "out": 15.0},
"claude-haiku-4-5": {"in": 0.8, "out": 4.0},
"gpt-5": {"in": 10.0, "out": 50.0},
"gpt-5-mini": {"in": 1.0, "out": 5.0},
"gemini-2.5-pro": {"in": 5.0, "out": 20.0},
"gemini-2.5-flash": {"in": 0.3, "out": 2.5},
}
def cost(in_tok, out_tok, model):
p = PRICES[model]
return in_tok * p["in"] / 1e6 + out_tok * p["out"] / 1e6
# Run
MODELS = [
("claude-opus-4-7", call_anthropic),
("claude-sonnet-4-6", call_anthropic),
("claude-haiku-4-5", call_anthropic),
("gpt-5", call_openai),
("gpt-5-mini", call_openai),
("gemini-2.5-pro", call_gemini),
("gemini-2.5-flash", call_gemini),
]
results = {}
for q in TEST_QUERIES:
results[q["name"]] = {}
print(f"\n=== Task: {q['name']} ===")
for model, fn in MODELS:
try:
r = fn(model, q["prompt"])
c = cost(r["input_tokens"], r["output_tokens"], model)
results[q["name"]][model] = {
**r, "cost": c
}
print(f" {model:<22} latency={r['latency']:.2f}s "
f"cost=${c:.5f} out={r['text'][:60]!r}")
except Exception as e:
print(f" {model:<22} ERROR: {e}")
# Save
with open("model_compare_results.json", "w") as f:
json.dump(results, f, indent=2, default=str)
3.2 用LiteLLM统一接口
# unified_router.py
"""
LiteLLM统一所有provider,写一次换model不重写
"""
# pip install litellm
from litellm import completion
import os
os.environ["ANTHROPIC_API_KEY"] = "..."
os.environ["OPENAI_API_KEY"] = "..."
os.environ["GEMINI_API_KEY"] = "..."
def ask(prompt, model="claude-sonnet-4-6"):
resp = completion(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=512,
temperature=0.0,
)
return resp.choices[0].message.content
# 同一接口
print(ask("Hello", "claude-haiku-4-5"))
print(ask("Hello", "gpt-5-mini"))
print(ask("Hello", "gemini/gemini-2.5-flash"))
print(ask("Hello", "together_ai/meta-llama/Meta-Llama-3.3-405B-Instruct"))
3.3 智能Router with fallback
# smart_router.py
"""
基于task类型 + cost budget 自动pick model
"""
class ModelRouter:
def __init__(self, budget_per_query=0.05):
self.budget = budget_per_query
self.tier_chains = {
"factual": ["claude-haiku-4-5", "gpt-5-mini", "gemini-2.5-flash"],
"code": ["claude-sonnet-4-6", "gpt-5", "claude-opus-4-7"],
"long_doc": ["gemini-2.5-pro", "claude-sonnet-4-6"],
"reasoning": ["claude-opus-4-7", "gpt-5"],
}
def classify(self, prompt):
if len(prompt) > 50000:
return "long_doc"
if any(kw in prompt.lower() for kw in ["code", "function", "def ", "class "]):
return "code"
if any(kw in prompt.lower() for kw in ["solve", "calculate", "prove", "step by step"]):
return "reasoning"
return "factual"
def call(self, prompt, max_retries=3):
task = self.classify(prompt)
chain = self.tier_chains[task]
for model in chain[:max_retries]:
try:
return ask(prompt, model)
except Exception as e:
print(f" {model} failed: {e}, trying next...")
continue
raise RuntimeError("All models failed")
四、Anthropic API最佳实践
4.1 何时选Claude over others
✅ Claude优势场景:
- Long agent loops with tool use (parallel + thinking)
- Document/PDF heavy workflow (Files API + citations)
- Compliance-critical (Constitutional AI training, safer defaults)
- Code generation/review (SWE-bench领先)
- Cost-sensitive Sonnet tier平衡好
❌ 别选Claude:
- 需要video understanding → Gemini
- 需要voice mode → GPT-5
- 需要2M+ context → Gemini 2.5 Pro
- 需要self-host → Llama
- 极度cost-sensitive simple task → Gemini Flash或Haiku
4.2 用AWS Bedrock / Vertex AI Anthropic
如果你公司已用AWS:
import boto3
client = boto3.client("bedrock-runtime", region_name="us-east-1")
resp = client.invoke_model(
modelId="anthropic.claude-opus-4-7-20260615-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 512,
"messages": [{"role": "user", "content": "Hello"}]
})
)
或Vertex AI:
from anthropic import AnthropicVertex
client = AnthropicVertex(region="us-central1", project_id="my-proj")
client.messages.create(model="claude-opus-4-7@20260615", ...)
好处:data residency、统一billing、私网。
五、金融领域应用
案例:银行内部AI平台model选型
需求矩阵:
| 用例 | 量级 | 主要要求 | 推荐model |
|---|---|---|---|
| 客服FAQ | 10M/day | 速度+成本 | Haiku 4.5 |
| 合规审查 | 100K/day | 准确+审计 | Opus 4.7 + thinking |
| 内部research | 5K/day | 深度reasoning | Opus 4.7 |
| 代码review (Copilot-like) | 50K/day | code能力 | Sonnet 4.6 |
| 文档分析 (RFP, contracts) | 1K/day | 长context | Gemini 2.5 Pro |
| 多语言客服 (中文) | 1M/day | 中文+成本 | Qwen 3 self-host |
案例:法务合同分析
def analyze_contract(contract_pdf):
# 长文档 + 引用要求 + 准确性 = Claude优势
return anthropic_client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
thinking={"type": "enabled", "budget_tokens": 16000},
messages=[{"role": "user", "content": [
{"type": "document", "source": {...},
"citations": {"enabled": True}},
{"type": "text", "text": "Identify all liability clauses, indemnification terms, and termination conditions."}
]}]
)
但long document compare/clause cross-reference可能要用Gemini 2.5 Pro (2M context)。
多模型架构示例
┌──────────────────────────────────────────────┐
│ API Gateway │
│ (rate limit, auth, log) │
└────────┬─────────────────────────────────────┘
│
┌────▼────┐
│ Router │ (classify task → pick model)
└─┬──┬──┬─┘
│ │ │
▼ ▼ ▼
Anthropic OpenAI Google
primary fallback fallback
│ │ │
└──┴──┴── Logger / Audit
六、常见陷阱
- 基于过期benchmark选型:模型迭代太快,3个月前的对比不再准。每月re-eval。
- 忽视latency尾部:avg 1s不代表p99 1s。生产关心p99。
- vendor lock-in:把所有bet压一家。至少有2 vendor active integration。
- 盲信MMLU score:你的真实任务和MMLU无关。自建domain-specific eval set。
- 不算cached cost:开了prompt caching后effective cost大幅下降,可能让"贵"模型变便宜。
- 跨provider behavior不一致:相同prompt在不同model得到不同答案。要在router层加eval保证quality consistency。
七、关键速查
选型决策树
Need open weights? → Llama / DeepSeek / Qwen
Need 1M+ context? → Gemini 2.5 Pro
Need video? → Gemini
Need voice? → GPT-5
Code intensive? → Claude 4.7 / GPT-5
Compliance/audit? → Claude (CAI defaults stronger)
Cost sensitive? → Haiku / Flash / Llama
General? → Claude Sonnet 4.6 (best $/perf)
Multi-vendor checklist
- 至少2 provider integration
- LiteLLM/portkey统一接口
- Per-task quality eval set (>50 examples)
- Monthly re-benchmark
- Fallback config (provider down时自动切)
- Cost monitoring per model
- Latency SLA per route
2026年模型ID速查
claude-opus-4-7
claude-sonnet-4-6
claude-haiku-4-5
gpt-5, gpt-5-mini, gpt-5-nano
gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite
meta-llama/Meta-Llama-3.3-405B-Instruct
deepseek-ai/DeepSeek-R1
Qwen/Qwen3-72B-Instruct
八、面试题
Q1: 给一个金融产品100M/year LLM预算,怎么分配?
分tier:80% volume用Haiku/Flash (基本Q&A);15% 用Sonnet (深度分析);5% 用Opus (高stake决策)。开heavy cache。Multi-vendor分散lock-in风险。冷场景考虑self-host Llama。Monitoring track per-route cost+quality。
Q2: 为什么实践中Sonnet比Opus更popular?
Pareto-optimal: Sonnet在80%任务上接近Opus质量,cost 5x低,latency更短。Default选Sonnet, 触发条件升级Opus——是大多数产品的最佳practice。
Q3: 如果Anthropic API down 1小时,你的产品怎么办?
(1) Fallback链:Anthropic API → Bedrock Anthropic → AWS region-2 → OpenAI/Gemini同等tier。(2) Cache hot responses做degraded service。(3) Queue非紧急requests。(4) 透明communication给user (status page, in-app message)。SLA和incident playbook写到合同。
Q4: 怎么确定一个任务该用thinking还是不用?
Eval set上跑两个版本:(a) 测accuracy差距:thinking + cost worth吗?阈值经验5%+ improvement值得。(b) 测latency差距:thinking典型多2-10s。如果UX不能等就不用。(c) Cost:thinking按output收费,可能贵2-5x。建议:先开试100个query评估,再决定default。
九、明日预告
Day 134: Week 20复习 — Prompt工程SOP,整合本周所有方法论。