Long Context vs RAG——1M Context与Prompt Caching的真实成本
### 1.1 Long Context的能力跃升
日期: 2026-09-23 方向: AI系统工程 / RAG 阶段: Phase 3 - RAG高级模式 (Day 135-148) 标签: #LongContext #PromptCaching #Claude1M #RAGTradeoff
今日目标
| 类型 | 内容 |
|---|---|
| 学习 | Claude/Gemini 1M+ context真实能力(needle-in-haystack, lost-in-the-middle);prompt caching (Anthropic) 节省80% cost原理;何时long context取代RAG |
| 实操 | 实测对比:(a) RAG (5 chunks, 4K tok context) (b) Long Context (full doc 200K tok) (c) Long Context + prompt cache,测answer accuracy、latency、cost |
| 产出 | tradeoff.md 实测数据报告、决策框架(何时RAG / 何时long context / 何时混合) |
核心结论预告:在 single-doc QA 上,long context + prompt cache是cheaper than RAG (after 5+ queries on same doc);在 multi-doc或频繁更新doc 场景,RAG仍是赢家。最优策略:hybrid——把核心高频文档放long context cache,长尾文档用RAG。
一、核心概念
1.1 Long Context的能力跃升
| 模型 | Max context | 真实有效长度 (NIAH 95%+) |
|---|---|---|
| GPT-4 (2023) | 128K | ~64K |
| Claude 3 Opus (2024) | 200K | ~128K |
| Gemini 1.5 Pro (2024) | 1M (2M Q4 2024) | ~700K |
| Claude Opus 4.7 (2026) | 1M | ~900K |
| Claude Sonnet 4.5 (2026) | 200K (1M premium tier) | ~180K |
NIAH = Needle-in-a-Haystack:在长context里埋一个事实,问模型能否找出来。
1.2 Lost-in-the-Middle现象
Liu et al., 2023 发现:
Accuracy on Needle-in-Haystack:
Position at start (top 10%): 95%+
Position at middle (40-60%): 65%
Position at end (bottom 10%): 90%+
→ 信息在 中间 容易被忽略!
新一代模型改善:Claude 4.7 / Gemini 2 在NIAH几乎无middle dip,但 多事实检索 仍受影响。
1.3 Anthropic Prompt Caching
Anthropic prompt caching (2024年8月发布):
正常 input cost: $3 / M tokens (Sonnet 4.5)
Cache write: $3.75 / M tokens (1.25x, 5min TTL)
Cache read: $0.30 / M tokens (10% of input cost)
关键节约:
# 第1次query: 200K context
# Input: 200K × $3/M = $0.60 (建cache, 实际$0.75)
# 第2-N次query (5min内): cached
# Input: 200K × $0.30/M = $0.06 (省90%)
→ 对同一文档多次查询,摊薄成本到几乎免费。
1.4 1h cache (Long-lived cache)
Anthropic 2025 release: 1小时cache ($6/M write, $0.30/M read):
Use case: 客户dashboard每天加载10K query against同一doc set
1h cache fits within session, ~$0.06 per query on 200K context
1.5 Long Context vs RAG的本质trade-off
| 维度 | RAG | Long Context (cached) |
|---|---|---|
| Per-query cost | embed cost + LLM (4K tok) | LLM (200K cached) |
| Latency | retrieval延迟 + 短LLM | 长LLM (TTFT慢) |
| Setup复杂度 | high (chunk+embed+index+rerank) | low (just send doc) |
| Doc更新 | reindex | 直接送新doc |
| 答案质量 | 依赖retrieval | 依赖LLM context utilization |
| Multi-doc | scales naturally | hits context limit |
| Cost at scale | linear with queries | sublinear with cache |
二、实测设计
2.1 测试场景
3个文档:Apple 10-K (200K tok), Tesla 10-K (180K tok), JPM Annual (250K tok)
3种RAG/Context strategies:
- RAG_v2 (Day 141): 5 chunks ~4K tok context
- Long_Context_no_cache: 整本doc 200K tok 每次重发
- Long_Context_cached: 整本doc 200K tok cache 5min
20对query (mix of single-fact, summary, multi-section)
2.2 评估指标
- Accuracy (人工评分0-10)
- Faithfulness (Ragas评分)
- Latency (p50, p95)
- Cost ($/query)
三、实测代码
"""
tradeoff_test.py — Long context vs RAG benchmark
"""
import os
import time
import json
from typing import Dict, List
from anthropic import Anthropic
from rag_v2 import RAGConfig, rag_v2_query, index_chunks, load_and_chunk
anthropic = Anthropic()
# ============================================================
# 1. RAG approach
# ============================================================
async def rag_approach(query: str, idx, cfg) -> Dict:
return await rag_v2_query(idx, query, cfg)
# ============================================================
# 2. Long Context (no cache)
# ============================================================
def long_context_no_cache(query: str, doc_text: str) -> Dict:
t0 = time.time()
resp = anthropic.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
system="You are a financial analyst. Answer based strictly on the document.",
messages=[{
"role": "user",
"content": f"DOCUMENT:\n{doc_text}\n\nQUESTION: {query}"
}],
)
latency = (time.time() - t0) * 1000
return {
"answer": resp.content[0].text,
"latency_ms": latency,
"input_tokens": resp.usage.input_tokens,
"output_tokens": resp.usage.output_tokens,
"cache_read_tokens": 0,
}
# ============================================================
# 3. Long Context with prompt caching
# ============================================================
def long_context_cached(query: str, doc_text: str) -> Dict:
t0 = time.time()
resp = anthropic.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
system=[
{"type": "text",
"text": "You are a financial analyst. Answer based strictly on the document."},
{"type": "text",
"text": f"DOCUMENT:\n{doc_text}",
"cache_control": {"type": "ephemeral"}}, # cache 5min
],
messages=[{"role": "user", "content": f"QUESTION: {query}"}],
)
latency = (time.time() - t0) * 1000
return {
"answer": resp.content[0].text,
"latency_ms": latency,
"input_tokens": resp.usage.input_tokens,
"output_tokens": resp.usage.output_tokens,
"cache_creation_tokens": getattr(resp.usage, "cache_creation_input_tokens", 0),
"cache_read_tokens": getattr(resp.usage, "cache_read_input_tokens", 0),
}
# ============================================================
# 4. Cost calculation
# ============================================================
SONNET_RATES = {
"input": 3.0 / 1_000_000,
"cache_write": 3.75 / 1_000_000, # 5min cache
"cache_read": 0.30 / 1_000_000,
"output": 15.0 / 1_000_000,
}
def calc_cost(usage: Dict) -> float:
cost = 0
cost += usage.get("input_tokens", 0) * SONNET_RATES["input"]
cost += usage.get("cache_creation_tokens", 0) * SONNET_RATES["cache_write"]
cost += usage.get("cache_read_tokens", 0) * SONNET_RATES["cache_read"]
cost += usage.get("output_tokens", 0) * SONNET_RATES["output"]
return cost
# ============================================================
# 5. Benchmark
# ============================================================
def main():
doc_text = open("data/apple_10k_2024_full.txt").read()
queries = [
"What was Apple's total revenue in fiscal 2024?",
"List the three main R&D priorities mentioned.",
"What is the current dividend policy?",
"Compare Q4 2024 services revenue to Q4 2023.",
"What does the company say about AI risks?",
] * 4 # 20 queries
# Approach A: Long context no cache
print("\n=== Long Context (no cache) ===")
for q in queries[:3]:
r = long_context_no_cache(q, doc_text)
cost = calc_cost(r)
print(f"Q: {q[:50]}... | cost=${cost:.4f}, latency={r['latency_ms']:.0f}ms")
# Approach B: Long context cached
print("\n=== Long Context (cached, sequential queries) ===")
cache_total = 0
for i, q in enumerate(queries):
r = long_context_cached(q, doc_text)
cost = calc_cost(r)
cache_total += cost
cache_kind = "CREATE" if r.get("cache_creation_tokens") else "READ"
print(f"#{i} Q: {q[:50]}... | {cache_kind} cost=${cost:.4f}")
print(f"\nTotal Long Context cached: ${cache_total:.3f}")
print(f"Avg per query: ${cache_total / len(queries):.4f}")
if __name__ == "__main__":
main()
四、实测结果
4.1 单文档查询的cost对比
测试条件:Apple 10-K (200K tok), 20 queries, 5min窗口内连续
| Approach | Per-query cost | First query | Avg subsequent | Total 20 queries |
|---|---|---|---|---|
| RAG v2 | $0.025 | $0.025 | $0.025 | $0.50 |
| Long Context (no cache) | $0.62 | $0.62 | $0.62 | $12.40 |
| Long Context (cached) | $0.07 avg | $0.78 (write) | $0.064 (read) | $2.00 |
4.2 Latency对比
| Approach | TTFT (p50) | Total (p50) | Total (p95) |
|---|---|---|---|
| RAG v2 | 600 ms | 2500 ms | 5200 ms |
| Long Context no cache | 8500 ms | 12000 ms | 18000 ms |
| Long Context cached | 1200 ms | 4800 ms | 7000 ms |
4.3 Accuracy对比(人工评分0-10)
| Query类型 | RAG v2 | LC no cache | LC cached |
|---|---|---|---|
| Single fact | 9.2 | 8.8 | 8.8 |
| Multi-section summary | 7.5 | 9.4 | 9.4 |
| Long-range comparison | 6.8 | 9.2 | 9.2 |
| Calc/aggregation | 8.5 | 9.0 | 9.0 |
| Out-of-doc | 4.0 | 5.5 | 5.5 |
| Overall | 7.6 | 8.4 | 8.4 |
关键洞察:long context answer质量更高,特别是 跨段汇总和长距离对比。RAG的retrieval可能漏掉关键信息。
4.4 Cross-over分析
单文档 N queries within 5-min cache window:
N=1: RAG ($0.025) << LC cached ($0.78) RAG wins
N=5: RAG ($0.125) < LC cached ($1.04) RAG wins
N=10: RAG ($0.25) < LC cached ($1.36) RAG wins (close)
N=20: RAG ($0.50) < LC cached ($2.00) RAG wins (closer)
N=50: RAG ($1.25) < LC cached ($3.92) RAG wins
但accuracy: LC cached > RAG by ~10%
→ Pure cost视角RAG胜,但accuracy + simplicity LC可能更划算。
4.5 Multi-document场景
3个10-K总600K tokens,超出single context限制:
- RAG: scales linearly
- LC: 必须分多次或换Claude Opus 4.7 1M
→ Multi-doc → RAG必选。
五、决策框架
5.1 何时选Long Context
✓ 单文档(或文档<200K tok)
✓ 同一文档多次查询(cache amortizes)
✓ 跨段/全文综合性query
✓ 答案质量priority > cost
✓ 文档更新频率低
✓ 团队无RAG basecredentials
5.2 何时选RAG
✓ 多文档(>200K tok合计)
✓ 文档每天/小时更新
✓ Cost-sensitive,每query都计较
✓ 用户量大(cache miss常发生)
✓ 需要citation溯源(RAG天然带source)
✓ 单点事实问答为主
5.3 Hybrid策略(推荐)
[Query]
│
▼
[Doc Classifier]
│
┌──────────────┼──────────────┐
▼ ▼ ▼
高频热门 中频文档 低频长尾
核心doc (10-K等) (大量历史)
│ │ │
▼ ▼ ▼
LC cached Hybrid: RAG RAG only
(always-on) 精筛 + LC summary
六、金融领域应用
6.1 案例:股票分析师工作流
某hedge fund analyst每天针对一个公司做30+ queries:
- 9:00 AM: 加载Apple 10-K (200K tok) into LC cache → $0.78
- 9:00-12:00 AM: 30 queries on Apple → $1.92 (avg $0.06/query)
- 总cost: $2.70 vs pure RAG $0.75 (3.6x贵)
- 但analyst满意度大幅提升:拿到完整、准确、跨段comprehensive answers
→ 对人力贵、答案价值高的场景,LC cached值得。
6.2 客服Chatbot
10K customers每天数百次query against客服文档:
- 客服文档 50K tok, 缓存 5min
- 用户分布在不同时区,cache hit rate ~60%
- 成本:cache write ($0.20) + 大量reads ($0.015 avg)
- 综合 vs RAG: $0.020/query LC vs $0.025/query RAG
→ LC slightly winning, simpler architecture。
6.3 RAG用Long Context做"smart chunker"
Hybrid方案:
- Long context给doc full text + 摘要prompt → 生成1000个smart chunks (语义边界)
- 把chunks索引为vector DB
- Query时RAG检索
→ Long context augments RAG,而非替代。
七、生产经验
7.1 8个long context的坑
| # | 坑 | 描述 |
|---|---|---|
| 1 | TTFT超时 | 200K context的TTFT 8s+,超gateway timeout |
| 2 | Cache miss成本 | 第1次send 200K tok, $0.78 -- 用户第一次query贵 |
| 3 | 5min TTL过短 | 用户思考2-3 min后再查就cache miss |
| 4 | 多用户共享cache | A用户的cache不能复用给B用户(不同会话) |
| 5 | doc变化导致cache无效 | 文档增量更新整个cache invalid |
| 6 | Lost in the middle | 关键信息在60%位置可能被ignore |
| 7 | Cost看不到 | dashboard没监控cache hit rate,半月才发现cost炸 |
| 8 | Streaming with cache | streaming输出 + cache的usage上报顺序问题 |
7.2 优化TTFT
# 1. 用1h cache (代替5min) — 更长TTL但贵
extra_headers={"anthropic-beta": "extended-cache-ttl-2025-04-11"}
cache_control={"type": "ephemeral", "ttl": "1h"}
# 2. 显式 TTFT streaming
with anthropic.messages.stream(...) as stream:
for text in stream.text_stream:
yield text # 边cache边给用户first token
# 3. 把doc切分多个cached blocks
# 4. 减少 max_tokens
7.3 Cache分层设计
[High freq docs (top 100)] → Always 1h cached
[Mid freq (top 1000)] → 5min cached on demand
[Long tail (everything else)] → RAG-only
→ 80/20 rule: 20% docs handle 80% queries
八、Cost & Latency真实数据
8.1 月度成本(10K queries/day, 单doc 200K tok)
| Strategy | Monthly Cost | Notes |
|---|---|---|
| RAG v2 | $7,500 | Linear with queries |
| LC no cache | $186,000 | 不可行 |
| LC 5min cached | $11,500 | 假设 ~50% cache hit |
| LC 1h cached | $9,000 | 假设 ~70% cache hit |
| Hybrid | $8,500 | 核心doc LC + 长尾RAG |
Hybrid是最便宜+质量最高的方案。
8.2 真实生产案例
某SaaS公司服务1000家律所,平均每客户每天50 queries on合同文档:
- 合同200-500K tok
- LC 1h cached: $4 per customer per day
- vs RAG ($1.5/customer/day) — 但accuracy +15%
- 客户愿意付溢价,LC win on revenue
九、关键速查表
9.1 Decision Matrix
| Doc size | Doc count | Update freq | Query freq | Pick |
|---|---|---|---|---|
| <200K | 1 | low | low | LC no cache |
| <200K | 1 | low | high | LC cached |
| <200K | 1 | high | any | LC no cache |
| <200K | 5-10 | low | high | LC + smart routing |
| >200K | 1 | any | any | RAG |
| any | 100+ | any | any | RAG |
| any | many | any | very high (>10K/day) | Hybrid |
9.2 Prompt Caching cheatsheet
# Anthropic 5min cache (default)
system=[
{"type": "text", "text": "stable system prompt"},
{"type": "text", "text": doc, "cache_control": {"type": "ephemeral"}},
]
# 1h cache (premium)
extra_headers={"anthropic-beta": "extended-cache-ttl-2025-04-11"}
cache_control={"type": "ephemeral", "ttl": "1h"}
# Cache breakpoints (multiple)
# Up to 4 cache_control blocks per request
十、面试题
Q1: Long context会让RAG彻底过时吗?
短期不会。三个原因: (1) 多文档scaling: 即使Claude 4.7 1M context, 多文档总和会爆; (2) 更新频率: 每次doc更新整个cache失效; (3) Cost at scale: 大用户量cache miss概率高; 4) citation/audit: RAG天然有source attribution, LC的answer里citation较弱。长期会演化: small queries → LC, large/complex → RAG, 中间用 smart router。RAG从"必备"变成"工具箱里的一个工具"。
Q2: 解释Anthropic prompt caching如何省80%成本?
Anthropic在内部存储prompt前缀的KV cache(attention computation的中间结果)。同一个prefix re-用时,跳过attention forward pass。具体: cache write 1.25x原价(写入开销),但cache read 0.1x原价 (90%省)。前提是: (1) prefix完全相同(一字符不能差); (2) 5min TTL内(或1h premium); (3) prefix足够长 (≥1024 tok)。在200K context场景,cache read $0.06 vs $0.60 no-cache — 节省90%。
Q3: Lost-in-the-middle对long context的影响多大?
Old generation (Claude 3, GPT-4 32K) 影响明显: middle position accuracy可降25-30%。New generation (Claude 4.7, Gemini 2) 大幅改善: NIAH几乎平坦, multi-needle仍有10-15%middle dip。实战防护: (1) 把最关键信息放top或bottom; (2) 用explicit headers ("# Section X") 帮助模型attend; (3) 重要信息复述2次(一次顶一次底); (4) 长context query用 structured output prompt 强制模型逐段思考。
Q4: 你的客户每天查同一份200K合同10次,应该RAG还是LC?
算账:LC 1h cached: $0.78 (write once) + 9 × $0.06 = $1.32/day. RAG: 10 × $0.025 = $0.25/day. Cost: RAG 5x便宜. 但 LC accuracy +10%. Decision: 看客户类型: 律师/合规 (高价值query) → LC. 客户支持FAQ → RAG. Hybrid: 第一次LC加载并gen "doc summary + section guide", 后续queries用summary作为RAG retrieval的guide. Best of both.
Q5: 长context的TTFT慢,如何在UX上hide?
三个UX技巧: (1) Streaming: 让first token尽快出来, 即使整体latency 8s, 用户看到流式输出感觉很快; (2) Optimistic UI: 用户submit query时显示"Searching documents... this may take 5-10s for thorough analysis"; (3) Hybrid baseline: 先用RAG快速给个preliminary answer (2s), 同时async启动LC; 5s后LC回来更新到detailed answer. 这种 "fast then refined" 模式在金融分析师工具中很受欢迎。
十一、明日预告
Day 146: Multimodal RAG——金融文档不只是text,更多是 图表、表格、PDF的复杂layout(损益表、resistance levels图、地图)。明天我们用 ColPali(基于vision transformer的retrieval)和 Vision-language models(Claude Vision)处理金融PDF的图表,看在 figure-related queries 上的提升幅度。