返回 Expert 笔记
Expert Day 145

Long Context vs RAG——1M Context与Prompt Caching的真实成本

### 1.1 Long Context的能力跃升

2026-09-23
Phase 3 - RAG高级模式 (Day 135-148)
LongContextPromptCachingClaude1MRAGTradeoff

日期: 2026-09-23 方向: AI系统工程 / RAG 阶段: Phase 3 - RAG高级模式 (Day 135-148) 标签: #LongContext #PromptCaching #Claude1M #RAGTradeoff


今日目标

类型内容
学习Claude/Gemini 1M+ context真实能力(needle-in-haystack, lost-in-the-middle);prompt caching (Anthropic) 节省80% cost原理;何时long context取代RAG
实操实测对比:(a) RAG (5 chunks, 4K tok context) (b) Long Context (full doc 200K tok) (c) Long Context + prompt cache,测answer accuracy、latency、cost
产出tradeoff.md 实测数据报告、决策框架(何时RAG / 何时long context / 何时混合)

核心结论预告:在 single-doc QA 上,long context + prompt cache是cheaper than RAG (after 5+ queries on same doc);在 multi-doc或频繁更新doc 场景,RAG仍是赢家。最优策略:hybrid——把核心高频文档放long context cache,长尾文档用RAG。


一、核心概念

1.1 Long Context的能力跃升

模型Max context真实有效长度 (NIAH 95%+)
GPT-4 (2023)128K~64K
Claude 3 Opus (2024)200K~128K
Gemini 1.5 Pro (2024)1M (2M Q4 2024)~700K
Claude Opus 4.7 (2026)1M~900K
Claude Sonnet 4.5 (2026)200K (1M premium tier)~180K

NIAH = Needle-in-a-Haystack:在长context里埋一个事实,问模型能否找出来。

1.2 Lost-in-the-Middle现象

Liu et al., 2023 发现:

Accuracy on Needle-in-Haystack:
  Position at start (top 10%): 95%+
  Position at middle (40-60%): 65%
  Position at end (bottom 10%): 90%+

→ 信息在 中间 容易被忽略!

新一代模型改善:Claude 4.7 / Gemini 2 在NIAH几乎无middle dip,但 多事实检索 仍受影响。

1.3 Anthropic Prompt Caching

Anthropic prompt caching (2024年8月发布):

正常 input cost: $3 / M tokens (Sonnet 4.5)
Cache write:    $3.75 / M tokens (1.25x, 5min TTL)
Cache read:     $0.30 / M tokens (10% of input cost)

关键节约

# 第1次query: 200K context
# Input: 200K × $3/M = $0.60 (建cache, 实际$0.75)

# 第2-N次query (5min内): cached
# Input: 200K × $0.30/M = $0.06 (省90%)

→ 对同一文档多次查询,摊薄成本到几乎免费

1.4 1h cache (Long-lived cache)

Anthropic 2025 release: 1小时cache ($6/M write, $0.30/M read):

Use case: 客户dashboard每天加载10K query against同一doc set
1h cache fits within session, ~$0.06 per query on 200K context

1.5 Long Context vs RAG的本质trade-off

维度RAGLong Context (cached)
Per-query costembed cost + LLM (4K tok)LLM (200K cached)
Latencyretrieval延迟 + 短LLM长LLM (TTFT慢)
Setup复杂度high (chunk+embed+index+rerank)low (just send doc)
Doc更新reindex直接送新doc
答案质量依赖retrieval依赖LLM context utilization
Multi-docscales naturallyhits context limit
Cost at scalelinear with queriessublinear with cache

二、实测设计

2.1 测试场景

3个文档:Apple 10-K (200K tok), Tesla 10-K (180K tok), JPM Annual (250K tok)

3种RAG/Context strategies:

  1. RAG_v2 (Day 141): 5 chunks ~4K tok context
  2. Long_Context_no_cache: 整本doc 200K tok 每次重发
  3. Long_Context_cached: 整本doc 200K tok cache 5min

20对query (mix of single-fact, summary, multi-section)

2.2 评估指标

  • Accuracy (人工评分0-10)
  • Faithfulness (Ragas评分)
  • Latency (p50, p95)
  • Cost ($/query)

三、实测代码

"""
tradeoff_test.py — Long context vs RAG benchmark
"""
import os
import time
import json
from typing import Dict, List
from anthropic import Anthropic
from rag_v2 import RAGConfig, rag_v2_query, index_chunks, load_and_chunk

anthropic = Anthropic()

# ============================================================
# 1. RAG approach
# ============================================================
async def rag_approach(query: str, idx, cfg) -> Dict:
    return await rag_v2_query(idx, query, cfg)


# ============================================================
# 2. Long Context (no cache)
# ============================================================
def long_context_no_cache(query: str, doc_text: str) -> Dict:
    t0 = time.time()
    resp = anthropic.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        system="You are a financial analyst. Answer based strictly on the document.",
        messages=[{
            "role": "user",
            "content": f"DOCUMENT:\n{doc_text}\n\nQUESTION: {query}"
        }],
    )
    latency = (time.time() - t0) * 1000
    return {
        "answer": resp.content[0].text,
        "latency_ms": latency,
        "input_tokens": resp.usage.input_tokens,
        "output_tokens": resp.usage.output_tokens,
        "cache_read_tokens": 0,
    }


# ============================================================
# 3. Long Context with prompt caching
# ============================================================
def long_context_cached(query: str, doc_text: str) -> Dict:
    t0 = time.time()
    resp = anthropic.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        system=[
            {"type": "text",
             "text": "You are a financial analyst. Answer based strictly on the document."},
            {"type": "text",
             "text": f"DOCUMENT:\n{doc_text}",
             "cache_control": {"type": "ephemeral"}},  # cache 5min
        ],
        messages=[{"role": "user", "content": f"QUESTION: {query}"}],
    )
    latency = (time.time() - t0) * 1000
    return {
        "answer": resp.content[0].text,
        "latency_ms": latency,
        "input_tokens": resp.usage.input_tokens,
        "output_tokens": resp.usage.output_tokens,
        "cache_creation_tokens": getattr(resp.usage, "cache_creation_input_tokens", 0),
        "cache_read_tokens": getattr(resp.usage, "cache_read_input_tokens", 0),
    }


# ============================================================
# 4. Cost calculation
# ============================================================
SONNET_RATES = {
    "input": 3.0 / 1_000_000,
    "cache_write": 3.75 / 1_000_000,    # 5min cache
    "cache_read": 0.30 / 1_000_000,
    "output": 15.0 / 1_000_000,
}


def calc_cost(usage: Dict) -> float:
    cost = 0
    cost += usage.get("input_tokens", 0) * SONNET_RATES["input"]
    cost += usage.get("cache_creation_tokens", 0) * SONNET_RATES["cache_write"]
    cost += usage.get("cache_read_tokens", 0) * SONNET_RATES["cache_read"]
    cost += usage.get("output_tokens", 0) * SONNET_RATES["output"]
    return cost


# ============================================================
# 5. Benchmark
# ============================================================
def main():
    doc_text = open("data/apple_10k_2024_full.txt").read()
    queries = [
        "What was Apple's total revenue in fiscal 2024?",
        "List the three main R&D priorities mentioned.",
        "What is the current dividend policy?",
        "Compare Q4 2024 services revenue to Q4 2023.",
        "What does the company say about AI risks?",
    ] * 4   # 20 queries

    # Approach A: Long context no cache
    print("\n=== Long Context (no cache) ===")
    for q in queries[:3]:
        r = long_context_no_cache(q, doc_text)
        cost = calc_cost(r)
        print(f"Q: {q[:50]}... | cost=${cost:.4f}, latency={r['latency_ms']:.0f}ms")

    # Approach B: Long context cached
    print("\n=== Long Context (cached, sequential queries) ===")
    cache_total = 0
    for i, q in enumerate(queries):
        r = long_context_cached(q, doc_text)
        cost = calc_cost(r)
        cache_total += cost
        cache_kind = "CREATE" if r.get("cache_creation_tokens") else "READ"
        print(f"#{i} Q: {q[:50]}... | {cache_kind} cost=${cost:.4f}")

    print(f"\nTotal Long Context cached: ${cache_total:.3f}")
    print(f"Avg per query: ${cache_total / len(queries):.4f}")


if __name__ == "__main__":
    main()

四、实测结果

4.1 单文档查询的cost对比

测试条件:Apple 10-K (200K tok), 20 queries, 5min窗口内连续

ApproachPer-query costFirst queryAvg subsequentTotal 20 queries
RAG v2$0.025$0.025$0.025$0.50
Long Context (no cache)$0.62$0.62$0.62$12.40
Long Context (cached)$0.07 avg$0.78 (write)$0.064 (read)$2.00

4.2 Latency对比

ApproachTTFT (p50)Total (p50)Total (p95)
RAG v2600 ms2500 ms5200 ms
Long Context no cache8500 ms12000 ms18000 ms
Long Context cached1200 ms4800 ms7000 ms

4.3 Accuracy对比(人工评分0-10)

Query类型RAG v2LC no cacheLC cached
Single fact9.28.88.8
Multi-section summary7.59.49.4
Long-range comparison6.89.29.2
Calc/aggregation8.59.09.0
Out-of-doc4.05.55.5
Overall7.68.48.4

关键洞察:long context answer质量更高,特别是 跨段汇总和长距离对比。RAG的retrieval可能漏掉关键信息。

4.4 Cross-over分析

单文档 N queries within 5-min cache window:
  N=1: RAG ($0.025) << LC cached ($0.78)        RAG wins
  N=5: RAG ($0.125) < LC cached ($1.04)         RAG wins  
  N=10: RAG ($0.25) < LC cached ($1.36)          RAG wins (close)
  N=20: RAG ($0.50) < LC cached ($2.00)          RAG wins (closer)
  N=50: RAG ($1.25) < LC cached ($3.92)          RAG wins
  
但accuracy: LC cached > RAG by ~10%

Pure cost视角RAG胜,但accuracy + simplicity LC可能更划算。

4.5 Multi-document场景

3个10-K总600K tokens,超出single context限制:

  • RAG: scales linearly
  • LC: 必须分多次或换Claude Opus 4.7 1M

→ Multi-doc → RAG必选


五、决策框架

5.1 何时选Long Context

✓ 单文档(或文档<200K tok)
✓ 同一文档多次查询(cache amortizes)
✓ 跨段/全文综合性query
✓ 答案质量priority > cost
✓ 文档更新频率低
✓ 团队无RAG basecredentials

5.2 何时选RAG

✓ 多文档(>200K tok合计)
✓ 文档每天/小时更新
✓ Cost-sensitive,每query都计较
✓ 用户量大(cache miss常发生)
✓ 需要citation溯源(RAG天然带source)
✓ 单点事实问答为主

5.3 Hybrid策略(推荐)

                   [Query]
                      │
                      ▼
              [Doc Classifier]
                      │
       ┌──────────────┼──────────────┐
       ▼              ▼               ▼
   高频热门        中频文档        低频长尾
   核心doc         (10-K等)       (大量历史)
       │              │               │
       ▼              ▼               ▼
   LC cached       Hybrid: RAG     RAG only
   (always-on)     精筛 + LC summary

六、金融领域应用

6.1 案例:股票分析师工作流

某hedge fund analyst每天针对一个公司做30+ queries:

  • 9:00 AM: 加载Apple 10-K (200K tok) into LC cache → $0.78
  • 9:00-12:00 AM: 30 queries on Apple → $1.92 (avg $0.06/query)
  • 总cost: $2.70 vs pure RAG $0.75 (3.6x贵)
  • 但analyst满意度大幅提升:拿到完整、准确、跨段comprehensive answers

→ 对人力贵、答案价值高的场景,LC cached值得

6.2 客服Chatbot

10K customers每天数百次query against客服文档:

  • 客服文档 50K tok, 缓存 5min
  • 用户分布在不同时区,cache hit rate ~60%
  • 成本:cache write ($0.20) + 大量reads ($0.015 avg)
  • 综合 vs RAG: $0.020/query LC vs $0.025/query RAG

LC slightly winning, simpler architecture

6.3 RAG用Long Context做"smart chunker"

Hybrid方案:

  1. Long context给doc full text + 摘要prompt → 生成1000个smart chunks (语义边界)
  2. 把chunks索引为vector DB
  3. Query时RAG检索

→ Long context augments RAG,而非替代。


七、生产经验

7.1 8个long context的坑

#描述
1TTFT超时200K context的TTFT 8s+,超gateway timeout
2Cache miss成本第1次send 200K tok, $0.78 -- 用户第一次query贵
35min TTL过短用户思考2-3 min后再查就cache miss
4多用户共享cacheA用户的cache不能复用给B用户(不同会话)
5doc变化导致cache无效文档增量更新整个cache invalid
6Lost in the middle关键信息在60%位置可能被ignore
7Cost看不到dashboard没监控cache hit rate,半月才发现cost炸
8Streaming with cachestreaming输出 + cache的usage上报顺序问题

7.2 优化TTFT

# 1. 用1h cache (代替5min) — 更长TTL但贵
extra_headers={"anthropic-beta": "extended-cache-ttl-2025-04-11"}
cache_control={"type": "ephemeral", "ttl": "1h"}

# 2. 显式 TTFT streaming
with anthropic.messages.stream(...) as stream:
    for text in stream.text_stream:
        yield text  # 边cache边给用户first token

# 3. 把doc切分多个cached blocks
# 4. 减少 max_tokens

7.3 Cache分层设计

[High freq docs (top 100)]    →  Always 1h cached
[Mid freq (top 1000)]          →  5min cached on demand
[Long tail (everything else)]  →  RAG-only

→ 80/20 rule: 20% docs handle 80% queries

八、Cost & Latency真实数据

8.1 月度成本(10K queries/day, 单doc 200K tok)

StrategyMonthly CostNotes
RAG v2$7,500Linear with queries
LC no cache$186,000不可行
LC 5min cached$11,500假设 ~50% cache hit
LC 1h cached$9,000假设 ~70% cache hit
Hybrid$8,500核心doc LC + 长尾RAG

Hybrid是最便宜+质量最高的方案

8.2 真实生产案例

某SaaS公司服务1000家律所,平均每客户每天50 queries on合同文档:

  • 合同200-500K tok
  • LC 1h cached: $4 per customer per day
  • vs RAG ($1.5/customer/day) — 但accuracy +15%
  • 客户愿意付溢价,LC win on revenue

九、关键速查表

9.1 Decision Matrix

Doc sizeDoc countUpdate freqQuery freqPick
<200K1lowlowLC no cache
<200K1lowhighLC cached
<200K1highanyLC no cache
<200K5-10lowhighLC + smart routing
>200K1anyanyRAG
any100+anyanyRAG
anymanyanyvery high (>10K/day)Hybrid

9.2 Prompt Caching cheatsheet

# Anthropic 5min cache (default)
system=[
    {"type": "text", "text": "stable system prompt"},
    {"type": "text", "text": doc, "cache_control": {"type": "ephemeral"}},
]

# 1h cache (premium)
extra_headers={"anthropic-beta": "extended-cache-ttl-2025-04-11"}
cache_control={"type": "ephemeral", "ttl": "1h"}

# Cache breakpoints (multiple)
# Up to 4 cache_control blocks per request

十、面试题

Q1: Long context会让RAG彻底过时吗?

短期不会。三个原因: (1) 多文档scaling: 即使Claude 4.7 1M context, 多文档总和会爆; (2) 更新频率: 每次doc更新整个cache失效; (3) Cost at scale: 大用户量cache miss概率高; 4) citation/audit: RAG天然有source attribution, LC的answer里citation较弱。长期会演化: small queries → LC, large/complex → RAG, 中间用 smart router。RAG从"必备"变成"工具箱里的一个工具"。

Q2: 解释Anthropic prompt caching如何省80%成本?

Anthropic在内部存储prompt前缀的KV cache(attention computation的中间结果)。同一个prefix re-用时,跳过attention forward pass。具体: cache write 1.25x原价(写入开销),但cache read 0.1x原价 (90%省)。前提是: (1) prefix完全相同(一字符不能差); (2) 5min TTL内(或1h premium); (3) prefix足够长 (≥1024 tok)。在200K context场景,cache read $0.06 vs $0.60 no-cache — 节省90%。

Q3: Lost-in-the-middle对long context的影响多大?

Old generation (Claude 3, GPT-4 32K) 影响明显: middle position accuracy可降25-30%。New generation (Claude 4.7, Gemini 2) 大幅改善: NIAH几乎平坦, multi-needle仍有10-15%middle dip。实战防护: (1) 把最关键信息放top或bottom; (2) 用explicit headers ("# Section X") 帮助模型attend; (3) 重要信息复述2次(一次顶一次底); (4) 长context query用 structured output prompt 强制模型逐段思考。

Q4: 你的客户每天查同一份200K合同10次,应该RAG还是LC?

算账:LC 1h cached: $0.78 (write once) + 9 × $0.06 = $1.32/day. RAG: 10 × $0.025 = $0.25/day. Cost: RAG 5x便宜. 但 LC accuracy +10%. Decision: 看客户类型: 律师/合规 (高价值query) → LC. 客户支持FAQ → RAG. Hybrid: 第一次LC加载并gen "doc summary + section guide", 后续queries用summary作为RAG retrieval的guide. Best of both.

Q5: 长context的TTFT慢,如何在UX上hide?

三个UX技巧: (1) Streaming: 让first token尽快出来, 即使整体latency 8s, 用户看到流式输出感觉很快; (2) Optimistic UI: 用户submit query时显示"Searching documents... this may take 5-10s for thorough analysis"; (3) Hybrid baseline: 先用RAG快速给个preliminary answer (2s), 同时async启动LC; 5s后LC回来更新到detailed answer. 这种 "fast then refined" 模式在金融分析师工具中很受欢迎。


十一、明日预告

Day 146: Multimodal RAG——金融文档不只是text,更多是 图表、表格、PDF的复杂layout(损益表、resistance levels图、地图)。明天我们用 ColPali(基于vision transformer的retrieval)和 Vision-language models(Claude Vision)处理金融PDF的图表,看在 figure-related queries 上的提升幅度。