返回 Expert 笔记
Expert Day 130

长上下文工程——1M Context、Lost-in-the-Middle、Prompt Caching

Context window演进、Lost-in-the-middle现象、context压缩、Prompt caching工程

2026-09-08
Phase 3 - LLM基础与Prompt工程 (Day 121-134)
LongContextCachingNeedleInHaystackLostInTheMiddle

日期: 2026-09-08 方向: AI系统工程 阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #LongContext #Caching #NeedleInHaystack #LostInTheMiddle


今日目标

类型内容
学习Context window演进、Lost-in-the-middle现象、context压缩、Prompt caching工程
实操1M context "needle in haystack"测试、prompt caching实测节省
产出needle_test.md + 长上下文最佳实践

一、理论基础

1.1 Context Window演进

模型Max Context
2020GPT-34K
2022GPT-3.516K
2023GPT-432K → 128K
2024Claude 3200K
2024Gemini 1.5 Pro1M, 2M
2025Claude 4.6200K (1M for select customers)
2026Claude 4.7 (1M context)1M (本系统使用此模型)
2026Gemini 2.52M+

1.2 Lost-in-the-Middle (Liu et al. 2023)

实证发现:模型对context中部信息retrieve准确率显著低于头尾。

Position vs accuracy (典型曲线):
  Beginning:  90%
  Middle:     65%
  End:        85%

原因猜测

  • Pretraining数据里"重要信息"分布在文档头/尾
  • Attention sink现象 (Xiao 2023):第一个token的attention总很高
  • Position embedding在中部"模糊"

1.3 Needle in Haystack测试

Greg Kamradt 2023经典测试:

  1. 构造N tokens的document
  2. 在某position k插入"needle"(一个unique fact,如"the special password is 12345")
  3. 在末尾问question:"What's the special password?"
  4. 测retrieval accuracy across (N, k)

Claude/Gemini广告里"99%+ retrieval at 1M context"就是这个测试。但有局限:单needle相对简单,多hop reasoning across long context仍是难点

1.4 Context压缩技术

  • LLMLingua (Microsoft): 用小模型识别"important tokens",删非关键
  • Hierarchical summarization: 对长doc逐层summarize
  • RAG: 不喂全文,retrieve相关chunk
  • Selective context: 多次API call,每次喂相关部分

1.5 Prompt Caching = 长context的工程救星

不缓存:1M context每次$15 (Opus input)。每秒查一次 = $1.3M/day。 缓存:write 1.25x×1次 + read 0.1x×N次。N大了均摊成本骤降。

N=1次: cost = $18.75 (cache miss)
N=100次: total = $18.75 + 100*$1.50 = $168.75, avg=$1.69/req (省89%)
N=1000次: total = $18.75 + 1000*$1.50 = $1518.75, avg=$1.52/req (省90%)

二、直觉解释

为什么长context不等于RAG死了?

理论上1M context可一次塞下全部知识。但:

  • Cost:1M tokens $15 per query
  • Latency:1M context首token时间~10-30s
  • 质量:信息越多越dilute attention

RAG只取relevant的5K tokens更快、更便宜、质量可能更好。

长context的真正价值

  • 复杂推理across many docs(无法预知relevance)
  • Few-shot with long examples
  • Code base整体理解
  • 多轮对话累积context

Lost-in-the-Middle为什么打不死?

虽然有technical fix(attention modification),但需要新架构。Claude 4.7用了一些技巧(推测:positional augmentation, RAG-style retrieval at attention level),但mid-position仍弱。工程对策:把关键信息显式放头尾。


三、代码实现

3.1 Needle-in-Haystack测试

# needle_test.py
"""
对Claude 4.7做长上下文retrieval测试
"""
import anthropic
import random
import time

client = anthropic.Anthropic()

# 用Project Gutenberg的free text做haystack
HAYSTACK_TEMPLATE = """The history of {} is long and complex. {} was founded in {}, with profound implications for trade routes. Subsequently, scholars in the {} century debated...
""" * 1000  # ~50K tokens

# 替换占位符填充到目标长度
def make_haystack(target_tokens=200000):
    cities = ["Athens", "Rome", "Beijing", "Cairo", "Paris"]
    haystack = ""
    while True:
        haystack += HAYSTACK_TEMPLATE.format(
            random.choice(cities), random.choice(cities),
            random.randint(100, 1900), random.randint(1, 21)
        )
        # 粗估token = chars / 4
        if len(haystack) / 4 > target_tokens:
            break
    return haystack

def insert_needle(haystack, needle, position_pct):
    """在position_pct处插入needle"""
    pos = int(len(haystack) * position_pct)
    return haystack[:pos] + " " + needle + " " + haystack[pos:]

def test_retrieval(haystack, needle, question):
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=200,
        temperature=0.0,
        system="Answer the question using only information from the provided text.",
        messages=[{"role": "user", "content": haystack + f"\n\nQuestion: {question}"}]
    )
    return resp.content[0].text, resp.usage

# 实验
NEEDLE = "The secret password is FALCON-9876."
QUESTION = "What is the secret password?"

results = []
target_length = 200_000  # 200K tokens
haystack = make_haystack(target_length)

for pos_pct in [0.0, 0.1, 0.25, 0.5, 0.75, 0.9, 1.0]:
    test_doc = insert_needle(haystack, NEEDLE, pos_pct)
    t0 = time.time()
    ans, usage = test_retrieval(test_doc, NEEDLE, QUESTION)
    latency = time.time() - t0
    found = "FALCON-9876" in ans
    print(f"Position {pos_pct*100:>5.0f}%: found={found}, "
          f"latency={latency:.1f}s, tokens={usage.input_tokens}, "
          f"answer={ans[:80]!r}")
    results.append((pos_pct, found, latency, usage.input_tokens))

预期典型输出:

Position    0%: found=True,  latency=12.3s, tokens=200145
Position   10%: found=True,  latency=12.1s
Position   25%: found=True,  latency=12.5s
Position   50%: found=True,  latency=12.8s   # Claude 4.7已经几乎100%
Position   75%: found=True
Position   90%: found=True
Position  100%: found=True

但更难的test(multiple needles, multi-hop)会显著降accuracy。

3.2 Prompt Caching实测

# caching_benefit.py
"""
对比有/无prompt caching的cost & latency
"""
import anthropic
import time

client = anthropic.Anthropic()

LARGE_CONTEXT = "..." * 50000  # ~50K tokens of stable docs

QUESTIONS = [
    "Summarize the main themes.",
    "Who are the key persons mentioned?",
    "What time period does this cover?",
    "What are the financial figures?",
    "List any product names.",
]

def call_no_cache(question):
    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=LARGE_CONTEXT,
        messages=[{"role": "user", "content": question}]
    )

def call_with_cache(question):
    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=[
            {"type": "text", "text": LARGE_CONTEXT,
             "cache_control": {"type": "ephemeral"}}
        ],
        messages=[{"role": "user", "content": question}]
    )

# Baseline (no cache)
print("=== No cache ===")
total_in, total_lat = 0, 0
for q in QUESTIONS:
    t0 = time.time()
    r = call_no_cache(q)
    lat = time.time() - t0
    print(f"  in={r.usage.input_tokens}, latency={lat:.1f}s")
    total_in += r.usage.input_tokens
    total_lat += lat

print(f"\nTotal input: {total_in}, total latency: {total_lat:.1f}s")
# Cost: 5 × 50K × $3/M = $0.75

# With cache
print("\n=== With cache ===")
total_in, total_cache_read, total_lat = 0, 0, 0
for q in QUESTIONS:
    t0 = time.time()
    r = call_with_cache(q)
    lat = time.time() - t0
    print(f"  in={r.usage.input_tokens}, "
          f"cache_create={r.usage.cache_creation_input_tokens}, "
          f"cache_read={r.usage.cache_read_input_tokens}, "
          f"latency={lat:.1f}s")
    total_cache_read += r.usage.cache_read_input_tokens
    total_lat += lat

# Cost: 1 write (50K × $3.75/M = $0.19) + 4 reads (200K × $0.30/M = $0.06) = $0.25 (省67%)
# Latency: 第一次同no-cache,后续可能快20-40%

预期:

=== No cache ===
  in=50000, latency=8.2s
  in=50000, latency=8.4s
  ...

=== With cache ===
  in=50000, cache_create=49500, cache_read=0, latency=8.3s   # write
  in=50100, cache_create=0, cache_read=49500, latency=5.1s   # hit
  in=50100, cache_create=0, cache_read=49500, latency=5.0s

3.3 RAG vs Long Context cost compare

# rag_vs_longcontext.py
"""
同一问题:用RAG (5K context) vs Full doc (200K context)
比较cost、latency、accuracy
"""

def cost_estimate(input_tokens, output_tokens=200, model="claude-sonnet-4-6"):
    PRICES = {"claude-sonnet-4-6": (3.0, 15.0)}
    p_in, p_out = PRICES[model]
    return input_tokens * p_in / 1e6 + output_tokens * p_out / 1e6

# 假设100次/天query
n_queries = 100

# 方案A: 每次喂全文 (200K)
cost_a = n_queries * cost_estimate(200_000)
print(f"A (full context, no cache): ${cost_a:.2f}/day")

# 方案B: 全文 + cache
cost_b_write = cost_estimate(200_000) * 1.25  # 第一次cache write
cost_b_reads = (n_queries - 1) * cost_estimate(200_000) * 0.1
cost_b = cost_b_write + cost_b_reads
print(f"B (full context, cached): ${cost_b:.2f}/day")

# 方案C: RAG提取5K context
embed_cost = 0.01 * n_queries  # embed query
retrieval_cost = 0.001 * n_queries  # vector DB query
gen_cost = n_queries * cost_estimate(5_000)
cost_c = embed_cost + retrieval_cost + gen_cost
print(f"C (RAG): ${cost_c:.2f}/day")

输出(典型):

A (full context, no cache): $63.00/day
B (full context, cached): $7.59/day  (省88%)
C (RAG): $1.55/day  (省97%)

accuracy不一定

  • A & B:完整信息但lost-in-the-middle
  • C:retrieve可能miss相关片段

四、Anthropic API最佳实践

4.1 Prompt Caching详细参数

client.messages.create(
    system=[
        # Without cache_control: 放immutable instruction
        {"type": "text", "text": "You are a helpful assistant."},
        # bp1: large stable KB, 5min default
        {"type": "text", "text": LARGE_DOC,
         "cache_control": {"type": "ephemeral"}},
        # bp2: 1h cache (hot KB)
        {"type": "text", "text": HOT_KB,
         "cache_control": {"type": "ephemeral", "ttl": "1h"}},
    ],
    tools=[
        # bp3: tool definitions can also cache
        {**tool_def, "cache_control": {"type": "ephemeral"}}
        for tool_def in TOOLS
    ],
    messages=[
        # bp4: conversation history可cache
        {"role": "user", "content": [
            {"type": "text", "text": COMPRESSED_HISTORY,
             "cache_control": {"type": "ephemeral"}}
        ]},
        {"role": "assistant", "content": "..."},
        {"role": "user", "content": NEW_QUERY}  # 不cache
    ]
)

4.2 Cache hit调试

def cache_attribution(response):
    u = response.usage
    total = u.input_tokens
    return {
        "non_cached": total - u.cache_creation_input_tokens - u.cache_read_input_tokens,
        "cache_write": u.cache_creation_input_tokens,
        "cache_read": u.cache_read_input_tokens,
        "hit_ratio": u.cache_read_input_tokens / max(total, 1),
    }

4.3 Long context结构推荐

[System: Brief instruction]      ← always re-read
[KB / Long Docs (cached)]        ← cached, immutable
[Examples (cached if stable)]
[Conversation history (cached selectively)]
[Most recent user message]       ← never cached

关键info放最后user message里重申——抗lost-in-the-middle最有效手段。


五、金融领域应用

案例:实时财报问答(cache驱动)

class FinancialQA:
    def __init__(self, ticker):
        self.ticker = ticker
        self.report = self._load_10k(ticker)  # 200K tokens

    def _load_10k(self, ticker):
        # 模拟load
        return f"<10-K of {ticker}>... " * 20000

    def query(self, question):
        return client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            system=[
                {"type": "text",
                 "text": f"You are an analyst expert on {self.ticker}."},
                {"type": "text",
                 "text": self.report,
                 "cache_control": {"type": "ephemeral", "ttl": "1h"}}
            ],
            messages=[{"role": "user", "content": question}]
        )

# Usage
qa = FinancialQA("AAPL")
# 第一次query: ~10s, 写入1h cache
qa.query("What's Q3 revenue?")
# 之后1小时内同一AAPL任何问题: ~2s, 节省90%
qa.query("What are the top risks?")
qa.query("Compare to FY25...")

业务模型:用户付费/月,infrastructure cost大量被cache吸收。Margin从30%→80%。

案例:合规审查multi-doc

# 喂一个monstrous compliance manual + transaction history
def compliance_audit(transaction):
    return client.messages.create(
        model="claude-opus-4-7",
        max_tokens=2048,
        system=[
            {"type": "text", "text": COMPLIANCE_MANUAL,    # 100K, 1h cache
             "cache_control": {"type": "ephemeral", "ttl": "1h"}},
            {"type": "text", "text": REGULATIONS_2026,     # 50K, 1h cache
             "cache_control": {"type": "ephemeral", "ttl": "1h"}}
        ],
        messages=[{"role": "user",
                   "content": f"Review this transaction:\n{transaction}"}]
    )

六、常见陷阱

  1. 以为长context = RAG死:错。长context贵且lost-in-the-middle。RAG仍是cost-optimal首选。
  2. Cache breakpoint乱放:每个breakpoint都要值得(>1024 tokens且reused)。<1024字符的cache_control被忽略。
  3. Cache prefix变了不知道:Anthropic不warn你"这次prefix和上次不同"。要主动监控cache_read_tokens > 0。
  4. 1h cache成本陷阱:write是2x费用。如果只用2-3次cache就过期,反而贵。
  5. Context window塞满 → max_tokens没空间:input 199K + max_tokens 4K → output可能被截断。留余地
  6. Lost-in-the-middle在multi-hop更明显:单needle好找;要"comparing X and Y",X在30%、Y在70%——准确率掉。

七、关键速查

Anthropic Cache矩阵

                  TTL      Write Multiplier   Read Multiplier
ephemeral 5min:   5min     1.25x              0.1x
ephemeral 1h:     1h       2.0x               0.1x

Min cache size:   1024 tokens (Opus/Sonnet), 2048 (Haiku)
Max breakpoints:  4 per request
Cache key:        exact prefix match (incl. whitespace)

长context决策树

有结构化stable knowledge?  → RAG (low cost)
需要全部context才能推理?   → Long context + cache
对话场景 multi-turn?       → Cache history blocks
偶尔one-off长文?           → Long context no cache OK

1M Context用法 (Claude 4.7 1M)

client.messages.create(
    model="claude-opus-4-7-1m",  # 假设的1M context变体
    max_tokens=8192,
    extra_headers={"anthropic-beta": "1m-context-beta"},
    ...
)
# Pricing: input ~2x normal beyond 200K

八、面试题

Q1: 为什么Claude advertise 99% needle retrieval at 1M context但实际有时仍出错?

"99%"是单needle单hop的简单测试。Multi-needle (找2件事arrgegate)、multi-hop reasoning (X depends on Y) 准确率显著降。生产里要靠prompt结构(关键信息显式放头尾)和验证。

Q2: 设计一个支持5000个用户的财报chatbot,怎么用prompt cache省钱?

(a) Per-ticker分组:每ticker的10-K作为独立cache breakpoint。(b) 流量集中的几只热门股用1h cache (write 2x但read省到极致)。(c) 长尾股用5min default。(d) 监控hit ratio低的cache,可能要重新组织。(e) Off-hours预热cache(cron job)。

Q3: Lost-in-the-middle对你的产品架构有什么影响?

(1) 不依赖中部信息:把关键instruction放头尾。(2) Multi-hop推理时拆分为多次API call (each focused)。(3) 长文档场景用Citations让模型显式找出source position。(4) 必要时用RAG精准retrieve替代long context。

Q4: 长context vs RAG你怎么选?

看是否需要"全图":(1) 知道相关chunks → RAG。(2) 需要跨文档推理 → 长context + cache。(3) 反复query同一corpus → 长context cache一次。(4) Cost critical → RAG永远更便宜。


九、明日预告

Day 131: Prompt安全 — Prompt injection、Jailbreak、Indirect injection、红队测试。