Expert Day 130

长上下文工程——1M Context、Lost-in-the-Middle、Prompt Caching

Context window演进、Lost-in-the-middle现象、context压缩、Prompt caching工程

2026-09-08

Phase 3 - LLM基础与Prompt工程 (Day 121-134)

LongContextCachingNeedleInHaystackLostInTheMiddle

日期: 2026-09-08 方向: AI系统工程阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #LongContext #Caching #NeedleInHaystack #LostInTheMiddle

今日目标

类型	内容
学习	Context window演进、Lost-in-the-middle现象、context压缩、Prompt caching工程
实操	1M context "needle in haystack"测试、prompt caching实测节省
产出	`needle_test.md` + 长上下文最佳实践

一、理论基础

1.1 Context Window演进

年	模型	Max Context
2020	GPT-3	4K
2022	GPT-3.5	16K
2023	GPT-4	32K → 128K
2024	Claude 3	200K
2024	Gemini 1.5 Pro	1M, 2M
2025	Claude 4.6	200K (1M for select customers)
2026	Claude 4.7 (1M context)	1M (本系统使用此模型)
2026	Gemini 2.5	2M+

1.2 Lost-in-the-Middle (Liu et al. 2023)

实证发现：模型对context中部信息retrieve准确率显著低于头尾。

Position vs accuracy (典型曲线):
  Beginning:  90%
  Middle:     65%
  End:        85%

原因猜测：

Pretraining数据里"重要信息"分布在文档头/尾
Attention sink现象 (Xiao 2023)：第一个token的attention总很高
Position embedding在中部"模糊"

1.3 Needle in Haystack测试

Greg Kamradt 2023经典测试：

构造N tokens的document
在某position k插入"needle"（一个unique fact，如"the special password is 12345"）
在末尾问question："What's the special password?"
测retrieval accuracy across (N, k)

Claude/Gemini广告里"99%+ retrieval at 1M context"就是这个测试。但有局限：单needle相对简单，多hop reasoning across long context仍是难点。

1.4 Context压缩技术

LLMLingua (Microsoft): 用小模型识别"important tokens"，删非关键
Hierarchical summarization: 对长doc逐层summarize
RAG: 不喂全文，retrieve相关chunk
Selective context: 多次API call，每次喂相关部分

1.5 Prompt Caching = 长context的工程救星

不缓存：1M context每次$15 (Opus input)。每秒查一次 = $1.3M/day。缓存：write 1.25x×1次 + read 0.1x×N次。N大了均摊成本骤降。

N=1次: cost = $18.75 (cache miss)
N=100次: total = $18.75 + 100*$1.50 = $168.75, avg=$1.69/req (省89%)
N=1000次: total = $18.75 + 1000*$1.50 = $1518.75, avg=$1.52/req (省90%)

二、直觉解释

为什么长context不等于RAG死了？

理论上1M context可一次塞下全部知识。但：

Cost：1M tokens $15 per query
Latency：1M context首token时间~10-30s
质量：信息越多越dilute attention

RAG只取relevant的5K tokens更快、更便宜、质量可能更好。

长context的真正价值：

复杂推理across many docs（无法预知relevance）
Few-shot with long examples
Code base整体理解
多轮对话累积context

Lost-in-the-Middle为什么打不死？

虽然有technical fix（attention modification），但需要新架构。Claude 4.7用了一些技巧（推测：positional augmentation, RAG-style retrieval at attention level），但mid-position仍弱。工程对策：把关键信息显式放头尾。

三、代码实现

3.1 Needle-in-Haystack测试

# needle_test.py
"""
对Claude 4.7做长上下文retrieval测试
"""
import anthropic
import random
import time

client = anthropic.Anthropic()

# 用Project Gutenberg的free text做haystack
HAYSTACK_TEMPLATE = """The history of {} is long and complex. {} was founded in {}, with profound implications for trade routes. Subsequently, scholars in the {} century debated...
""" * 1000  # ~50K tokens

# 替换占位符填充到目标长度
def make_haystack(target_tokens=200000):
    cities = ["Athens", "Rome", "Beijing", "Cairo", "Paris"]
    haystack = ""
    while True:
        haystack += HAYSTACK_TEMPLATE.format(
            random.choice(cities), random.choice(cities),
            random.randint(100, 1900), random.randint(1, 21)
        )
        # 粗估token = chars / 4
        if len(haystack) / 4 > target_tokens:
            break
    return haystack

def insert_needle(haystack, needle, position_pct):
    """在position_pct处插入needle"""
    pos = int(len(haystack) * position_pct)
    return haystack[:pos] + " " + needle + " " + haystack[pos:]

def test_retrieval(haystack, needle, question):
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=200,
        temperature=0.0,
        system="Answer the question using only information from the provided text.",
        messages=[{"role": "user", "content": haystack + f"\n\nQuestion: {question}"}]
    )
    return resp.content[0].text, resp.usage

# 实验
NEEDLE = "The secret password is FALCON-9876."
QUESTION = "What is the secret password?"

results = []
target_length = 200_000  # 200K tokens
haystack = make_haystack(target_length)

for pos_pct in [0.0, 0.1, 0.25, 0.5, 0.75, 0.9, 1.0]:
    test_doc = insert_needle(haystack, NEEDLE, pos_pct)
    t0 = time.time()
    ans, usage = test_retrieval(test_doc, NEEDLE, QUESTION)
    latency = time.time() - t0
    found = "FALCON-9876" in ans
    print(f"Position {pos_pct*100:>5.0f}%: found={found}, "
          f"latency={latency:.1f}s, tokens={usage.input_tokens}, "
          f"answer={ans[:80]!r}")
    results.append((pos_pct, found, latency, usage.input_tokens))

预期典型输出：

Position    0%: found=True,  latency=12.3s, tokens=200145
Position   10%: found=True,  latency=12.1s
Position   25%: found=True,  latency=12.5s
Position   50%: found=True,  latency=12.8s   # Claude 4.7已经几乎100%
Position   75%: found=True
Position   90%: found=True
Position  100%: found=True

但更难的test（multiple needles, multi-hop）会显著降accuracy。

3.2 Prompt Caching实测

# caching_benefit.py
"""
对比有/无prompt caching的cost & latency
"""
import anthropic
import time

client = anthropic.Anthropic()

LARGE_CONTEXT = "..." * 50000  # ~50K tokens of stable docs

QUESTIONS = [
    "Summarize the main themes.",
    "Who are the key persons mentioned?",
    "What time period does this cover?",
    "What are the financial figures?",
    "List any product names.",
]

def call_no_cache(question):
    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=LARGE_CONTEXT,
        messages=[{"role": "user", "content": question}]
    )

def call_with_cache(question):
    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=[
            {"type": "text", "text": LARGE_CONTEXT,
             "cache_control": {"type": "ephemeral"}}
        ],
        messages=[{"role": "user", "content": question}]
    )

# Baseline (no cache)
print("=== No cache ===")
total_in, total_lat = 0, 0
for q in QUESTIONS:
    t0 = time.time()
    r = call_no_cache(q)
    lat = time.time() - t0
    print(f"  in={r.usage.input_tokens}, latency={lat:.1f}s")
    total_in += r.usage.input_tokens
    total_lat += lat

print(f"\nTotal input: {total_in}, total latency: {total_lat:.1f}s")
# Cost: 5 × 50K × $3/M = $0.75

# With cache
print("\n=== With cache ===")
total_in, total_cache_read, total_lat = 0, 0, 0
for q in QUESTIONS:
    t0 = time.time()
    r = call_with_cache(q)
    lat = time.time() - t0
    print(f"  in={r.usage.input_tokens}, "
          f"cache_create={r.usage.cache_creation_input_tokens}, "
          f"cache_read={r.usage.cache_read_input_tokens}, "
          f"latency={lat:.1f}s")
    total_cache_read += r.usage.cache_read_input_tokens
    total_lat += lat

# Cost: 1 write (50K × $3.75/M = $0.19) + 4 reads (200K × $0.30/M = $0.06) = $0.25 (省67%)
# Latency: 第一次同no-cache，后续可能快20-40%

预期：

=== No cache ===
  in=50000, latency=8.2s
  in=50000, latency=8.4s
  ...

=== With cache ===
  in=50000, cache_create=49500, cache_read=0, latency=8.3s   # write
  in=50100, cache_create=0, cache_read=49500, latency=5.1s   # hit
  in=50100, cache_create=0, cache_read=49500, latency=5.0s

3.3 RAG vs Long Context cost compare

# rag_vs_longcontext.py
"""
同一问题：用RAG (5K context) vs Full doc (200K context)
比较cost、latency、accuracy
"""

def cost_estimate(input_tokens, output_tokens=200, model="claude-sonnet-4-6"):
    PRICES = {"claude-sonnet-4-6": (3.0, 15.0)}
    p_in, p_out = PRICES[model]
    return input_tokens * p_in / 1e6 + output_tokens * p_out / 1e6

# 假设100次/天query
n_queries = 100

# 方案A: 每次喂全文 (200K)
cost_a = n_queries * cost_estimate(200_000)
print(f"A (full context, no cache): ${cost_a:.2f}/day")

# 方案B: 全文 + cache
cost_b_write = cost_estimate(200_000) * 1.25  # 第一次cache write
cost_b_reads = (n_queries - 1) * cost_estimate(200_000) * 0.1
cost_b = cost_b_write + cost_b_reads
print(f"B (full context, cached): ${cost_b:.2f}/day")

# 方案C: RAG提取5K context
embed_cost = 0.01 * n_queries  # embed query
retrieval_cost = 0.001 * n_queries  # vector DB query
gen_cost = n_queries * cost_estimate(5_000)
cost_c = embed_cost + retrieval_cost + gen_cost
print(f"C (RAG): ${cost_c:.2f}/day")

输出（典型）：

A (full context, no cache): $63.00/day
B (full context, cached): $7.59/day  (省88%)
C (RAG): $1.55/day  (省97%)

但accuracy不一定：

A & B：完整信息但lost-in-the-middle
C：retrieve可能miss相关片段

四、Anthropic API最佳实践

4.1 Prompt Caching详细参数

client.messages.create(
    system=[
        # Without cache_control: 放immutable instruction
        {"type": "text", "text": "You are a helpful assistant."},
        # bp1: large stable KB, 5min default
        {"type": "text", "text": LARGE_DOC,
         "cache_control": {"type": "ephemeral"}},
        # bp2: 1h cache (hot KB)
        {"type": "text", "text": HOT_KB,
         "cache_control": {"type": "ephemeral", "ttl": "1h"}},
    ],
    tools=[
        # bp3: tool definitions can also cache
        {**tool_def, "cache_control": {"type": "ephemeral"}}
        for tool_def in TOOLS
    ],
    messages=[
        # bp4: conversation history可cache
        {"role": "user", "content": [
            {"type": "text", "text": COMPRESSED_HISTORY,
             "cache_control": {"type": "ephemeral"}}
        ]},
        {"role": "assistant", "content": "..."},
        {"role": "user", "content": NEW_QUERY}  # 不cache
    ]
)

4.2 Cache hit调试

def cache_attribution(response):
    u = response.usage
    total = u.input_tokens
    return {
        "non_cached": total - u.cache_creation_input_tokens - u.cache_read_input_tokens,
        "cache_write": u.cache_creation_input_tokens,
        "cache_read": u.cache_read_input_tokens,
        "hit_ratio": u.cache_read_input_tokens / max(total, 1),
    }

4.3 Long context结构推荐

[System: Brief instruction]      ← always re-read
[KB / Long Docs (cached)]        ← cached, immutable
[Examples (cached if stable)]
[Conversation history (cached selectively)]
[Most recent user message]       ← never cached

关键info放最后user message里重申——抗lost-in-the-middle最有效手段。

五、金融领域应用

案例：实时财报问答（cache驱动）

class FinancialQA:
    def __init__(self, ticker):
        self.ticker = ticker
        self.report = self._load_10k(ticker)  # 200K tokens

    def _load_10k(self, ticker):
        # 模拟load
        return f"<10-K of {ticker}>... " * 20000

    def query(self, question):
        return client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            system=[
                {"type": "text",
                 "text": f"You are an analyst expert on {self.ticker}."},
                {"type": "text",
                 "text": self.report,
                 "cache_control": {"type": "ephemeral", "ttl": "1h"}}
            ],
            messages=[{"role": "user", "content": question}]
        )

# Usage
qa = FinancialQA("AAPL")
# 第一次query: ~10s, 写入1h cache
qa.query("What's Q3 revenue?")
# 之后1小时内同一AAPL任何问题: ~2s, 节省90%
qa.query("What are the top risks?")
qa.query("Compare to FY25...")

业务模型：用户付费/月，infrastructure cost大量被cache吸收。Margin从30%→80%。

案例：合规审查multi-doc

# 喂一个monstrous compliance manual + transaction history
def compliance_audit(transaction):
    return client.messages.create(
        model="claude-opus-4-7",
        max_tokens=2048,
        system=[
            {"type": "text", "text": COMPLIANCE_MANUAL,    # 100K, 1h cache
             "cache_control": {"type": "ephemeral", "ttl": "1h"}},
            {"type": "text", "text": REGULATIONS_2026,     # 50K, 1h cache
             "cache_control": {"type": "ephemeral", "ttl": "1h"}}
        ],
        messages=[{"role": "user",
                   "content": f"Review this transaction:\n{transaction}"}]
    )

六、常见陷阱

以为长context = RAG死：错。长context贵且lost-in-the-middle。RAG仍是cost-optimal首选。
Cache breakpoint乱放：每个breakpoint都要值得（>1024 tokens且reused）。<1024字符的cache_control被忽略。
Cache prefix变了不知道：Anthropic不warn你"这次prefix和上次不同"。要主动监控cache_read_tokens > 0。
1h cache成本陷阱：write是2x费用。如果只用2-3次cache就过期，反而贵。
Context window塞满 → max_tokens没空间：input 199K + max_tokens 4K → output可能被截断。留余地。
Lost-in-the-middle在multi-hop更明显：单needle好找；要"comparing X and Y"，X在30%、Y在70%——准确率掉。

七、关键速查

Anthropic Cache矩阵

                  TTL      Write Multiplier   Read Multiplier
ephemeral 5min:   5min     1.25x              0.1x
ephemeral 1h:     1h       2.0x               0.1x

Min cache size:   1024 tokens (Opus/Sonnet), 2048 (Haiku)
Max breakpoints:  4 per request
Cache key:        exact prefix match (incl. whitespace)

长context决策树

有结构化stable knowledge?  → RAG (low cost)
需要全部context才能推理?   → Long context + cache
对话场景 multi-turn?       → Cache history blocks
偶尔one-off长文?           → Long context no cache OK

1M Context用法 (Claude 4.7 1M)

client.messages.create(
    model="claude-opus-4-7-1m",  # 假设的1M context变体
    max_tokens=8192,
    extra_headers={"anthropic-beta": "1m-context-beta"},
    ...
)
# Pricing: input ~2x normal beyond 200K

八、面试题

Q1: 为什么Claude advertise 99% needle retrieval at 1M context但实际有时仍出错？

"99%"是单needle单hop的简单测试。Multi-needle (找2件事arrgegate)、multi-hop reasoning (X depends on Y) 准确率显著降。生产里要靠prompt结构（关键信息显式放头尾）和验证。

Q2: 设计一个支持5000个用户的财报chatbot，怎么用prompt cache省钱？

(a) Per-ticker分组：每ticker的10-K作为独立cache breakpoint。(b) 流量集中的几只热门股用1h cache (write 2x但read省到极致)。(c) 长尾股用5min default。(d) 监控hit ratio低的cache，可能要重新组织。(e) Off-hours预热cache（cron job）。

Q3: Lost-in-the-middle对你的产品架构有什么影响？

(1) 不依赖中部信息：把关键instruction放头尾。(2) Multi-hop推理时拆分为多次API call (each focused)。(3) 长文档场景用Citations让模型显式找出source position。(4) 必要时用RAG精准retrieve替代long context。

Q4: 长context vs RAG你怎么选？

看是否需要"全图"：(1) 知道相关chunks → RAG。(2) 需要跨文档推理 → 长context + cache。(3) 反复query同一corpus → 长context cache一次。(4) Cost critical → RAG永远更便宜。

九、明日预告

Day 131: Prompt安全 — Prompt injection、Jailbreak、Indirect injection、红队测试。