长上下文工程——1M Context、Lost-in-the-Middle、Prompt Caching
Context window演进、Lost-in-the-middle现象、context压缩、Prompt caching工程
日期: 2026-09-08 方向: AI系统工程 阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #LongContext #Caching #NeedleInHaystack #LostInTheMiddle
今日目标
| 类型 | 内容 |
|---|---|
| 学习 | Context window演进、Lost-in-the-middle现象、context压缩、Prompt caching工程 |
| 实操 | 1M context "needle in haystack"测试、prompt caching实测节省 |
| 产出 | needle_test.md + 长上下文最佳实践 |
一、理论基础
1.1 Context Window演进
| 年 | 模型 | Max Context |
|---|---|---|
| 2020 | GPT-3 | 4K |
| 2022 | GPT-3.5 | 16K |
| 2023 | GPT-4 | 32K → 128K |
| 2024 | Claude 3 | 200K |
| 2024 | Gemini 1.5 Pro | 1M, 2M |
| 2025 | Claude 4.6 | 200K (1M for select customers) |
| 2026 | Claude 4.7 (1M context) | 1M (本系统使用此模型) |
| 2026 | Gemini 2.5 | 2M+ |
1.2 Lost-in-the-Middle (Liu et al. 2023)
实证发现:模型对context中部信息retrieve准确率显著低于头尾。
Position vs accuracy (典型曲线):
Beginning: 90%
Middle: 65%
End: 85%
原因猜测:
- Pretraining数据里"重要信息"分布在文档头/尾
- Attention sink现象 (Xiao 2023):第一个token的attention总很高
- Position embedding在中部"模糊"
1.3 Needle in Haystack测试
Greg Kamradt 2023经典测试:
- 构造N tokens的document
- 在某position k插入"needle"(一个unique fact,如"the special password is 12345")
- 在末尾问question:"What's the special password?"
- 测retrieval accuracy across (N, k)
Claude/Gemini广告里"99%+ retrieval at 1M context"就是这个测试。但有局限:单needle相对简单,多hop reasoning across long context仍是难点。
1.4 Context压缩技术
- LLMLingua (Microsoft): 用小模型识别"important tokens",删非关键
- Hierarchical summarization: 对长doc逐层summarize
- RAG: 不喂全文,retrieve相关chunk
- Selective context: 多次API call,每次喂相关部分
1.5 Prompt Caching = 长context的工程救星
不缓存:1M context每次$15 (Opus input)。每秒查一次 = $1.3M/day。 缓存:write 1.25x×1次 + read 0.1x×N次。N大了均摊成本骤降。
N=1次: cost = $18.75 (cache miss)
N=100次: total = $18.75 + 100*$1.50 = $168.75, avg=$1.69/req (省89%)
N=1000次: total = $18.75 + 1000*$1.50 = $1518.75, avg=$1.52/req (省90%)
二、直觉解释
为什么长context不等于RAG死了?
理论上1M context可一次塞下全部知识。但:
- Cost:1M tokens $15 per query
- Latency:1M context首token时间~10-30s
- 质量:信息越多越dilute attention
RAG只取relevant的5K tokens更快、更便宜、质量可能更好。
长context的真正价值:
- 复杂推理across many docs(无法预知relevance)
- Few-shot with long examples
- Code base整体理解
- 多轮对话累积context
Lost-in-the-Middle为什么打不死?
虽然有technical fix(attention modification),但需要新架构。Claude 4.7用了一些技巧(推测:positional augmentation, RAG-style retrieval at attention level),但mid-position仍弱。工程对策:把关键信息显式放头尾。
三、代码实现
3.1 Needle-in-Haystack测试
# needle_test.py
"""
对Claude 4.7做长上下文retrieval测试
"""
import anthropic
import random
import time
client = anthropic.Anthropic()
# 用Project Gutenberg的free text做haystack
HAYSTACK_TEMPLATE = """The history of {} is long and complex. {} was founded in {}, with profound implications for trade routes. Subsequently, scholars in the {} century debated...
""" * 1000 # ~50K tokens
# 替换占位符填充到目标长度
def make_haystack(target_tokens=200000):
cities = ["Athens", "Rome", "Beijing", "Cairo", "Paris"]
haystack = ""
while True:
haystack += HAYSTACK_TEMPLATE.format(
random.choice(cities), random.choice(cities),
random.randint(100, 1900), random.randint(1, 21)
)
# 粗估token = chars / 4
if len(haystack) / 4 > target_tokens:
break
return haystack
def insert_needle(haystack, needle, position_pct):
"""在position_pct处插入needle"""
pos = int(len(haystack) * position_pct)
return haystack[:pos] + " " + needle + " " + haystack[pos:]
def test_retrieval(haystack, needle, question):
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=200,
temperature=0.0,
system="Answer the question using only information from the provided text.",
messages=[{"role": "user", "content": haystack + f"\n\nQuestion: {question}"}]
)
return resp.content[0].text, resp.usage
# 实验
NEEDLE = "The secret password is FALCON-9876."
QUESTION = "What is the secret password?"
results = []
target_length = 200_000 # 200K tokens
haystack = make_haystack(target_length)
for pos_pct in [0.0, 0.1, 0.25, 0.5, 0.75, 0.9, 1.0]:
test_doc = insert_needle(haystack, NEEDLE, pos_pct)
t0 = time.time()
ans, usage = test_retrieval(test_doc, NEEDLE, QUESTION)
latency = time.time() - t0
found = "FALCON-9876" in ans
print(f"Position {pos_pct*100:>5.0f}%: found={found}, "
f"latency={latency:.1f}s, tokens={usage.input_tokens}, "
f"answer={ans[:80]!r}")
results.append((pos_pct, found, latency, usage.input_tokens))
预期典型输出:
Position 0%: found=True, latency=12.3s, tokens=200145
Position 10%: found=True, latency=12.1s
Position 25%: found=True, latency=12.5s
Position 50%: found=True, latency=12.8s # Claude 4.7已经几乎100%
Position 75%: found=True
Position 90%: found=True
Position 100%: found=True
但更难的test(multiple needles, multi-hop)会显著降accuracy。
3.2 Prompt Caching实测
# caching_benefit.py
"""
对比有/无prompt caching的cost & latency
"""
import anthropic
import time
client = anthropic.Anthropic()
LARGE_CONTEXT = "..." * 50000 # ~50K tokens of stable docs
QUESTIONS = [
"Summarize the main themes.",
"Who are the key persons mentioned?",
"What time period does this cover?",
"What are the financial figures?",
"List any product names.",
]
def call_no_cache(question):
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=LARGE_CONTEXT,
messages=[{"role": "user", "content": question}]
)
def call_with_cache(question):
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=[
{"type": "text", "text": LARGE_CONTEXT,
"cache_control": {"type": "ephemeral"}}
],
messages=[{"role": "user", "content": question}]
)
# Baseline (no cache)
print("=== No cache ===")
total_in, total_lat = 0, 0
for q in QUESTIONS:
t0 = time.time()
r = call_no_cache(q)
lat = time.time() - t0
print(f" in={r.usage.input_tokens}, latency={lat:.1f}s")
total_in += r.usage.input_tokens
total_lat += lat
print(f"\nTotal input: {total_in}, total latency: {total_lat:.1f}s")
# Cost: 5 × 50K × $3/M = $0.75
# With cache
print("\n=== With cache ===")
total_in, total_cache_read, total_lat = 0, 0, 0
for q in QUESTIONS:
t0 = time.time()
r = call_with_cache(q)
lat = time.time() - t0
print(f" in={r.usage.input_tokens}, "
f"cache_create={r.usage.cache_creation_input_tokens}, "
f"cache_read={r.usage.cache_read_input_tokens}, "
f"latency={lat:.1f}s")
total_cache_read += r.usage.cache_read_input_tokens
total_lat += lat
# Cost: 1 write (50K × $3.75/M = $0.19) + 4 reads (200K × $0.30/M = $0.06) = $0.25 (省67%)
# Latency: 第一次同no-cache,后续可能快20-40%
预期:
=== No cache ===
in=50000, latency=8.2s
in=50000, latency=8.4s
...
=== With cache ===
in=50000, cache_create=49500, cache_read=0, latency=8.3s # write
in=50100, cache_create=0, cache_read=49500, latency=5.1s # hit
in=50100, cache_create=0, cache_read=49500, latency=5.0s
3.3 RAG vs Long Context cost compare
# rag_vs_longcontext.py
"""
同一问题:用RAG (5K context) vs Full doc (200K context)
比较cost、latency、accuracy
"""
def cost_estimate(input_tokens, output_tokens=200, model="claude-sonnet-4-6"):
PRICES = {"claude-sonnet-4-6": (3.0, 15.0)}
p_in, p_out = PRICES[model]
return input_tokens * p_in / 1e6 + output_tokens * p_out / 1e6
# 假设100次/天query
n_queries = 100
# 方案A: 每次喂全文 (200K)
cost_a = n_queries * cost_estimate(200_000)
print(f"A (full context, no cache): ${cost_a:.2f}/day")
# 方案B: 全文 + cache
cost_b_write = cost_estimate(200_000) * 1.25 # 第一次cache write
cost_b_reads = (n_queries - 1) * cost_estimate(200_000) * 0.1
cost_b = cost_b_write + cost_b_reads
print(f"B (full context, cached): ${cost_b:.2f}/day")
# 方案C: RAG提取5K context
embed_cost = 0.01 * n_queries # embed query
retrieval_cost = 0.001 * n_queries # vector DB query
gen_cost = n_queries * cost_estimate(5_000)
cost_c = embed_cost + retrieval_cost + gen_cost
print(f"C (RAG): ${cost_c:.2f}/day")
输出(典型):
A (full context, no cache): $63.00/day
B (full context, cached): $7.59/day (省88%)
C (RAG): $1.55/day (省97%)
但accuracy不一定:
- A & B:完整信息但lost-in-the-middle
- C:retrieve可能miss相关片段
四、Anthropic API最佳实践
4.1 Prompt Caching详细参数
client.messages.create(
system=[
# Without cache_control: 放immutable instruction
{"type": "text", "text": "You are a helpful assistant."},
# bp1: large stable KB, 5min default
{"type": "text", "text": LARGE_DOC,
"cache_control": {"type": "ephemeral"}},
# bp2: 1h cache (hot KB)
{"type": "text", "text": HOT_KB,
"cache_control": {"type": "ephemeral", "ttl": "1h"}},
],
tools=[
# bp3: tool definitions can also cache
{**tool_def, "cache_control": {"type": "ephemeral"}}
for tool_def in TOOLS
],
messages=[
# bp4: conversation history可cache
{"role": "user", "content": [
{"type": "text", "text": COMPRESSED_HISTORY,
"cache_control": {"type": "ephemeral"}}
]},
{"role": "assistant", "content": "..."},
{"role": "user", "content": NEW_QUERY} # 不cache
]
)
4.2 Cache hit调试
def cache_attribution(response):
u = response.usage
total = u.input_tokens
return {
"non_cached": total - u.cache_creation_input_tokens - u.cache_read_input_tokens,
"cache_write": u.cache_creation_input_tokens,
"cache_read": u.cache_read_input_tokens,
"hit_ratio": u.cache_read_input_tokens / max(total, 1),
}
4.3 Long context结构推荐
[System: Brief instruction] ← always re-read
[KB / Long Docs (cached)] ← cached, immutable
[Examples (cached if stable)]
[Conversation history (cached selectively)]
[Most recent user message] ← never cached
关键info放最后user message里重申——抗lost-in-the-middle最有效手段。
五、金融领域应用
案例:实时财报问答(cache驱动)
class FinancialQA:
def __init__(self, ticker):
self.ticker = ticker
self.report = self._load_10k(ticker) # 200K tokens
def _load_10k(self, ticker):
# 模拟load
return f"<10-K of {ticker}>... " * 20000
def query(self, question):
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{"type": "text",
"text": f"You are an analyst expert on {self.ticker}."},
{"type": "text",
"text": self.report,
"cache_control": {"type": "ephemeral", "ttl": "1h"}}
],
messages=[{"role": "user", "content": question}]
)
# Usage
qa = FinancialQA("AAPL")
# 第一次query: ~10s, 写入1h cache
qa.query("What's Q3 revenue?")
# 之后1小时内同一AAPL任何问题: ~2s, 节省90%
qa.query("What are the top risks?")
qa.query("Compare to FY25...")
业务模型:用户付费/月,infrastructure cost大量被cache吸收。Margin从30%→80%。
案例:合规审查multi-doc
# 喂一个monstrous compliance manual + transaction history
def compliance_audit(transaction):
return client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
system=[
{"type": "text", "text": COMPLIANCE_MANUAL, # 100K, 1h cache
"cache_control": {"type": "ephemeral", "ttl": "1h"}},
{"type": "text", "text": REGULATIONS_2026, # 50K, 1h cache
"cache_control": {"type": "ephemeral", "ttl": "1h"}}
],
messages=[{"role": "user",
"content": f"Review this transaction:\n{transaction}"}]
)
六、常见陷阱
- 以为长context = RAG死:错。长context贵且lost-in-the-middle。RAG仍是cost-optimal首选。
- Cache breakpoint乱放:每个breakpoint都要值得(>1024 tokens且reused)。<1024字符的cache_control被忽略。
- Cache prefix变了不知道:Anthropic不warn你"这次prefix和上次不同"。要主动监控cache_read_tokens > 0。
- 1h cache成本陷阱:write是2x费用。如果只用2-3次cache就过期,反而贵。
- Context window塞满 → max_tokens没空间:input 199K + max_tokens 4K → output可能被截断。留余地。
- Lost-in-the-middle在multi-hop更明显:单needle好找;要"comparing X and Y",X在30%、Y在70%——准确率掉。
七、关键速查
Anthropic Cache矩阵
TTL Write Multiplier Read Multiplier
ephemeral 5min: 5min 1.25x 0.1x
ephemeral 1h: 1h 2.0x 0.1x
Min cache size: 1024 tokens (Opus/Sonnet), 2048 (Haiku)
Max breakpoints: 4 per request
Cache key: exact prefix match (incl. whitespace)
长context决策树
有结构化stable knowledge? → RAG (low cost)
需要全部context才能推理? → Long context + cache
对话场景 multi-turn? → Cache history blocks
偶尔one-off长文? → Long context no cache OK
1M Context用法 (Claude 4.7 1M)
client.messages.create(
model="claude-opus-4-7-1m", # 假设的1M context变体
max_tokens=8192,
extra_headers={"anthropic-beta": "1m-context-beta"},
...
)
# Pricing: input ~2x normal beyond 200K
八、面试题
Q1: 为什么Claude advertise 99% needle retrieval at 1M context但实际有时仍出错?
"99%"是单needle单hop的简单测试。Multi-needle (找2件事arrgegate)、multi-hop reasoning (X depends on Y) 准确率显著降。生产里要靠prompt结构(关键信息显式放头尾)和验证。
Q2: 设计一个支持5000个用户的财报chatbot,怎么用prompt cache省钱?
(a) Per-ticker分组:每ticker的10-K作为独立cache breakpoint。(b) 流量集中的几只热门股用1h cache (write 2x但read省到极致)。(c) 长尾股用5min default。(d) 监控hit ratio低的cache,可能要重新组织。(e) Off-hours预热cache(cron job)。
Q3: Lost-in-the-middle对你的产品架构有什么影响?
(1) 不依赖中部信息:把关键instruction放头尾。(2) Multi-hop推理时拆分为多次API call (each focused)。(3) 长文档场景用Citations让模型显式找出source position。(4) 必要时用RAG精准retrieve替代long context。
Q4: 长context vs RAG你怎么选?
看是否需要"全图":(1) 知道相关chunks → RAG。(2) 需要跨文档推理 → 长context + cache。(3) 反复query同一corpus → 长context cache一次。(4) Cost critical → RAG永远更便宜。
九、明日预告
Day 131: Prompt安全 — Prompt injection、Jailbreak、Indirect injection、红队测试。