Expert Day 127

Week 19复习——LLM工程师必懂50点

整合Day 121-126的核心知识

2026-09-05

Phase 3 - LLM基础与Prompt工程 (Day 121-134)

ReviewLLMEngineeringCheatsheet

日期: 2026-09-05 方向: AI系统工程阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #Review #LLMEngineering #Cheatsheet

今日目标

类型	内容
学习	整合Day 121-126的核心知识
实操	把所有概念串成一个mental model；做面试快闪测试
产出	`llm_essentials.md` — LLM工程师必懂50点（按主题分类）

一、Transformer架构（10点）

Self-attention复杂度：$O(n^2 d)$ time，$O(n^2)$ space。Flash Attention降空间到$O(n)$。
为什么除以$\sqrt{d_k}$：防止点积方差爆炸导致softmax饱和、梯度消失。
Multi-head本质：多个并行attention子空间，每个学不同关系。Anthropic研究显示有"induction head"、"copy head"等专门化。
KV-cache：自回归生成时缓存历史K/V，把每步$O(n^2)$降到$O(n)$。是long context的工程基础。
GQA (Grouped Query Attention)：多个Q head共享1个KV head（如8:1），KV cache缩小8x。Llama 3、Claude都用。
RoPE > 绝对位置编码：旋转角度只依赖相对位置；可外推（YaRN scaling）；与GQA兼容。
ALiBi vs RoPE：ALiBi零参数外推强但表达弱；RoPE可学习强表达，主流选择。
Pre-norm vs post-norm：现代LLM全用pre-norm，训练稳定。RMSNorm比LayerNorm快20%。
SwiGLU：现代FFN：$\text{silu}(xW_1) \odot (xW_2) \cdot W_3$，比ReLU FFN好~1%。
Decoder-only > Encoder-decoder：自回归生成场景下decoder-only架构简单、scaling好。除非翻译/摘要专用模型。

二、Scaling Laws & 训练（8点）

Chinchilla rule：N : D ≈ 1 : 20是compute-optimal。
总训练FLOPs ≈ 6ND。一次性训GPT-4 (~~1.8T MoE, 13T tokens) 估计~~$60M-100M。
Over-training小模型：Llama-3 8B用15T tokens，远超optimal——为inference cost牺牲training cost。
MFU：Model FLOPs Utilization 0.3-0.5典型；H100 fp16 peak 989 TFLOPs/s。
Emergent abilities：部分真实（CoT），部分是评测metric artifact。要看具体任务scaling曲线。
Test-time compute scaling：o1/Claude extended thinking开启新维度。Inference token从1K→100K性能持续提升。
MoE模型：total params和active params不同。GPT-4 1.8T total / ~280B active per token。
Inference cost = forever cost：Training是一次性，inference vis service生涯持续。决策时要算总TCO。

三、Tokenization（6点）

Byte-level BPE：vocab从256起步，永不OOV。GPT-2开始主流。
中文token密度：Anthropic Claude ~1.5 chars/token；GPT-4o ~1 char/token；GPT-3.5 cl100k ~0.5 char/token。
数字tokenization坑："1234"是1 token但"12345678"被切成多个，破坏十进制结构 → 算术失败。
Whitespace敏感：" Apple" vs "Apple"是不同token。Few-shot example格式要一致。
prompt cache依赖exact prefix：一个trailing space就cache miss。
count_tokens API免费：生产前先估算cost。Anthropic的count_tokens是单独endpoint。

四、Sampling（6点）

Temperature：T=0确定性greedy；T=1原分布；T>1更随机。Anthropic max=1.0。
Top-p > Top-k：自适应分布形状。常用top_p=0.95。
Self-consistency：CoT + 多次sample + majority vote。GSM8K +10%但cost ×N。
Speculative decoding：draft model快速提议+target model验证。2-4x加速且数学保证output distribution相同。
Extended thinking强制T=1：Anthropic extended thinking不让你设T，避免greedy破坏推理。
T=0不100%确定：GPU non-determinism (parallel reduction, kernel) 可造成极小diff。要strict reproducibility需seed + deterministic kernel。

五、Prompt Engineering（10点）

Few-shot example质量 > 数量：5个高质比50个低质好。
Anthropic推荐XML tag：<example>, <context>, <query>，Claude训练数据偏好。
System prompt定role：放在system message里比user message更被尊重。
CoT："Let's think step by step"经典咒语；Claude 4.7默认很多场景已CoT，可能不必加。
Self-Consistency只对有"标准答案"的任务有效，creative writing无意义。
ToT高价值场景：24-game、puzzle、code search、写作大纲——中间状态可评估时。
Few-shot vs Fine-tune：< 100 examples → few-shot；> 1000 → fine-tune；中间二者都试。
Lost-in-the-middle：长context中间部分易被忽略。重要instruction放头尾。
JSON output禁wrapping：用Tools API或明确说"output ONLY raw JSON, no markdown"。
Tool description要明确何时用何时不用：避免模型乱call tools。

六、Anthropic Claude特殊（8点）

Claude 4.7 (Opus)：旗舰，$15/Mtok input, $75/Mtok output。Sonnet 4.6 $3/$15。Haiku 4.5 $0.8/$4。
Prompt caching：write 1.25x，read 0.1x，5min ephemeral or 1h ($2x write)。最少1024 tokens (Sonnet/Opus)，2048 (Haiku)。
Extended thinking：budget_tokens 1K-32K+。强制T=1, top_p=1。Thinking tokens按output价格计费。
Tool use parallel：一个assistant message可包含多个tool_use blocks，user响应必须包含全部tool_results。
Tool use with thinking：thinking里规划→tool calls→results→final answer。质量显著高于直接tool use。
Cache breakpoints最多4个：一个request里。常见用法：1个system长KB + 1个对话history。
Files API：上传PDF/图片，多次引用不重复传输——比inline base64省token+网络。
Citations：Claude可引用source documents并标注quote spans。Legal/research场景必备。

七、生产工程（4点）

监控关键指标：

cache_creation_input_tokens, cache_read_input_tokens (attribution)
input_tokens, output_tokens (cost)
latency p50/p95/p99
stop_reason分布 (max_tokens超出说明输出截断)
Tool use error rate

Failover策略：

模型层：Claude Opus → Sonnet → Haiku → GPT-5 → Gemini
区域层：Anthropic API → Bedrock → Vertex AI Anthropic
Rate limit hit时：exponential backoff + Message Batches API（异步50%折扣）

二、心智模型整合

LLM作为函数

output = LLM(
    weights,           # 训练决定 (Day 121-122)
    tokenization,      # vocab / encoding (Day 123)
    sampling,          # T, top_p (Day 124)
    prompt,            # 你的输入 (Day 125)
    schema_constraint  # tools / structured (Day 126)
)

每天都在工程上讨论的就是右边这5个变量。

Cost = (input + output) × price × scale

每个token都是钱。优化方向：

减input：cache、压缩历史、不重复KB
减output：明确length限制
选model：能用Haiku别用Opus
选API mode：能用batch别用sync

Quality = (model capability) × (prompt quality) × (sampling)

模型capability是地板：Claude 4.7 > Sonnet 4.6 > Haiku
Prompt quality是放大器：prompt差Opus也愚蠢
Sampling是variance：T=0稳定但缺多样

三、代码：LLM-Ops Cheat Sheet实现

# llm_ops_cheatsheet.py
"""
日常使用Claude API的几个生产级辅助函数。
"""
import anthropic
import time
import functools

client = anthropic.Anthropic()

def with_retry(max_attempts=3, base_delay=1.0):
    """exponential backoff retry"""
    def decorator(fn):
        @functools.wraps(fn)
        def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    return fn(*args, **kwargs)
                except (anthropic.RateLimitError, anthropic.APIConnectionError) as e:
                    if attempt == max_attempts - 1:
                        raise
                    delay = base_delay * (2 ** attempt)
                    time.sleep(delay)
            return None
        return wrapper
    return decorator

@with_retry(max_attempts=3)
def call_claude(model, system, user_msg, **kwargs):
    """统一入口 + 自动metric记录"""
    t0 = time.time()
    resp = client.messages.create(
        model=model,
        max_tokens=kwargs.get("max_tokens", 4096),
        temperature=kwargs.get("temperature", 0.0),
        system=system,
        messages=[{"role": "user", "content": user_msg}],
        **{k: v for k, v in kwargs.items() if k not in ("max_tokens", "temperature")}
    )
    latency = time.time() - t0

    # 记录metrics
    metrics = {
        "latency_s": latency,
        "input_tokens": resp.usage.input_tokens,
        "output_tokens": resp.usage.output_tokens,
        "cache_create": getattr(resp.usage, "cache_creation_input_tokens", 0),
        "cache_read": getattr(resp.usage, "cache_read_input_tokens", 0),
        "stop_reason": resp.stop_reason,
        "model": resp.model,
    }
    # cost估算（per 1M tokens prices, 2026年预估）
    PRICES = {
        "claude-opus-4-7":   {"in": 15.0, "out": 75.0},
        "claude-sonnet-4-6": {"in": 3.0,  "out": 15.0},
        "claude-haiku-4-5":  {"in": 0.8,  "out": 4.0},
    }
    p = PRICES.get(resp.model, {"in": 0, "out": 0})
    cost = (
        metrics["cache_create"] * p["in"] * 1.25 / 1e6 +
        metrics["cache_read"]   * p["in"] * 0.10 / 1e6 +
        (metrics["input_tokens"] - metrics["cache_create"] - metrics["cache_read"]) * p["in"] / 1e6 +
        metrics["output_tokens"] * p["out"] / 1e6
    )
    metrics["cost_usd"] = round(cost, 6)

    return resp, metrics

def tiered_completion(user_msg, system="", quality="auto"):
    """
    根据问题复杂度自动选模型。
    'auto' / 'fast' / 'balanced' / 'best'
    """
    if quality == "fast":
        model = "claude-haiku-4-5"
    elif quality == "best":
        model = "claude-opus-4-7"
    elif quality == "balanced":
        model = "claude-sonnet-4-6"
    elif quality == "auto":
        # 简单heuristic：问题长度 + keyword
        complex_keywords = ["analyze", "calculate", "prove", "debug", "design"]
        is_complex = (
            len(user_msg) > 500 or
            any(k in user_msg.lower() for k in complex_keywords)
        )
        model = "claude-sonnet-4-6" if is_complex else "claude-haiku-4-5"
    else:
        model = quality

    return call_claude(model, system, user_msg)

if __name__ == "__main__":
    resp, m = tiered_completion("What is 2+2?")
    print(f"Model: {m['model']}, Latency: {m['latency_s']:.2f}s, Cost: ${m['cost_usd']:.6f}")

四、Anthropic API最佳实践

一周精华

始终做的：

用system message定role
用XML tag结构化prompt
长system/KB加cache_control
temperature=0 for结构化任务
Pydantic二次校验JSON output
监控cache hit rate, stop_reason分布

永远不做的：

把tools定义动态变化（每次微调tool description）→ cache miss
max_tokens不设硬上限 → 偶发千倍cost
用prompt硬要求"输出JSON" → 用Tools API
T=0且无seed以为100%确定 → 还是有non-determinism

五、金融领域应用

一周整合：财报问答系统架构

User Question
     ↓
[Tier 1: Haiku 4.5 + cache]   ← 简单factual: "What was Q3 revenue?"
     ↓ (escalate if confidence low)
[Tier 2: Sonnet 4.6 + extended thinking + cache]   ← multi-step分析
     ↓ (escalate if complexity exceeds)
[Tier 3: Opus 4.7 + thinking + tools (data fetch)]   ← 深度research

成本 ~ Tier1 $0.001 / Tier2 $0.05 / Tier3 $0.50 per query。

六、常见陷阱

以为Anthropic API是OpenAI Compatible：参数名不同（max_tokens vs max_completion_tokens）。
用OpenAI-style "user/assistant"无system：放弃了Claude对system的强尊重。
prompt里硬塞数据没分层：把指令、example、KB全塞system → 难维护、cache不友好。
没监控stop_reason：max_tokens超时却被当成正常response。
不区分模型cost：Opus用一周费率开炸，没人意识到Sonnet能干同样事。

七、关键速查

一页cheatsheet

Cost (per 1M tokens, 2026):
  Opus 4.7   $15 in  / $75 out
  Sonnet 4.6 $3  in  / $15 out
  Haiku 4.5  $0.8 in / $4  out

Cache:
  write 1.25x, read 0.1x; 5min default, 1h ($2x write)
  min: 1024 tokens (Opus/Sonnet), 2048 (Haiku)

Thinking budgets:
  1024 (light), 8K (medium), 16K-32K (hard), 64K+ (research)

Sampling defaults:
  Structured/code:  T=0
  Factual:          T=0
  Summarization:    T=0.3-0.5
  Creative:         T=0.7-1.0

Token density:
  English: ~4 char/token
  Chinese: ~1.5 char/token (Claude), ~1 (GPT-4o)

八、面试题（综合）

Q1: 给一个金融research bot预算每天1万美元，怎么design tier？

Tier 1 (90% queries, factual): Haiku + heavy caching, $0.002/query → 4M queries/day room Tier 2 (8%, analysis): Sonnet + thinking 8K + cache, $0.10/query → 80K/day Tier 3 (2%, deep): Opus + thinking 32K + tools, $0.50/query → 20K/day 总: 4M+80K+20K queries/day @ ~$10K

Q2: prompt cache和HTTP cache有什么本质区别？

HTTP cache是response level (exact request match)；prompt cache是KV state level (prefix match)。后者覆盖"同context+不同question"——这是agent/RAG最常见模式。

Q3: 用CoT、ToT、Self-Consistency哪个组合最强？

最强但贵：Tree-of-Thoughts + Self-Consistency + Extended Thinking。实际production很少这么堆，因为cost prohibitive。pareto-optimal: Sonnet + thinking 16K + 3-sample self-consistency。

Q4: 一个prompt你怎么从0迭代到production-ready？

(1) Zero-shot baseline → 测acc。(2) 加3 few-shot → 测。(3) 加CoT → 测。(4) 调T试不同稳定性。(5) 加structured output (Tools API)。(6) Pydantic validate。(7) Cache stable部分。(8) Monitor cache hit、stop_reason、cost。(9) Set up A/B with new model versions。(10) Document以便re-tune。

九、明日预告

Day 128: Prompt自动化优化 — OPRO、APE、DSPy。