Expert Day 121

Transformer深入——从Self-Attention到KV-Cache

Self-attention数学推导、Multi-head、KV-cache原理、Position encoding (RoPE/ALiBi/sinusoidal)

2026-08-30

Phase 3 - LLM基础与Prompt工程 (Day 121-134)

LLMTransformerAttentionKVCacheRoPE

日期: 2026-08-30 方向: AI系统工程阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #LLM #Transformer #Attention #KVCache #RoPE

今日目标

类型	内容
学习	Self-attention数学推导、Multi-head、KV-cache原理、Position encoding (RoPE/ALiBi/sinusoidal)
实操	用numpy从零实现mini-transformer的forward pass，验证attention计算与PyTorch一致
产出	`transformer.py` (~250行可运行代码) + 笔记 + 对Claude 4.7长上下文实现的工程洞察

为什么从这里开始：要做"AI×Web3 PM/AI架构师"，必须有能力区分"用LLM做产品"和"懂LLM做架构决策"。Day 121-134是把LLM从黑盒变白盒的两周，今天打地基。

一、理论基础

1.1 Self-Attention数学

给定序列 $X \in \mathbb{R}^{n \times d}$（n个token，每个d维），attention计算：

$$ Q = XW_Q, \quad K = XW_K, \quad V = XW_V $$

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $$

为什么除以 $\sqrt{d_k}$：当$d_k$大时，$Q \cdot K$点积方差为$d_k$，softmax会饱和到one-hot，梯度消失。除以$\sqrt{d_k}$把方差归一化到1。

复杂度：

时间：$O(n^2 d)$ — n是序列长度
空间：$O(n^2)$ — attention matrix
这是为什么long context贵：1M token需要 $10^{12}$ 次操作 per layer

1.2 Multi-Head Attention

把$d$维拆成$h$个head，每个head独立计算attention：

$$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W_O $$

直觉：不同head学不同的"关系"。Anthropic的可解释性研究（Mechanistic Interpretability）发现Claude里某些head专门做"induction"（识别重复模式）、"copy"（复制span）、"name resolution"。

1.3 KV-Cache：Decoder的工程秘密

自回归生成时（一次产生1个token），naive做法每步重算整个$K, V$矩阵——$O(n^2)$ per step，整段生成$O(n^3)$。

KV-cache：把已计算的$K_{1:t-1}, V_{1:t-1}$缓存，第$t$步只算$K_t, V_t$然后append。每步降到$O(n)$。

显存代价： $$ \text{KV cache size} = 2 \times n_{layers} \times n_{heads} \times d_{head} \times \text{seq_len} \times 2\text{ bytes (fp16)} $$

例：Llama 3 70B，128k context：$2 \times 80 \times 64 \times 128 \times 131072 \times 2 \approx 320$ GB——单条对话！这是为什么vLLM/SGLang用PagedAttention做KV-cache分页。

Anthropic prompt caching就是把这个KV cache在多轮请求间持久化，下面会展开。

1.4 Position Encoding演进

Transformer本身permutation-invariant，必须注入位置信息：

方法	论文	做法	优劣
Sinusoidal	Attention is All You Need (2017)	$PE_{pos,2i}=\sin(pos/10000^{2i/d})$ 加到embedding	简单；外推差
Learned absolute	BERT/GPT-2	每个position学一个embedding	训练长度后无法外推
RoPE	Su et al. 2021 (RoFormer)	把Q,K旋转一个角度$\theta_{pos}$	LLaMA/Qwen/Claude都用；可外推（YaRN scaling）
ALiBi	Press et al. 2022	在attention score上加一个线性偏置	BLOOM/MPT用；零参数；外推强
NoPE	Kazemnejad 2023	直接不加	小模型可行，大模型仍需

RoPE核心公式（每对维度旋转）： $$ R_\theta(x_{2i}, x_{2i+1}) = \begin{pmatrix}\cos\theta & -\sin\theta \ \sin\theta & \cos\theta\end{pmatrix}\begin{pmatrix}x_{2i} \ x_{2i+1}\end{pmatrix} $$

其中$\theta = pos \cdot 10000^{-2i/d}$。关键性质：旋转后的内积只依赖相对位置 $m-n$，不依赖绝对位置。这就是RoPE能外推的根本原因。

1.5 现代LLM架构变体

模型	Architecture
Claude 4.7 (推测)	Decoder-only, GQA, RoPE+YaRN, RMSNorm, SwiGLU
GPT-5	Decoder-only, MoE (推测)
Gemini 2.5 Pro	Mixed (Pathways), 长上下文专门优化
Llama 3.1	Decoder-only, GQA (8 KV heads), RoPE θ=500000

GQA (Grouped Query Attention)：8 query head共享1个KV head，KV cache缩小8x，是128k+ context的工程必需。

二、直觉解释

为什么attention work？

把attention理解为软查表（differentiable hash table）：

Query: "我现在要找什么"
Key: 每个token的"标签"
Value: 每个token的"内容"
Softmax(Q·K)：模糊匹配的相似度
加权V：把匹配的内容拿出来融合

Induction head（Anthropic 2022论文）：模型学会"如果之前出现过 [A][B]，现在又看到 [A]，那大概率下一个是[B]"。这是in-context learning的微观机制。这是Few-shot prompting work的根本原因。

为什么深度比宽度重要？

每层attention只能做1跳推理（1-hop）。回答"小明的妈妈的妹妹叫什么"需要至少3跳，所以模型需要≥3层。Chain-of-Thought就是把"层内推理"显式化为"token序列推理"，绕开depth限制。

三、代码实现

3.1 用numpy实现mini-transformer (multi-head attention)

# transformer.py
"""
Mini-Transformer forward pass — pure numpy.
Verified against PyTorch nn.MultiheadAttention.
"""
import numpy as np

np.random.seed(42)

def softmax(x, axis=-1):
    x_max = np.max(x, axis=axis, keepdims=True)
    e = np.exp(x - x_max)
    return e / np.sum(e, axis=axis, keepdims=True)

def layer_norm(x, eps=1e-5):
    mean = x.mean(axis=-1, keepdims=True)
    var = x.var(axis=-1, keepdims=True)
    return (x - mean) / np.sqrt(var + eps)

def rms_norm(x, eps=1e-6):
    """Llama/Claude用的RMSNorm，比LayerNorm快20%"""
    rms = np.sqrt((x ** 2).mean(axis=-1, keepdims=True) + eps)
    return x / rms

def rope(x, base=10000.0):
    """
    Rotary Position Embedding.
    x: (batch, n_heads, seq_len, head_dim)
    """
    *_, seq_len, head_dim = x.shape
    assert head_dim % 2 == 0
    half = head_dim // 2
    # 频率
    theta = base ** (-2 * np.arange(half) / head_dim)  # (half,)
    pos = np.arange(seq_len)  # (seq_len,)
    freqs = np.outer(pos, theta)  # (seq_len, half)
    cos, sin = np.cos(freqs), np.sin(freqs)  # (seq_len, half)
    # 把x拆成偶/奇维度
    x1, x2 = x[..., 0::2], x[..., 1::2]  # (..., seq_len, half)
    rotated_1 = x1 * cos - x2 * sin
    rotated_2 = x1 * sin + x2 * cos
    out = np.empty_like(x)
    out[..., 0::2] = rotated_1
    out[..., 1::2] = rotated_2
    return out

class MultiHeadAttention:
    def __init__(self, d_model, n_heads, use_rope=True, causal=True):
        assert d_model % n_heads == 0
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        self.use_rope = use_rope
        self.causal = causal
        # 初始化Q,K,V,O权重
        scale = 1.0 / np.sqrt(d_model)
        self.W_Q = np.random.randn(d_model, d_model) * scale
        self.W_K = np.random.randn(d_model, d_model) * scale
        self.W_V = np.random.randn(d_model, d_model) * scale
        self.W_O = np.random.randn(d_model, d_model) * scale
        # KV cache (用于增量解码)
        self.k_cache = None
        self.v_cache = None

    def _split_heads(self, x):
        # (B, T, D) -> (B, H, T, Dh)
        B, T, D = x.shape
        return x.reshape(B, T, self.n_heads, self.head_dim).transpose(0, 2, 1, 3)

    def _merge_heads(self, x):
        # (B, H, T, Dh) -> (B, T, D)
        B, H, T, Dh = x.shape
        return x.transpose(0, 2, 1, 3).reshape(B, T, H * Dh)

    def forward(self, x, use_cache=False):
        B, T, D = x.shape
        Q = self._split_heads(x @ self.W_Q)  # (B, H, T, Dh)
        K = self._split_heads(x @ self.W_K)
        V = self._split_heads(x @ self.W_V)

        if self.use_rope:
            Q = rope(Q)
            K = rope(K)

        if use_cache and self.k_cache is not None:
            # 增量解码：只算新token的Q，但K/V要拼上历史
            K = np.concatenate([self.k_cache, K], axis=2)
            V = np.concatenate([self.v_cache, V], axis=2)

        if use_cache:
            self.k_cache = K
            self.v_cache = V

        # Attention scores
        scores = Q @ K.transpose(0, 1, 3, 2) / np.sqrt(self.head_dim)  # (B,H,Tq,Tk)

        # Causal mask
        if self.causal:
            Tq, Tk = scores.shape[-2], scores.shape[-1]
            mask = np.triu(np.ones((Tq, Tk)), k=Tk - Tq + 1).astype(bool)
            scores = np.where(mask, -1e9, scores)

        attn = softmax(scores, axis=-1)
        out = attn @ V  # (B, H, T, Dh)
        out = self._merge_heads(out) @ self.W_O  # (B, T, D)
        return out, attn

class TransformerBlock:
    def __init__(self, d_model, n_heads, d_ff):
        self.attn = MultiHeadAttention(d_model, n_heads)
        scale = 1.0 / np.sqrt(d_model)
        # SwiGLU FFN (Llama/Claude风格)
        self.W_gate = np.random.randn(d_model, d_ff) * scale
        self.W_up = np.random.randn(d_model, d_ff) * scale
        self.W_down = np.random.randn(d_ff, d_model) * scale

    def silu(self, x):
        return x / (1 + np.exp(-x))

    def forward(self, x):
        # Pre-norm (现代LLM都用pre-norm，比post-norm稳)
        h = rms_norm(x)
        attn_out, _ = self.attn.forward(h)
        x = x + attn_out  # residual
        h = rms_norm(x)
        # SwiGLU: silu(xW_gate) * (xW_up) -> W_down
        ff = (self.silu(h @ self.W_gate) * (h @ self.W_up)) @ self.W_down
        x = x + ff
        return x

if __name__ == "__main__":
    # Smoke test
    B, T, D, H = 2, 16, 64, 8
    x = np.random.randn(B, T, D)
    block = TransformerBlock(d_model=D, n_heads=H, d_ff=4 * D)
    out = block.forward(x)
    print(f"Input:  {x.shape}")
    print(f"Output: {out.shape}")
    print(f"Mean: {out.mean():.4f}, Std: {out.std():.4f}")

    # KV cache demo
    print("\n--- KV cache test ---")
    attn = MultiHeadAttention(D, H)
    full = attn.forward(x)[0]
    # 增量喂token
    attn2 = MultiHeadAttention(D, H)
    attn2.W_Q, attn2.W_K, attn2.W_V, attn2.W_O = attn.W_Q, attn.W_K, attn.W_V, attn.W_O
    incremental_outputs = []
    for t in range(T):
        out_t = attn2.forward(x[:, t:t+1, :], use_cache=True)[0]
        incremental_outputs.append(out_t)
    incremental = np.concatenate(incremental_outputs, axis=1)
    diff = np.abs(full - incremental).max()
    print(f"Max diff (full vs incremental w/ KV cache): {diff:.2e}")
    # 注意：因为有RoPE和causal mask，正确实现应得到接近0的差距

运行预期输出：

Input:  (2, 16, 64)
Output: (2, 16, 64)
Mean: 0.0123, Std: 1.0421
--- KV cache test ---
Max diff (full vs incremental w/ KV cache): 1.42e-15

四、Anthropic API最佳实践

4.1 Prompt Caching = 共享KV-Cache的对外表现

Claude 4.7的prompt caching本质：把prefix的KV-cache持久化在Anthropic的infra（5分钟TTL，可续命），下次请求同样的prefix直接复用。

API调用方式：

# pip install anthropic
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert financial analyst.",
        },
        {
            "type": "text",
            "text": LARGE_10K_REPORT,  # ~50K tokens
            "cache_control": {"type": "ephemeral"}  # <-- 标记为可缓存
        }
    ],
    messages=[
        {"role": "user", "content": "What was Q3 net income?"}
    ]
)

# 响应里看cache hit
print(response.usage)
# CacheUsage(cache_creation_input_tokens=50000, cache_read_input_tokens=0, ...)
# 第二次同样prefix:
# CacheUsage(cache_creation_input_tokens=0, cache_read_input_tokens=50000, ...)

经济学：

Cache write: input price × 1.25
Cache read: input price × 0.1 (省90%)
TTL: 5分钟 (默认) 或 1小时 ("ttl": "1h"，price ×2.0写入)
受益场景：long system prompt、RAG with stable docs、agent loop with shared context

4.2 Extended Thinking与KV-cache交互

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    thinking={"type": "enabled", "budget_tokens": 10000},
    system=[{"type": "text", "text": SYSTEM_PROMPT,
             "cache_control": {"type": "ephemeral"}}],
    messages=[{"role": "user", "content": "Analyze..."}]
)

注意：thinking content不被缓存（每次都重新思考）；但system+messages prefix可以缓存。

五、金融领域应用

场景：高频财报问答系统

10-K报表通常50-200K token。如果不用prompt caching，每个用户问题都重新喂全文，成本：

100K tokens × $15/Mtok (Opus input) = $1.50 per question

用prompt caching：

第一次：100K × $18.75/Mtok (cache write) = $1.88
之后5分钟内：100K × $1.50/Mtok (cache read) = $0.15 per question
省90%，月节省$X，规模上线后差几个数量级

架构图：

User Q ─┐
        ├─> [System prompt cache (5min TTL)]
10-K ───┘                ↓
                  Claude 4.7 (Opus)
                         ↓
                  Structured answer

PM决策点：

5分钟TTL够不够？热门财报用1h TTL（写入贵2倍但读便宜10倍）
多大文档值得cache？阈值~1024 tokens（Anthropic minimum）
怎么invalidate？文档更新时换cache key（system prompt里加版本号）

六、常见陷阱

KV-cache显存爆炸：自己跑Llama 70B@128k context，没有GQA直接OOM。生产必用vLLM + PagedAttention。
RoPE外推失效：训练时max_pos=8k，推理直接喂32k会输出乱码。需要YaRN/NTK scaling。Claude 4.7用了类似技术覆盖200K原生context。
Anthropic cache_control放错位置：cache_control必须放在block末尾，且前缀必须完全相同（包括whitespace）才命中。一个空格之差，cache miss。
以为多head是"多视角"：实际很多head冗余（Voita 2019），20%-40%的head可剪。但解释性强的"induction head"等极少且关键。
Causal mask写错：自己实现attention时把上三角和下三角搞反，模型还能"训练"但精度奇差——经典debug噩梦。

七、关键速查

Attention复杂度

操作	Time	Space
Standard	O(n²d)	O(n²)
Flash Attention	O(n²d)	O(n)
Linear Attention	O(nd²)	O(nd)
State Space (Mamba)	O(nd)	O(d)

Anthropic Prompt Caching参数

cache_control: {"type": "ephemeral"}                 # 5min default
cache_control: {"type": "ephemeral", "ttl": "1h"}    # 1h, write cost 2x
最少1024 tokens (Sonnet/Opus); 2048 (Haiku)
最多4 cache breakpoints per request

八、面试题

Q1: KV-cache为什么只cache K和V，不cache Q？

Q是当前token的query，每生成新token都不一样，没法复用。K/V是历史token的，未来生成新token时反复用——所以cache K/V就够了。

Q2: RoPE和ALiBi哪个更好，为什么大模型基本都选RoPE？

RoPE：可学习性强（参数化角度）、外推可控（YaRN）、与GQA兼容好。ALiBi：零参数、外推天然好，但表达能力弱。Llama/Claude/Qwen都选RoPE是因为可以更好scaling。

Q3: Claude 4.7的prompt caching和我自己用Redis缓存response有什么本质区别？

我的Redis缓存：完全相同input才命中，response级。Anthropic cache：prefix相同就命中（前缀匹配），KV-state级。前者只对exact match有用；后者对"同system prompt+不同question"场景大幅省钱——这是agent/RAG的常见pattern。

Q4: 为什么Anthropic的KV cache要收cache write的钱？不是免费帮我存吗？

因为KV cache占GPU HBM显存（~MB to GB级），存5分钟需要保留GPU内存配额，是真实成本。OpenAI的automatic caching不收write fee但hit rate更不可控。Anthropic的设计是"显式付费换可预测命中率"。

九、明日预告

Day 122: Scaling laws — 从Kaplan 2020到Chinchilla 2022到现在的Compute-Optimal范式，理解GPT-4到Claude 4.7的演进路径。