Expert Day 122

Scaling Laws——从Kaplan到Chinchilla再到Claude 4.7

Kaplan 2020 vs Chinchilla 2022、Compute-optimal训练、Emergent abilities真伪争议、Test-time compute（Inference scaling）

2026-08-31

Phase 3 - LLM基础与Prompt工程 (Day 121-134)

ScalingLawsChinchillaComputeOptimalEmergentLLM

日期: 2026-08-31 方向: AI系统工程阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #ScalingLaws #Chinchilla #ComputeOptimal #Emergent #LLM

今日目标

类型	内容
学习	Kaplan 2020 vs Chinchilla 2022、Compute-optimal训练、Emergent abilities真伪争议、Test-time compute（Inference scaling）
实操	用Chinchilla公式估算训练GPT-4/Claude 4.7需要多少compute，绘制loss vs N/D曲线
产出	笔记 + GPT-4到Claude 4.7演进时间线 + 个人对"AGI scaling路径"的判断

一、理论基础

1.1 Kaplan Scaling Laws (2020)

OpenAI Kaplan等人发现：测试loss $L$ 与模型参数$N$、数据$D$、计算$C$是幂律关系：

$$ L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076 $$ $$ L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095 $$

Kaplan建议：给定compute预算$C$，把大部分花在模型参数$N$上，数据$D$只要"够用"。这导致了GPT-3（175B params, 300B tokens）的"大模型小数据"路径。

1.2 Chinchilla (2022) 颠覆

DeepMind的Chinchilla论文重新做实验，结论：Kaplan低估了data scaling。Compute-optimal下：

$$ N_{opt} : D_{opt} \approx 1 : 20 $$

也就是说每个参数大约对应20个训练token。

Chinchilla 70B用了1.4T tokens击败Gopher 280B（用300B tokens）——同compute下小模型+多数据更优。

1.3 Compute Optimal公式

$$ C \approx 6 \cdot N \cdot D $$

（forward 2N FLOPs/token，backward 4N FLOPs/token；6N·D是训练总FLOPs粗略估计）

给定$C$预算，最优$N^, D^$满足： $$ N^* \propto C^{0.5}, \quad D^* \propto C^{0.5}, \quad N^* \cdot D^* \propto C $$

1.4 后Chinchilla时代：超训练

实际生产中，data不只是为了compute-optimal，还为了inference效率。Llama 3 8B训练了15T tokens，远超Chinchilla optimal的160B：

为什么过度训练？

推理便宜：8B模型inference比70B快10x
部署友好：能在单GPU/边缘跑
模型固定后再训只增加训练成本，但inference cost是终身的

→ 现在主流做法是Chinchilla optimal × 5-50倍数据，称为"over-trained"。

1.5 Emergent Abilities：真还是假？

Wei等人2022提出GPT-3、LaMDA等大模型在某scale突然出现新能力（multi-step arithmetic、word unscrambling等）。

反方观点（Schaeffer 2023）：emergence是度量artifact——用exact match scoring才emerge，用continuous metric (log-probability)其实是smooth scaling。

当前共识：

部分能力（few-shot CoT）确实有相变
部分是评测方法的问题
关键是对下游任务：能力门槛存在，不是smooth的

1.6 Test-time Compute Scaling（推理scaling）

OpenAI o1 / Claude 4.7 extended thinking开启的新维度：不增模型大小，增加inference时的"思考"token也能提升性能。

DeepSeek R1论文显示：在AIME math上，inference token从1k→100k，性能从30%→90%。

新Scaling Law（推测）： $$ L = f(N_{train}, D_{train}, T_{inference}) $$

三维优化：训练参数、训练数据、推理token。

二、直觉解释

为什么Scaling work？

两种解释流派：

数据视角：互联网文本里包含人类智能的"压缩"。模型越大、数据越多，越能解压出推理、规划、知识。Anthropic CEO Dario Amodei的"compressed humanity"观点。
算法视角：随着规模，Transformer学到更复杂的电路（circuits）。Mechanistic interpretability正在揭示这些电路（induction head, attention head specialization, MLP neurons做concept storage）。

为什么Chinchilla比Kaplan准？

Kaplan早期实验里learning rate schedule没充分调，小模型实际能用更多数据继续降loss。这是简单的"实验设计bias"——所以工程细节决定理论。

Test-time compute的直觉

人类做难题也"想得久一点"。让模型也用更多token思考（CoT、self-consistency、search）相当于把"训练时学到的电路"用更多次。这是为什么强大但小的模型 + 长思考 ≈ 大模型直答。

三、代码实现

3.1 Chinchilla optimal计算器

# scaling_laws.py
"""
Chinchilla compute-optimal estimator + 训练cost估算
"""
import numpy as np
import matplotlib.pyplot as plt

def chinchilla_optimal(compute_flops):
    """
    给定compute预算（FLOPs），返回optimal参数N和数据D
    根据Hoffmann et al. 2022 Table 3:
        N_opt = (C / (6 * 20))^0.5 * sqrt(20)
    简化：N : D = 1 : 20，且 C = 6ND
    => N = sqrt(C / 120),  D = 20 * N
    """
    N = np.sqrt(compute_flops / 120)
    D = 20 * N
    return N, D

def gpu_hours_to_flops(gpu_hours, gpu_type="H100"):
    """估算GPU小时对应的FLOPs（fp16）"""
    flops_per_sec = {
        "A100": 312e12,
        "H100": 989e12,
        "H200": 989e12,  # H200吞吐相同，HBM更大
        "B200": 4500e12,  # Blackwell fp4
    }[gpu_type]
    # 假设MFU (Model FLOPs Utilization) = 0.4
    mfu = 0.4
    return gpu_hours * 3600 * flops_per_sec * mfu

def estimate_training_cost(N_params, D_tokens, gpu_type="H100", gpu_hourly_cost=2.5):
    """估算训练成本"""
    flops = 6 * N_params * D_tokens  # 总训练FLOPs
    flops_per_sec = {
        "A100": 312e12,
        "H100": 989e12,
        "B200": 4500e12,
    }[gpu_type]
    mfu = 0.4
    seconds = flops / (flops_per_sec * mfu)
    gpu_hours = seconds / 3600
    cost = gpu_hours * gpu_hourly_cost
    return {
        "total_flops": flops,
        "gpu_hours": gpu_hours,
        "estimated_cost_usd": cost,
        "days_on_1024_gpus": gpu_hours / 1024 / 24,
    }

# === 历史模型对照表 ===
models = [
    # (name, N (params), D (tokens), 注释)
    ("GPT-3",          175e9,  300e9,  "Kaplan-style undertrained"),
    ("Chinchilla",      70e9, 1400e9,  "Compute-optimal"),
    ("LLaMA-1 65B",     65e9, 1400e9,  "Following Chinchilla"),
    ("LLaMA-3 70B",     70e9,15000e9,  "Heavy over-training"),
    ("LLaMA-3 8B",       8e9,15000e9,  "Extreme over-training"),
    ("GPT-4 (est)",   1.8e12, 13e12,   "MoE 1.8T total / ~280B active"),
    ("Claude 3 Opus", 0.5e12,  5e12,   "Estimated"),  # 都是估算
    ("Claude 4.7 (推测)", 1e12, 30e12, "Estimated, post-training heavy"),
]

print(f"{'Model':<20} {'N (B)':>8} {'D (T)':>8} {'D:N':>6} {'Cost (M$)':>10}")
print("-" * 70)
for name, N, D, _ in models:
    cost_info = estimate_training_cost(N, D, "H100", 2.5)
    cost_m = cost_info["estimated_cost_usd"] / 1e6
    print(f"{name:<20} {N/1e9:>8.1f} {D/1e12:>8.2f} {D/N:>6.0f} {cost_m:>10.1f}")

# === Compute → Optimal N, D 曲线 ===
compute_range = np.logspace(20, 26, 100)  # 1e20 到 1e26 FLOPs
Ns, Ds = [], []
for C in compute_range:
    N, D = chinchilla_optimal(C)
    Ns.append(N)
    Ds.append(D)

plt.figure(figsize=(10, 6))
plt.loglog(compute_range, np.array(Ns) / 1e9, label="Optimal N (B params)")
plt.loglog(compute_range, np.array(Ds) / 1e9, label="Optimal D (B tokens)")
plt.xlabel("Compute Budget (FLOPs)")
plt.ylabel("Billions")
plt.title("Chinchilla Compute-Optimal Scaling")
plt.legend()
plt.grid(True, which="both", alpha=0.3)
# plt.savefig("chinchilla_curves.png")

# === GPT-4假设：1e25 FLOPs ===
C_gpt4 = 1e25
N_opt, D_opt = chinchilla_optimal(C_gpt4)
print(f"\nGPT-4 ({C_gpt4:.0e} FLOPs) Chinchilla-optimal:")
print(f"  N = {N_opt/1e9:.1f}B, D = {D_opt/1e12:.2f}T")
print(f"  实际 GPT-4 (~1.8T MoE / 13T tokens) 偏离optimal——MoE改变规则")

输出示例：

Model                  N (B)    D (T)    D:N  Cost (M$)
----------------------------------------------------------------------
GPT-3                  175.0     0.30      2        2.6
Chinchilla              70.0     1.40     20        4.9
LLaMA-3 70B             70.0    15.00    214       52.5
GPT-4 (est)           1800.0    13.00      7     1170.0
Claude 4.7 (推测)     1000.0    30.00     30     1500.0

GPT-4 (1e+25 FLOPs) Chinchilla-optimal:
  N = 288.7B, D = 5.77T

3.2 测试emergent abilities

# emergence_test.py
"""
对比不同size Claude模型在multi-digit乘法上的性能
（是否真的是emergent？）
"""
import anthropic
import random

client = anthropic.Anthropic()

models_to_test = [
    "claude-haiku-4-5",
    "claude-sonnet-4-6",
    "claude-opus-4-7",
]

def gen_problem(n_digits):
    a = random.randint(10**(n_digits-1), 10**n_digits - 1)
    b = random.randint(10**(n_digits-1), 10**n_digits - 1)
    return a, b, a * b

def test_model(model, n_digits, n_problems=20):
    correct = 0
    for _ in range(n_problems):
        a, b, ans = gen_problem(n_digits)
        resp = client.messages.create(
            model=model,
            max_tokens=100,
            messages=[{
                "role": "user",
                "content": f"What is {a} * {b}? Answer with only the number."
            }]
        )
        text = resp.content[0].text.strip().replace(",", "")
        try:
            if int(text) == ans:
                correct += 1
        except ValueError:
            pass
    return correct / n_problems

if __name__ == "__main__":
    for model in models_to_test:
        for n in [2, 3, 4, 5]:
            acc = test_model(model, n, n_problems=20)
            print(f"{model:<25} {n}-digit: {acc:.0%}")

预期输出（典型）：

claude-haiku-4-5          2-digit: 100%
claude-haiku-4-5          3-digit: 85%
claude-haiku-4-5          4-digit: 30%
claude-haiku-4-5          5-digit: 5%
claude-opus-4-7           2-digit: 100%
claude-opus-4-7           3-digit: 100%
claude-opus-4-7           4-digit: 95%
claude-opus-4-7           5-digit: 70%

四、Anthropic API最佳实践

4.1 Extended Thinking是Test-time Scaling的产品化

import anthropic
client = anthropic.Anthropic()

# 简单问题：不开thinking，省钱
resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    messages=[{"role": "user", "content": "What is 2+2?"}]
)

# 难问题：开thinking，让模型"想"久一点
resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    thinking={"type": "enabled", "budget_tokens": 16000},
    messages=[{"role": "user", "content": "Solve this AIME problem: ..."}]
)
print(resp.usage.input_tokens, resp.usage.output_tokens)
# thinking tokens按output价格计费

budget_tokens选择经验：

简单问题：don't enable
Math/coding hard：8K-16K
Multi-step planning：16K-32K
Research/agentic：32K+

4.2 用小模型 + thinking 替代大模型

经济权衡：

Opus 4.7 直答: $15/Mtok input
Sonnet 4.6 + 16K thinking: $3/Mtok input + ~$15/Mtok thinking
Haiku 4.5 + 32K thinking: $0.8/Mtok input + ~$4/Mtok thinking (推测)

很多场景Sonnet+thinking比Opus直答又便宜又准确——这是Anthropic产品化test-time scaling的核心命题。

五、金融领域应用

案例：投行research报告生成

任务：根据10-K + 行业数据 → 生成5页research report

方案	模型	Cost/report	质量
A: Opus直答	claude-opus-4-7	$4.50	9/10
B: Sonnet + 16K thinking	claude-sonnet-4-6	$1.20	9/10
C: Haiku + 32K thinking	claude-haiku-4-5	$0.40	7/10

PM决策：用方案B，质量与Opus相当但成本省70%。这就是"compute-optimal scaling for inference"在产品里的体现。

历史时间线（金融AI视角）

年	模型	金融应用
2020	GPT-3 175B	Bloomberg开始关注
2022	ChatGPT	普及；Bloomberg做BloombergGPT (50B)
2023	GPT-4, Claude 2	JPMorgan IndexGPT, 内部copilot
2024	Claude 3 Opus, GPT-4o	Morgan Stanley deploy AI advisor
2025	Claude 3.7, GPT-4.5, o1	Goldman Sachs自动化research
2026	Claude 4.7, GPT-5, Gemini 2.5 Pro	Multi-agent investment workflows

六、常见陷阱

以为模型越大越好：Llama 3 8B over-trained后在很多任务超过老的70B。部署上 size matters。
忽视inference cost：训练cost是一次性的，inference是终身的。Compute-optimal for training ≠ optimal for serving。
滥用Extended Thinking：简单问题开thinking浪费钱也没收益。要做task-difficulty router。
Emergent ability盲信：宣传"GPT-4突然会X"——很多其实是评测metric问题。要看具体任务的scaling曲线。
MoE vs Dense混淆：GPT-4 1.8T是total params但active~280B/token。算cost时要用active params乘D。

七、关键速查

训练成本估算公式

Total FLOPs ≈ 6 × N × D
GPU hours = FLOPs / (peak_flops × MFU)
MFU典型: 0.3-0.5
H100 fp16 peak: 989 TFLOPs/s
Cost = GPU_hours × hourly_rate ($1.5-$3/hr cloud, $0.5/hr owned)

Anthropic Thinking预算建议

{"type": "enabled", "budget_tokens": 1024}    # 最小 (sanity check)
{"type": "enabled", "budget_tokens": 8000}    # 中等数学/coding
{"type": "enabled", "budget_tokens": 16000}   # 难任务 (AIME, complex code)
{"type": "enabled", "budget_tokens": 32000}   # 极难/research

八、面试题

Q1: 解释Chinchilla为什么修正了Kaplan的结论？

Kaplan在小learning rate和有限训练步数下做实验，低估了data scaling。Chinchilla用更精细的训练设置发现compute-optimal下N:D ≈ 1:20，而非Kaplan暗示的"大模型小数据"。这导致LLaMA从GPT-3的"175B/300B"路线转向"70B/1.4T"。

Q2: 为什么LLaMA-3 8B用了15T tokens远超Chinchilla optimal？

因为训练成本是一次性的，但inference成本是无限次的。over-train小模型让它接近大模型质量，但inference快10x、显存少10x，部署上极有优势。这是"training-optimal" vs "inference-optimal"的取舍。

Q3: Test-time compute scaling（如o1、Claude extended thinking）会让训练scaling停止吗？

不会停，但会改变投资分配。训练scaling 仍带来更强base model；inference scaling 在fixed model上加深推理。两者乘法关系：强base + 长think = 最强。Anthropic的策略明显是两条都做。

Q4: 如果你是CTO，给你100M美金训练预算，怎么花？

(1) 不自己pretrain——cost-effective方式是买开源base (LLaMA) 或API。(2) 100M优先做高质量SFT/RLHF数据 + 领域post-training。(3) Pretrain $100M今天只能做~10B-30B模型（按H100 cluster），输出质量打不过Claude/GPT。除非有独特数据或战略原因，不自己pretrain。

九、明日预告

Day 123: Tokenization — BPE/SentencePiece、为什么数字/中文/code是pain point、tiktoken vs Anthropic tokenizer实测。