Expert Day 123

Tokenization——BPE、SentencePiece与LLM隐藏的坑

BPE/WordPiece/SentencePiece算法、Anthropic vs OpenAI vs Google tokenizer差异

2026-09-01

Phase 3 - LLM基础与Prompt工程 (Day 121-134)

TokenizationBPESentencePiecetiktokenLLM工程

日期: 2026-09-01 方向: AI系统工程阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #Tokenization #BPE #SentencePiece #tiktoken #LLM工程

今日目标

类型	内容
学习	BPE/WordPiece/SentencePiece算法、Anthropic vs OpenAI vs Google tokenizer差异
实操	用tiktoken/anthropic SDK测试中文/数字/code/emoji的tokenization边缘案例
产出	笔记 + tokenization陷阱清单 + 金融数字处理最佳实践

一、理论基础

1.1 为什么需要tokenization？

LLM处理离散token，不是字符。两个极端：

Char-level：词表小（~256），序列超长，效率低
Word-level：词表巨大（>10万），OOV (Out-of-Vocabulary) 问题严重

Subword tokenization：折中——常见词整词，罕见词拆subword。这是BPE/WordPiece/SentencePiece共同思路。

1.2 BPE (Byte Pair Encoding) 算法

Sennrich et al. 2016：

1. 初始词表 = 所有字符
2. 统计语料中相邻字符对频次
3. 把出现最多的pair合并为新token
4. 重复2-3直到词表达到目标大小（如32K, 50K, 100K）

例：训练语料 ["hug", "pug", "pun", "bun", "hugs"]

初始：{h, u, g, p, n, b, s}
最频繁pair: (u,g) → 合并为ug
下一轮: (h, ug) → hug
等等...

Byte-level BPE（GPT-2开始）：直接在UTF-8字节上做BPE，词表256起步，永远没有OOV——这是现代主流。

1.3 SentencePiece（Google）

不依赖空格分词（对日中泰友好）：

Unigram model：从大词表开始，迭代去除最不重要的token
BPE模式：和上面BPE一样
把空格当普通char的一种（用▁表示）

1.4 主流tokenizer对比

Model	Tokenizer	Vocab Size	中文1字符	中文1个词	数字处理
GPT-3.5/4 (cl100k_base)	tiktoken (BPE)	~100K	通常2-3 token	4-8 token	1-3位单token，4位+拆
GPT-4o (o200k_base)	tiktoken (BPE)	~200K	通常1-2 token	2-4 token	优化更好
Claude (3-4.7)	Custom BPE	~65K (推测)	通常1-2 token	2-3 token	per-digit很多场景
Llama 3	SentencePiece BPE	128K	1-2 token	2-4 token	per-digit
Gemini	SentencePiece	~256K	中文优化	1-3 token	优化

1.5 Tokenization的"非Markov"问题

最大坑：tokenization边界不是语义边界。例：

" running" → [" running"]      (1 token)
"running"  → ["run", "ning"]   (2 token，前面没空格)

split-on-whitespace习惯导致LLM对"无空格的同一个词"陌生——这是为什么in-context learning有时breakable。

二、直觉解释

为什么GPT不会算3782×4691？

不是因为它不会算术，而是因为：

"3782" 被tokenize成 ["37", "82"] 或 ["378", "2"]，位数信息错位
训练数据里很少出现exact这两个4位数相乘的"事实"
必须靠CoT/工具调用一步步算

GPT-4o引入100K→200K词表，把更多4位数收为单token，简单算术明显改善。

为什么有些prompt"莫名好用"？

经典发现："Let's think step by step" 比"Solve this:"效果好。部分原因是training data里CoT样本前面出现这种短语，触发induction head；部分是tokenization上"step by step"是高频phrase单token化效率高。

中文为什么贵？

历史原因：早期tokenizer训练语料英文为主。中文常见字往往2-3 byte UTF-8，需要2-3个BPE合并才能成单token。Anthropic Claude在中文上做了特殊优化，但还是不如英文密度。

API成本影响：

英文: ~4 chars / token
中文: ~1.5 chars / token (3x token / unit info)
日文: ~2 chars / token

→ 同样一句话翻译成不同语言，cost差2-3x。

三、代码实现

3.1 测试tiktoken vs Anthropic tokenizer

# tokenization_test.py
"""
对比tiktoken (OpenAI) 与 anthropic SDK的token count在不同输入上的差异
"""
import tiktoken
import anthropic

# OpenAI tokenizers
enc_gpt35 = tiktoken.get_encoding("cl100k_base")  # GPT-3.5/4
enc_gpt4o = tiktoken.get_encoding("o200k_base")   # GPT-4o/5

client = anthropic.Anthropic()

def count_anthropic(text, model="claude-opus-4-7"):
    """Anthropic的count_tokens API"""
    resp = client.messages.count_tokens(
        model=model,
        messages=[{"role": "user", "content": text}]
    )
    return resp.input_tokens

test_cases = [
    # (label, text)
    ("English short",        "Hello, world!"),
    ("English long",         "The quick brown fox jumps over the lazy dog. " * 10),
    ("Chinese short",        "你好，世界！"),
    ("Chinese para",         "今天天气真好，我们去公园散步吧。" * 10),
    ("Number 4-digit",       "1234"),
    ("Number 10-digit",      "1234567890"),
    ("Number with comma",    "1,234,567,890"),
    ("Decimal",              "3.14159265358979"),
    ("Code snippet",         "def hello():\n    print('hi')\n    return 42"),
    ("JSON",                 '{"name":"Alice","age":30,"items":[1,2,3]}'),
    ("Emoji",                "I love this! 🎉🚀💯"),
    ("URL",                  "https://api.anthropic.com/v1/messages?param=value"),
    ("Financial",            "Q3 2024 net income: $4,521,890,000 (+12.3% YoY)"),
    ("HTML",                 "<div class='container'><span>Hello</span></div>"),
    ("Repeated chars",       "aaaaaaaaaaaaaaaaaaaa"),
    ("Hash",                 "0x742d35Cc6634C0532925a3b844Bc9e7595f0fA8e"),
]

print(f"{'Test':<20} {'cl100k':>8} {'o200k':>8} {'Claude':>8} {'Chars':>6}")
print("-" * 60)
for label, text in test_cases:
    c1 = len(enc_gpt35.encode(text))
    c2 = len(enc_gpt4o.encode(text))
    c3 = count_anthropic(text)
    print(f"{label:<20} {c1:>8} {c2:>8} {c3:>8} {len(text):>6}")

预期典型输出：

Test                    cl100k    o200k   Claude  Chars
------------------------------------------------------------
English short                4        4        5     13
English long                95       86       96    450
Chinese short                7        4        5     6
Chinese para               140       82      100   170
Number 4-digit               2        1        2     4
Number 10-digit              4        2        4    10
Number with comma            7        4        5    13
Decimal                      8        5        7    16
Code snippet                14       12       13    40
JSON                        18       15       17    42
Emoji                       11        7        9    20
URL                         15       11       13    52
Financial                   20       14       17    54
HTML                        14       11       13    47
Repeated chars               4        3        3    20
Hash                        24       20       22    42

洞察：

GPT-4o的o200k_base在中文/数字上明显省一半token
Claude在英文与o200k接近，中文比cl100k好
ETH地址（0x...）每个4-5字符一个token，很贵

3.2 研究Number Tokenization

# number_tokens.py
"""
看清楚数字怎么被切
"""
import tiktoken
enc = tiktoken.get_encoding("o200k_base")

numbers = ["1", "12", "123", "1234", "12345", "123456",
           "1234567890",
           "1.5", "3.14", "3.14159",
           "1,000", "1,000,000",
           "$1,234.56",
           "0.0001", "1e-9"]

for n in numbers:
    tokens = enc.encode(n)
    decoded = [enc.decode([t]) for t in tokens]
    print(f"{n:>12}  =>  {decoded}  ({len(tokens)} tokens)")

输出：

           1  =>  ['1']  (1 tokens)
          12  =>  ['12']  (1 tokens)
         123  =>  ['123']  (1 tokens)
        1234  =>  ['1234']  (1 tokens)
       12345  =>  ['12345']  (1 tokens)   # o200k把5位数也合了
      123456  =>  ['123', '456']  (2 tokens)
   1234567890 =>  ['123', '456', '789', '0']  (4 tokens)
         1.5  =>  ['1', '.', '5']  (3 tokens)
        3.14  =>  ['3', '.', '14']  (3 tokens)
   1,000,000  =>  ['1', ',', '000', ',', '000']  (5 tokens)
   $1,234.56  =>  ['$', '1', ',', '234', '.', '56']  (6 tokens)

金融场景大坑：$1,234.56 用了6 token而1234.56只用4 token。模型把"1234.56"和"1,234.56"当成两个不同的东西。

3.3 Detect prompt cache breakpoint precision

# cache_alignment.py
"""
prompt caching要prefix完全一致才命中。
小心whitespace差异。
"""
import anthropic
client = anthropic.Anthropic()

LARGE_CONTEXT = "...50K tokens of stable text..."  # 占位

def call_with_cache(content_suffix=""):
    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=100,
        system=[{
            "type": "text",
            "text": LARGE_CONTEXT + content_suffix,
            "cache_control": {"type": "ephemeral"}
        }],
        messages=[{"role": "user", "content": "Summarize."}]
    )

# Run 1: write cache
r1 = call_with_cache("")
print(f"Run 1: write={r1.usage.cache_creation_input_tokens}, read={r1.usage.cache_read_input_tokens}")

# Run 2: same prefix, hit cache
r2 = call_with_cache("")
print(f"Run 2: write={r2.usage.cache_creation_input_tokens}, read={r2.usage.cache_read_input_tokens}")

# Run 3: trailing space difference, MISS!
r3 = call_with_cache(" ")
print(f"Run 3 (trailing space): write={r3.usage.cache_creation_input_tokens}, read={r3.usage.cache_read_input_tokens}")

教训：cache miss because of one trailing whitespace是真实生产事故。

四、Anthropic API最佳实践

4.1 用count_tokens API预算成本

# 在生产前精确算token count
resp = client.messages.count_tokens(
    model="claude-opus-4-7",
    system=[{"type": "text", "text": SYSTEM_PROMPT}],
    messages=[{"role": "user", "content": user_input}]
)
print(f"Will cost: {resp.input_tokens} input tokens × $15/Mtok = ${resp.input_tokens * 15 / 1e6:.4f}")

生产建议：

在call真正messages API前，先count_tokens估算cost
对于长context场景设置硬上限（例如>200K直接reject）
count_tokens API本身免费

4.2 Cache breakpoint最佳实践

Claude支持4个cache breakpoints。最佳布局：

messages_create(
    system=[
        {"type": "text", "text": STABLE_INSTRUCTIONS},   # 无cache_control，短不值得cache
        {"type": "text", "text": LARGE_DOMAIN_KB,
         "cache_control": {"type": "ephemeral"}},        # bp1: KB（小时级稳定）
    ],
    messages=[
        # 历史对话压缩
        {"role": "user", "content": [
            {"type": "text", "text": COMPRESSED_HISTORY,
             "cache_control": {"type": "ephemeral"}}     # bp2: 对话历史（5min cache）
        ]},
        {"role": "assistant", "content": "..."},
        {"role": "user", "content": NEW_QUESTION}        # 不cache
    ]
)

五、金融领域应用

案例：解析10-K报表的tokenization陷阱

# finance_tokenization.py
import tiktoken
enc = tiktoken.get_encoding("o200k_base")

# 报表中常见格式
texts = [
    "Net income of $4,521,890",           # comma separator
    "Net income of $4521890",             # no comma
    "Net income of 4.52 million USD",     # words
    "Net income of $4.52M",               # abbreviated
    "净利润为45.21亿元",                    # Chinese
    "Q3'24",                              # quarter shorthand
    "EBITDA: $1,234.56M (+12.3% YoY)",
]
for t in texts:
    n = len(enc.encode(t))
    print(f"{n:3} tokens: {t}")

金融PM设计决策：

数字格式预处理：把report里所有数字标准化为"1234567"或"1.23M"，再喂模型——可大幅省token、提升解析准确率
避免重要数字被切：4位数以上加CoT让模型逐位推理
币种统一：USD/CNY/EUR符号会影响tokenization效率

实用代码：财报数字结构化

import re

def normalize_financial_text(text):
    """金融文本预处理 — 帮助LLM"""
    # $1,234.56 → 1234.56
    text = re.sub(r'\$([\d,]+\.?\d*)', lambda m: m.group(1).replace(',', ''), text)
    # 1.5 million → 1500000
    text = re.sub(r'(\d+\.?\d*)\s*million', lambda m: str(int(float(m.group(1)) * 1e6)), text, flags=re.I)
    text = re.sub(r'(\d+\.?\d*)\s*billion', lambda m: str(int(float(m.group(1)) * 1e9)), text, flags=re.I)
    return text

print(normalize_financial_text("Net income of $4,521,890 (was $1.5 million)"))
# Net income of 4521890 (was 1500000)

六、常见陷阱

以为char count = token count：中文一个字常2-3 token，估算成本严重偏低。
数字推理失误：模型对"3782 × 4691"算错，是因为tokenization把数字拆成BPE chunks，不再保留十进制结构。关键数字加分隔或强制CoT。
Whitespace破坏cache：prompt模板里多一个尾随空格，全部cache miss——production critical。
Token limits先trigger于含义截断：max_tokens=4096可能在JSON输出中间断掉，留下损坏JSON。设置stop_sequences或structured output。
emoji/特殊字符占大token：💯=1 token，但很多Unicode符号3-4 token。前端用户输入可能用emoji轰炸token budget。

七、关键速查

Token估算rule of thumb

英文: 1 token ≈ 0.75 word ≈ 4 chars
中文: 1 token ≈ 1.5 字 (Anthropic), 1 字 (GPT-4o)
代码: 1 token ≈ 3 chars (Python/JS)
JSON: 1 token ≈ 2.5 chars (大量括号引号)
ETH地址: 42 chars ≈ 22 tokens
URL: highly variable, 1-3 chars/token

Anthropic count_tokens

client.messages.count_tokens(
    model="claude-opus-4-7",
    system=[...],
    messages=[...],
    tools=[...]  # tools也算token
)
# 返回 {"input_tokens": 12345}

八、面试题

Q1: 解释BPE训练过程，为什么要byte-level BPE？

BPE迭代合并最频繁字符对生成subword词表。Byte-level BPE在UTF-8字节上做（vocab size从256起步），所以永远没有OOV——任何Unicode字符都能表示。这是GPT-2/3/4都用byte-level的原因。

Q2: 为什么Claude算3782 × 4691常算错，但能解决AIME problem？

算术错是因为tokenization破坏了数字的十进制结构（"3782"可能被切成"37"+"82"）。AIME能解是因为有extended thinking + CoT分解，模型把每一步的中间结果"重新tokenize"，绕开了内部表示问题。形式化推理 > 直觉算术。

Q3: 设计一个多语言客服bot，怎么估算成本？

(1) 不同语言token密度差异巨大：中文是英文token的~2x。(2) 用count_tokens API在仿真集上测平均tokens per turn。(3) 不同语言用户分别建模ARPU。(4) 中文/日文用户考虑额外缓存优化（system prompt中文版本独立缓存）。(5) 监控tail latency——长turn可能token超预算。

Q4: prompt caching cache miss可能由哪些细节引起？

(a) 任何whitespace差异（trailing space, line ending CRLF vs LF）。(b) cache_control标记位置不一致。(c) 模型不同（Opus vs Sonnet 各cache）。(d) System messages内顺序不一致。(e) 时间过期（5min ephemeral）。(f) Anthropic infra cache eviction (低概率但存在)。生产上要日志记录cache_creation_input_tokens、cache_read_input_tokens进行attribution。

九、明日预告

Day 124: Sampling策略 — Temperature/Top-p/Top-k、Speculative decoding、对比5种sampling在creative writing/code generation/structured output上的效果。