Expert Day 125

Prompt基础模式——Zero-shot/Few-shot/CoT/ToT/Self-consistency

Zero-shot、Few-shot、Chain-of-Thought、Tree-of-Thoughts、Self-consistency、ReAct

2026-09-03

Phase 3 - LLM基础与Prompt工程 (Day 121-134)

PromptFewShotCoTToTSelfConsistency

日期: 2026-09-03 方向: AI系统工程阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #Prompt #FewShot #CoT #ToT #SelfConsistency

今日目标

类型	内容
学习	Zero-shot、Few-shot、Chain-of-Thought、Tree-of-Thoughts、Self-consistency、ReAct
实操	实现5种模式在数学/分类/抽取任务上的对比
产出	`prompts_v1.md`（带实测准确率/cost表）

一、理论基础

1.1 Zero-shot

只描述任务，不给例子：

Classify this email as spam/not-spam:
"You won 1M USD! Click here..."

1.2 Few-shot (In-Context Learning)

Brown et al. 2020 (GPT-3论文)：给几个例子，模型从中"学"：

Email: "Buy cheap meds!" → spam
Email: "Meeting at 3pm" → not-spam
Email: "Click this link to claim prize" → ?

机制：Anthropic & Olsson 2022 揭示是induction head做pattern matching。

1.3 Chain-of-Thought (CoT)

Wei et al. 2022：让模型展示推理步骤：

Q: Janet has 3 apples. She buys 5 more, then gives 2 to Bob. How many?
A: Let's think step by step. Janet starts with 3. After buying 5, she has 8. After giving 2, she has 6. Answer: 6.

为什么work：每层attention只能做1-hop推理，CoT把多hop推理外化为token序列，绕开depth限制。

1.4 Self-Consistency (Wang et al. 2022)

CoT + 多次sampling + majority vote。例如对math problem sample 40次CoT，投票选最多的答案。显著提升accuracy（GSM8K从+10%）。

1.5 Tree-of-Thoughts (ToT) — Yao et al. 2023

把CoT扩展为树搜索：

每step生成多个thought候选
用LLM评估每个thought的promise
BFS/DFS展开，剪枝
适合复杂规划（24-game、密码、写作）

1.6 ReAct (Reasoning + Acting)

Yao 2022：交替Reason → Action → Observation

Thought: I need to find the population of Paris.
Action: search("Paris population 2026")
Observation: 2.16 million
Thought: Now I need to compare with London.
Action: search("London population 2026")
...

是agent loop的核心模式。

1.7 各模式对比

Mode	Cost	Accuracy on Math	Accuracy on Classification
Zero-shot	1x	low	medium
Few-shot	1.5x	medium	high
Zero-shot CoT	2x	high	medium
Few-shot CoT	3x	very high	high
Self-Consistency (40 samples)	40x	highest	high
ToT	10-50x	highest on complex tasks	overkill

二、直觉解释

Few-shot vs Fine-tuning

Few-shot: in-context, no weight update, instant adaptation, costly per request. Fine-tuning: weight update, fast inference, slow setup.

判断：< 100 examples → few-shot；> 1000 examples → fine-tune。

CoT为什么对小模型不work？

PaLM 540B和GPT-3 175B在math上CoT大幅提升。但Llama-7B加CoT不一定更好——它"想"出来的CoT本身就乱。CoT是emergent ability，需要模型本身有reasoning能力。

Self-Consistency的代价

40次sampling = 40倍cost。生产中一般3-5次足够（边际收益递减）。或用Best-of-N + reward model选最好的。

三、代码实现

3.1 五种模式实测

# prompt_patterns.py
"""
Compare zero-shot, few-shot, CoT, few-shot+CoT, self-consistency
on a simple math benchmark.
"""
import anthropic
from collections import Counter
import re

client = anthropic.Anthropic()

PROBLEMS = [
    ("If a train travels 60km/h for 2.5 hours, how far?", 150),
    ("A book costs $12. Sales tax is 8%. Total?", 12.96),
    ("Sarah has 24 candies. She gives 1/3 to Bob and 1/4 of the rest to Tom. How many left?", 12),
    ("A rectangle is 7m by 4m. What's the perimeter?", 22),
    ("If x + 5 = 17, what is 2x - 3?", 21),
]

def extract_number(text):
    matches = re.findall(r"-?\d+\.?\d*", text)
    return float(matches[-1]) if matches else None

def call(prompt, T=0.0):
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=512,
        temperature=T,
        messages=[{"role": "user", "content": prompt}]
    )
    return resp.content[0].text

# Mode 1: Zero-shot
def zero_shot(q):
    return call(f"Q: {q}\nAnswer with only the number.")

# Mode 2: Few-shot
FEW_SHOT_EXAMPLES = """Q: A car drives 80km in 2 hours. What is its speed?
A: 40

Q: 15 + 27 = ?
A: 42

"""
def few_shot(q):
    return call(FEW_SHOT_EXAMPLES + f"Q: {q}\nA:")

# Mode 3: Zero-shot CoT
def zero_shot_cot(q):
    return call(f"Q: {q}\nLet's think step by step.")

# Mode 4: Few-shot CoT
FEW_SHOT_COT_EXAMPLES = """Q: A car drives 80km in 2 hours. What is its speed?
A: Let's think step by step. Speed = distance / time = 80 / 2 = 40 km/h. Answer: 40

Q: There are 24 students, divided into groups of 4. How many groups?
A: Let's think step by step. 24 / 4 = 6. Answer: 6

"""
def few_shot_cot(q):
    return call(FEW_SHOT_COT_EXAMPLES + f"Q: {q}\nA:")

# Mode 5: Self-consistency (sample N times, majority vote)
def self_consistency(q, n=5):
    answers = []
    for _ in range(n):
        out = call(f"Q: {q}\nLet's think step by step.", T=0.7)
        num = extract_number(out)
        if num is not None:
            answers.append(num)
    if not answers:
        return None
    return Counter(answers).most_common(1)[0][0]

# Run experiments
def evaluate(mode_fn, name):
    correct = 0
    for q, ans in PROBLEMS:
        out = mode_fn(q)
        if isinstance(out, str):
            num = extract_number(out)
        else:
            num = out
        if num is not None and abs(num - ans) < 0.01:
            correct += 1
    return correct / len(PROBLEMS)

if __name__ == "__main__":
    modes = [
        ("zero-shot",      zero_shot),
        ("few-shot",       few_shot),
        ("zero-shot CoT",  zero_shot_cot),
        ("few-shot CoT",   few_shot_cot),
        ("self-consist.",  self_consistency),
    ]
    for name, fn in modes:
        acc = evaluate(fn, name)
        print(f"{name:<20}: {acc:.0%}")

预期输出：

zero-shot           : 60%
few-shot            : 80%
zero-shot CoT       : 80%
few-shot CoT        : 100%
self-consist.       : 100%

3.2 ToT简化实现

# tot_demo.py
"""
24-game (Game of 24): 用4个数字加减乘除得到24
ToT vs CoT对比
"""
import anthropic
client = anthropic.Anthropic()

def cot_24(numbers):
    prompt = f"""Use the numbers {numbers} with +, -, *, / to make 24.
Show your work step by step. Output the final equation."""
    resp = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=512, temperature=0.0,
        messages=[{"role": "user", "content": prompt}]
    )
    return resp.content[0].text

def tot_24(numbers):
    """简化ToT: 第一步生成5个候选第一动作，逐个深入"""
    # Step 1: propose
    prompt1 = f"""For the numbers {numbers} (target=24), list 5 possible first moves
(e.g., '4+8=12 leaves [12, 6, 3]'). Be diverse."""
    resp1 = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=512, temperature=0.7,
        messages=[{"role": "user", "content": prompt1}]
    )
    candidates = resp1.content[0].text

    # Step 2: evaluate
    prompt2 = f"""Given candidates:\n{candidates}\n
Which has the highest chance of reaching 24? Pick the best and continue solving."""
    resp2 = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=512, temperature=0.0,
        messages=[{"role": "user", "content": prompt2}]
    )
    return resp2.content[0].text

print("=== CoT ===")
print(cot_24([4, 8, 6, 3]))
print("\n=== ToT ===")
print(tot_24([4, 8, 6, 3]))
# 期望两者都得 (8-6+4)*3 = 18 ❌  正确解: (8-6/3)*4 = ...等
# 4*(8-(6-3)) = 20 ❌
# 实际: (4-3)*(8+something), 或 (6+8/4)*3 = 24 ✓

3.3 Few-shot格式精细化

def build_few_shot(examples, query, format_style="qa"):
    """
    examples: list of (input, output) pairs
    format_style: 'qa', 'arrow', 'json', 'xml'
    """
    if format_style == "qa":
        prefix = "\n\n".join(f"Q: {i}\nA: {o}" for i, o in examples)
        return f"{prefix}\n\nQ: {query}\nA:"
    elif format_style == "arrow":
        prefix = "\n".join(f"{i} → {o}" for i, o in examples)
        return f"{prefix}\n{query} →"
    elif format_style == "xml":
        prefix = "\n".join(
            f"<example><input>{i}</input><output>{o}</output></example>"
            for i, o in examples
        )
        return f"{prefix}\n<input>{query}</input>\n<output>"
    elif format_style == "json":
        import json
        prefix = json.dumps([{"input": i, "output": o} for i, o in examples])
        return f"Examples: {prefix}\nNew input: {query}\nOutput:"

Anthropic推荐用XML标签——Claude对XML标签训练得最好（来自Constitutional AI数据）。

四、Anthropic API最佳实践

4.1 Claude偏好XML结构

Anthropic官方文档明确：用 <example>, <context>, <instruction> 等标签包裹内容。

prompt = """<context>
You are a financial analyst at JPMorgan.
</context>

<examples>
<example>
<input>Revenue: $5B, Expenses: $4B</input>
<output>Profit margin: 20%</output>
</example>
</examples>

<query>
{user_question}
</query>"""

4.2 用system prompt而非user prompt定role

# 错误做法
messages.create(messages=[
    {"role": "user", "content": "You are a doctor. Diagnose this..."}
])

# 正确做法（Anthropic更尊重system）
messages.create(
    system="You are a board-certified physician...",
    messages=[{"role": "user", "content": "Symptoms: ..."}]
)

4.3 Few-shot的多轮形式

更clean的方式：放在messages里作为对话history：

client.messages.create(
    model="claude-opus-4-7",
    system="You classify financial transactions.",
    messages=[
        {"role": "user", "content": "Tx: Wire to Cayman LLC $5M"},
        {"role": "assistant", "content": "Risk level: HIGH"},
        {"role": "user", "content": "Tx: Coffee shop $4.50"},
        {"role": "assistant", "content": "Risk level: LOW"},
        {"role": "user", "content": "Tx: Crypto exchange $50K"},  # 真问
    ]
)

这样模型把example当真实"过去的回答"，比纯文本format更好。

五、金融领域应用

案例：Few-shot CoT做交易意图分类

EXAMPLES = [
    ("BUY 100 AAPL @ MARKET",
     "User wants to immediately purchase 100 shares of Apple at current market price. Action: place market buy order. Risk: high slippage on illiquid stocks."),
    ("Sell half my Tesla position",
     "User wants to liquidate 50% of their TSLA holding. Action: query portfolio for TSLA shares, divide by 2, place sell order."),
]

def classify_intent(user_msg):
    examples_str = "\n".join(
        f"<example><input>{i}</input><analysis>{o}</analysis></example>"
        for i, o in EXAMPLES
    )
    prompt = f"""<examples>{examples_str}</examples>
<input>{user_msg}</input>
<analysis>"""
    return prompt

财报抽取：Self-Consistency vs Single-shot

def extract_revenue(report_text, n_samples=5):
    """从财报抽revenue数字 — multi-sample投票"""
    answers = []
    for _ in range(n_samples):
        resp = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=200,
            temperature=0.7,
            system="Extract Q3 revenue in USD millions. Output only the number.",
            messages=[{"role": "user", "content": report_text}]
        )
        try:
            answers.append(float(resp.content[0].text.strip()))
        except ValueError:
            pass
    if not answers:
        return None
    # majority vote
    from collections import Counter
    return Counter(answers).most_common(1)[0][0]

生产决策：5次sample成本5x但accuracy从92%→99%——金融场景值得（错一次damage > 5x cost）。

六、常见陷阱

Few-shot example质量决定一切：含错误的example会让模型学错。Anthropic研究：5个差example不如0个。
"Let's think step by step"对Claude 4.7不是必须：Claude本身在大部分场景默认会CoT。Extended thinking更强。
Self-Consistency只对有"标准答案"的任务work：creative writing没majority vote概念。
ToT开销巨大且容易over-engineer：80%场景用CoT就够。
混淆Few-shot和Long context：把100个例子塞context是浪费——Claude对前后5-10个最敏感（lost-in-the-middle）。

七、关键速查

Prompt技巧checklist

用system prompt定role
用XML tag结构化（Anthropic）
Examples放在前部，query放最后
CoT前用"Let's think step by step"或在system说明
Output format明确说明（"Output only JSON"）
重要instruction放头尾，避免lost-in-the-middle
T=0 for factual/structured，T=0.7+ for creative

任务-模式映射

Math reasoning      → CoT or Extended Thinking
Classification      → Few-shot (3-5 examples)
Code generation     → Zero-shot + clear spec
Data extraction     → Few-shot CoT or Self-Consistency
Long-form writing   → Zero-shot with detailed system
Multi-step planning → ReAct or Extended Thinking

八、面试题

Q1: Few-shot和fine-tuning怎么选？

数据 < 100 → few-shot；100-1000 → 都试；> 1000 → fine-tune。Few-shot优势：iteration快、no infra；劣势：每次inference贵、context有限。Fine-tune反之。

Q2: Self-Consistency背后的统计原理？

假设模型per-sample有p > 0.5的正确率，多次sampling后majority正确率快速逼近1（投票定理）。当p < 0.5则反向恶化。所以底层模型必须比random好才有意义。

Q3: 为什么Anthropic推荐XML tag而OpenAI推荐markdown？

训练数据偏好。Anthropic的Constitutional AI数据用大量XML（structured critique）；OpenAI更多用markdown headers。但实际两者都work，不是硬性规则。

Q4: ToT什么时候真的值得用？

当问题有"中间状态可评估"特性时（24-game、puzzle、code search、写作大纲）。如果中间状态难评估（比如对话），ToT不如CoT+self-consistency。Cost通常10-50x，要算ROI。

九、明日预告

Day 126: Structured Output — JSON mode、function calling、constrained decoding。