Prompt基础模式——Zero-shot/Few-shot/CoT/ToT/Self-consistency
Zero-shot、Few-shot、Chain-of-Thought、Tree-of-Thoughts、Self-consistency、ReAct
日期: 2026-09-03 方向: AI系统工程 阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #Prompt #FewShot #CoT #ToT #SelfConsistency
今日目标
| 类型 | 内容 |
|---|---|
| 学习 | Zero-shot、Few-shot、Chain-of-Thought、Tree-of-Thoughts、Self-consistency、ReAct |
| 实操 | 实现5种模式在数学/分类/抽取任务上的对比 |
| 产出 | prompts_v1.md(带实测准确率/cost表) |
一、理论基础
1.1 Zero-shot
只描述任务,不给例子:
Classify this email as spam/not-spam:
"You won 1M USD! Click here..."
1.2 Few-shot (In-Context Learning)
Brown et al. 2020 (GPT-3论文):给几个例子,模型从中"学":
Email: "Buy cheap meds!" → spam
Email: "Meeting at 3pm" → not-spam
Email: "Click this link to claim prize" → ?
机制:Anthropic & Olsson 2022 揭示是induction head做pattern matching。
1.3 Chain-of-Thought (CoT)
Wei et al. 2022:让模型展示推理步骤:
Q: Janet has 3 apples. She buys 5 more, then gives 2 to Bob. How many?
A: Let's think step by step. Janet starts with 3. After buying 5, she has 8. After giving 2, she has 6. Answer: 6.
为什么work:每层attention只能做1-hop推理,CoT把多hop推理外化为token序列,绕开depth限制。
1.4 Self-Consistency (Wang et al. 2022)
CoT + 多次sampling + majority vote。例如对math problem sample 40次CoT,投票选最多的答案。显著提升accuracy(GSM8K从+10%)。
1.5 Tree-of-Thoughts (ToT) — Yao et al. 2023
把CoT扩展为树搜索:
- 每step生成多个thought候选
- 用LLM评估每个thought的promise
- BFS/DFS展开,剪枝
- 适合复杂规划(24-game、密码、写作)
1.6 ReAct (Reasoning + Acting)
Yao 2022:交替Reason → Action → Observation
Thought: I need to find the population of Paris.
Action: search("Paris population 2026")
Observation: 2.16 million
Thought: Now I need to compare with London.
Action: search("London population 2026")
...
是agent loop的核心模式。
1.7 各模式对比
| Mode | Cost | Accuracy on Math | Accuracy on Classification |
|---|---|---|---|
| Zero-shot | 1x | low | medium |
| Few-shot | 1.5x | medium | high |
| Zero-shot CoT | 2x | high | medium |
| Few-shot CoT | 3x | very high | high |
| Self-Consistency (40 samples) | 40x | highest | high |
| ToT | 10-50x | highest on complex tasks | overkill |
二、直觉解释
Few-shot vs Fine-tuning
Few-shot: in-context, no weight update, instant adaptation, costly per request. Fine-tuning: weight update, fast inference, slow setup.
判断:< 100 examples → few-shot;> 1000 examples → fine-tune。
CoT为什么对小模型不work?
PaLM 540B和GPT-3 175B在math上CoT大幅提升。但Llama-7B加CoT不一定更好——它"想"出来的CoT本身就乱。CoT是emergent ability,需要模型本身有reasoning能力。
Self-Consistency的代价
40次sampling = 40倍cost。生产中一般3-5次足够(边际收益递减)。或用Best-of-N + reward model选最好的。
三、代码实现
3.1 五种模式实测
# prompt_patterns.py
"""
Compare zero-shot, few-shot, CoT, few-shot+CoT, self-consistency
on a simple math benchmark.
"""
import anthropic
from collections import Counter
import re
client = anthropic.Anthropic()
PROBLEMS = [
("If a train travels 60km/h for 2.5 hours, how far?", 150),
("A book costs $12. Sales tax is 8%. Total?", 12.96),
("Sarah has 24 candies. She gives 1/3 to Bob and 1/4 of the rest to Tom. How many left?", 12),
("A rectangle is 7m by 4m. What's the perimeter?", 22),
("If x + 5 = 17, what is 2x - 3?", 21),
]
def extract_number(text):
matches = re.findall(r"-?\d+\.?\d*", text)
return float(matches[-1]) if matches else None
def call(prompt, T=0.0):
resp = client.messages.create(
model="claude-haiku-4-5",
max_tokens=512,
temperature=T,
messages=[{"role": "user", "content": prompt}]
)
return resp.content[0].text
# Mode 1: Zero-shot
def zero_shot(q):
return call(f"Q: {q}\nAnswer with only the number.")
# Mode 2: Few-shot
FEW_SHOT_EXAMPLES = """Q: A car drives 80km in 2 hours. What is its speed?
A: 40
Q: 15 + 27 = ?
A: 42
"""
def few_shot(q):
return call(FEW_SHOT_EXAMPLES + f"Q: {q}\nA:")
# Mode 3: Zero-shot CoT
def zero_shot_cot(q):
return call(f"Q: {q}\nLet's think step by step.")
# Mode 4: Few-shot CoT
FEW_SHOT_COT_EXAMPLES = """Q: A car drives 80km in 2 hours. What is its speed?
A: Let's think step by step. Speed = distance / time = 80 / 2 = 40 km/h. Answer: 40
Q: There are 24 students, divided into groups of 4. How many groups?
A: Let's think step by step. 24 / 4 = 6. Answer: 6
"""
def few_shot_cot(q):
return call(FEW_SHOT_COT_EXAMPLES + f"Q: {q}\nA:")
# Mode 5: Self-consistency (sample N times, majority vote)
def self_consistency(q, n=5):
answers = []
for _ in range(n):
out = call(f"Q: {q}\nLet's think step by step.", T=0.7)
num = extract_number(out)
if num is not None:
answers.append(num)
if not answers:
return None
return Counter(answers).most_common(1)[0][0]
# Run experiments
def evaluate(mode_fn, name):
correct = 0
for q, ans in PROBLEMS:
out = mode_fn(q)
if isinstance(out, str):
num = extract_number(out)
else:
num = out
if num is not None and abs(num - ans) < 0.01:
correct += 1
return correct / len(PROBLEMS)
if __name__ == "__main__":
modes = [
("zero-shot", zero_shot),
("few-shot", few_shot),
("zero-shot CoT", zero_shot_cot),
("few-shot CoT", few_shot_cot),
("self-consist.", self_consistency),
]
for name, fn in modes:
acc = evaluate(fn, name)
print(f"{name:<20}: {acc:.0%}")
预期输出:
zero-shot : 60%
few-shot : 80%
zero-shot CoT : 80%
few-shot CoT : 100%
self-consist. : 100%
3.2 ToT简化实现
# tot_demo.py
"""
24-game (Game of 24): 用4个数字加减乘除得到24
ToT vs CoT对比
"""
import anthropic
client = anthropic.Anthropic()
def cot_24(numbers):
prompt = f"""Use the numbers {numbers} with +, -, *, / to make 24.
Show your work step by step. Output the final equation."""
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=512, temperature=0.0,
messages=[{"role": "user", "content": prompt}]
)
return resp.content[0].text
def tot_24(numbers):
"""简化ToT: 第一步生成5个候选第一动作,逐个深入"""
# Step 1: propose
prompt1 = f"""For the numbers {numbers} (target=24), list 5 possible first moves
(e.g., '4+8=12 leaves [12, 6, 3]'). Be diverse."""
resp1 = client.messages.create(
model="claude-sonnet-4-6", max_tokens=512, temperature=0.7,
messages=[{"role": "user", "content": prompt1}]
)
candidates = resp1.content[0].text
# Step 2: evaluate
prompt2 = f"""Given candidates:\n{candidates}\n
Which has the highest chance of reaching 24? Pick the best and continue solving."""
resp2 = client.messages.create(
model="claude-sonnet-4-6", max_tokens=512, temperature=0.0,
messages=[{"role": "user", "content": prompt2}]
)
return resp2.content[0].text
print("=== CoT ===")
print(cot_24([4, 8, 6, 3]))
print("\n=== ToT ===")
print(tot_24([4, 8, 6, 3]))
# 期望两者都得 (8-6+4)*3 = 18 ❌ 正确解: (8-6/3)*4 = ...等
# 4*(8-(6-3)) = 20 ❌
# 实际: (4-3)*(8+something), 或 (6+8/4)*3 = 24 ✓
3.3 Few-shot格式精细化
def build_few_shot(examples, query, format_style="qa"):
"""
examples: list of (input, output) pairs
format_style: 'qa', 'arrow', 'json', 'xml'
"""
if format_style == "qa":
prefix = "\n\n".join(f"Q: {i}\nA: {o}" for i, o in examples)
return f"{prefix}\n\nQ: {query}\nA:"
elif format_style == "arrow":
prefix = "\n".join(f"{i} → {o}" for i, o in examples)
return f"{prefix}\n{query} →"
elif format_style == "xml":
prefix = "\n".join(
f"<example><input>{i}</input><output>{o}</output></example>"
for i, o in examples
)
return f"{prefix}\n<input>{query}</input>\n<output>"
elif format_style == "json":
import json
prefix = json.dumps([{"input": i, "output": o} for i, o in examples])
return f"Examples: {prefix}\nNew input: {query}\nOutput:"
Anthropic推荐用XML标签——Claude对XML标签训练得最好(来自Constitutional AI数据)。
四、Anthropic API最佳实践
4.1 Claude偏好XML结构
Anthropic官方文档明确:用 <example>, <context>, <instruction> 等标签包裹内容。
prompt = """<context>
You are a financial analyst at JPMorgan.
</context>
<examples>
<example>
<input>Revenue: $5B, Expenses: $4B</input>
<output>Profit margin: 20%</output>
</example>
</examples>
<query>
{user_question}
</query>"""
4.2 用system prompt而非user prompt定role
# 错误做法
messages.create(messages=[
{"role": "user", "content": "You are a doctor. Diagnose this..."}
])
# 正确做法(Anthropic更尊重system)
messages.create(
system="You are a board-certified physician...",
messages=[{"role": "user", "content": "Symptoms: ..."}]
)
4.3 Few-shot的多轮形式
更clean的方式:放在messages里作为对话history:
client.messages.create(
model="claude-opus-4-7",
system="You classify financial transactions.",
messages=[
{"role": "user", "content": "Tx: Wire to Cayman LLC $5M"},
{"role": "assistant", "content": "Risk level: HIGH"},
{"role": "user", "content": "Tx: Coffee shop $4.50"},
{"role": "assistant", "content": "Risk level: LOW"},
{"role": "user", "content": "Tx: Crypto exchange $50K"}, # 真问
]
)
这样模型把example当真实"过去的回答",比纯文本format更好。
五、金融领域应用
案例:Few-shot CoT做交易意图分类
EXAMPLES = [
("BUY 100 AAPL @ MARKET",
"User wants to immediately purchase 100 shares of Apple at current market price. Action: place market buy order. Risk: high slippage on illiquid stocks."),
("Sell half my Tesla position",
"User wants to liquidate 50% of their TSLA holding. Action: query portfolio for TSLA shares, divide by 2, place sell order."),
]
def classify_intent(user_msg):
examples_str = "\n".join(
f"<example><input>{i}</input><analysis>{o}</analysis></example>"
for i, o in EXAMPLES
)
prompt = f"""<examples>{examples_str}</examples>
<input>{user_msg}</input>
<analysis>"""
return prompt
财报抽取:Self-Consistency vs Single-shot
def extract_revenue(report_text, n_samples=5):
"""从财报抽revenue数字 — multi-sample投票"""
answers = []
for _ in range(n_samples):
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=200,
temperature=0.7,
system="Extract Q3 revenue in USD millions. Output only the number.",
messages=[{"role": "user", "content": report_text}]
)
try:
answers.append(float(resp.content[0].text.strip()))
except ValueError:
pass
if not answers:
return None
# majority vote
from collections import Counter
return Counter(answers).most_common(1)[0][0]
生产决策:5次sample成本5x但accuracy从92%→99%——金融场景值得(错一次damage > 5x cost)。
六、常见陷阱
- Few-shot example质量决定一切:含错误的example会让模型学错。Anthropic研究:5个差example不如0个。
- "Let's think step by step"对Claude 4.7不是必须:Claude本身在大部分场景默认会CoT。Extended thinking更强。
- Self-Consistency只对有"标准答案"的任务work:creative writing没majority vote概念。
- ToT开销巨大且容易over-engineer:80%场景用CoT就够。
- 混淆Few-shot和Long context:把100个例子塞context是浪费——Claude对前后5-10个最敏感(lost-in-the-middle)。
七、关键速查
Prompt技巧checklist
- 用system prompt定role
- 用XML tag结构化(Anthropic)
- Examples放在前部,query放最后
- CoT前用"Let's think step by step"或在system说明
- Output format明确说明("Output only JSON")
- 重要instruction放头尾,避免lost-in-the-middle
- T=0 for factual/structured,T=0.7+ for creative
任务-模式映射
Math reasoning → CoT or Extended Thinking
Classification → Few-shot (3-5 examples)
Code generation → Zero-shot + clear spec
Data extraction → Few-shot CoT or Self-Consistency
Long-form writing → Zero-shot with detailed system
Multi-step planning → ReAct or Extended Thinking
八、面试题
Q1: Few-shot和fine-tuning怎么选?
数据 < 100 → few-shot;100-1000 → 都试;> 1000 → fine-tune。Few-shot优势:iteration快、no infra;劣势:每次inference贵、context有限。Fine-tune反之。
Q2: Self-Consistency背后的统计原理?
假设模型per-sample有p > 0.5的正确率,多次sampling后majority正确率快速逼近1(投票定理)。当p < 0.5则反向恶化。所以底层模型必须比random好才有意义。
Q3: 为什么Anthropic推荐XML tag而OpenAI推荐markdown?
训练数据偏好。Anthropic的Constitutional AI数据用大量XML(structured critique);OpenAI更多用markdown headers。但实际两者都work,不是硬性规则。
Q4: ToT什么时候真的值得用?
当问题有"中间状态可评估"特性时(24-game、puzzle、code search、写作大纲)。如果中间状态难评估(比如对话),ToT不如CoT+self-consistency。Cost通常10-50x,要算ROI。
九、明日预告
Day 126: Structured Output — JSON mode、function calling、constrained decoding。