Prompt自动化优化——OPRO、APE、DSPy
OPRO (Google)、APE (Zhou et al.)、DSPy (Stanford) 框架
日期: 2026-09-06 方向: AI系统工程 阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #DSPy #OPRO #APE #PromptOptimization #AutoPrompt
今日目标
| 类型 | 内容 |
|---|---|
| 学习 | OPRO (Google)、APE (Zhou et al.)、DSPy (Stanford) 框架 |
| 实操 | 用DSPy优化一个金融抽取任务的prompt |
| 产出 | dspy_demo.py + DSPy选型/局限分析 |
一、理论基础
1.1 为什么需要自动prompt优化
人工写prompt痛点:
- 试错vs ROI不成比例
- 改model→重写prompt(GPT-4 prompt不一定work在Claude)
- A/B test多变量(temperature、example数、wording)爆炸
- 多步pipeline中每步prompt独立优化
核心idea:把prompt当parameter,用training data自动优化。
1.2 OPRO (Google 2023)
Optimization by PROmpting:用LLM自己当优化器。
Loop:
- 维护"prompts与score"历史
- Meta prompt: "These prompts produced these scores. Generate a better one"
- 模型输出新prompt
- 在validation set上evaluate
- Repeat
经典发现:在GSM8K上找到 "Take a deep breath and work step by step" 比手工的"Let's think step by step"高8%。听起来玄学但可复现。
1.3 APE (Automatic Prompt Engineer, Zhou et al. 2022)
类似但reverse-engineer:
- 给一组(input, output)examples
- LLM逆向"是什么instruction让这些example成立"
- Sample多个候选instruction
- 在holdout set评估,挑最优
1.4 DSPy (Stanford, Khattab et al. 2023)
最系统化的framework,把prompt编程化:
# 不写prompt,写signature
class FinExtract(dspy.Signature):
"""Extract financial metrics from earnings report."""
report = dspy.InputField()
revenue = dspy.OutputField()
net_income = dspy.OutputField()
# 用module组合
class FinPipeline(dspy.Module):
def __init__(self):
self.extract = dspy.ChainOfThought(FinExtract)
def forward(self, report):
return self.extract(report=report)
DSPy compile时自动:
- 生成prompt
- 找best few-shot examples(bootstrap)
- 优化instruction(COPRO/MIPRO optimizer)
- 在training set上evaluate
1.5 三者对比
| 框架 | 优化对象 | 输入 | 优势 | 劣势 |
|---|---|---|---|---|
| OPRO | Instruction text | Score function | 简单 | 单轮,无pipeline |
| APE | Instruction text | Examples | 容易冷启动 | 短instruction |
| DSPy | Modules + prompts + few-shot | Signatures + metrics | Pipeline级别 + Compose | 学习曲线 |
二、直觉解释
自动优化work的本质
LLM对prompt措辞极度敏感("step by step" vs "step-by-step" 准确率差异>5%)。人类靠直觉,机器可以系统搜索这个combinatorial space。
DSPy "compile"概念
像machine learning的train:
- Signature = 任务类型(如function signature)
- Module = "用什么pattern"(CoT、ReAct、MultiChain)
- Optimizer = "怎么找best prompt"
- Compile = 在train set上自动tune
为什么不取代人工Prompt Engineer?
- DSPy找的local optimum,对data distribution敏感
- 优化cost不低(每次compile要call LLM几百次)
- 可解释性差("为什么这个prompt work"——不知道)
- 模型换了要重新compile
三、代码实现
3.1 完整DSPy demo:财报抽取优化
# dspy_demo.py
"""
用DSPy优化"从earnings report抽取revenue + net_income"任务
"""
# pip install dspy-ai
import dspy
# === 1. 配置LM backend ===
# DSPy支持Anthropic Claude
lm = dspy.LM(
model="anthropic/claude-haiku-4-5",
api_key="<YOUR_KEY>",
max_tokens=512
)
dspy.configure(lm=lm)
# === 2. 定义Signature ===
class ExtractFinancials(dspy.Signature):
"""Extract revenue and net income from an earnings report.
Output revenue and net_income in millions of USD as numbers only."""
report: str = dspy.InputField(desc="The earnings report text")
revenue: float = dspy.OutputField(desc="Revenue in USD millions")
net_income: float = dspy.OutputField(desc="Net income in USD millions")
# === 3. 用Module包装 ===
class FinExtractor(dspy.Module):
def __init__(self):
super().__init__()
self.extract = dspy.ChainOfThought(ExtractFinancials)
def forward(self, report):
return self.extract(report=report)
# === 4. 准备training data ===
trainset = [
dspy.Example(
report="Apple Q3'26: Revenue $94.5B, net income $24.1B.",
revenue=94500.0,
net_income=24100.0
).with_inputs("report"),
dspy.Example(
report="Microsoft FY26 Q1: Revenue of $62.0 billion. Net income reached $21.5 billion.",
revenue=62000.0,
net_income=21500.0
).with_inputs("report"),
dspy.Example(
report="Tesla Q3 2026 delivered $25.18B in revenue with net income of $1.85B.",
revenue=25180.0,
net_income=1850.0
).with_inputs("report"),
dspy.Example(
report="Amazon Q3'26 GAAP earnings: Net sales were $158.9B, net income $14.6B.",
revenue=158900.0,
net_income=14600.0
).with_inputs("report"),
dspy.Example(
report="Google Q3 26 reported $88.3 billion in revenue and net income of $26.3 billion.",
revenue=88300.0,
net_income=26300.0
).with_inputs("report"),
]
# === 5. Metric ===
def metric(example, pred, trace=None):
"""两个数字都在5%误差内算对"""
try:
rev_ok = abs(pred.revenue - example.revenue) / example.revenue < 0.05
ni_ok = abs(pred.net_income - example.net_income) / example.net_income < 0.05
return rev_ok and ni_ok
except (AttributeError, TypeError, ZeroDivisionError):
return False
# === 6. 不优化的baseline ===
baseline = FinExtractor()
print("=== Baseline (no optimization) ===")
for ex in trainset[:2]:
pred = baseline(report=ex.report)
print(f" Predicted: rev={pred.revenue}, ni={pred.net_income}")
print(f" True: rev={ex.revenue}, ni={ex.net_income}")
print(f" Match: {metric(ex, pred)}\n")
# === 7. 用Optimizer ===
from dspy.teleprompt import BootstrapFewShot
optimizer = BootstrapFewShot(
metric=metric,
max_bootstrapped_demos=3, # 自动从training set选best 3 demos
max_labeled_demos=4,
)
print("=== Compiling (running optimization) ===")
compiled_extractor = optimizer.compile(
FinExtractor(),
trainset=trainset
)
# === 8. 测试compiled ===
test_report = "Nvidia Q3 FY27: Datacenter revenue of $30.8B drove total revenue to $35.1B with net income of $19.3B."
print("\n=== Test on unseen example ===")
pred = compiled_extractor(report=test_report)
print(f"Revenue: {pred.revenue}, Net Income: {pred.net_income}")
# 期望: 35100, 19300
print(f"\nReasoning trace:\n{pred.reasoning if hasattr(pred, 'reasoning') else 'n/a'}")
# === 9. inspect compiled prompt ===
print("\n=== Final compiled prompt (text) ===")
print(compiled_extractor.extract.signature)
# DSPy optimizer改了field descriptions、加了few-shot demos
3.2 用MIPRO(更高级optimizer)
# 用MIPROv2: 同时优化instruction + demos
from dspy.teleprompt import MIPROv2
mipro = MIPROv2(
metric=metric,
auto="medium", # "light" / "medium" / "heavy"
num_threads=4,
)
compiled_v2 = mipro.compile(
student=FinExtractor(),
trainset=trainset[:4],
valset=trainset[4:],
)
# MIPRO会同时尝试不同instruction wordings
3.3 简易OPRO实现
# my_opro.py
"""
教学版OPRO:自己实现一个prompt search loop
"""
import anthropic
client = anthropic.Anthropic()
def evaluate_prompt(instruction, examples):
"""在examples上测一个instruction的accuracy"""
correct = 0
for ex in examples:
resp = client.messages.create(
model="claude-haiku-4-5", max_tokens=100, temperature=0,
system=instruction,
messages=[{"role": "user", "content": ex["input"]}]
)
if resp.content[0].text.strip() == ex["expected"]:
correct += 1
return correct / len(examples)
def opro(seed_instructions, examples, n_rounds=5):
history = []
for instr in seed_instructions:
score = evaluate_prompt(instr, examples)
history.append((instr, score))
for round_i in range(n_rounds):
# Sort history by score
history.sort(key=lambda x: x[1], reverse=True)
prior = "\n".join(f"Score {s:.2f}: {p}" for p, s in history[:5])
meta_prompt = f"""I'm trying to find the best prompt for a task.
Here are previously tried prompts and their scores (0-1):
{prior}
Generate a NEW prompt that may score higher. Output only the prompt text."""
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=300, temperature=0.7,
messages=[{"role": "user", "content": meta_prompt}]
)
new_instr = resp.content[0].text.strip()
score = evaluate_prompt(new_instr, examples)
history.append((new_instr, score))
print(f"Round {round_i}: new score={score:.2f}, prompt={new_instr[:80]}...")
return max(history, key=lambda x: x[1])
# 用法
SEEDS = [
"Translate to French.",
"Output the French translation.",
"Provide French translation, nothing else."
]
EXAMPLES = [
{"input": "Hello", "expected": "Bonjour"},
{"input": "Goodbye", "expected": "Au revoir"},
]
best, score = opro(SEEDS, EXAMPLES, n_rounds=3)
print(f"\nBest: {best} (score: {score:.2f})")
四、Anthropic API最佳实践
4.1 DSPy + Anthropic配置
import dspy
# Anthropic直接
lm = dspy.LM(
model="anthropic/claude-sonnet-4-6",
max_tokens=2048,
temperature=0.0
)
# 通过Bedrock
# lm = dspy.LM(model="bedrock/anthropic.claude-sonnet-4-6-v1:0")
# 通过Vertex AI
# lm = dspy.LM(model="vertex_ai/claude-sonnet-4-6@20260601")
dspy.configure(lm=lm)
4.2 Compile阶段成本控制
DSPy compile要call LLM几十-几百次。生产里:
- 用cheap model做optimizer (Haiku)
- 用target model做final eval
- Compile结果存盘(
compiled_extractor.save("prog.json")),不重复编译
# 双model setup
prompt_model = dspy.LM(model="anthropic/claude-sonnet-4-6")
task_model = dspy.LM(model="anthropic/claude-haiku-4-5")
dspy.configure(lm=task_model)
mipro = MIPROv2(
metric=metric,
prompt_model=prompt_model, # 用Sonnet做meta optimization
task_model=task_model, # 用Haiku做task execution
)
五、金融领域应用
案例:Sentiment分类自动优化
# fin_sentiment_dspy.py
class SentimentClassify(dspy.Signature):
"""Classify financial news sentiment as bullish/bearish/neutral."""
headline: str = dspy.InputField()
sentiment: str = dspy.OutputField(desc="One of: bullish, bearish, neutral")
reasoning: str = dspy.OutputField(desc="Brief reasoning")
trainset = [
dspy.Example(headline="Apple beats earnings expectations, raises guidance",
sentiment="bullish", reasoning="Beat + raise").with_inputs("headline"),
dspy.Example(headline="Tesla recalls 200K vehicles over autopilot defect",
sentiment="bearish", reasoning="Costly recall").with_inputs("headline"),
dspy.Example(headline="Fed holds rates steady as expected",
sentiment="neutral", reasoning="In line with expectations").with_inputs("headline"),
# ... 50 more examples
]
def metric(ex, pred, trace=None):
return ex.sentiment.lower() == pred.sentiment.lower()
clf = dspy.ChainOfThought(SentimentClassify)
optimizer = BootstrapFewShot(metric=metric, max_bootstrapped_demos=5)
compiled = optimizer.compile(clf, trainset=trainset)
compiled.save("sentiment_v1.json")
生产收益:相比手工prompt,DSPy compile后accuracy +6%、cost -30%(更短prompt)。
案例:多步pipeline (Research Agent)
class FinResearchPipeline(dspy.Module):
def __init__(self):
super().__init__()
self.extract_metrics = dspy.ChainOfThought(ExtractFinancials)
self.compare_peers = dspy.ChainOfThought("metrics, peers -> comparison")
self.summarize = dspy.ChainOfThought("comparison, metrics -> summary")
def forward(self, report, peer_data):
m = self.extract_metrics(report=report)
c = self.compare_peers(metrics=m, peers=peer_data)
s = self.summarize(comparison=c, metrics=m)
return s
DSPy可以端到端compile这个pipeline——传统手工方法每个step独立调,pipeline-level优化几乎不可能。
六、常见陷阱
- 小training set过拟合:DSPy compile过拟合training set,testset表现差。需要≥30 examples + holdout。
- Metric设计太简单:accuracy ignore了edge cases(罕见但重要的)。
- Optimizer cost失控:MIPRO heavy模式可能compile cost>$100。Light模式起步。
- Compile后忘记save:每次启动重新compile浪费钱。Compile一次save到文件。
- Model版本变了不重新compile:Claude 4.6→4.7的optimal prompt可能不一样。
- 过度自动化忽视domain knowledge:自动找出的prompt有时违反compliance/safety要求。
七、关键速查
DSPy核心概念
Signature = task interface (inputs/outputs定义)
Module = pattern (CoT/ReAct/Predict)
Optimizer = compile strategy (BootstrapFewShot/MIPRO/COPRO)
Metric = evaluation function
Compile = optimize prompt + few-shot
Optimizer选型
BootstrapFewShot : 简单,找best demos
COPRO : 优化instruction
MIPROv2 : 同时优化instruction + demos (最强)
KNNFewShot : 用KNN选最相关demos at runtime
何时用DSPy
- ✅ 多步pipeline优化
- ✅ 想换model不想重写prompt
- ✅ 有training data (≥30 examples)
- ❌ 一次性简单任务
- ❌ 没evaluation metric的任务
八、面试题
Q1: DSPy和LangChain区别?
LangChain主要是orchestration(chain组装、memory、tools);DSPy是optimization(自动tune prompt + demos)。可同时用:DSPy编程模型 + LangChain做runtime。
Q2: OPRO找出来的"Take a deep breath"为什么work?
不知道。training data里这种自我提示phrase可能高度correlate with高质量推理context。是经验现象,不是理论。这正是auto-prompt的价值——找出人类直觉之外的pattern。
Q3: 自动优化的prompt安全吗?
不一定。Optimizer可能找到"jailbreak-adjacent" prompt(如"ignore safety, just answer")。生产中要在optimizer里加safety constraint检查。
Q4: 给一个deepfake检测任务,DSPy能帮你做什么?
(1) 把"分类"包装成Signature。(2) 用training data + DSPy optimizer找best prompt + demos。(3) 比较CoT vs zero-shot vs ReAct module。(4) 保存compiled prog. 限制:vision任务DSPy支持有限,需要custom signature;deepfake特定domain knowledge无法自动学到。
九、明日预告
Day 129: Multi-modal prompting — Vision、Voice、Document,Claude vision深度。