返回 Expert 笔记
Expert Day 128

Prompt自动化优化——OPRO、APE、DSPy

OPRO (Google)、APE (Zhou et al.)、DSPy (Stanford) 框架

2026-09-06
Phase 3 - LLM基础与Prompt工程 (Day 121-134)
DSPyOPROAPEPromptOptimizationAutoPrompt

日期: 2026-09-06 方向: AI系统工程 阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #DSPy #OPRO #APE #PromptOptimization #AutoPrompt


今日目标

类型内容
学习OPRO (Google)、APE (Zhou et al.)、DSPy (Stanford) 框架
实操用DSPy优化一个金融抽取任务的prompt
产出dspy_demo.py + DSPy选型/局限分析

一、理论基础

1.1 为什么需要自动prompt优化

人工写prompt痛点:

  • 试错vs ROI不成比例
  • 改model→重写prompt(GPT-4 prompt不一定work在Claude)
  • A/B test多变量(temperature、example数、wording)爆炸
  • 多步pipeline中每步prompt独立优化

核心idea:把prompt当parameter,用training data自动优化。

1.2 OPRO (Google 2023)

Optimization by PROmpting:用LLM自己当优化器。

Loop:

  1. 维护"prompts与score"历史
  2. Meta prompt: "These prompts produced these scores. Generate a better one"
  3. 模型输出新prompt
  4. 在validation set上evaluate
  5. Repeat

经典发现:在GSM8K上找到 "Take a deep breath and work step by step" 比手工的"Let's think step by step"高8%。听起来玄学但可复现。

1.3 APE (Automatic Prompt Engineer, Zhou et al. 2022)

类似但reverse-engineer:

  1. 给一组(input, output)examples
  2. LLM逆向"是什么instruction让这些example成立"
  3. Sample多个候选instruction
  4. 在holdout set评估,挑最优

1.4 DSPy (Stanford, Khattab et al. 2023)

最系统化的framework,把prompt编程化:

# 不写prompt,写signature
class FinExtract(dspy.Signature):
    """Extract financial metrics from earnings report."""
    report = dspy.InputField()
    revenue = dspy.OutputField()
    net_income = dspy.OutputField()

# 用module组合
class FinPipeline(dspy.Module):
    def __init__(self):
        self.extract = dspy.ChainOfThought(FinExtract)

    def forward(self, report):
        return self.extract(report=report)

DSPy compile时自动:

  • 生成prompt
  • 找best few-shot examples(bootstrap)
  • 优化instruction(COPRO/MIPRO optimizer)
  • 在training set上evaluate

1.5 三者对比

框架优化对象输入优势劣势
OPROInstruction textScore function简单单轮,无pipeline
APEInstruction textExamples容易冷启动短instruction
DSPyModules + prompts + few-shotSignatures + metricsPipeline级别 + Compose学习曲线

二、直觉解释

自动优化work的本质

LLM对prompt措辞极度敏感("step by step" vs "step-by-step" 准确率差异>5%)。人类靠直觉,机器可以系统搜索这个combinatorial space

DSPy "compile"概念

像machine learning的train:

  • Signature = 任务类型(如function signature)
  • Module = "用什么pattern"(CoT、ReAct、MultiChain)
  • Optimizer = "怎么找best prompt"
  • Compile = 在train set上自动tune

为什么不取代人工Prompt Engineer?

  • DSPy找的local optimum,对data distribution敏感
  • 优化cost不低(每次compile要call LLM几百次)
  • 可解释性差("为什么这个prompt work"——不知道)
  • 模型换了要重新compile

三、代码实现

3.1 完整DSPy demo:财报抽取优化

# dspy_demo.py
"""
用DSPy优化"从earnings report抽取revenue + net_income"任务
"""
# pip install dspy-ai
import dspy

# === 1. 配置LM backend ===
# DSPy支持Anthropic Claude
lm = dspy.LM(
    model="anthropic/claude-haiku-4-5",
    api_key="<YOUR_KEY>",
    max_tokens=512
)
dspy.configure(lm=lm)

# === 2. 定义Signature ===
class ExtractFinancials(dspy.Signature):
    """Extract revenue and net income from an earnings report.

    Output revenue and net_income in millions of USD as numbers only."""
    report: str = dspy.InputField(desc="The earnings report text")
    revenue: float = dspy.OutputField(desc="Revenue in USD millions")
    net_income: float = dspy.OutputField(desc="Net income in USD millions")

# === 3. 用Module包装 ===
class FinExtractor(dspy.Module):
    def __init__(self):
        super().__init__()
        self.extract = dspy.ChainOfThought(ExtractFinancials)

    def forward(self, report):
        return self.extract(report=report)

# === 4. 准备training data ===
trainset = [
    dspy.Example(
        report="Apple Q3'26: Revenue $94.5B, net income $24.1B.",
        revenue=94500.0,
        net_income=24100.0
    ).with_inputs("report"),
    dspy.Example(
        report="Microsoft FY26 Q1: Revenue of $62.0 billion. Net income reached $21.5 billion.",
        revenue=62000.0,
        net_income=21500.0
    ).with_inputs("report"),
    dspy.Example(
        report="Tesla Q3 2026 delivered $25.18B in revenue with net income of $1.85B.",
        revenue=25180.0,
        net_income=1850.0
    ).with_inputs("report"),
    dspy.Example(
        report="Amazon Q3'26 GAAP earnings: Net sales were $158.9B, net income $14.6B.",
        revenue=158900.0,
        net_income=14600.0
    ).with_inputs("report"),
    dspy.Example(
        report="Google Q3 26 reported $88.3 billion in revenue and net income of $26.3 billion.",
        revenue=88300.0,
        net_income=26300.0
    ).with_inputs("report"),
]

# === 5. Metric ===
def metric(example, pred, trace=None):
    """两个数字都在5%误差内算对"""
    try:
        rev_ok = abs(pred.revenue - example.revenue) / example.revenue < 0.05
        ni_ok = abs(pred.net_income - example.net_income) / example.net_income < 0.05
        return rev_ok and ni_ok
    except (AttributeError, TypeError, ZeroDivisionError):
        return False

# === 6. 不优化的baseline ===
baseline = FinExtractor()
print("=== Baseline (no optimization) ===")
for ex in trainset[:2]:
    pred = baseline(report=ex.report)
    print(f"  Predicted: rev={pred.revenue}, ni={pred.net_income}")
    print(f"  True:      rev={ex.revenue}, ni={ex.net_income}")
    print(f"  Match:     {metric(ex, pred)}\n")

# === 7. 用Optimizer ===
from dspy.teleprompt import BootstrapFewShot

optimizer = BootstrapFewShot(
    metric=metric,
    max_bootstrapped_demos=3,  # 自动从training set选best 3 demos
    max_labeled_demos=4,
)

print("=== Compiling (running optimization) ===")
compiled_extractor = optimizer.compile(
    FinExtractor(),
    trainset=trainset
)

# === 8. 测试compiled ===
test_report = "Nvidia Q3 FY27: Datacenter revenue of $30.8B drove total revenue to $35.1B with net income of $19.3B."
print("\n=== Test on unseen example ===")
pred = compiled_extractor(report=test_report)
print(f"Revenue: {pred.revenue}, Net Income: {pred.net_income}")
# 期望: 35100, 19300
print(f"\nReasoning trace:\n{pred.reasoning if hasattr(pred, 'reasoning') else 'n/a'}")

# === 9. inspect compiled prompt ===
print("\n=== Final compiled prompt (text) ===")
print(compiled_extractor.extract.signature)
# DSPy optimizer改了field descriptions、加了few-shot demos

3.2 用MIPRO(更高级optimizer)

# 用MIPROv2: 同时优化instruction + demos
from dspy.teleprompt import MIPROv2

mipro = MIPROv2(
    metric=metric,
    auto="medium",   # "light" / "medium" / "heavy"
    num_threads=4,
)

compiled_v2 = mipro.compile(
    student=FinExtractor(),
    trainset=trainset[:4],
    valset=trainset[4:],
)
# MIPRO会同时尝试不同instruction wordings

3.3 简易OPRO实现

# my_opro.py
"""
教学版OPRO:自己实现一个prompt search loop
"""
import anthropic
client = anthropic.Anthropic()

def evaluate_prompt(instruction, examples):
    """在examples上测一个instruction的accuracy"""
    correct = 0
    for ex in examples:
        resp = client.messages.create(
            model="claude-haiku-4-5", max_tokens=100, temperature=0,
            system=instruction,
            messages=[{"role": "user", "content": ex["input"]}]
        )
        if resp.content[0].text.strip() == ex["expected"]:
            correct += 1
    return correct / len(examples)

def opro(seed_instructions, examples, n_rounds=5):
    history = []
    for instr in seed_instructions:
        score = evaluate_prompt(instr, examples)
        history.append((instr, score))

    for round_i in range(n_rounds):
        # Sort history by score
        history.sort(key=lambda x: x[1], reverse=True)
        prior = "\n".join(f"Score {s:.2f}: {p}" for p, s in history[:5])

        meta_prompt = f"""I'm trying to find the best prompt for a task.
Here are previously tried prompts and their scores (0-1):
{prior}

Generate a NEW prompt that may score higher. Output only the prompt text."""

        resp = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=300, temperature=0.7,
            messages=[{"role": "user", "content": meta_prompt}]
        )
        new_instr = resp.content[0].text.strip()
        score = evaluate_prompt(new_instr, examples)
        history.append((new_instr, score))
        print(f"Round {round_i}: new score={score:.2f}, prompt={new_instr[:80]}...")

    return max(history, key=lambda x: x[1])

# 用法
SEEDS = [
    "Translate to French.",
    "Output the French translation.",
    "Provide French translation, nothing else."
]
EXAMPLES = [
    {"input": "Hello", "expected": "Bonjour"},
    {"input": "Goodbye", "expected": "Au revoir"},
]
best, score = opro(SEEDS, EXAMPLES, n_rounds=3)
print(f"\nBest: {best} (score: {score:.2f})")

四、Anthropic API最佳实践

4.1 DSPy + Anthropic配置

import dspy

# Anthropic直接
lm = dspy.LM(
    model="anthropic/claude-sonnet-4-6",
    max_tokens=2048,
    temperature=0.0
)

# 通过Bedrock
# lm = dspy.LM(model="bedrock/anthropic.claude-sonnet-4-6-v1:0")

# 通过Vertex AI
# lm = dspy.LM(model="vertex_ai/claude-sonnet-4-6@20260601")

dspy.configure(lm=lm)

4.2 Compile阶段成本控制

DSPy compile要call LLM几十-几百次。生产里:

  • 用cheap model做optimizer (Haiku)
  • 用target model做final eval
  • Compile结果存盘(compiled_extractor.save("prog.json")),不重复编译
# 双model setup
prompt_model = dspy.LM(model="anthropic/claude-sonnet-4-6")
task_model = dspy.LM(model="anthropic/claude-haiku-4-5")
dspy.configure(lm=task_model)

mipro = MIPROv2(
    metric=metric,
    prompt_model=prompt_model,  # 用Sonnet做meta optimization
    task_model=task_model,       # 用Haiku做task execution
)

五、金融领域应用

案例:Sentiment分类自动优化

# fin_sentiment_dspy.py
class SentimentClassify(dspy.Signature):
    """Classify financial news sentiment as bullish/bearish/neutral."""
    headline: str = dspy.InputField()
    sentiment: str = dspy.OutputField(desc="One of: bullish, bearish, neutral")
    reasoning: str = dspy.OutputField(desc="Brief reasoning")

trainset = [
    dspy.Example(headline="Apple beats earnings expectations, raises guidance",
                 sentiment="bullish", reasoning="Beat + raise").with_inputs("headline"),
    dspy.Example(headline="Tesla recalls 200K vehicles over autopilot defect",
                 sentiment="bearish", reasoning="Costly recall").with_inputs("headline"),
    dspy.Example(headline="Fed holds rates steady as expected",
                 sentiment="neutral", reasoning="In line with expectations").with_inputs("headline"),
    # ... 50 more examples
]

def metric(ex, pred, trace=None):
    return ex.sentiment.lower() == pred.sentiment.lower()

clf = dspy.ChainOfThought(SentimentClassify)
optimizer = BootstrapFewShot(metric=metric, max_bootstrapped_demos=5)
compiled = optimizer.compile(clf, trainset=trainset)
compiled.save("sentiment_v1.json")

生产收益:相比手工prompt,DSPy compile后accuracy +6%、cost -30%(更短prompt)。

案例:多步pipeline (Research Agent)

class FinResearchPipeline(dspy.Module):
    def __init__(self):
        super().__init__()
        self.extract_metrics = dspy.ChainOfThought(ExtractFinancials)
        self.compare_peers = dspy.ChainOfThought("metrics, peers -> comparison")
        self.summarize = dspy.ChainOfThought("comparison, metrics -> summary")

    def forward(self, report, peer_data):
        m = self.extract_metrics(report=report)
        c = self.compare_peers(metrics=m, peers=peer_data)
        s = self.summarize(comparison=c, metrics=m)
        return s

DSPy可以端到端compile这个pipeline——传统手工方法每个step独立调,pipeline-level优化几乎不可能。


六、常见陷阱

  1. 小training set过拟合:DSPy compile过拟合training set,testset表现差。需要≥30 examples + holdout。
  2. Metric设计太简单:accuracy ignore了edge cases(罕见但重要的)。
  3. Optimizer cost失控:MIPRO heavy模式可能compile cost>$100。Light模式起步。
  4. Compile后忘记save:每次启动重新compile浪费钱。Compile一次save到文件
  5. Model版本变了不重新compile:Claude 4.6→4.7的optimal prompt可能不一样。
  6. 过度自动化忽视domain knowledge:自动找出的prompt有时违反compliance/safety要求。

七、关键速查

DSPy核心概念

Signature  = task interface (inputs/outputs定义)
Module     = pattern (CoT/ReAct/Predict)
Optimizer  = compile strategy (BootstrapFewShot/MIPRO/COPRO)
Metric     = evaluation function
Compile    = optimize prompt + few-shot

Optimizer选型

BootstrapFewShot  : 简单,找best demos
COPRO             : 优化instruction
MIPROv2           : 同时优化instruction + demos (最强)
KNNFewShot        : 用KNN选最相关demos at runtime

何时用DSPy

  • ✅ 多步pipeline优化
  • ✅ 想换model不想重写prompt
  • ✅ 有training data (≥30 examples)
  • ❌ 一次性简单任务
  • ❌ 没evaluation metric的任务

八、面试题

Q1: DSPy和LangChain区别?

LangChain主要是orchestration(chain组装、memory、tools);DSPy是optimization(自动tune prompt + demos)。可同时用:DSPy编程模型 + LangChain做runtime。

Q2: OPRO找出来的"Take a deep breath"为什么work?

不知道。training data里这种自我提示phrase可能高度correlate with高质量推理context。是经验现象,不是理论。这正是auto-prompt的价值——找出人类直觉之外的pattern。

Q3: 自动优化的prompt安全吗?

不一定。Optimizer可能找到"jailbreak-adjacent" prompt(如"ignore safety, just answer")。生产中要在optimizer里加safety constraint检查。

Q4: 给一个deepfake检测任务,DSPy能帮你做什么?

(1) 把"分类"包装成Signature。(2) 用training data + DSPy optimizer找best prompt + demos。(3) 比较CoT vs zero-shot vs ReAct module。(4) 保存compiled prog. 限制:vision任务DSPy支持有限,需要custom signature;deepfake特定domain knowledge无法自动学到。


九、明日预告

Day 129: Multi-modal prompting — Vision、Voice、Document,Claude vision深度。