返回 Expert 笔记
Expert Day 134

Week 20复习——Prompt工程SOP

整合Day 121-133所有方法论

2026-09-12
Phase 3 - LLM基础与Prompt工程 (Day 121-134)
PromptEngineeringSOPMethodologyProduction

日期: 2026-09-12 方向: AI系统工程 阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #PromptEngineering #SOP #Methodology #Production


今日目标

类型内容
学习整合Day 121-133所有方法论
实操写"Prompt工程SOP"——把零散知识变production playbook
产出prompt_sop.md — 一份完整、可分享的工程SOP

一、Prompt工程的"工程化"心智模型

1.1 把Prompt当软件来管理

软件工程概念Prompt工程对应
Source codePrompt template
TestsEval set + metric
CI/CDPrompt regression suite
VersioningPrompt versioning + git
LoggingToken usage, latency, output samples
A/B testingPrompt variant comparison
RefactoringDSPy auto-optimization
Code reviewPrompt review by domain expert + safety check

反模式:把prompt当一次性artifact写死、不版本化、不测试、不监控。

1.2 Prompt产品的生命周期

1. Specification     ─ 定义任务、输入输出、success metric
2. Prototyping       ─ 快速尝试zero-shot, few-shot, CoT
3. Eval design       ─ 30+ examples + metric (accuracy, F1, etc.)
4. Iteration         ─ DSPy/manual tuning
5. Safety review     ─ Red team, content filter
6. Production deploy ─ Caching, monitoring, fallback
7. Continuous eval   ─ A/B test new variants, model upgrades
8. Sunset            ─ Migrate to new model, archive old

二、Prompt工程SOP(完整版)

Phase 1: Specification

模板:

prompt_id: financial_report_extractor_v1
owner: data_team@mycompany
purpose: Extract revenue, EBITDA, net income from earnings call transcripts.

inputs:
  - transcript: free-form English text, 5K-50K tokens

outputs:
  schema:
    revenue_usd_m: number
    ebitda_usd_m: number | null
    net_income_usd_m: number
    fiscal_period: string (Q[1-4]'YY format)

success_criteria:
  - accuracy: >= 95% (within 5% of ground truth)
  - latency: p95 < 5s
  - cost: < $0.05 per call

constraints:
  - Must comply with SEC fair disclosure rules.
  - No hallucinated numbers.
  - Cite source sentence when uncertain.

Phase 2: Prototyping

□ Start with simplest: zero-shot
□ Try few-shot (3-5 examples)
□ Try CoT if reasoning needed
□ Try Tools API for structured output
□ Try Extended Thinking for hard tasks
□ Pick best baseline before optimization

Phase 3: Eval Design

Mandatory components:

  1. Train set (10-30 examples): for iteration
  2. Holdout set (30-100): for final eval, never see during dev
  3. Adversarial set (10-30): edge cases, format変异, malicious inputs
  4. Metric: pass/fail or scored function
  5. Eval pipeline: scriptable, repeatable
# eval_template.py
def evaluate(prompt_fn, dataset):
    results = []
    for example in dataset:
        try:
            pred = prompt_fn(example["input"])
            score = metric(example["expected"], pred)
        except Exception as e:
            score = 0
        results.append({"example": example, "pred": pred, "score": score})

    return {
        "mean_score": sum(r["score"] for r in results) / len(results),
        "failures": [r for r in results if r["score"] < 1],
    }

Phase 4: Iteration

Iteration loop:

1. Pick a failure case
2. Hypothesize: why did it fail?
3. Modify prompt (or model/temp)
4. Re-eval on training set + previous failures
5. If global score improved, promote
6. Else revert

避免over-fitting:

  • 始终保留holdout没见过
  • 每5轮 iteration跑一次holdout看是否真improve

Phase 5: Safety Review

Per Day 131:

  • Red team: 20+ attack attempts
  • Input filter for known patterns
  • Output filter for forbidden topics
  • If agent: tool permissions minimal + audit log
  • CAI/second-pass for high-stake routes

Phase 6: Production Deployment

Checklist:

  • Prompt cached (cache_control on stable parts)
  • Model fallback chain configured
  • Monitoring: cost, latency, stop_reason, cache_hit_ratio
  • Alerting: error spike, cost spike, latency p99 anomaly
  • Versioning: prompt_id with version, can rollback
  • Documentation: runbook for incidents

Phase 7: Continuous Eval

# regression_test.py (run nightly)
"""
对prompt regression: 每晚跑holdout set
新model release也跑一次
"""
GOLDEN_SET = load_dataset("golden_set.jsonl")
for prompt_version in ACTIVE_PROMPTS:
    for model in CANDIDATE_MODELS:
        score = evaluate(make_fn(prompt_version, model), GOLDEN_SET)
        push_metric_to_observability(prompt_version, model, score)

三、Prompt设计Checklist

整合本周所有tips:

结构

  • System prompt定义role + scope
  • 用XML tag标记 sections (Anthropic preference)
  • Examples放前部, query放最后
  • 关键instruction头尾各放一遍 (anti lost-in-the-middle)
  • 长context用cache_control

内容

  • Output format明确说明(JSON schema, 长度, 风格)
  • Edge cases处理("if no data found, output null")
  • 错误处理 ("if uncertain, say so")
  • Citations / sources (RAG/document场景)
  • Refuse pattern (out-of-scope)

参数

  • T=0 for structured/factual
  • T=0.7+ for creative
  • max_tokens适当上限
  • stop_sequences for structured
  • Tools API for严格schema

性能

  • Prefix cached (system prompt + KB)
  • Tools定义cached (if 大)
  • 选最便宜的能干活的model
  • Extended thinking only when needed
  • Batch API for非紧急任务(50% off)

安全

  • Input wrapped in <user_query>
  • System prompt防override (explicit instruction)
  • Output filter for sensitive topics
  • Tool permissions minimal
  • Audit log每个decision

四、代码:完整Prompt工程pipeline

# prompt_engineering_pipeline.py
"""
Production-quality prompt management
"""
import anthropic
import time
import json
import logging
from dataclasses import dataclass, field
from typing import Callable, List

logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)

client = anthropic.Anthropic()

@dataclass
class PromptConfig:
    name: str
    version: str
    model: str
    system: str
    temperature: float = 0.0
    max_tokens: int = 1024
    use_cache: bool = True
    use_thinking: bool = False
    thinking_budget: int = 0
    tools: list = field(default_factory=list)

@dataclass
class CallResult:
    text: str
    structured: dict | None
    latency_s: float
    input_tokens: int
    output_tokens: int
    cache_read: int
    cache_write: int
    cost_usd: float
    stop_reason: str
    model: str

PRICES = {
    "claude-opus-4-7":   (15.0, 75.0),
    "claude-sonnet-4-6": (3.0,  15.0),
    "claude-haiku-4-5":  (0.8,  4.0),
}

def call(config: PromptConfig, user_message: str, **extra) -> CallResult:
    """统一调用入口 — 包含cache、监控、cost计算"""
    system_blocks = [{"type": "text", "text": config.system}]
    if config.use_cache and len(config.system) > 1024 * 4:  # >~1024 tokens
        system_blocks[0]["cache_control"] = {"type": "ephemeral"}

    kwargs = dict(
        model=config.model,
        max_tokens=config.max_tokens,
        temperature=config.temperature,
        system=system_blocks,
        messages=[{"role": "user", "content": user_message}],
    )
    if config.use_thinking:
        kwargs["thinking"] = {"type": "enabled",
                              "budget_tokens": config.thinking_budget}
    if config.tools:
        kwargs["tools"] = config.tools

    t0 = time.time()
    resp = client.messages.create(**kwargs)
    latency = time.time() - t0

    # extract text
    text = ""
    structured = None
    for block in resp.content:
        if block.type == "text":
            text += block.text
        elif block.type == "tool_use":
            structured = block.input

    # cost
    p_in, p_out = PRICES.get(config.model, (0, 0))
    cache_w = getattr(resp.usage, "cache_creation_input_tokens", 0)
    cache_r = getattr(resp.usage, "cache_read_input_tokens", 0)
    base_in = resp.usage.input_tokens - cache_w - cache_r
    cost = (base_in * p_in + cache_w * p_in * 1.25 + cache_r * p_in * 0.1
            + resp.usage.output_tokens * p_out) / 1e6

    result = CallResult(
        text=text,
        structured=structured,
        latency_s=latency,
        input_tokens=resp.usage.input_tokens,
        output_tokens=resp.usage.output_tokens,
        cache_read=cache_r,
        cache_write=cache_w,
        cost_usd=cost,
        stop_reason=resp.stop_reason,
        model=resp.model,
    )

    # log for observability
    log.info(json.dumps({
        "prompt_id": f"{config.name}@{config.version}",
        "model": resp.model,
        "input_tokens": result.input_tokens,
        "output_tokens": result.output_tokens,
        "cache_read": cache_r, "cache_write": cache_w,
        "latency_s": result.latency_s,
        "cost_usd": result.cost_usd,
        "stop_reason": resp.stop_reason,
    }))
    return result


# Eval framework
@dataclass
class EvalResult:
    score: float
    passed: int
    total: int
    failures: list

def eval_prompt(config: PromptConfig, dataset, metric_fn) -> EvalResult:
    passed = 0
    failures = []
    total_cost = 0
    for ex in dataset:
        result = call(config, ex["input"])
        total_cost += result.cost_usd
        ok = metric_fn(ex["expected"], result.text or result.structured)
        if ok:
            passed += 1
        else:
            failures.append({
                "input": ex["input"],
                "expected": ex["expected"],
                "got": result.text or result.structured
            })
    score = passed / max(len(dataset), 1)
    log.info(f"Eval: {passed}/{len(dataset)} = {score:.0%}, cost=${total_cost:.4f}")
    return EvalResult(score=score, passed=passed,
                      total=len(dataset), failures=failures)


# Example用法
if __name__ == "__main__":
    config_v1 = PromptConfig(
        name="financial_extractor",
        version="v1",
        model="claude-haiku-4-5",
        system="Extract revenue from financial text. Output number only.",
        temperature=0.0,
        max_tokens=100,
    )

    dataset = [
        {"input": "Q3 2026 revenue was $94.5B.", "expected": "94.5"},
        {"input": "Reported $62 billion in revenue.", "expected": "62"},
    ]

    def metric(expected, got):
        try:
            return abs(float(got.replace("$", "").replace("B", "").strip()) - float(expected)) < 0.5
        except (ValueError, AttributeError):
            return False

    eval_v1 = eval_prompt(config_v1, dataset, metric)
    print(f"v1 score: {eval_v1.score:.0%}")

五、Anthropic API最佳实践(综合)

5.1 Production deployment template

# prod_anthropic.py
import anthropic
from anthropic import APITimeoutError, RateLimitError, APIConnectionError

client = anthropic.Anthropic(
    timeout=30.0,         # 不要默认10分钟
    max_retries=3,        # SDK内置retry
)

# Messages Batches API for非urgent
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"req-{i}",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": q}]
            }
        }
        for i, q in enumerate(queries)
    ]
)
# ~50% cost savings vs sync API

5.2 Final reference: 全部feature combined

client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    temperature=0.0,                                # 锁定 (除非creative)
    system=[
        {"type": "text", "text": SHORT_INSTR},
        {"type": "text", "text": LARGE_KB,
         "cache_control": {"type": "ephemeral", "ttl": "1h"}}
    ],
    thinking={"type": "enabled", "budget_tokens": 16000},  # 难任务
    tools=TOOLS,                                           # function calling
    tool_choice={"type": "auto"},
    messages=[
        {"role": "user", "content": [
            {"type": "document", "source": {"type": "file", "file_id": "..."},
             "citations": {"enabled": True}},
            {"type": "image", "source": {"type": "url", "url": "..."}},
            {"type": "text", "text": user_question}
        ]}
    ],
    extra_headers={"anthropic-beta": "files-api-2025-04-14"}
)

六、金融领域应用:Putting it all together

案例:完整RFP分析system

需求:律所每周收10份RFP(200-500页),AI做initial review。

Architecture:

RFP Upload → Files API (cache 1h)
            ↓
   Stage 1: Haiku — extract metadata (parties, date, scope)
            ↓
   Stage 2: Sonnet + thinking — identify risk clauses, citations enabled
            ↓
   Stage 3: Opus + thinking 32K — deep liability analysis
            ↓
   CAI second-pass — compliance review
            ↓
   Human Reviewer (final approval)

经济:

  • Stage 1: $0.05 per RFP
  • Stage 2: $1.50 per RFP (cached after first reuse)
  • Stage 3: $15.00 per RFP
  • Total: ~$17/RFP,vs 律师2-4小时 ($1500+) = 100x cost reduction

Quality safeguards:

  • Each stage's output stored with citations
  • Audit log for compliance
  • Failure mode: escalate to human if thinking confidence low
  • Weekly regression eval on 50 historical RFPs

七、常见陷阱(综合版Top 10)

  1. 没有eval set:所有迭代都是vibes-based,无法证明improvement。先建eval set再写prompt
  2. 盲信新model:Claude 4.6→4.7升级后可能regression on you特定任务。强制eval before migrate
  3. 不监控cost:开发期cost小看不出来,production上线后炸账单。day 1就加cost dashboard
  4. Prompt注入忽视:所有input都可能含attack。至少做input wrap + output filter
  5. Cache miss不知道:whitespace差异导致全部miss,cost飙升。监控cache_read_ratio
  6. Thinking开太多:不需要thinking的任务也开,cost+latency爆。A/B test thinking on/off
  7. Multi-vendor没规划:单一provider lock-in。至少LiteLLM抽象 + 1 fallback
  8. No versioning:prompt改完没记录,bug回滚不了。Git管理prompts
  9. Tools description草率:模型乱call。Tool description≥50字,明确何时用何时不用
  10. 以为Prompt能解决所有问题:架构层面问题(如RAG retrieval差)prompt救不了。Prompt不是银弹

八、关键速查(终极cheatsheet)

========== ANTHROPIC PROMPT ENGINEERING SOP ==========

INPUT BLOCKS:
- system: role + scope + constitution + KB (cache long parts)
- messages: history (cache stable parts) + current
- tools: (cache if many)

OUTPUT CONTROL:
- temperature: 0 (default) | 0.7 (creative)
- max_tokens: budget (don't omit)
- stop_sequences: list (for structured)
- tools + tool_choice: for strict schema

OPTIMIZATION:
- prompt_caching cache_control: 5min default, 1h for hot KB
- thinking budget_tokens: 1K-32K based on difficulty
- batch API: 50% off async tasks

MULTIMODAL:
- image: base64 / URL / file_id
- document (PDF): up to 100 pages, citations
- pre-resize images to <1568px长边

SAMPLING (per task):
- factual / structured: T=0
- creative: T=0.7-1.0
- code: T=0
- summarization: T=0.3-0.5

SECURITY:
- system prompt anti-override
- input XML wrap
- output filter forbidden
- tool permission minimal
- audit log every decision

OBSERVABILITY:
- track input_tokens, output_tokens
- track cache_creation, cache_read
- track stop_reason distribution
- track latency p50/p95/p99
- per-model cost split

EVAL:
- holdout set (never train on)
- adversarial set (red team)
- metric scriptable
- nightly regression
- A/B before promote

DEPLOYMENT:
- prompt versioned in git
- model fallback chain
- rate limit handling
- error retry with backoff
- canary rollout new prompts

==================================================

九、面试题(综合)

Q1: 你怎么从0到1做一个production-grade LLM应用?

(1) Spec: 任务定义 + success metric。(2) Eval set: 30+ examples + adversarial。(3) Prototype: zero-shot baseline。(4) Iterate with eval. (5) Optimize cost: caching, model tier。(6) Safety review: red team. (7) Deploy: monitoring, fallback, versioning。(8) Continuous eval: regression suite, A/B for new variants/models.

Q2: 一个prompt精度从80%到99%要做什么?

取决于gap nature:(a) Mode confusion → few-shot。(b) Reasoning errors → CoT or thinking。(c) Format inconsistency → Tools API。(d) Variance → self-consistency。(e) Specific failures → DSPy auto-tune。(f) 模型ceiling → upgrade model。关键:先归因,再针对性fix

Q3: 你最关心一个LLM应用的什么指标?

三角:quality (eval score)、cost ($/query)、latency (p95)。Production还要看:cache hit ratio (cost攻击器)、stop_reason分布 (truncation警报)、retry rate (vendor health)、user feedback (proxy for真实满意度)。

Q4: Prompt engineering会被取代吗?

部分会。DSPy等auto-optimization会接管"调措辞"工作。但策略层不会:spec design、eval design、failure analysis、safety review、cost optimization——这些是engineering judgement,自动化短期不能替代。Prompt engineer→AI engineer/AI architect是自然演进。

Q5: Claude对其他model最大的competitive moat是什么?

三个:(1) Constitutional AI默认更安全(compliance-critical场景默认选Claude)。(2) Tool use + thinking + parallel最成熟(agent loop首选)。(3) 长文档/PDF处理 + Citations(research/legal场景)。但moat不稳——OpenAI、Google都在追赶。生产应避免over-rely任何vendor。


十、Phase 3 Day 121-134 总结

两周覆盖

  • 理论基础:Transformer、Scaling、Tokenization、Sampling
  • 工程方法:Prompt patterns、Structured output、Auto-optimization、Multimodal、Long context
  • 安全治理:Prompt safety、Constitutional AI
  • 工程实践:Model selection、SOP

核心take-aways

  1. LLM从黑盒 → 白盒(理解Transformer + sampling让你debug LLM)
  2. Prompt → Software(version, test, monitor)
  3. Cost是first-class concern(cache, model tier, batch)
  4. Safety不是事后想(constitution + red team从day 1)
  5. Multi-vendor是必需(lock-in风险real)
  6. Anthropic Claude 4.7是current 旗舰但不是唯一答案(task driven)

下两周(Day 135-148, Phase 3继续)预告

  • RAG架构与向量数据库
  • Embedding model对比
  • Hybrid search (BM25 + dense)
  • Agentic patterns (ReAct, Plan-and-Execute)
  • Tool design philosophy
  • Memory systems

十一、明日预告

Day 135: RAG基础 — Embedding、Vector DB、Retrieval pipeline。Phase 3 Week 21开始。