Week 20复习——Prompt工程SOP
整合Day 121-133所有方法论
日期: 2026-09-12 方向: AI系统工程 阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #PromptEngineering #SOP #Methodology #Production
今日目标
| 类型 | 内容 |
|---|---|
| 学习 | 整合Day 121-133所有方法论 |
| 实操 | 写"Prompt工程SOP"——把零散知识变production playbook |
| 产出 | prompt_sop.md — 一份完整、可分享的工程SOP |
一、Prompt工程的"工程化"心智模型
1.1 把Prompt当软件来管理
| 软件工程概念 | Prompt工程对应 |
|---|---|
| Source code | Prompt template |
| Tests | Eval set + metric |
| CI/CD | Prompt regression suite |
| Versioning | Prompt versioning + git |
| Logging | Token usage, latency, output samples |
| A/B testing | Prompt variant comparison |
| Refactoring | DSPy auto-optimization |
| Code review | Prompt review by domain expert + safety check |
反模式:把prompt当一次性artifact写死、不版本化、不测试、不监控。
1.2 Prompt产品的生命周期
1. Specification ─ 定义任务、输入输出、success metric
2. Prototyping ─ 快速尝试zero-shot, few-shot, CoT
3. Eval design ─ 30+ examples + metric (accuracy, F1, etc.)
4. Iteration ─ DSPy/manual tuning
5. Safety review ─ Red team, content filter
6. Production deploy ─ Caching, monitoring, fallback
7. Continuous eval ─ A/B test new variants, model upgrades
8. Sunset ─ Migrate to new model, archive old
二、Prompt工程SOP(完整版)
Phase 1: Specification
模板:
prompt_id: financial_report_extractor_v1
owner: data_team@mycompany
purpose: Extract revenue, EBITDA, net income from earnings call transcripts.
inputs:
- transcript: free-form English text, 5K-50K tokens
outputs:
schema:
revenue_usd_m: number
ebitda_usd_m: number | null
net_income_usd_m: number
fiscal_period: string (Q[1-4]'YY format)
success_criteria:
- accuracy: >= 95% (within 5% of ground truth)
- latency: p95 < 5s
- cost: < $0.05 per call
constraints:
- Must comply with SEC fair disclosure rules.
- No hallucinated numbers.
- Cite source sentence when uncertain.
Phase 2: Prototyping
□ Start with simplest: zero-shot
□ Try few-shot (3-5 examples)
□ Try CoT if reasoning needed
□ Try Tools API for structured output
□ Try Extended Thinking for hard tasks
□ Pick best baseline before optimization
Phase 3: Eval Design
Mandatory components:
- Train set (10-30 examples): for iteration
- Holdout set (30-100): for final eval, never see during dev
- Adversarial set (10-30): edge cases, format変异, malicious inputs
- Metric: pass/fail or scored function
- Eval pipeline: scriptable, repeatable
# eval_template.py
def evaluate(prompt_fn, dataset):
results = []
for example in dataset:
try:
pred = prompt_fn(example["input"])
score = metric(example["expected"], pred)
except Exception as e:
score = 0
results.append({"example": example, "pred": pred, "score": score})
return {
"mean_score": sum(r["score"] for r in results) / len(results),
"failures": [r for r in results if r["score"] < 1],
}
Phase 4: Iteration
Iteration loop:
1. Pick a failure case
2. Hypothesize: why did it fail?
3. Modify prompt (or model/temp)
4. Re-eval on training set + previous failures
5. If global score improved, promote
6. Else revert
避免over-fitting:
- 始终保留holdout没见过
- 每5轮 iteration跑一次holdout看是否真improve
Phase 5: Safety Review
Per Day 131:
- Red team: 20+ attack attempts
- Input filter for known patterns
- Output filter for forbidden topics
- If agent: tool permissions minimal + audit log
- CAI/second-pass for high-stake routes
Phase 6: Production Deployment
Checklist:
- Prompt cached (cache_control on stable parts)
- Model fallback chain configured
- Monitoring: cost, latency, stop_reason, cache_hit_ratio
- Alerting: error spike, cost spike, latency p99 anomaly
- Versioning: prompt_id with version, can rollback
- Documentation: runbook for incidents
Phase 7: Continuous Eval
# regression_test.py (run nightly)
"""
对prompt regression: 每晚跑holdout set
新model release也跑一次
"""
GOLDEN_SET = load_dataset("golden_set.jsonl")
for prompt_version in ACTIVE_PROMPTS:
for model in CANDIDATE_MODELS:
score = evaluate(make_fn(prompt_version, model), GOLDEN_SET)
push_metric_to_observability(prompt_version, model, score)
三、Prompt设计Checklist
整合本周所有tips:
结构
- System prompt定义role + scope
- 用XML tag标记 sections (Anthropic preference)
- Examples放前部, query放最后
- 关键instruction头尾各放一遍 (anti lost-in-the-middle)
- 长context用cache_control
内容
- Output format明确说明(JSON schema, 长度, 风格)
- Edge cases处理("if no data found, output null")
- 错误处理 ("if uncertain, say so")
- Citations / sources (RAG/document场景)
- Refuse pattern (out-of-scope)
参数
- T=0 for structured/factual
- T=0.7+ for creative
- max_tokens适当上限
- stop_sequences for structured
- Tools API for严格schema
性能
- Prefix cached (system prompt + KB)
- Tools定义cached (if 大)
- 选最便宜的能干活的model
- Extended thinking only when needed
- Batch API for非紧急任务(50% off)
安全
- Input wrapped in
<user_query> - System prompt防override (explicit instruction)
- Output filter for sensitive topics
- Tool permissions minimal
- Audit log每个decision
四、代码:完整Prompt工程pipeline
# prompt_engineering_pipeline.py
"""
Production-quality prompt management
"""
import anthropic
import time
import json
import logging
from dataclasses import dataclass, field
from typing import Callable, List
logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)
client = anthropic.Anthropic()
@dataclass
class PromptConfig:
name: str
version: str
model: str
system: str
temperature: float = 0.0
max_tokens: int = 1024
use_cache: bool = True
use_thinking: bool = False
thinking_budget: int = 0
tools: list = field(default_factory=list)
@dataclass
class CallResult:
text: str
structured: dict | None
latency_s: float
input_tokens: int
output_tokens: int
cache_read: int
cache_write: int
cost_usd: float
stop_reason: str
model: str
PRICES = {
"claude-opus-4-7": (15.0, 75.0),
"claude-sonnet-4-6": (3.0, 15.0),
"claude-haiku-4-5": (0.8, 4.0),
}
def call(config: PromptConfig, user_message: str, **extra) -> CallResult:
"""统一调用入口 — 包含cache、监控、cost计算"""
system_blocks = [{"type": "text", "text": config.system}]
if config.use_cache and len(config.system) > 1024 * 4: # >~1024 tokens
system_blocks[0]["cache_control"] = {"type": "ephemeral"}
kwargs = dict(
model=config.model,
max_tokens=config.max_tokens,
temperature=config.temperature,
system=system_blocks,
messages=[{"role": "user", "content": user_message}],
)
if config.use_thinking:
kwargs["thinking"] = {"type": "enabled",
"budget_tokens": config.thinking_budget}
if config.tools:
kwargs["tools"] = config.tools
t0 = time.time()
resp = client.messages.create(**kwargs)
latency = time.time() - t0
# extract text
text = ""
structured = None
for block in resp.content:
if block.type == "text":
text += block.text
elif block.type == "tool_use":
structured = block.input
# cost
p_in, p_out = PRICES.get(config.model, (0, 0))
cache_w = getattr(resp.usage, "cache_creation_input_tokens", 0)
cache_r = getattr(resp.usage, "cache_read_input_tokens", 0)
base_in = resp.usage.input_tokens - cache_w - cache_r
cost = (base_in * p_in + cache_w * p_in * 1.25 + cache_r * p_in * 0.1
+ resp.usage.output_tokens * p_out) / 1e6
result = CallResult(
text=text,
structured=structured,
latency_s=latency,
input_tokens=resp.usage.input_tokens,
output_tokens=resp.usage.output_tokens,
cache_read=cache_r,
cache_write=cache_w,
cost_usd=cost,
stop_reason=resp.stop_reason,
model=resp.model,
)
# log for observability
log.info(json.dumps({
"prompt_id": f"{config.name}@{config.version}",
"model": resp.model,
"input_tokens": result.input_tokens,
"output_tokens": result.output_tokens,
"cache_read": cache_r, "cache_write": cache_w,
"latency_s": result.latency_s,
"cost_usd": result.cost_usd,
"stop_reason": resp.stop_reason,
}))
return result
# Eval framework
@dataclass
class EvalResult:
score: float
passed: int
total: int
failures: list
def eval_prompt(config: PromptConfig, dataset, metric_fn) -> EvalResult:
passed = 0
failures = []
total_cost = 0
for ex in dataset:
result = call(config, ex["input"])
total_cost += result.cost_usd
ok = metric_fn(ex["expected"], result.text or result.structured)
if ok:
passed += 1
else:
failures.append({
"input": ex["input"],
"expected": ex["expected"],
"got": result.text or result.structured
})
score = passed / max(len(dataset), 1)
log.info(f"Eval: {passed}/{len(dataset)} = {score:.0%}, cost=${total_cost:.4f}")
return EvalResult(score=score, passed=passed,
total=len(dataset), failures=failures)
# Example用法
if __name__ == "__main__":
config_v1 = PromptConfig(
name="financial_extractor",
version="v1",
model="claude-haiku-4-5",
system="Extract revenue from financial text. Output number only.",
temperature=0.0,
max_tokens=100,
)
dataset = [
{"input": "Q3 2026 revenue was $94.5B.", "expected": "94.5"},
{"input": "Reported $62 billion in revenue.", "expected": "62"},
]
def metric(expected, got):
try:
return abs(float(got.replace("$", "").replace("B", "").strip()) - float(expected)) < 0.5
except (ValueError, AttributeError):
return False
eval_v1 = eval_prompt(config_v1, dataset, metric)
print(f"v1 score: {eval_v1.score:.0%}")
五、Anthropic API最佳实践(综合)
5.1 Production deployment template
# prod_anthropic.py
import anthropic
from anthropic import APITimeoutError, RateLimitError, APIConnectionError
client = anthropic.Anthropic(
timeout=30.0, # 不要默认10分钟
max_retries=3, # SDK内置retry
)
# Messages Batches API for非urgent
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"req-{i}",
"params": {
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"messages": [{"role": "user", "content": q}]
}
}
for i, q in enumerate(queries)
]
)
# ~50% cost savings vs sync API
5.2 Final reference: 全部feature combined
client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
temperature=0.0, # 锁定 (除非creative)
system=[
{"type": "text", "text": SHORT_INSTR},
{"type": "text", "text": LARGE_KB,
"cache_control": {"type": "ephemeral", "ttl": "1h"}}
],
thinking={"type": "enabled", "budget_tokens": 16000}, # 难任务
tools=TOOLS, # function calling
tool_choice={"type": "auto"},
messages=[
{"role": "user", "content": [
{"type": "document", "source": {"type": "file", "file_id": "..."},
"citations": {"enabled": True}},
{"type": "image", "source": {"type": "url", "url": "..."}},
{"type": "text", "text": user_question}
]}
],
extra_headers={"anthropic-beta": "files-api-2025-04-14"}
)
六、金融领域应用:Putting it all together
案例:完整RFP分析system
需求:律所每周收10份RFP(200-500页),AI做initial review。
Architecture:
RFP Upload → Files API (cache 1h)
↓
Stage 1: Haiku — extract metadata (parties, date, scope)
↓
Stage 2: Sonnet + thinking — identify risk clauses, citations enabled
↓
Stage 3: Opus + thinking 32K — deep liability analysis
↓
CAI second-pass — compliance review
↓
Human Reviewer (final approval)
经济:
- Stage 1: $0.05 per RFP
- Stage 2: $1.50 per RFP (cached after first reuse)
- Stage 3: $15.00 per RFP
- Total: ~$17/RFP,vs 律师2-4小时 ($1500+) = 100x cost reduction
Quality safeguards:
- Each stage's output stored with citations
- Audit log for compliance
- Failure mode: escalate to human if thinking confidence low
- Weekly regression eval on 50 historical RFPs
七、常见陷阱(综合版Top 10)
- 没有eval set:所有迭代都是vibes-based,无法证明improvement。先建eval set再写prompt。
- 盲信新model:Claude 4.6→4.7升级后可能regression on you特定任务。强制eval before migrate。
- 不监控cost:开发期cost小看不出来,production上线后炸账单。day 1就加cost dashboard。
- Prompt注入忽视:所有input都可能含attack。至少做input wrap + output filter。
- Cache miss不知道:whitespace差异导致全部miss,cost飙升。监控cache_read_ratio。
- Thinking开太多:不需要thinking的任务也开,cost+latency爆。A/B test thinking on/off。
- Multi-vendor没规划:单一provider lock-in。至少LiteLLM抽象 + 1 fallback。
- No versioning:prompt改完没记录,bug回滚不了。Git管理prompts。
- Tools description草率:模型乱call。Tool description≥50字,明确何时用何时不用。
- 以为Prompt能解决所有问题:架构层面问题(如RAG retrieval差)prompt救不了。Prompt不是银弹。
八、关键速查(终极cheatsheet)
========== ANTHROPIC PROMPT ENGINEERING SOP ==========
INPUT BLOCKS:
- system: role + scope + constitution + KB (cache long parts)
- messages: history (cache stable parts) + current
- tools: (cache if many)
OUTPUT CONTROL:
- temperature: 0 (default) | 0.7 (creative)
- max_tokens: budget (don't omit)
- stop_sequences: list (for structured)
- tools + tool_choice: for strict schema
OPTIMIZATION:
- prompt_caching cache_control: 5min default, 1h for hot KB
- thinking budget_tokens: 1K-32K based on difficulty
- batch API: 50% off async tasks
MULTIMODAL:
- image: base64 / URL / file_id
- document (PDF): up to 100 pages, citations
- pre-resize images to <1568px长边
SAMPLING (per task):
- factual / structured: T=0
- creative: T=0.7-1.0
- code: T=0
- summarization: T=0.3-0.5
SECURITY:
- system prompt anti-override
- input XML wrap
- output filter forbidden
- tool permission minimal
- audit log every decision
OBSERVABILITY:
- track input_tokens, output_tokens
- track cache_creation, cache_read
- track stop_reason distribution
- track latency p50/p95/p99
- per-model cost split
EVAL:
- holdout set (never train on)
- adversarial set (red team)
- metric scriptable
- nightly regression
- A/B before promote
DEPLOYMENT:
- prompt versioned in git
- model fallback chain
- rate limit handling
- error retry with backoff
- canary rollout new prompts
==================================================
九、面试题(综合)
Q1: 你怎么从0到1做一个production-grade LLM应用?
(1) Spec: 任务定义 + success metric。(2) Eval set: 30+ examples + adversarial。(3) Prototype: zero-shot baseline。(4) Iterate with eval. (5) Optimize cost: caching, model tier。(6) Safety review: red team. (7) Deploy: monitoring, fallback, versioning。(8) Continuous eval: regression suite, A/B for new variants/models.
Q2: 一个prompt精度从80%到99%要做什么?
取决于gap nature:(a) Mode confusion → few-shot。(b) Reasoning errors → CoT or thinking。(c) Format inconsistency → Tools API。(d) Variance → self-consistency。(e) Specific failures → DSPy auto-tune。(f) 模型ceiling → upgrade model。关键:先归因,再针对性fix。
Q3: 你最关心一个LLM应用的什么指标?
三角:quality (eval score)、cost ($/query)、latency (p95)。Production还要看:cache hit ratio (cost攻击器)、stop_reason分布 (truncation警报)、retry rate (vendor health)、user feedback (proxy for真实满意度)。
Q4: Prompt engineering会被取代吗?
部分会。DSPy等auto-optimization会接管"调措辞"工作。但策略层不会:spec design、eval design、failure analysis、safety review、cost optimization——这些是engineering judgement,自动化短期不能替代。Prompt engineer→AI engineer/AI architect是自然演进。
Q5: Claude对其他model最大的competitive moat是什么?
三个:(1) Constitutional AI默认更安全(compliance-critical场景默认选Claude)。(2) Tool use + thinking + parallel最成熟(agent loop首选)。(3) 长文档/PDF处理 + Citations(research/legal场景)。但moat不稳——OpenAI、Google都在追赶。生产应避免over-rely任何vendor。
十、Phase 3 Day 121-134 总结
两周覆盖:
- 理论基础:Transformer、Scaling、Tokenization、Sampling
- 工程方法:Prompt patterns、Structured output、Auto-optimization、Multimodal、Long context
- 安全治理:Prompt safety、Constitutional AI
- 工程实践:Model selection、SOP
核心take-aways:
- LLM从黑盒 → 白盒(理解Transformer + sampling让你debug LLM)
- Prompt → Software(version, test, monitor)
- Cost是first-class concern(cache, model tier, batch)
- Safety不是事后想(constitution + red team从day 1)
- Multi-vendor是必需(lock-in风险real)
- Anthropic Claude 4.7是current 旗舰但不是唯一答案(task driven)
下两周(Day 135-148, Phase 3继续)预告:
- RAG架构与向量数据库
- Embedding model对比
- Hybrid search (BM25 + dense)
- Agentic patterns (ReAct, Plan-and-Execute)
- Tool design philosophy
- Memory systems
十一、明日预告
Day 135: RAG基础 — Embedding、Vector DB、Retrieval pipeline。Phase 3 Week 21开始。