Expert Day 171
Versioning 与 CI — Prompt 版本化与 PromptOps
### 1.1 Prompt = 软件资产
2026-10-19
Phase 3 - 生产基础设施与评估 (Day 163-176)PromptOpsCICDPromptRegistryVersioningRollout
日期: 2026-10-19 方向: AI系统工程 / LLMOps / DevOps 阶段: Phase 3 - 生产基础设施与评估 (Day 163-176) 标签: #PromptOps #CICD #PromptRegistry #Versioning #Rollout
今日目标
| 类型 | 内容 |
|---|---|
| 学习 | Prompt 当代码管理;prompt registry vs git;语义版本;灰度发布 / canary;rollback;与 eval/observability 整合 |
| 实操 | 搭建一个完整 PromptOps:git → CI 跑 eval → Langfuse registry 发布 → 灰度 |
| 产出 | docs/ai-infra/ci_pipeline.yml:完整 CI 配置 + prompt 管理代码 |
一、核心概念
1.1 Prompt = 软件资产
| 维度 | 普通配置 | Prompt |
|---|---|---|
| 影响系统行为 | ✓ | ✓✓✓(完全决定) |
| 出 bug 的概率 | 中 | 高(隐性、难测) |
| 多人协作 | 是 | 是 |
| 需要版本化 | 是 | 必须 |
| 需要 review | 是 | 必须(PM/SME/AI Eng) |
| 需要灰度 | 偶尔 | 必须(影响 100% 流量行为) |
1.2 Prompt 管理两种范式
| 范式 | 存储 | 切换 | 推荐 |
|---|---|---|---|
| Code-centric(git 里) | .txt / .md / .py 字符串 | 走代码发布流程 | 简单团队 |
| Registry-centric(Langfuse / Promptfoo) | DB + 版本号 | 不重启服务热切 | 中大团队、要 A/B |
| 混合 | git 主源 + 同步到 registry | git 是 source of truth | 生产推荐 |
1.3 Prompt 版本号语义
参考 SemVer,但语义不同:
v MAJOR . MINOR . PATCH
v 2 . 3 . 1
MAJOR: 行为破坏性变化(输出格式变了、新增必填字段)
MINOR: 行为兼容性增强(加 few-shot、改善表达)
PATCH: 无行为变化(typo、注释)
关键:MAJOR 升级必须有 dataset migration 和 downstream review。
1.4 灰度发布策略
v1.0 (100%)
│
▼ 新 v2.0 通过 CI eval
│
v1.0 (95%) + v2.0 (5%) ← canary 24h,监控指标
│
▼ user_score 不降、cost 不增、error 不升
│
v1.0 (50%) + v2.0 (50%) ← 24h
│
▼ all clear
│
v2.0 (100%)
│
▼ keep v1.0 hot 7 天可秒级回滚
│
deprecated
二、生产架构图
开发者 ── git push prompt change ──→ GitHub
│
▼
┌──────────┐
│ CI │
│ (Day 169)│
│ eval │
└────┬─────┘
│ pass
▼
auto-publish to Langfuse
(registry, status=draft)
│
human approve → status=staging
│
┌──────────┴──────────┐
▼ ▼
App fetches Eval against
on cold start full golden
│
status=production?
│
┌─────────┴──────┐
▼ ▼
Canary 5% Old 95%
│ │
monitor metrics ─┐ │
▼ │
promote / rollback
三、代码实现
3.1 Prompt 文件结构(git 源)
prompts/
├── manifest.yaml # 总索引
├── customer_support/
│ ├── system_v2.3.0.md # 当前生产
│ ├── system_v2.4.0.md # PR 中
│ ├── tests/
│ │ ├── test_basic.json
│ │ └── test_compliance.json
│ └── README.md # owner / changelog
└── credit_decision/
├── system_v1.5.2.md
└── ...
# prompts/manifest.yaml
prompts:
customer_support:
owner: pm-jane
sme: compliance-li
current_prod_version: 2.3.0
eval_dataset: golden/customer_support_v1.1.0.json
min_pass_rate: 0.92
max_avg_cost_per_req: 0.005
max_p95_latency_s: 2.0
credit_decision:
owner: pm-bob
sme: risk-zhao
current_prod_version: 1.5.2
eval_dataset: golden/credit_v1.0.0.json
min_pass_rate: 1.0 # 严格
max_avg_cost_per_req: 0.05
max_p95_latency_s: 4.0
3.2 Prompt 版本提取(CI 用)
"""prompt_version.py — 从 markdown 文件抽取 metadata"""
import re
import yaml
from pathlib import Path
def parse_prompt(path: Path) -> dict:
"""支持 frontmatter + body"""
text = path.read_text(encoding="utf-8")
if text.startswith("---"):
_, fm, body = text.split("---", 2)
meta = yaml.safe_load(fm)
else:
meta = {}
body = text
# 从文件名抽 version
m = re.search(r"v(\d+)\.(\d+)\.(\d+)", path.name)
if m:
meta["version"] = ".".join(m.groups())
return {"meta": meta, "body": body.strip(), "path": str(path)}
frontmatter 示例:
---
prompt_id: customer_support.system
version: 2.4.0
upgrade_type: minor
parent_version: 2.3.0
changelog: |
- 新增 5 条信用卡常见问题 few-shot
- 修复"提前还款"问答的语气问题
authors: [pm-jane]
reviewers: [compliance-li]
---
你是某商业银行的客户服务助手。
# 行为准则
1. ...
# Few-shot 示例
...
3.3 Prompt registry 同步(git → Langfuse)
"""publish_prompt.py — CI 跑完 eval 后调用,同步到 Langfuse registry"""
from langfuse import Langfuse
from prompt_version import parse_prompt
lf = Langfuse()
def publish(prompt_path: str, status: str = "staging") -> str:
p = parse_prompt(Path(prompt_path))
name = p["meta"]["prompt_id"]
version = p["meta"]["version"]
# 检查是否已存在该 version
try:
existing = lf.get_prompt(name, version=int(version.split(".")[0]))
if existing.commit_message == version:
print(f"Already published: {name} {version}")
return existing.id
except Exception:
pass
created = lf.create_prompt(
name=name,
prompt=p["body"],
labels=[status, f"v{version}", f"git_sha:{p['meta'].get('git_sha','unknown')[:7]}"],
commit_message=version,
config={
"model": "claude-sonnet-4-6",
"temperature": 0.0,
"max_tokens": 1024,
}
)
print(f"Published: {name} {version} as {status}")
return created.id
if __name__ == "__main__":
import sys
publish(sys.argv[1], status=sys.argv[2] if len(sys.argv) > 2 else "staging")
3.4 应用端获取 prompt(带 fallback)
"""app_prompt.py"""
from langfuse import Langfuse
from functools import lru_cache
import time
lf = Langfuse()
@lru_cache(maxsize=128)
def _get_prompt_cached(name: str, label: str, ttl: int):
"""ttl 用于失效"""
p = lf.get_prompt(name, label=label)
return p, time.time()
def get_prompt(name: str, label: str = "production"):
"""带 60s cache + 远端 down 时 fallback 到本地副本"""
try:
p, fetched_at = _get_prompt_cached(name, label, ttl=int(time.time()) // 60)
return p
except Exception:
# fallback:本地预存的最后一次成功 prompt
return _load_local_fallback(name, label)
def _load_local_fallback(name: str, label: str):
# 启动时同步到本地 disk,保命
return open(f"./prompts_fallback/{name}_{label}.md").read()
# 使用:
def chat_with_prompt(question: str):
p = get_prompt("customer_support.system", label="production")
# Anthropic 调用
r = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=p.prompt, # 这里取真实 prompt 文本
messages=[{"role": "user", "content": question}],
# 关联 trace
metadata={"prompt_name": p.name, "prompt_version": p.version},
)
return r.content[0].text
3.5 灰度路由
"""canary_router.py — 按用户 hash 分流到 canary or stable"""
import hashlib
def route_prompt_label(user_id: str, canary_pct: float = 0.05) -> str:
"""根据 user_id hash 分流,保证同一用户始终落同一桶"""
h = int(hashlib.md5(user_id.encode()).hexdigest()[:8], 16)
bucket = (h % 10000) / 10000
if bucket < canary_pct:
return "canary"
return "production"
def chat(user_id: str, question: str):
label = route_prompt_label(user_id, canary_pct=0.05)
p = get_prompt("customer_support.system", label=label)
# ... 调 LLM ...
metadata = {"prompt_label": label, "prompt_version": p.version}
3.6 完整 CI(GitHub Actions)
# .github/workflows/promptops.yml
name: PromptOps
on:
pull_request:
paths: ['prompts/**']
jobs:
prompt-ci:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.11' }
- run: pip install -e eval_v1 langfuse pyyaml
- name: Detect changed prompts
id: detect
run: |
changed=$(git diff --name-only origin/main HEAD -- 'prompts/**/*.md' | tr '\n' ' ')
echo "files=$changed" >> $GITHUB_OUTPUT
- name: Validate prompt frontmatter
run: python scripts/validate_prompts.py ${{ steps.detect.outputs.files }}
- name: Run eval against golden set
run: |
for f in ${{ steps.detect.outputs.files }}; do
prompt_id=$(yq '.prompt_id' $f)
golden=$(yq ".prompts.${prompt_id}.eval_dataset" prompts/manifest.yaml)
min_rate=$(yq ".prompts.${prompt_id}.min_pass_rate" prompts/manifest.yaml)
python -m eval run --golden $golden --target $f --min-pass-rate $min_rate
done
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Publish to Langfuse (staging label)
if: success()
run: |
for f in ${{ steps.detect.outputs.files }}; do
python scripts/publish_prompt.py $f staging
done
env:
LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PK }}
LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SK }}
LANGFUSE_HOST: ${{ secrets.LANGFUSE_HOST }}
- name: Generate PR comment
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const r = JSON.parse(fs.readFileSync('reports/last.json'));
const body = `## PromptOps CI ✅
- **Pass rate**: ${(r.rate*100).toFixed(1)}%
- **P95 latency**: ${r.p95_latency_s.toFixed(2)}s
- **Cost**: $${(r.cost).toFixed(3)}
- **Status**: ⚠️ Published to staging. Approval required to promote to canary.`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});
promote-canary:
needs: prompt-ci
if: github.event.label.name == 'approved-canary'
runs-on: ubuntu-latest
steps:
- run: python scripts/promote.py --label canary --pct 5
- run: |
# 设置定时检查
echo "Canary started at $(date). Auto-promote check in 24h."
promote-prod:
needs: prompt-ci
if: github.event.label.name == 'approved-prod'
runs-on: ubuntu-latest
steps:
- name: Verify canary metrics
run: python scripts/verify_canary.py --hours 24 --min-score-delta 0
- run: python scripts/promote.py --label production --pct 100
- run: python scripts/keep_old_hot.py --days 7 # 旧 prompt 保留 7 天
3.7 Canary 监控脚本
"""verify_canary.py — 检查 canary 24h 指标是否合格"""
import argparse
from langfuse import Langfuse
from datetime import datetime, timedelta
lf = Langfuse()
def check(hours: int, min_score_delta: float = 0):
end = datetime.utcnow()
start = end - timedelta(hours=hours)
# canary
canary_traces = lf.get_traces(tags=["canary"], from_timestamp=start, to_timestamp=end)
prod_traces = lf.get_traces(tags=["production"], from_timestamp=start, to_timestamp=end)
canary_score = avg_score(canary_traces, "user_thumbs")
prod_score = avg_score(prod_traces, "user_thumbs")
print(f"canary score: {canary_score:.3f} ({len(canary_traces)} traces)")
print(f"prod score: {prod_score:.3f} ({len(prod_traces)} traces)")
delta = canary_score - prod_score
if delta < min_score_delta:
print(f"❌ score regressed by {-delta:.3f}, blocking promotion")
sys.exit(1)
# cost / latency 也检查
canary_cost = avg_cost(canary_traces)
prod_cost = avg_cost(prod_traces)
if canary_cost > prod_cost * 1.10:
print(f"❌ cost +{(canary_cost/prod_cost-1)*100:.1f}%, blocking")
sys.exit(1)
print("✅ canary clean, ok to promote")
四、Cost & Performance 实测数据
| 阶段 | 流量 | 持续时间 | 监控指标 |
|---|---|---|---|
| Staging | 内部 dogfooding | 1-3 天 | functional |
| Canary | 5% | 24h | user_score, cost, latency, error_rate |
| Ramp | 25% / 50% / 75% | 各 24h | 同上 |
| Production | 100% | — | 长期监控 |
| Hot rollback ready | — | 7 天 | 旧版本保留 |
实战数据:某团队 prompt 改动 prod_v3 → prod_v4:
- CI 60 cases det eval:97% pass, $0.03
- CI judge vs v3:win rate 58%, $1.80
- Canary 24h(5% 流量):user_score +2.1%, cost -7%, P95 latency +50ms(容忍内)
- 全量 v4,节省 monthly cost ~$420
五、金融领域应用
- prompt 是合规资产:每个生产 prompt 必须有 SME 签字(合规/风控/法务),review 走流程
- prompt diff 审计:v2 → v3 改动了哪些字、新增 few-shot 是否合规、changelog 必填,6 年保留
- 金融场景禁止"自动发布":CI 只能发到 staging,promote 必须人工 approve label
- 回滚演练:每月做一次 rollback drill(生产切回旧版本),验证 hot rollback 真能 < 1 分钟
- 多租户 prompt 隔离:私行 / 零售 / 对公各有 prompt 仓库,互不污染
六、生产经验与陷阱
- Prompt 直接写进代码:被改一行 git diff 不显眼,code review 漏过。必须外置文件
- 没有版本就升级 prompt:bug 来了不知道哪个版本起的、影响多少用户。强制语义版本
- CI 跑得太严:每个 PR 跑 5000 case 的 judge,1 小时不出结果,开发者绕开 CI。CI 必须 < 5min
- Canary 流量太少:< 1% 时 sample 不足,看不出问题。最少 5%
- Canary 不够长:1h 够吗?不够。用户行为有 day/week 周期,至少 24h,金融关键 prompt 7 天
- rollback 时 prompt cache 还在:rollback 后 Anthropic prompt cache 还指向旧版(cache_control 命中),要等 5min/1h 自然过期
- prompt 文件含 PII / 真实客户数据:不能 commit。CI 跑 PII 扫描 hook
- A/B 数据被新功能污染:canary 期间不要同时上线其它新功能,否则归因混乱
七、关键速查
| 步骤 | 工具 |
|---|---|
| Prompt 写 | git markdown |
| Lint | yq + custom validator |
| Eval | eval_v1(Day 169) |
| Publish | Langfuse SDK create_prompt |
| Canary | label-based routing |
| Monitor | Langfuse score + Prometheus |
| Rollback | label switch |
八、面试题
-
Prompt 应该 git 管还是 registry 管?
- 推荐混合:git 是 source of truth + CI 自动同步到 registry;registry 提供热切、灰度、A/B 能力;不要让两边脱钩
-
Prompt 改一个字也要走 CI 吗?
- 是。"小改动" 的 LLM 行为变化最难预测。CI 至少跑 deterministic eval(30s 跑完)
-
Canary 多久才能 promote?
- 至少 24h 覆盖一个完整业务周期;金融关键决策 prompt 7 天;监控 user_score、cost、latency、error 四个维度
-
Prompt rollback 比 code rollback 难在哪?
- prompt cache 仍可能指向旧;下游业务可能已基于新行为调整;同时灰度多个 prompt 时归因难
-
怎么防止开发者直接在 prod 改 prompt?
- prompt 文件 git owners 双签 + Langfuse production label 必须 CI publish 流程才能打 + prod 写权限只给 CI service account
明日预告
Day 172:Fine-tuning 决策 — Prompt vs RAG vs FT 什么时候 prompt 不够要 RAG?什么时候 RAG 不够要 fine-tune?决策框架 + 三种方案在同一任务的对比实验。