Expert Day 171

Versioning 与 CI — Prompt 版本化与 PromptOps

### 1.1 Prompt = 软件资产

2026-10-19

Phase 3 - 生产基础设施与评估 (Day 163-176)

PromptOpsCICDPromptRegistryVersioningRollout

日期: 2026-10-19 方向: AI系统工程 / LLMOps / DevOps 阶段: Phase 3 - 生产基础设施与评估 (Day 163-176) 标签: #PromptOps #CICD #PromptRegistry #Versioning #Rollout

今日目标

类型	内容
学习	Prompt 当代码管理；prompt registry vs git；语义版本；灰度发布 / canary；rollback；与 eval/observability 整合
实操	搭建一个完整 PromptOps：git → CI 跑 eval → Langfuse registry 发布 → 灰度
产出	`docs/ai-infra/ci_pipeline.yml`：完整 CI 配置 + prompt 管理代码

一、核心概念

1.1 Prompt = 软件资产

维度	普通配置	Prompt
影响系统行为	✓	✓✓✓（完全决定）
出 bug 的概率	中	高（隐性、难测）
多人协作	是	是
需要版本化	是	必须
需要 review	是	必须（PM/SME/AI Eng）
需要灰度	偶尔	必须（影响 100% 流量行为）

1.2 Prompt 管理两种范式

范式	存储	切换	推荐
Code-centric（git 里）	`.txt` / `.md` / `.py` 字符串	走代码发布流程	简单团队
Registry-centric（Langfuse / Promptfoo）	DB + 版本号	不重启服务热切	中大团队、要 A/B
混合	git 主源 + 同步到 registry	git 是 source of truth	生产推荐

1.3 Prompt 版本号语义

参考 SemVer，但语义不同：

v MAJOR . MINOR . PATCH
v 2     . 3     . 1

MAJOR: 行为破坏性变化（输出格式变了、新增必填字段）
MINOR: 行为兼容性增强（加 few-shot、改善表达）
PATCH: 无行为变化（typo、注释）

关键：MAJOR 升级必须有 dataset migration 和 downstream review。

1.4 灰度发布策略

v1.0 (100%)
  │
  ▼ 新 v2.0 通过 CI eval
  │
v1.0 (95%) + v2.0 (5%)   ← canary 24h，监控指标
  │
  ▼ user_score 不降、cost 不增、error 不升
  │
v1.0 (50%) + v2.0 (50%)  ← 24h
  │
  ▼ all clear
  │
v2.0 (100%)
  │
  ▼ keep v1.0 hot 7 天可秒级回滚
  │
deprecated

二、生产架构图

   开发者 ── git push prompt change ──→ GitHub
                                         │
                                         ▼
                                    ┌──────────┐
                                    │  CI      │
                                    │  (Day 169)│
                                    │  eval    │
                                    └────┬─────┘
                                         │ pass
                                         ▼
                              auto-publish to Langfuse
                              (registry, status=draft)
                                         │
                              human approve → status=staging
                                         │
                              ┌──────────┴──────────┐
                              ▼                     ▼
                          App fetches             Eval against
                          on cold start            full golden
                                                    │
                                              status=production?
                                                    │
                                          ┌─────────┴──────┐
                                          ▼                ▼
                                       Canary 5%       Old 95%
                                          │                │
                                       monitor metrics ─┐  │
                                                        ▼  │
                                                 promote / rollback

三、代码实现

3.1 Prompt 文件结构（git 源）

prompts/
├── manifest.yaml              # 总索引
├── customer_support/
│   ├── system_v2.3.0.md       # 当前生产
│   ├── system_v2.4.0.md       # PR 中
│   ├── tests/
│   │   ├── test_basic.json
│   │   └── test_compliance.json
│   └── README.md              # owner / changelog
└── credit_decision/
    ├── system_v1.5.2.md
    └── ...

# prompts/manifest.yaml
prompts:
  customer_support:
    owner: pm-jane
    sme: compliance-li
    current_prod_version: 2.3.0
    eval_dataset: golden/customer_support_v1.1.0.json
    min_pass_rate: 0.92
    max_avg_cost_per_req: 0.005
    max_p95_latency_s: 2.0
  credit_decision:
    owner: pm-bob
    sme: risk-zhao
    current_prod_version: 1.5.2
    eval_dataset: golden/credit_v1.0.0.json
    min_pass_rate: 1.0   # 严格
    max_avg_cost_per_req: 0.05
    max_p95_latency_s: 4.0

3.2 Prompt 版本提取（CI 用）

"""prompt_version.py — 从 markdown 文件抽取 metadata"""
import re
import yaml
from pathlib import Path

def parse_prompt(path: Path) -> dict:
    """支持 frontmatter + body"""
    text = path.read_text(encoding="utf-8")
    if text.startswith("---"):
        _, fm, body = text.split("---", 2)
        meta = yaml.safe_load(fm)
    else:
        meta = {}
        body = text

    # 从文件名抽 version
    m = re.search(r"v(\d+)\.(\d+)\.(\d+)", path.name)
    if m:
        meta["version"] = ".".join(m.groups())
    return {"meta": meta, "body": body.strip(), "path": str(path)}

frontmatter 示例：

---
prompt_id: customer_support.system
version: 2.4.0
upgrade_type: minor
parent_version: 2.3.0
changelog: |
  - 新增 5 条信用卡常见问题 few-shot
  - 修复"提前还款"问答的语气问题
authors: [pm-jane]
reviewers: [compliance-li]
---

你是某商业银行的客户服务助手。

# 行为准则
1. ...

# Few-shot 示例
...

3.3 Prompt registry 同步（git → Langfuse）

"""publish_prompt.py — CI 跑完 eval 后调用，同步到 Langfuse registry"""
from langfuse import Langfuse
from prompt_version import parse_prompt

lf = Langfuse()

def publish(prompt_path: str, status: str = "staging") -> str:
    p = parse_prompt(Path(prompt_path))
    name = p["meta"]["prompt_id"]
    version = p["meta"]["version"]

    # 检查是否已存在该 version
    try:
        existing = lf.get_prompt(name, version=int(version.split(".")[0]))
        if existing.commit_message == version:
            print(f"Already published: {name} {version}")
            return existing.id
    except Exception:
        pass

    created = lf.create_prompt(
        name=name,
        prompt=p["body"],
        labels=[status, f"v{version}", f"git_sha:{p['meta'].get('git_sha','unknown')[:7]}"],
        commit_message=version,
        config={
            "model": "claude-sonnet-4-6",
            "temperature": 0.0,
            "max_tokens": 1024,
        }
    )
    print(f"Published: {name} {version} as {status}")
    return created.id


if __name__ == "__main__":
    import sys
    publish(sys.argv[1], status=sys.argv[2] if len(sys.argv) > 2 else "staging")

3.4 应用端获取 prompt（带 fallback）

"""app_prompt.py"""
from langfuse import Langfuse
from functools import lru_cache
import time

lf = Langfuse()

@lru_cache(maxsize=128)
def _get_prompt_cached(name: str, label: str, ttl: int):
    """ttl 用于失效"""
    p = lf.get_prompt(name, label=label)
    return p, time.time()


def get_prompt(name: str, label: str = "production"):
    """带 60s cache + 远端 down 时 fallback 到本地副本"""
    try:
        p, fetched_at = _get_prompt_cached(name, label, ttl=int(time.time()) // 60)
        return p
    except Exception:
        # fallback：本地预存的最后一次成功 prompt
        return _load_local_fallback(name, label)


def _load_local_fallback(name: str, label: str):
    # 启动时同步到本地 disk，保命
    return open(f"./prompts_fallback/{name}_{label}.md").read()


# 使用：
def chat_with_prompt(question: str):
    p = get_prompt("customer_support.system", label="production")
    # Anthropic 调用
    r = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=p.prompt,  # 这里取真实 prompt 文本
        messages=[{"role": "user", "content": question}],
        # 关联 trace
        metadata={"prompt_name": p.name, "prompt_version": p.version},
    )
    return r.content[0].text

3.5 灰度路由

"""canary_router.py — 按用户 hash 分流到 canary or stable"""
import hashlib

def route_prompt_label(user_id: str, canary_pct: float = 0.05) -> str:
    """根据 user_id hash 分流，保证同一用户始终落同一桶"""
    h = int(hashlib.md5(user_id.encode()).hexdigest()[:8], 16)
    bucket = (h % 10000) / 10000
    if bucket < canary_pct:
        return "canary"
    return "production"


def chat(user_id: str, question: str):
    label = route_prompt_label(user_id, canary_pct=0.05)
    p = get_prompt("customer_support.system", label=label)
    # ... 调 LLM ...
    metadata = {"prompt_label": label, "prompt_version": p.version}

3.6 完整 CI（GitHub Actions）

# .github/workflows/promptops.yml
name: PromptOps
on:
  pull_request:
    paths: ['prompts/**']

jobs:
  prompt-ci:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.11' }
      - run: pip install -e eval_v1 langfuse pyyaml

      - name: Detect changed prompts
        id: detect
        run: |
          changed=$(git diff --name-only origin/main HEAD -- 'prompts/**/*.md' | tr '\n' ' ')
          echo "files=$changed" >> $GITHUB_OUTPUT

      - name: Validate prompt frontmatter
        run: python scripts/validate_prompts.py ${{ steps.detect.outputs.files }}

      - name: Run eval against golden set
        run: |
          for f in ${{ steps.detect.outputs.files }}; do
            prompt_id=$(yq '.prompt_id' $f)
            golden=$(yq ".prompts.${prompt_id}.eval_dataset" prompts/manifest.yaml)
            min_rate=$(yq ".prompts.${prompt_id}.min_pass_rate" prompts/manifest.yaml)
            python -m eval run --golden $golden --target $f --min-pass-rate $min_rate
          done
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

      - name: Publish to Langfuse (staging label)
        if: success()
        run: |
          for f in ${{ steps.detect.outputs.files }}; do
            python scripts/publish_prompt.py $f staging
          done
        env:
          LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PK }}
          LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SK }}
          LANGFUSE_HOST: ${{ secrets.LANGFUSE_HOST }}

      - name: Generate PR comment
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const r = JSON.parse(fs.readFileSync('reports/last.json'));
            const body = `## PromptOps CI ✅
            - **Pass rate**: ${(r.rate*100).toFixed(1)}%
            - **P95 latency**: ${r.p95_latency_s.toFixed(2)}s
            - **Cost**: $${(r.cost).toFixed(3)}
            - **Status**: ⚠️ Published to staging. Approval required to promote to canary.`;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

  promote-canary:
    needs: prompt-ci
    if: github.event.label.name == 'approved-canary'
    runs-on: ubuntu-latest
    steps:
      - run: python scripts/promote.py --label canary --pct 5
      - run: |
          # 设置定时检查
          echo "Canary started at $(date). Auto-promote check in 24h."

  promote-prod:
    needs: prompt-ci
    if: github.event.label.name == 'approved-prod'
    runs-on: ubuntu-latest
    steps:
      - name: Verify canary metrics
        run: python scripts/verify_canary.py --hours 24 --min-score-delta 0
      - run: python scripts/promote.py --label production --pct 100
      - run: python scripts/keep_old_hot.py --days 7  # 旧 prompt 保留 7 天

3.7 Canary 监控脚本

"""verify_canary.py — 检查 canary 24h 指标是否合格"""
import argparse
from langfuse import Langfuse
from datetime import datetime, timedelta

lf = Langfuse()


def check(hours: int, min_score_delta: float = 0):
    end = datetime.utcnow()
    start = end - timedelta(hours=hours)

    # canary
    canary_traces = lf.get_traces(tags=["canary"], from_timestamp=start, to_timestamp=end)
    prod_traces = lf.get_traces(tags=["production"], from_timestamp=start, to_timestamp=end)

    canary_score = avg_score(canary_traces, "user_thumbs")
    prod_score = avg_score(prod_traces, "user_thumbs")

    print(f"canary score: {canary_score:.3f} ({len(canary_traces)} traces)")
    print(f"prod   score: {prod_score:.3f} ({len(prod_traces)} traces)")

    delta = canary_score - prod_score
    if delta < min_score_delta:
        print(f"❌ score regressed by {-delta:.3f}, blocking promotion")
        sys.exit(1)

    # cost / latency 也检查
    canary_cost = avg_cost(canary_traces)
    prod_cost = avg_cost(prod_traces)
    if canary_cost > prod_cost * 1.10:
        print(f"❌ cost +{(canary_cost/prod_cost-1)*100:.1f}%, blocking")
        sys.exit(1)

    print("✅ canary clean, ok to promote")

四、Cost & Performance 实测数据

阶段	流量	持续时间	监控指标
Staging	内部 dogfooding	1-3 天	functional
Canary	5%	24h	user_score, cost, latency, error_rate
Ramp	25% / 50% / 75%	各 24h	同上
Production	100%	—	长期监控
Hot rollback ready	—	7 天	旧版本保留

实战数据：某团队 prompt 改动 prod_v3 → prod_v4：

CI 60 cases det eval：97% pass, $0.03
CI judge vs v3：win rate 58%, $1.80
Canary 24h（5% 流量）：user_score +2.1%, cost -7%, P95 latency +50ms（容忍内）
全量 v4，节省 monthly cost ~$420

五、金融领域应用

prompt 是合规资产：每个生产 prompt 必须有 SME 签字（合规/风控/法务），review 走流程
prompt diff 审计：v2 → v3 改动了哪些字、新增 few-shot 是否合规、changelog 必填，6 年保留
金融场景禁止"自动发布"：CI 只能发到 staging，promote 必须人工 approve label
回滚演练：每月做一次 rollback drill（生产切回旧版本），验证 hot rollback 真能 < 1 分钟
多租户 prompt 隔离：私行 / 零售 / 对公各有 prompt 仓库，互不污染

六、生产经验与陷阱

Prompt 直接写进代码：被改一行 git diff 不显眼，code review 漏过。必须外置文件
没有版本就升级 prompt：bug 来了不知道哪个版本起的、影响多少用户。强制语义版本
CI 跑得太严：每个 PR 跑 5000 case 的 judge，1 小时不出结果，开发者绕开 CI。CI 必须 < 5min
Canary 流量太少：< 1% 时 sample 不足，看不出问题。最少 5%
Canary 不够长：1h 够吗？不够。用户行为有 day/week 周期，至少 24h，金融关键 prompt 7 天
rollback 时 prompt cache 还在：rollback 后 Anthropic prompt cache 还指向旧版（cache_control 命中），要等 5min/1h 自然过期
prompt 文件含 PII / 真实客户数据：不能 commit。CI 跑 PII 扫描 hook
A/B 数据被新功能污染：canary 期间不要同时上线其它新功能，否则归因混乱

七、关键速查

步骤	工具
Prompt 写	git markdown
Lint	yq + custom validator
Eval	eval_v1（Day 169）
Publish	Langfuse SDK `create_prompt`
Canary	label-based routing
Monitor	Langfuse score + Prometheus
Rollback	label switch

八、面试题

Prompt 应该 git 管还是 registry 管？
- 推荐混合：git 是 source of truth + CI 自动同步到 registry；registry 提供热切、灰度、A/B 能力；不要让两边脱钩
Prompt 改一个字也要走 CI 吗？
- 是。"小改动" 的 LLM 行为变化最难预测。CI 至少跑 deterministic eval（30s 跑完）
Canary 多久才能 promote？
- 至少 24h 覆盖一个完整业务周期；金融关键决策 prompt 7 天；监控 user_score、cost、latency、error 四个维度
Prompt rollback 比 code rollback 难在哪？
- prompt cache 仍可能指向旧；下游业务可能已基于新行为调整；同时灰度多个 prompt 时归因难
怎么防止开发者直接在 prod 改 prompt？
- prompt 文件 git owners 双签 + Langfuse production label 必须 CI publish 流程才能打 + prod 写权限只给 CI service account

明日预告

Day 172：Fine-tuning 决策 — Prompt vs RAG vs FT 什么时候 prompt 不够要 RAG？什么时候 RAG 不够要 fine-tune？决策框架 + 三种方案在同一任务的对比实验。