返回 Expert 笔记
Expert Day 171

Versioning 与 CI — Prompt 版本化与 PromptOps

### 1.1 Prompt = 软件资产

2026-10-19
Phase 3 - 生产基础设施与评估 (Day 163-176)
PromptOpsCICDPromptRegistryVersioningRollout

日期: 2026-10-19 方向: AI系统工程 / LLMOps / DevOps 阶段: Phase 3 - 生产基础设施与评估 (Day 163-176) 标签: #PromptOps #CICD #PromptRegistry #Versioning #Rollout


今日目标

类型内容
学习Prompt 当代码管理;prompt registry vs git;语义版本;灰度发布 / canary;rollback;与 eval/observability 整合
实操搭建一个完整 PromptOps:git → CI 跑 eval → Langfuse registry 发布 → 灰度
产出docs/ai-infra/ci_pipeline.yml:完整 CI 配置 + prompt 管理代码

一、核心概念

1.1 Prompt = 软件资产

维度普通配置Prompt
影响系统行为✓✓✓(完全决定)
出 bug 的概率高(隐性、难测)
多人协作
需要版本化必须
需要 review必须(PM/SME/AI Eng)
需要灰度偶尔必须(影响 100% 流量行为)

1.2 Prompt 管理两种范式

范式存储切换推荐
Code-centric(git 里).txt / .md / .py 字符串走代码发布流程简单团队
Registry-centric(Langfuse / Promptfoo)DB + 版本号不重启服务热切中大团队、要 A/B
混合git 主源 + 同步到 registrygit 是 source of truth生产推荐

1.3 Prompt 版本号语义

参考 SemVer,但语义不同:

v MAJOR . MINOR . PATCH
v 2     . 3     . 1

MAJOR: 行为破坏性变化(输出格式变了、新增必填字段)
MINOR: 行为兼容性增强(加 few-shot、改善表达)
PATCH: 无行为变化(typo、注释)

关键:MAJOR 升级必须有 dataset migration 和 downstream review。

1.4 灰度发布策略

v1.0 (100%)
  │
  ▼ 新 v2.0 通过 CI eval
  │
v1.0 (95%) + v2.0 (5%)   ← canary 24h,监控指标
  │
  ▼ user_score 不降、cost 不增、error 不升
  │
v1.0 (50%) + v2.0 (50%)  ← 24h
  │
  ▼ all clear
  │
v2.0 (100%)
  │
  ▼ keep v1.0 hot 7 天可秒级回滚
  │
deprecated

二、生产架构图

   开发者 ── git push prompt change ──→ GitHub
                                         │
                                         ▼
                                    ┌──────────┐
                                    │  CI      │
                                    │  (Day 169)│
                                    │  eval    │
                                    └────┬─────┘
                                         │ pass
                                         ▼
                              auto-publish to Langfuse
                              (registry, status=draft)
                                         │
                              human approve → status=staging
                                         │
                              ┌──────────┴──────────┐
                              ▼                     ▼
                          App fetches             Eval against
                          on cold start            full golden
                                                    │
                                              status=production?
                                                    │
                                          ┌─────────┴──────┐
                                          ▼                ▼
                                       Canary 5%       Old 95%
                                          │                │
                                       monitor metrics ─┐  │
                                                        ▼  │
                                                 promote / rollback

三、代码实现

3.1 Prompt 文件结构(git 源)

prompts/
├── manifest.yaml              # 总索引
├── customer_support/
│   ├── system_v2.3.0.md       # 当前生产
│   ├── system_v2.4.0.md       # PR 中
│   ├── tests/
│   │   ├── test_basic.json
│   │   └── test_compliance.json
│   └── README.md              # owner / changelog
└── credit_decision/
    ├── system_v1.5.2.md
    └── ...
# prompts/manifest.yaml
prompts:
  customer_support:
    owner: pm-jane
    sme: compliance-li
    current_prod_version: 2.3.0
    eval_dataset: golden/customer_support_v1.1.0.json
    min_pass_rate: 0.92
    max_avg_cost_per_req: 0.005
    max_p95_latency_s: 2.0
  credit_decision:
    owner: pm-bob
    sme: risk-zhao
    current_prod_version: 1.5.2
    eval_dataset: golden/credit_v1.0.0.json
    min_pass_rate: 1.0   # 严格
    max_avg_cost_per_req: 0.05
    max_p95_latency_s: 4.0

3.2 Prompt 版本提取(CI 用)

"""prompt_version.py — 从 markdown 文件抽取 metadata"""
import re
import yaml
from pathlib import Path

def parse_prompt(path: Path) -> dict:
    """支持 frontmatter + body"""
    text = path.read_text(encoding="utf-8")
    if text.startswith("---"):
        _, fm, body = text.split("---", 2)
        meta = yaml.safe_load(fm)
    else:
        meta = {}
        body = text

    # 从文件名抽 version
    m = re.search(r"v(\d+)\.(\d+)\.(\d+)", path.name)
    if m:
        meta["version"] = ".".join(m.groups())
    return {"meta": meta, "body": body.strip(), "path": str(path)}

frontmatter 示例:

---
prompt_id: customer_support.system
version: 2.4.0
upgrade_type: minor
parent_version: 2.3.0
changelog: |
  - 新增 5 条信用卡常见问题 few-shot
  - 修复"提前还款"问答的语气问题
authors: [pm-jane]
reviewers: [compliance-li]
---

你是某商业银行的客户服务助手。

# 行为准则
1. ...

# Few-shot 示例
...

3.3 Prompt registry 同步(git → Langfuse)

"""publish_prompt.py — CI 跑完 eval 后调用,同步到 Langfuse registry"""
from langfuse import Langfuse
from prompt_version import parse_prompt

lf = Langfuse()

def publish(prompt_path: str, status: str = "staging") -> str:
    p = parse_prompt(Path(prompt_path))
    name = p["meta"]["prompt_id"]
    version = p["meta"]["version"]

    # 检查是否已存在该 version
    try:
        existing = lf.get_prompt(name, version=int(version.split(".")[0]))
        if existing.commit_message == version:
            print(f"Already published: {name} {version}")
            return existing.id
    except Exception:
        pass

    created = lf.create_prompt(
        name=name,
        prompt=p["body"],
        labels=[status, f"v{version}", f"git_sha:{p['meta'].get('git_sha','unknown')[:7]}"],
        commit_message=version,
        config={
            "model": "claude-sonnet-4-6",
            "temperature": 0.0,
            "max_tokens": 1024,
        }
    )
    print(f"Published: {name} {version} as {status}")
    return created.id


if __name__ == "__main__":
    import sys
    publish(sys.argv[1], status=sys.argv[2] if len(sys.argv) > 2 else "staging")

3.4 应用端获取 prompt(带 fallback)

"""app_prompt.py"""
from langfuse import Langfuse
from functools import lru_cache
import time

lf = Langfuse()

@lru_cache(maxsize=128)
def _get_prompt_cached(name: str, label: str, ttl: int):
    """ttl 用于失效"""
    p = lf.get_prompt(name, label=label)
    return p, time.time()


def get_prompt(name: str, label: str = "production"):
    """带 60s cache + 远端 down 时 fallback 到本地副本"""
    try:
        p, fetched_at = _get_prompt_cached(name, label, ttl=int(time.time()) // 60)
        return p
    except Exception:
        # fallback:本地预存的最后一次成功 prompt
        return _load_local_fallback(name, label)


def _load_local_fallback(name: str, label: str):
    # 启动时同步到本地 disk,保命
    return open(f"./prompts_fallback/{name}_{label}.md").read()


# 使用:
def chat_with_prompt(question: str):
    p = get_prompt("customer_support.system", label="production")
    # Anthropic 调用
    r = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=p.prompt,  # 这里取真实 prompt 文本
        messages=[{"role": "user", "content": question}],
        # 关联 trace
        metadata={"prompt_name": p.name, "prompt_version": p.version},
    )
    return r.content[0].text

3.5 灰度路由

"""canary_router.py — 按用户 hash 分流到 canary or stable"""
import hashlib

def route_prompt_label(user_id: str, canary_pct: float = 0.05) -> str:
    """根据 user_id hash 分流,保证同一用户始终落同一桶"""
    h = int(hashlib.md5(user_id.encode()).hexdigest()[:8], 16)
    bucket = (h % 10000) / 10000
    if bucket < canary_pct:
        return "canary"
    return "production"


def chat(user_id: str, question: str):
    label = route_prompt_label(user_id, canary_pct=0.05)
    p = get_prompt("customer_support.system", label=label)
    # ... 调 LLM ...
    metadata = {"prompt_label": label, "prompt_version": p.version}

3.6 完整 CI(GitHub Actions)

# .github/workflows/promptops.yml
name: PromptOps
on:
  pull_request:
    paths: ['prompts/**']

jobs:
  prompt-ci:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.11' }
      - run: pip install -e eval_v1 langfuse pyyaml

      - name: Detect changed prompts
        id: detect
        run: |
          changed=$(git diff --name-only origin/main HEAD -- 'prompts/**/*.md' | tr '\n' ' ')
          echo "files=$changed" >> $GITHUB_OUTPUT

      - name: Validate prompt frontmatter
        run: python scripts/validate_prompts.py ${{ steps.detect.outputs.files }}

      - name: Run eval against golden set
        run: |
          for f in ${{ steps.detect.outputs.files }}; do
            prompt_id=$(yq '.prompt_id' $f)
            golden=$(yq ".prompts.${prompt_id}.eval_dataset" prompts/manifest.yaml)
            min_rate=$(yq ".prompts.${prompt_id}.min_pass_rate" prompts/manifest.yaml)
            python -m eval run --golden $golden --target $f --min-pass-rate $min_rate
          done
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

      - name: Publish to Langfuse (staging label)
        if: success()
        run: |
          for f in ${{ steps.detect.outputs.files }}; do
            python scripts/publish_prompt.py $f staging
          done
        env:
          LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PK }}
          LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SK }}
          LANGFUSE_HOST: ${{ secrets.LANGFUSE_HOST }}

      - name: Generate PR comment
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const r = JSON.parse(fs.readFileSync('reports/last.json'));
            const body = `## PromptOps CI ✅
            - **Pass rate**: ${(r.rate*100).toFixed(1)}%
            - **P95 latency**: ${r.p95_latency_s.toFixed(2)}s
            - **Cost**: $${(r.cost).toFixed(3)}
            - **Status**: ⚠️ Published to staging. Approval required to promote to canary.`;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

  promote-canary:
    needs: prompt-ci
    if: github.event.label.name == 'approved-canary'
    runs-on: ubuntu-latest
    steps:
      - run: python scripts/promote.py --label canary --pct 5
      - run: |
          # 设置定时检查
          echo "Canary started at $(date). Auto-promote check in 24h."

  promote-prod:
    needs: prompt-ci
    if: github.event.label.name == 'approved-prod'
    runs-on: ubuntu-latest
    steps:
      - name: Verify canary metrics
        run: python scripts/verify_canary.py --hours 24 --min-score-delta 0
      - run: python scripts/promote.py --label production --pct 100
      - run: python scripts/keep_old_hot.py --days 7  # 旧 prompt 保留 7 天

3.7 Canary 监控脚本

"""verify_canary.py — 检查 canary 24h 指标是否合格"""
import argparse
from langfuse import Langfuse
from datetime import datetime, timedelta

lf = Langfuse()


def check(hours: int, min_score_delta: float = 0):
    end = datetime.utcnow()
    start = end - timedelta(hours=hours)

    # canary
    canary_traces = lf.get_traces(tags=["canary"], from_timestamp=start, to_timestamp=end)
    prod_traces = lf.get_traces(tags=["production"], from_timestamp=start, to_timestamp=end)

    canary_score = avg_score(canary_traces, "user_thumbs")
    prod_score = avg_score(prod_traces, "user_thumbs")

    print(f"canary score: {canary_score:.3f} ({len(canary_traces)} traces)")
    print(f"prod   score: {prod_score:.3f} ({len(prod_traces)} traces)")

    delta = canary_score - prod_score
    if delta < min_score_delta:
        print(f"❌ score regressed by {-delta:.3f}, blocking promotion")
        sys.exit(1)

    # cost / latency 也检查
    canary_cost = avg_cost(canary_traces)
    prod_cost = avg_cost(prod_traces)
    if canary_cost > prod_cost * 1.10:
        print(f"❌ cost +{(canary_cost/prod_cost-1)*100:.1f}%, blocking")
        sys.exit(1)

    print("✅ canary clean, ok to promote")

四、Cost & Performance 实测数据

阶段流量持续时间监控指标
Staging内部 dogfooding1-3 天functional
Canary5%24huser_score, cost, latency, error_rate
Ramp25% / 50% / 75%各 24h同上
Production100%长期监控
Hot rollback ready7 天旧版本保留

实战数据:某团队 prompt 改动 prod_v3 → prod_v4:

  • CI 60 cases det eval:97% pass, $0.03
  • CI judge vs v3:win rate 58%, $1.80
  • Canary 24h(5% 流量):user_score +2.1%, cost -7%, P95 latency +50ms(容忍内)
  • 全量 v4,节省 monthly cost ~$420

五、金融领域应用

  1. prompt 是合规资产:每个生产 prompt 必须有 SME 签字(合规/风控/法务),review 走流程
  2. prompt diff 审计:v2 → v3 改动了哪些字、新增 few-shot 是否合规、changelog 必填,6 年保留
  3. 金融场景禁止"自动发布":CI 只能发到 staging,promote 必须人工 approve label
  4. 回滚演练:每月做一次 rollback drill(生产切回旧版本),验证 hot rollback 真能 < 1 分钟
  5. 多租户 prompt 隔离:私行 / 零售 / 对公各有 prompt 仓库,互不污染

六、生产经验与陷阱

  1. Prompt 直接写进代码:被改一行 git diff 不显眼,code review 漏过。必须外置文件
  2. 没有版本就升级 prompt:bug 来了不知道哪个版本起的、影响多少用户。强制语义版本
  3. CI 跑得太严:每个 PR 跑 5000 case 的 judge,1 小时不出结果,开发者绕开 CI。CI 必须 < 5min
  4. Canary 流量太少:< 1% 时 sample 不足,看不出问题。最少 5%
  5. Canary 不够长:1h 够吗?不够。用户行为有 day/week 周期,至少 24h,金融关键 prompt 7 天
  6. rollback 时 prompt cache 还在:rollback 后 Anthropic prompt cache 还指向旧版(cache_control 命中),要等 5min/1h 自然过期
  7. prompt 文件含 PII / 真实客户数据:不能 commit。CI 跑 PII 扫描 hook
  8. A/B 数据被新功能污染:canary 期间不要同时上线其它新功能,否则归因混乱

七、关键速查

步骤工具
Prompt 写git markdown
Lintyq + custom validator
Evaleval_v1(Day 169)
PublishLangfuse SDK create_prompt
Canarylabel-based routing
MonitorLangfuse score + Prometheus
Rollbacklabel switch

八、面试题

  1. Prompt 应该 git 管还是 registry 管?

    • 推荐混合:git 是 source of truth + CI 自动同步到 registry;registry 提供热切、灰度、A/B 能力;不要让两边脱钩
  2. Prompt 改一个字也要走 CI 吗?

    • 是。"小改动" 的 LLM 行为变化最难预测。CI 至少跑 deterministic eval(30s 跑完)
  3. Canary 多久才能 promote?

    • 至少 24h 覆盖一个完整业务周期;金融关键决策 prompt 7 天;监控 user_score、cost、latency、error 四个维度
  4. Prompt rollback 比 code rollback 难在哪?

    • prompt cache 仍可能指向旧;下游业务可能已基于新行为调整;同时灰度多个 prompt 时归因难
  5. 怎么防止开发者直接在 prod 改 prompt?

    • prompt 文件 git owners 双签 + Langfuse production label 必须 CI publish 流程才能打 + prod 写权限只给 CI service account

明日预告

Day 172:Fine-tuning 决策 — Prompt vs RAG vs FT 什么时候 prompt 不够要 RAG?什么时候 RAG 不够要 fine-tune?决策框架 + 三种方案在同一任务的对比实验。