Expert Day 169
Week 25 复习 — 整合 Eval Pipeline `eval_v1`
### 1.1 Eval pipeline 全景
2026-10-17
Phase 3 - 生产基础设施与评估 (Day 163-176)EvalPipelineCICDReportingIntegration
日期: 2026-10-17 方向: AI系统工程 / Eval / Integration 阶段: Phase 3 - 生产基础设施与评估 (Day 163-176) 标签: #EvalPipeline #CICD #Reporting #Integration
今日目标
| 类型 | 内容 |
|---|---|
| 学习 | 把过去 6 天的 vLLM、cost、latency、deterministic、judge、golden 整合成一个端到端 eval 系统 |
| 实操 | 写 eval_v1 CLI,支持 eval run --golden golden.json --target prod_v2 --model claude-sonnet-4-6,输出 markdown 报告 |
| 产出 | eval_v1/:完整目录、CLI、CI workflow、HTML 报告 |
一、核心概念
1.1 Eval pipeline 全景
Golden Set (json) ──┐
│
Target System ──────┤── Runner ── Outputs ──┐
│ │
Reference System ──┘ ├── Det Eval ──┐
│ │
└── LLM Judge ─┤
│
┌────────┴────────┐
│ Aggregator │
└─────────────────┘
│
┌────────────────┼─────────────────┐
▼ ▼ ▼
MD report JSON metrics PR comment
│
▼
Langfuse / DB
1.2 设计原则
- 声明式:YAML 配 eval 流程,不写代码
- 可重入:随时跑、随时停、可断点续跑
- 多维报告:分 category / subcategory / severity 切片
- CI 友好:失败时返回非 0 退出码、生成 PR comment
- 生产复用:runner 同代码也能跑生产抽样
二、生产架构图
eval_v1/
├── eval/
│ ├── __init__.py
│ ├── cli.py # entry: `python -m eval run/report`
│ ├── runner.py # 跑 LLM
│ ├── checks/
│ │ ├── __init__.py
│ │ ├── deterministic.py # Day 166
│ │ ├── judge.py # Day 167
│ │ └── numeric.py
│ ├── golden.py # 加载 / 校验 golden set
│ ├── report.py # markdown / html / pr-comment
│ └── observability.py # Langfuse / Prometheus push
├── configs/
│ ├── default.yaml
│ ├── ci.yaml # PR 阶段(快、小)
│ └── nightly.yaml # 夜间全量
├── golden/
│ ├── v1.0.0.json
│ └── v1.1.0.json
├── .github/workflows/
│ └── eval.yml
└── pyproject.toml
三、代码实现
3.1 CLI
"""eval/cli.py"""
import argparse
import asyncio
import json
import sys
from pathlib import Path
import yaml
from eval.runner import run_pipeline
from eval.report import render_markdown, render_html
def main():
p = argparse.ArgumentParser(prog="eval")
sub = p.add_subparsers(dest="cmd", required=True)
pr = sub.add_parser("run")
pr.add_argument("--config", default="configs/default.yaml")
pr.add_argument("--golden", required=True)
pr.add_argument("--target", required=True, help="target system version (e.g. prompt v3)")
pr.add_argument("--model", default="claude-sonnet-4-6")
pr.add_argument("--out", default="reports/")
pr.add_argument("--max-cases", type=int, default=None)
pr.add_argument("--baseline", default=None, help="reference system to compare")
rp = sub.add_parser("report")
rp.add_argument("--results", required=True)
rp.add_argument("--format", choices=["md", "html", "pr"], default="md")
args = p.parse_args()
if args.cmd == "run":
cfg = yaml.safe_load(open(args.config))
cfg.update({"target": args.target, "model": args.model, "max_cases": args.max_cases, "baseline": args.baseline})
results = asyncio.run(run_pipeline(args.golden, cfg))
out_path = Path(args.out) / f"{args.target}.json"
out_path.parent.mkdir(parents=True, exist_ok=True)
json.dump(results, open(out_path, "w"), ensure_ascii=False, indent=2)
# exit code = 1 if any P0 fails
p0_fails = sum(1 for r in results["cases"] if r["severity"] == "P0" and not r["passed"])
if p0_fails > 0:
print(f"❌ {p0_fails} P0 failures", file=sys.stderr)
sys.exit(1)
print(f"✓ saved {out_path}")
elif args.cmd == "report":
data = json.load(open(args.results))
if args.format == "md":
print(render_markdown(data))
elif args.format == "html":
print(render_html(data))
if __name__ == "__main__":
main()
3.2 Runner(核心调度)
"""eval/runner.py"""
import asyncio
import json
import time
from anthropic import AsyncAnthropic
from eval.checks.deterministic import check_deterministic
from eval.checks.judge import judge_pairwise
client = AsyncAnthropic()
async def call_target(case: dict, model: str, prompt_version: str) -> dict:
"""模拟你的产品系统:根据 prompt version 选 system prompt"""
system_prompts = {
"prod_v2": "你是金融助手 v2 ...",
"prod_v3": "你是金融助手 v3 ...",
}
sys_p = system_prompts[prompt_version]
t0 = time.time()
r = await client.messages.create(
model=model, max_tokens=1024,
system=sys_p,
messages=[{"role": "user", "content": json.dumps(case["input"], ensure_ascii=False)}],
temperature=0.0,
)
return {
"text": r.content[0].text,
"elapsed_s": time.time() - t0,
"input_tokens": r.usage.input_tokens,
"output_tokens": r.usage.output_tokens,
}
async def run_pipeline(golden_path: str, cfg: dict) -> dict:
data = json.load(open(golden_path))
cases = data["cases"][: cfg.get("max_cases")]
sem = asyncio.Semaphore(cfg.get("concurrency", 16))
async def one(case):
async with sem:
out = await call_target(case, cfg["model"], cfg["target"])
det = check_deterministic(case, out["text"])
judge = None
if cfg.get("baseline"):
base_out = await call_target(case, cfg["model"], cfg["baseline"])
judge = await judge_pairwise(case, out["text"], base_out["text"])
return {
"id": case["id"],
"category": case["category"],
"severity": case["severity"],
"passed": det["passed"],
"failures": det["failures"],
"judge": judge,
"elapsed_s": out["elapsed_s"],
"tokens": {"in": out["input_tokens"], "out": out["output_tokens"]},
}
results = await asyncio.gather(*[one(c) for c in cases])
# aggregate
n = len(results)
p = sum(r["passed"] for r in results)
by_sev = {}
for sev in ("P0", "P1", "P2"):
sub = [r for r in results if r["severity"] == sev]
if sub:
by_sev[sev] = {"n": len(sub), "pass": sum(r["passed"] for r in sub),
"rate": sum(r["passed"] for r in sub) / len(sub)}
by_cat = {}
for cat in ("normal", "edge", "adversarial", "regression"):
sub = [r for r in results if r["category"] == cat]
if sub:
by_cat[cat] = {"n": len(sub), "pass": sum(r["passed"] for r in sub),
"rate": sum(r["passed"] for r in sub) / len(sub)}
return {
"version": data["version"],
"target": cfg["target"],
"model": cfg["model"],
"n": n,
"passed": p,
"rate": p / n,
"by_severity": by_sev,
"by_category": by_cat,
"p95_latency_s": sorted([r["elapsed_s"] for r in results])[int(0.95 * n)],
"total_input_tokens": sum(r["tokens"]["in"] for r in results),
"total_output_tokens": sum(r["tokens"]["out"] for r in results),
"cases": results,
}
3.3 Report 生成
"""eval/report.py"""
def render_markdown(data: dict) -> str:
out = []
out.append(f"# Eval Report — {data['target']}")
out.append(f"\n**Model**: {data['model']} ")
out.append(f"**Golden version**: {data['version']} ")
out.append(f"**Total**: {data['passed']}/{data['n']} = **{data['rate']*100:.1f}%**")
out.append(f"**P95 latency**: {data['p95_latency_s']:.2f}s ")
cost = data['total_input_tokens'] * 3 / 1e6 + data['total_output_tokens'] * 15 / 1e6 # claude-sonnet-4-6 价格
out.append(f"**Cost**: ${cost:.3f}\n")
out.append("\n## By Severity\n| Sev | Pass | Rate |")
out.append("|-----|------|------|")
for sev, s in data["by_severity"].items():
emoji = "✅" if s["rate"] == 1.0 else ("⚠️" if s["rate"] > 0.85 else "❌")
out.append(f"| {emoji} {sev} | {s['pass']}/{s['n']} | {s['rate']*100:.1f}% |")
out.append("\n## By Category\n| Category | Pass | Rate |")
out.append("|----------|------|------|")
for cat, s in data["by_category"].items():
out.append(f"| {cat} | {s['pass']}/{s['n']} | {s['rate']*100:.1f}% |")
fails = [r for r in data["cases"] if not r["passed"]]
if fails:
out.append(f"\n## Failures ({len(fails)})\n")
for r in fails[:20]:
out.append(f"- **{r['id']}** ({r['severity']}/{r['category']}): {'; '.join(r['failures'])}")
return "\n".join(out)
def render_html(data: dict) -> str:
md = render_markdown(data)
import markdown
return f"<html><body>{markdown.markdown(md, extensions=['tables'])}</body></html>"
def render_pr_comment(data: dict) -> str:
rate = data["rate"]
p0 = data["by_severity"].get("P0", {}).get("rate", 1.0)
badge = "🟢" if rate > 0.95 else ("🟡" if rate > 0.85 else "🔴")
return f"""## {badge} Eval Result: {rate*100:.1f}% pass
| Metric | Value |
|--------|-------|
| Total | {data['passed']}/{data['n']} |
| P0 pass rate | {p0*100:.1f}% |
| P95 latency | {data['p95_latency_s']:.2f}s |
| Cost | ${data['total_input_tokens']*3/1e6 + data['total_output_tokens']*15/1e6:.3f} |
[Full report]({{report_url}})
"""
3.4 GitHub Actions CI
# .github/workflows/eval.yml
name: LLM Eval
on:
pull_request:
paths:
- 'prompts/**'
- 'eval_v1/**'
schedule:
- cron: '0 18 * * *' # 每日凌晨 02:00 北京时间
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- run: pip install -e eval_v1
- name: PR fast eval (only normal+regression P0)
if: github.event_name == 'pull_request'
run: |
python -m eval run \
--config eval_v1/configs/ci.yaml \
--golden eval_v1/golden/v1.1.0.json \
--target prod_v3 \
--model claude-sonnet-4-6 \
--max-cases 50
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Nightly full eval
if: github.event_name == 'schedule'
run: |
python -m eval run \
--config eval_v1/configs/nightly.yaml \
--golden eval_v1/golden/v1.1.0.json \
--target prod_v3 \
--model claude-opus-4-7 \
--baseline prod_v2
- name: PR comment
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const result = JSON.parse(fs.readFileSync('reports/prod_v3.json'));
const comment = require('./eval_v1/eval/report.js').render_pr_comment(result);
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: comment
});
3.5 Default config
# configs/default.yaml
concurrency: 16
deterministic_checks:
- schema
- regex
- contains
- numeric
- length
judge:
enabled: false
model: claude-opus-4-7
flip_position: true
report:
format: [md, json]
push_to_langfuse: true
exit_on_p0_fail: true
四、Cost & Performance 实测数据
| Pipeline 配置 | 100 cases 时间 | 成本 |
|---|---|---|
| Det only(CI fast) | 28 s | $0.04 |
| Det + judge(pairwise,flip) | 4.5 min | $1.80 |
| Det + judge + flip + ensemble | 7 min | $3.40 |
| Nightly full(500 cases, judge, baseline) | 35 min | $12 |
月度成本预算(典型 SaaS 团队):
- PR eval(10 PR/day × 28 PR/月 × $0.04) ≈ $11
- Nightly full(30 × $12) ≈ $360
- 抽样监控(每天 200 条 × $0.02) ≈ $120
- 总计 ≈ $500/月
五、金融领域应用
- 合规审计证据:每次 deploy 关联 eval report → S3 永久存储 → 监管要求 7 年保留
- prompt 改动可逆:CI 跑过的 baseline + target 对比,回滚有量化依据
- 跨业务线复用:信贷、风控、客服各有 golden,CI 一套
- 冠军挑战赛:每月一次"挑战 prod_vN",团队提交 prompt,自动跑 eval 排序
- 监管备案:金融 AI 上线前,eval pass rate + bias 测试 + adversarial 抗性可作为备案材料
六、生产经验与陷阱
- CI 太慢就没人用:PR eval 必须 < 5 分钟。判 cache 重复 case 跳过
- 报告 PR comment 噪音大:只在 rate 变化 > 1% 或 P0 fail 时评论
- golden 版本与代码版本不绑定:建议 golden 哈希进 commit message
- 眼里只有总分:80% pass 看似不错,但 P0 才 60% 是灾难。报告必须分 severity
- Eval 系统本身不被 review:eval code 也要 unit test。golden 加一条"eval pipeline self-test"
- CI 用错模型:CI 用 haiku 看着 95% pass,生产用 sonnet 反而 88%。CI 必须用同一模型
- Latency check 漏掉冷启动:第一个 request TTFT 可能高(vLLM 没预热),warmup 一下再统计
七、关键速查
| 命令 | 用途 |
|---|---|
eval run --golden v1.json --target v3 | 跑 |
eval report --results out.json --format md | 生成报告 |
eval run --baseline v2 --target v3 | 对比 |
| 报告必备字段 |
|---|
| 总 pass rate |
| 分 severity / category |
| Top failures |
| Cost / Latency |
| Golden version + Target version |
八、面试题
-
Eval pipeline 设计的关键决策?
- 同步 vs 异步(异步必须);CI 快/夜间慢分层;severity 区分;声明式 config;可重入
-
CI eval 失败怎么处理?
- P0 必拦;P1/P2 给 PR comment 警告;强制 review;golden update 须有 SME 共审
-
Eval 系统本身可信吗?
- 用合成数据自测;alert 误报率监控;定期人工 spot-check 报告 vs 真实
-
多 prompt 版本 A/B 怎么落地?
- 同 dataset 跑两次(不同 system prompt),pairwise judge 对比 + win rate;新版本 win rate > 55% 才升级
-
如何把 eval 报告作为合规证据?
- golden case 关联法规条款 + eval JSON 哈希入 S3 worm bucket + 报告含 model_id/git_sha/golden_version 三元组
明日预告
Day 170:LLMOps 工具链 — Langfuse / Helicone / LangSmith 观测性、tracing、scoring。把 eval 报告、生产调用、cost 指标都接到 Langfuse,实现"production-grade observability"。