目录
AI MLOps Continuous Delivery Release Playbook
受众:AI PM、AI Architect、Platform PM、MLOps Lead、Model Risk、AI Governance、Release Manager、金融零售技术负责人。
核心问题:当 AI 系统的 code、data、feature、model、prompt、RAG index、tool schema、policy 和 eval 同时变化时,团队如何建立可复现、可审计、可回滚、可分级放量的持续交付体系。
学习目标:不讲基础 BA,不停留在“训练一个模型”。目标是训练高级角色能设计 CD4ML / MLOps continuous delivery 架构、release gate、promotion workflow、风险分级上线、governance evidence 和可展示作品集。
重要说明:本文是学习、架构设计和作品集材料,不构成法律、监管、模型验证或正式合规意见。金融零售正式项目必须由 business owner、technology、security、privacy、legal、compliance、model risk、operational risk、internal audit 共同确认适用要求、审批权、证据保留和发布边界。
1. One-Sentence Positioning(一句话定位)
CD4ML / MLOps continuous delivery 不是把 notebook 自动部署成 API,而是:
用受控 pipeline 把代码、数据、特征、模型、prompt、评估、发布、监控、回滚和治理证据串成一个可重复的 AI release system。
在传统软件里,release 的核心对象通常是代码包。AI 系统的 release 对象更复杂:
AI release =
code version
+ data snapshot
+ feature schema and transformation
+ model artifact
+ prompt / policy / tool config
+ eval dataset and result
+ deployment route
+ monitoring and rollback plan
+ governance evidence
这份手册训练的是三个高级能力:
能力 高级表现 作品集资产 Release architecture 能把 ML pipeline、CI/CD、continuous training、model registry、feature store、eval gate、deployment strategy 组合成生产架构 CD4ML reference architecture Release decision 能根据业务风险、模型表现、数据质量、漂移、成本、延迟和人工控制做 go / limited go / no-go / rollback Risk-tiered release gate memo Governance evidence 每次上线都能复现模型来源、训练数据、特征版本、评估结果、审批、例外、放量和回滚记录 AI release evidence binder
核心观点:
没有 lineage 的模型不能发布。
没有 eval gate 的模型不能放量。
没有 rollback path 的模型不能进入高风险流程。
没有 evidence binder 的模型不能通过金融零售审计。
2. Source Anchors
以下来源作为学习锚点和术语校准。正式项目必须按访问日期复核最新版本、产品状态、地区可用性、合同条款、监管要求和机构内部政策。
Anchor Link 本手册使用方式 Martin Fowler / Thoughtworks CD4ML https://martinfowler.com/articles/cd4ml.html 学习 Continuous Delivery for Machine Learning 的核心理念:把 ML 交付视为跨团队、跨 artifact、跨环境的持续交付问题,而不是一次性模型训练。 Google Cloud MLOps continuous delivery and automation pipelines https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning 学习 MLOps automation level、CI/CD/CT、pipeline orchestration、model deployment 和持续训练的分层框架。 TensorFlow TFX https://www.tensorflow.org/tfx 学习生产级 ML pipeline 组件思维:数据摄取、统计、schema、transform、trainer、evaluator、pusher、metadata 和 pipeline orchestration。 NIST AI Risk Management Framework https://www.nist.gov/itl/ai-risk-management-framework 用 Govern / Map / Measure / Manage 组织 AI release 的风险识别、评估、处置、监控和治理证据。 AI EvalOps Platform Architecture Playbook docs/AI_EVALOPS_PLATFORM_ARCHITECTURE_PLAYBOOK.md把 eval dataset、evaluator、experiment、release gate 和生产监控接入本手册的发布流程。 AI Model Risk Management Playbook docs/AI_MODEL_RISK_MANAGEMENT_PLAYBOOK.md把 inventory、validation、ongoing monitoring、change management 和 effective challenge 接入 MLOps release governance。 AI Audit Evidence Binder Playbook docs/AI_AUDIT_EVIDENCE_BINDER_PLAYBOOK.md把每次训练、评估、发布、审批、例外、监控和回滚沉淀成审计证据包。 AI Incident / Postmortem / Reliability Playbook docs/AI_INCIDENT_POSTMORTEM_RELIABILITY_PLAYBOOK.md把 rollback、containment、incident trigger、postmortem 和 regression evidence 接入 release engineering。
3. CD4ML Architecture
3.1 参考架构
CD4ML 架构的目标不是“把训练脚本跑起来”,而是让每个 AI release 都能回答六个问题:
这次变更改了什么。
它依赖哪些数据、特征、prompt、模型和配置。
它用什么数据集和评估器证明质量。
它在哪个风险等级下允许上线。
它如何分阶段放量和监控。
它出问题后如何回滚、止血和复现。
flowchart TB
A[Change Request] --> B[Risk Tiering and Release Scope]
B --> C[Source Control]
C --> D[CI: Code, Config, Prompt, Schema Tests]
D --> E[Data and Feature Validation]
E --> F[Pipeline Orchestrator]
F --> G[Training / Fine-tuning / Prompt Build]
G --> H[Model and Artifact Registry]
H --> I[Offline Eval and Validation]
I --> J[Release Gate]
J --> K[Staging / Shadow]
K --> L[Canary / Ramp]
L --> M[Production Route]
M --> N[Monitoring and Online Eval]
N --> O[Incident / Rollback / Retraining Trigger]
O --> B
H --> P[Lineage and Metadata Store]
I --> P
J --> Q[Governance Evidence Binder]
M --> Q
N --> Q
3.2 八个核心控制对象
控制对象 管什么 版本粒度 关键证据 Code pipeline code、training code、serving code、feature transformation、policy code commit SHA、build ID、container digest CI result、test report、dependency scan、review approval Data raw data、training set、validation set、production sample、label set dataset snapshot、partition、query hash、retention tag dataset card、lineage、quality report、access approval Feature feature definition、schema、transformation、feature store view feature view version、schema version、transform hash feature validation、training-serving skew report Model trained artifact、fine-tuned model、adapter、calibration layer model version、artifact hash、training run ID model card、metrics、registry approval、signature Prompt / Policy system prompt、few-shot、tool instruction、guardrail policy、decision policy prompt version、policy version、config hash prompt diff、policy test、approved use boundary Eval eval dataset、evaluator、rubric、judge、threshold eval run ID、dataset version、evaluator version experiment report、slice analysis、gate decision Deployment endpoint、route、traffic split、feature flag、environment release ID、deployment manifest、route config deployment record、canary metrics、rollback plan Governance risk tier、approval、exception、monitoring, incident linkage evidence binder version、decision log ID gate memo、attestation、issue log、audit sample
3.3 Model / Data / Code / Prompt Version Coupling
AI release 最容易失控的地方,是团队只给 model artifact 编号,却没有绑定 data、feature、prompt、eval 和 serving route。
推荐把每个发布版本定义为 release bundle:
release_id: AIREL-AML-COPILOT-2026-007
use_case_id: AML-COPILOT-001
risk_tier: Tier 1
code:
repo_commit: 9f2c81a
pipeline_image: registry/aml-pipeline@sha256:4b7...
serving_image: registry/aml-serving@sha256:2c1...
data:
training_snapshot: AML-TRAIN-2026Q2-v3
validation_snapshot: AML-VALIDATION-2026Q2-v2
label_policy: AML-LABEL-GUIDE-v1.4
features:
feature_view: aml-alert-features-v6
transform_hash: 72e4...
model:
model_id: aml-narrative-ranker
model_version: 1.8.0
artifact_hash: c55a...
prompt_policy:
prompt_version: aml-summary-system-prompt-v2.3
guardrail_policy: aml-output-policy-v1.2
tool_schema: case-evidence-tool-v1.1
eval:
dataset_versions:
- AML-GOLDEN-2026Q2-v2
- AML-REGRESSION-2026Q2-v4
- AML-REDTEAM-2026Q2-v1
evaluator_versions:
- RULE-CITATION-v2
- JUDGE-GROUNDEDNESS-v1.3
deployment:
route: aml-copilot-prod
traffic_start: shadow
rollback_target: AIREL-AML-COPILOT-2026-006
如果 release bundle 缺少任一关键对象,风险不是“文档不完整”,而是:
缺失 实际风险 没有 data snapshot 训练结果无法复现,漂移无法解释,验证争议无法追溯 没有 feature version 训练与线上特征不一致,模型表现突然退化 没有 prompt / policy version 同一个 model version 行为不同,事故复盘无法定位 没有 eval dataset version 指标不可比较,release gate 被人为调整 没有 deployment manifest 生产实际运行版本与批准版本不一致 没有 rollback target 事故时只能临时停服,无法有序恢复
CD4ML 架构至少需要六类 registry / metadata 能力:
Registry 管理对象 必须支持 Model registry model artifact、signature、metric、approval status、deployment stage versioning、stage promotion、owner、risk tag、rollback target Dataset registry training / validation / eval / production sample source lineage、snapshot、schema、classification、retention、allowed use Feature registry / feature store feature definition、transform、serving view training-serving parity、freshness、schema validation、owner Prompt / policy registry prompt、rubric、tool instruction、guardrail config diff、approval、hash、environment promotion、rollback Experiment / eval registry training run、eval run、slice metric、judge result baseline comparison、critical failure、confidence interval、report link Deployment registry release bundle、route、traffic split、environment、rollback immutable manifest、change approval、canary result、incident linkage
3.5 Pipeline Types
不是所有 AI 系统都需要同一条 pipeline。高级架构师要按系统类型选择 pipeline。
系统类型 Pipeline 重点 Release 风险 传统 ML classifier / ranker data validation、feature engineering、training、model evaluation、serving parity data drift、feature skew、threshold shift、label leakage RAG assistant source ingestion、chunking、embedding、index build、retrieval eval、answer eval stale docs、ACL failure、wrong citation、index rollback LLM prompt product prompt registry、golden set、judge calibration、policy tests、route config prompt drift、over-refusal、unsafe answer、vendor model change Agent workflow tool schema、permission matrix、simulation、state tests、side-effect audit tool misuse、approval bypass、loop、irreversible action Decision automation model + rules + workflow + human override customer impact、fairness、adverse action、regulatory audit
4. Release Gate Model
4.1 Gate Philosophy
Release gate 不是把项目拖慢的审批表,而是把“上线风险是否可接受”变成可重复、可审计的决策。
建议把 gate 分成三类:
Gate 类型 目的 决策输出 Engineering gate 证明 artifact 可构建、可测试、可复现、可部署 pass / fail Quality gate 证明模型、prompt、RAG、tool 在目标任务上满足阈值 go / fix / compare again Risk gate 证明 residual risk 与业务场景、人工控制、放量范围匹配 go / limited go / no-go / risk acceptance
4.2 风险分级发布模型
Risk tier 典型 AI 用例 Gate 强度 放量策略 审批要求 Tier 0 - Prohibited or restricted 直接自动拒贷、未经审批资金动作、自动提交监管报告 默认不允许;如监管和内部政策允许,需最高等级审查 不进入生产自动化 Executive、Legal、Compliance、Risk、Board-level evidence Tier 1 - High impact AML / KYC copilot、欺诈复核、信贷政策 RAG、客户可见费用/权利回答 全量 gate:data、feature、eval、security、privacy、model risk、business sign-off shadow -> staff pilot -> canary -> limited ramp -> full scale Business、Model Risk、Compliance、Security、Data Owner Tier 2 - Medium impact 员工内部分析、运营摘要、低风险推荐、非客户可见流程辅助 标准 gate:CI、eval、数据质量、owner approval、监控 staging -> canary -> ramp Product、Platform、Data Owner、Risk consult Tier 3 - Low impact 内部文案、低风险总结、个人生产力工具 轻量 gate:安全、数据分类、basic eval、usage monitoring direct limited release with monitoring Product owner、Security policy
4.3 Gate Stack
Gate 进入条件 检查内容 失败动作 G0 Scope and Risk use case 被登记 approved use、prohibited use、risk tier、customer impact、human role 返回 intake,限制用途或拒绝进入开发 G1 Source and Data Readiness 数据源可用 data owner、classification、lineage、label policy、schema、sampling、retention 阻止训练或限制为 sandbox G2 Build and CI 代码、配置、prompt 可构建 unit tests、schema tests、prompt tests、dependency scan、container build 修复后重新构建 G3 Feature Validation 特征或 index 已生成 schema drift、missingness、range、freshness、training-serving skew、ACL 阻止训练或回滚 feature/index G4 Offline Eval candidate artifact 已产生 golden set、regression set、red-team、slice analysis、cost、latency no-go 或 targeted remediation G5 Risk and Security Review eval 通过工程阈值 privacy、security、model risk、fairness、explainability、human control limited go、risk acceptance 或 no-go G6 Shadow Readiness staging 可运行 production-like traffic replay、no side effect、logging、monitoring dashboard 留在 shadow,修复 observability G7 Canary 小流量生产 quality、latency、cost、feedback、override、critical failure 自动回滚或暂停 ramp G8 Ramp and Scale canary 稳定 segment expansion、capacity、support readiness、issue aging 降低流量或限制场景 G9 Post-Release Review 生产运行一段时间 expected vs actual、incident、drift、business value、evidence completeness 更新 gate、dataset、controls 和 training plan
4.4 Hard Gates vs Soft Gates
Gate 项 Hard gate 示例 Soft gate 示例 Critical failure PII 泄露、越权工具动作、错误客户承诺、unsupported regulated claim 必须为 0 低风险文案 tone 分数略低 Data quality 关键特征缺失率超过阈值;training-serving schema 不一致 某个非关键特征 freshness 轻微延迟 Eval coverage 高风险 slice 无样本覆盖 长尾低风险 intent 覆盖不足但暂不放量 Security 写工具审批缺失;secret 出现在日志 dependency 中低风险 CVE 有补丁计划 Monitoring 无法按 release_id 追踪 production trace dashboard 某个非关键 tile 延迟刷新 Rollback 没有可执行 rollback target 回滚演练用时超过目标但仍可执行
4.5 Release Gate Decision
Decision 含义 适用场景 Go 满足目标风险等级和上线范围的全部硬门槛 可进入下一阶段或生产 Limited go 硬门槛通过,但需要限制用户、地区、产品、流量、功能或人工复核 高风险 slice 覆盖不足、人工产能有限、部分监控仍需增强 No-go 存在 critical failure、重大回归、证据缺失或控制不可用 修复后重新 gate Rollback 已上线版本触发质量、风险、成本、延迟或事故阈值 切回上一个批准 release bundle Risk acceptance 已知 residual risk 被明确接受,并有时间、范围、补偿控制和审批 短期业务必要性高且风险可控
5.1 环境与晋级路径
local / notebook
-> experiment workspace
-> controlled training pipeline
-> staging
-> shadow
-> canary
-> limited production
-> scaled production
-> post-release review
每个环境的职责不同:
环境 目的 允许行为 禁止行为 Local / notebook 探索、特征假设、快速实验 使用脱敏样本、生成实验思路 直接连接生产数据和生产工具 Experiment workspace 受控实验和 baseline 比较 记录 run、dataset、参数、指标 手工复制 artifact 到生产 Training pipeline 可复现训练和评估 固定 data snapshot、feature version、container 运行未登记数据或未审查依赖 Staging 生产相似环境验证 使用批准配置和 mock / replay traffic 对真实客户产生 side effect Shadow 线上旁路评估 读取真实请求,输出不影响用户和系统状态 触发写工具或客户可见内容 Canary 小比例真实流量 限定用户、产品、地区、业务时段 不带监控和自动回滚地扩大 Limited production 受控放量 分 segment 扩展、人工复核、强化监控 跨越未评估场景 Scaled production 规模化运行 常规监控、漂移检测、周期 gate 供应商或策略变更绕过 gate
5.2 CI / CD / CT for ML
能力 在普通软件中的含义 在 ML / AI 中的扩展 CI 代码构建、单元测试、静态检查 pipeline code、training code、serving code、feature transform、prompt、policy、schema、eval code 全部测试 CD 自动部署通过 gate 的 artifact model / prompt / index / policy / route 作为 release bundle 分阶段部署 CT 通常不存在 基于数据漂移、标签到达、业务规则变化、模型退化或周期计划触发训练和重新验证
5.3 CI 检查清单
检查 示例 Pipeline unit tests 数据摄取、join、sampling、split、label transform 不产生泄漏 Feature transform tests 特征计算在边界值、缺失值、异常值下稳定 Schema tests training、validation、serving schema 兼容 Prompt tests 禁止用途、输出格式、工具调用规则、拒答规则通过 smoke cases Policy tests guardrail、DLP、permission、threshold 规则可解释且可回归 Eval code tests evaluator 不依赖隐式全局状态,metric 计算可复现 Reproducibility tests 固定 snapshot、container、参数后可重跑并得到可解释差异 Security tests dependency、secret、container、data egress、tool permission 扫描
5.4 Continuous Training 触发器
Continuous training 不能等同于“有新数据就自动上线”。触发训练和触发发布是两件事。
Trigger 训练动作 发布动作 Data drift 用新分布训练 candidate 或重校准 必须通过 offline eval、slice comparison 和 canary Label arrival 用新标签回测和再训练 若收益显著且无回归,进入 release gate Business rule change 更新标签规则、特征、prompt 或 policy 按 materiality 判断是否 major release Performance degradation 生成修复候选版本 触发 incident / issue,发布前需证明修复 Scheduled refresh 周期训练 低风险可自动生成 candidate,高风险仍需审批 Vendor model change 重新评估 route 或 pin 旧版本 不允许无 gate 指向 latest RAG source update 重建 index / embedding 做 retrieval eval、freshness check、ACL check 后再 promote
5.5 Feature Validation
特征验证是 MLOps release 的硬门槛。它比“数据质量”更具体,因为它直接影响模型行为。
验证维度 问题 失败示例 Schema 字段名、类型、枚举、单位是否一致 income_amount 从月收入变成年收入Distribution 分布是否相对训练基线异常变化 欺诈模型的交易金额分布突然右移 Missingness 缺失率是否超过阈值 KYC 文档 OCR 字段缺失率从 4% 升至 22% Freshness 特征是否按 SLA 更新 AML 交易 velocity 特征延迟 2 天 Range 数值范围是否合理 年龄为负数或交易金额为 0 的高比例异常 Cardinality 类别取值是否爆炸或缺失 merchant category 新增大量 unknown Label leakage 特征是否包含未来信息或目标代理 chargeback outcome 进入训练特征 Training-serving skew 线上特征计算是否与训练一致 训练用净额,线上用毛额 Access / ACL 特征是否允许用于该 use case 客户敏感属性进入不允许的推荐模型
5.6 Shadow / Canary / Ramp
阶段 目标 流量 成功标准 停止条件 Shadow 用真实输入验证行为但不影响用户 0% customer impact 输出质量、trace、latency、cost、policy 通过 critical failure、日志缺失、成本异常 Canary 1 小范围员工或低风险 segment 1%-5% 无 critical failure;人工 override 在阈值内 任何高风险失败或 SLO 破坏 Canary 2 扩大到代表性 segment 5%-20% 不同产品、地区、渠道 slice 稳定 某 slice 显著退化 Controlled ramp 分阶段扩展 20%-50% 质量、业务价值、支持队列、成本稳定 incident、投诉、人工积压 Full scale 常规生产 目标流量 持续监控和定期 review drift、vendor change、policy change
5.7 Rollback Strategy
AI rollback 不只是“回滚代码”。必须能按组件回滚。
Rollback 类型 适用场景 动作 Route rollback 新模型、新 prompt、新 policy 表现退化 model gateway 或 feature flag 切回上一个批准 release bundle Model rollback model artifact 引入回归 registry stage 回退,endpoint 指向旧 model version Prompt rollback prompt 改动导致拒答、越权或语气退化 prompt registry 指针切回旧版本并记录 incident Index rollback RAG 新索引检索过期、错误或低权限文档 恢复旧 index snapshot,暂停 source ingest Feature rollback 特征计算错误或 schema drift 切回旧 feature view 或关闭受影响特征 Tool rollback Agent 写工具误用或审批缺失 禁用写工具,切 read-only / draft-only 模式 Policy rollback guardrail 过度阻断或漏拦截 恢复旧 policy config 并强化人工复核 Batch quarantine 批处理输出可能错误 暂停下游消费,标记待复核,不让错误结果进入客户流程 Compensation 已产生业务 side effect 撤销、冲正、客户补救、监管或审计记录
1. Product owner 提交 AI release change request。
2. Release manager 确认 risk tier、scope、affected components。
3. Pipeline 锁定 code commit、data snapshot、feature version、prompt version。
4. CI 运行代码、配置、schema、prompt、policy 和安全测试。
5. Training pipeline 生成 candidate model / prompt / index artifact。
6. Eval runner 对 golden、regression、red-team、slice sets 运行评估。
7. Model registry 记录 artifact、metrics、lineage、approval status。
8. Release gate board 审核 eval、risk、security、rollback、monitoring。
9. 进入 staging 和 shadow,确认 trace、latency、cost、policy、fallback。
10. Canary 小流量上线,自动监控 critical failure、override、complaint、cost。
11. Ramp 按 segment 扩展,所有阶段生成 evidence。
12. Post-release review 更新 dataset、controls、issue log 和 portfolio record。
6. Financial Retail Examples
6.1 Credit Policy RAG Assistant
维度 设计 Use case 为信贷运营和承销人员回答内部政策问题,必须引用批准政策条款 Release bundle prompt、embedding index、policy document snapshot、retriever、reranker、answerability gate、citation judge 高风险失败 引用过期政策、无证据回答、把内部建议说成客户决定、错误 adverse action 相关表述 Gate source freshness、ACL、retrieval recall、citation correctness、regulated refusal、SME review Shadow / canary 先对历史问题 replay,再面向 trained underwriter 小范围开放 Rollback 回滚 index、关闭自由生成、切回政策门户搜索和 SME escalation Evidence policy source manifest、index lineage、eval report、SME sign-off、release gate memo
6.2 AML Alert Narrative Copilot
维度 设计 Use case 帮助 AML 分析师汇总交易证据和草拟 investigation narrative Release bundle model route、prompt、case evidence tool、transaction feature snapshot、groundedness judge、red-team set 高风险失败 unsupported suspicious activity conclusion、错误实体合并、遗漏关键证据、PII 处理不当 Gate critical failure 为 0;historical incident regression 100% pass;SME 复核 high-risk slice Shadow / canary 只生成 analyst draft,不自动写入 SAR 或关闭 case Rollback 禁用 narrative generation,保留 evidence summary,只允许人工草拟 Evidence trace sample、tool audit、human edit rate、model risk validation note
6.3 Fraud Decisioning Model
维度 设计 Use case 实时交易欺诈评分,支持 approve / challenge / decline 路由 Release bundle feature view、training data snapshot、model artifact、threshold config、decision policy、monitoring dashboard 高风险失败 false positive 导致客户交易被错误阻断;false negative 导致损失扩大;特征延迟导致评分失真 Gate ROC / PR、cost-based threshold、segment fairness、latency、feature freshness、shadow backtest Shadow / canary 先 shadow 记录建议不影响交易,再对低风险 segment canary Rollback threshold rollback、model rollback、fallback rules、人工复核队列 Evidence threshold decision memo、business loss simulation、segment analysis、rollback rehearsal
6.4 Payment Dispute Agent
维度 设计 Use case 阅读争议材料、建议下一步、草拟客户沟通,可在审批后创建 case note Release bundle prompt、tool schema、approval workflow、idempotency policy、dispute document RAG、eval set 高风险失败 重复 provisional credit、错误关闭 dispute、错误客户承诺、未升级监管投诉 Gate tool simulation、approval UI test、idempotency test、regulated phrase block、complaint escalation test Shadow / canary draft-only,写工具关闭;再对 trained agents 开启审批后写 case note Rollback 关闭写工具,保留只读摘要,冻结受影响 dispute outputs Evidence tool call ledger、approval record、customer communication review、incident drill result
6.5 Customer-Facing Service AI
维度 设计 Use case 客户咨询账户、费用、权益、争议和服务流程 Release bundle model route、prompt、policy guardrail、answer templates、handoff rule、DLP、conversation eval 高风险失败 虚构费用减免承诺、错误投诉权利、PII 泄露、未转人工 Gate customer-visible critical failure 为 0;tone、disclosure、handoff、policy compliance 通过 Shadow / canary 内部坐席辅助 -> 小比例客户会话 -> 分主题 ramp Rollback 关闭自由回答,保留模板化 FAQ 和人工转接 Evidence conversation QA、complaint keyword monitoring、DLP report、handoff SLA
7. Controls and Evidence
7.1 Control Objective Map
Control objective 控制问题 Evidence Release scope is approved AI 用例、风险等级、上线范围和禁止用途是否清晰 use case card、risk tier memo、approved use record Artifact is reproducible 训练和评估是否能用同一 snapshot 重跑 release bundle、pipeline run ID、container digest、dataset snapshot Data is governed 数据来源、分类、血缘、质量和权限是否受控 dataset card、data quality report、access approval、retention rule Features are valid 特征是否符合 schema、freshness、range 和 serving parity feature validation report、skew report、feature owner approval Model is validated 模型是否达到质量、稳定性、公平性、成本和延迟要求 experiment report、slice analysis、model card、validation memo Prompt / policy is controlled prompt、guardrail、tool instruction 是否版本化和审批 prompt diff、policy test result、config hash Eval is meaningful 评估数据和评估器是否覆盖目标风险 eval contract、dataset card、evaluator card、calibration report Release is risk-tiered gate 强度和放量策略是否匹配风险 gate memo、approval trail、limited-go conditions Deployment is observable 线上是否能追踪 release_id、artifact、trace 和业务影响 deployment manifest、trace schema、dashboard Rollback is executable 出事后是否能快速回到受控状态 rollback runbook、rollback rehearsal、previous release bundle Governance is auditable 决策、例外、问题和监控是否能被审计 evidence binder、issue log、attestation、review minutes
7.2 Reproducibility Minimum Evidence
每次模型或 AI 系统发布必须保存:
Evidence 最小内容 Code snapshot repo、commit、branch protection、reviewer、build ID Runtime environment container digest、base image、Python / package lock、hardware profile Data snapshot source systems、query、partition、time window、sampling、row count、hash Label snapshot labeling guide、label source、reviewer role、quality check、label version Feature snapshot feature definition、transform code、feature store version、freshness Training run parameters、seed、algorithm、training metrics、resource usage Model artifact artifact URI、hash、signature、input/output schema、calibration Prompt / policy prompt text hash、policy config、tool schema、guardrail version Eval run dataset version、evaluator version、thresholds、slice metrics、raw failures Deployment route config、traffic split、feature flags、environment variables Approval decision maker、decision date、conditions、exceptions、expiry
7.3 Artifact Lineage
建议每个 production prediction / response 都能追溯:
request_id
release_id
use_case_id
risk_tier
model_version
prompt_version
policy_version
feature_view_version
dataset_or_index_version
tool_schema_version
eval_gate_id
deployment_route
human_review_status
monitoring_tags
对高风险金融零售 AI,trace 还应保留:
Trace element 为什么重要 input classification 判断是否包含 PII、PCI、AML、credit、complaint、vulnerable customer 等敏感分类 retrieved evidence RAG 答案是否有证据支持,是否用了过期或低权限文档 feature values 传统 ML 决策能否解释和复算 policy decisions guardrail、DLP、handoff、tool permission 为什么允许或阻断 tool call ledger Agent 是否触发 side effect,是否经过审批和幂等控制 human override 人工是否接受、修改、拒绝 AI 输出 customer impact flag 是否客户可见、是否影响账户、交易、投诉、信贷或监管流程
7.4 Model Registry 最小字段
Field 示例 model_id fraud-realtime-score-v3model_version 3.4.2artifact_hash sha256:7b8f...training_run_id TRN-FRAUD-2026-0521training_dataset FRAUD-TRAIN-2026Q2-v5feature_view fraud-auth-features-v11model_signature input schema、output schema、score range owner Fraud Analytics Lead risk_tier Tier 1 approved_stage staging / shadow / canary / production approval_conditions low-risk card-present segment, human review for high-value decline eval_report EVAL-FRAUD-2026-014rollback_target fraud-realtime-score-v3:3.3.9monitoring_dashboard production quality and drift dashboard review_expiry next quarterly validation date
7.5 Governance Evidence Binder
每次 Tier 1 / Tier 2 发布建议生成一个 evidence binder:
Section Evidence Executive summary release scope、risk tier、decision、conditions、rollback target Architecture pipeline diagram、component map、data flow、tool boundary、human role Data and feature dataset card、feature validation、lineage、privacy classification Model / prompt / policy model card、prompt diff、policy config、tool schema、model registry record Eval eval contract、golden / regression / red-team result、slice analysis、failure review Security / privacy DLP、access control、threat model、dependency scan、egress control Release decision gate memo、approval trail、exceptions、limited-go conditions Deployment manifest、traffic plan、shadow/canary metrics、runbook Monitoring dashboard snapshot、alert rules、sampling plan、human review queue Rollback rollback plan、rollback target、drill evidence、incident trigger Post-release actual outcomes、issues、remediation、dataset updates、review minutes
8. Templates
8.1 AI Release Gate Memo
# AI Release Gate Memo: AML Copilot Narrative v2.3
## Decision
- decision: Limited go to canary
- release_id: AIREL-AML-COPILOT-2026-007
- use_case_id: AML-COPILOT-001
- risk_tier: Tier 1
- decision date: 2026-06-29
- rollback target: AIREL-AML-COPILOT-2026-006
## Scope
- approved use: analyst-facing draft narrative and evidence summary
- prohibited use: final SAR decision, automatic regulatory submission, automatic case closure
- canary scope: 15 trained Tier 2 AML analysts, high-risk alerts excluded from first 72 hours
- human control: analyst must review and edit before saving to case record
## Release Bundle
- code commit: 9f2c81a
- serving image: registry/aml-serving@sha256:2c1
- model version: aml-narrative-ranker 1.8.0
- prompt version: aml-summary-system-prompt v2.3
- evidence tool schema: case-evidence-tool v1.1
- transaction feature view: aml-alert-features v6
- RAG index: aml-policy-index-2026Q2-v2
## Eval Results
| Gate metric | Result | Threshold | Decision |
|---|---:|---:|---|
| Critical unsupported claim | 0 | 0 | pass |
| PII leakage | 0 | 0 | pass |
| Citation correctness | 96.8% | 95.0% | pass |
| Historical incident regression | 100% | 100% | pass |
| High-risk slice groundedness | 97.1% | 97.0% | pass |
| P95 latency | 7.8s | 9.0s | pass |
## Conditions
- first 72 hours: all generated narratives reviewed by AML Quality Lead sample queue
- high-risk entity ambiguity cases remain disabled until targeted sample count reaches approved coverage
- daily release check for citation correctness, unsupported claim, analyst major rewrite rate and latency
## Rollback
- automatic rollback trigger: any unsupported suspicious activity conclusion saved to case record
- manual rollback trigger: citation correctness below threshold for two consecutive daily reviews
- rollback action: route pointer returns to AIREL-AML-COPILOT-2026-006 and narrative generation is disabled for affected alerts
## Approval
- Business owner: AML Operations Director
- Model Risk: Model Risk VP
- Compliance: BSA/AML Compliance Lead
- Platform: AI Platform Owner
- Security / Privacy: approved for canary scope
Field Example Promotion ID PROMO-FRAUD-2026-018 From stage shadow To stage canary Model fraud-realtime-score-v3 version 3.4.2 Release bundle AIREL-FRAUD-2026-018 Baseline production version 3.3.9 Candidate benefit 4.2% improvement in fraud capture at same false positive cost Critical risks false positive in vulnerable customer segment, feature freshness delay Required controls feature freshness alert, high-value decline human review, threshold rollback Decision promote to 5% low-risk card-present traffic Expiry decision expires after 14 days if ramp not completed
8.3 Dataset and Feature Validation Card
Field Example Validation ID DATA-FEATURE-AML-2026-011 Dataset snapshot AML-TRAIN-2026Q2-v3 Feature view aml-alert-features-v6 Owner AML Data Product Owner Source systems transaction monitoring warehouse, KYC profile store, case management Sampling method stratified by alert type, risk score band, entity type and investigation outcome Schema result pass, 0 incompatible fields Missingness result pass, all critical features below threshold Freshness result pass, transaction velocity features updated within SLA Leakage check pass, post-investigation outcome fields excluded Access control pass, restricted AML fields available only in approved environment Decision approved for Tier 1 candidate training and offline eval
8.4 Experiment Comparison Report
Section Content Experiment ID EXP-CREDIT-RAG-2026-009 Baseline prompt v1.7 + index 2026Q2-v1 + model route A Candidate prompt v1.8 + index 2026Q2-v2 + model route A Dataset CREDIT-GOLDEN-2026Q2-v3, CREDIT-REGRESSION-2026Q2-v2, CREDIT-REDTEAM-2026Q2-v1 Primary outcome citation correctness improved from 93.4% to 96.2% Critical failures 0 in baseline, 0 in candidate Material regression candidate over-refusal increased in small business lending policy slice Cost / latency latency +0.6s due to reranker; cost within approved budget Decision limited go for consumer credit policy; small business slice remains on baseline Evidence eval run report, failure sample review, SME sign-off, prompt diff
8.5 Canary Plan
Field Example Canary ID CANARY-CX-AI-2026-004 Use case customer-facing fee and account service assistant Entry condition shadow passed for 10 business days, 0 customer-visible critical failures Traffic 2% authenticated web chat, excluding complaint, hardship and vulnerable customer tags Duration 5 business days before ramp decision Monitored metrics policy violation, incorrect fee statement, handoff failure, DLP hit, customer negative feedback, p95 latency Auto rollback any incorrect fee waiver commitment, any PII leak, handoff failure above threshold Manual review daily QA sample of 100 conversations plus all negative feedback with fee / complaint keywords Exit decision go to 10% ramp, extend canary, rollback or no-go
8.6 Rollback Runbook
Step Owner Action Evidence 1 Incident Commander Declare rollback trigger and freeze ramp incident / decision log 2 Platform Owner Set route to previous approved release bundle deployment event, route diff 3 Model Registry Owner Confirm production stage points to rollback target registry audit record 4 Prompt / Policy Owner Revert prompt and guardrail config if affected prompt registry version diff 5 Data / RAG Owner Revert index or feature view if affected index manifest, feature view version 6 Business Owner Activate fallback workflow and user communication operations notice 7 Risk / Compliance Confirm impacted population query and evidence preservation impact query result 8 EvalOps Owner Add failure cases to regression set and validate fix regression run ID 9 Release Manager Prepare restart gate memo restart decision record
8.7 Continuous Training Trigger Record
Field Example Trigger ID CT-FRAUD-2026-021 Trigger source production drift dashboard Trigger condition card-not-present merchant category distribution shifted beyond approved threshold Candidate action train fraud model candidate using 2026Q2-v6 snapshot Release rule candidate generation is automatic; production promotion requires full Tier 1 gate Required eval segment performance, false positive cost, vulnerable customer slice, latency, feature freshness Owner Fraud Analytics Lead Decision candidate trained and held in staging pending model risk review
8.8 Evidence Binder Index
Binder section Artifact 01 Scope use case card, risk tier memo, approved/prohibited use 02 Architecture CD4ML diagram, component map, data flow, rollback map 03 Data dataset card, feature validation, lineage and access evidence 04 Build CI report, container digest, dependency and security scan 05 Model model card, training run, registry entry, calibration record 06 Prompt / Policy prompt diff, policy tests, tool schema review 07 Eval eval contract, experiment report, slice failures, SME review 08 Gate release gate memo, approvals, limited-go conditions 09 Deployment manifest, shadow result, canary plan, ramp log 10 Monitoring dashboard, alert rules, human review sample, drift report 11 Rollback rollback target, runbook, drill result, actual rollback record 12 Post-release review minutes, issues, remediation, regression updates
9. 30-Day Training Plan
目标:30 天内围绕一个金融零售 AI 用例,完成一套可展示的 CD4ML / MLOps release engineering 作品集。推荐主线选择 Credit Policy RAG、AML Copilot、Fraud Scoring Model、Payment Dispute Agent 或 Customer-Facing Service AI。
Day 任务 Artifact 1 选择 use case,定义业务目标、用户、客户影响和禁止用途 Use Case Card 2 判定 risk tier,写出为什么是 Tier 1 / Tier 2 / Tier 3 Risk Tier Memo 3 画 AS-IS / TO-BE AI release lifecycle Release Lifecycle Map 4 设计 CD4ML reference architecture Architecture Diagram 5 拆 release bundle:code、data、feature、model、prompt、policy、eval、deployment Release Bundle Spec 6 设计 model registry 字段和 stage promotion Model Registry Spec 7 设计 dataset registry 和 dataset card Dataset Governance Pack 8 设计 feature validation checks Feature Validation Card 9 设计 prompt / policy registry 和 prompt diff 流程 Prompt Governance Spec 10 设计 CI 检查:code、schema、prompt、policy、security CI Checklist 11 设计 training pipeline 和 reproducibility evidence Training Pipeline Spec 12 设计 eval dataset:golden、regression、red-team、slice Eval Dataset Plan 13 设计 evaluator:deterministic、human、judge、business metric Evaluator Card 14 写 baseline vs candidate experiment report Experiment Report 15 设计 release gate stack:G0 到 G9 Gate Model 16 写 Tier 1 release gate memo 示例 Gate Memo 17 设计 shadow 流程和 production-like replay Shadow Plan 18 设计 canary 和 ramp 策略 Canary Plan 19 设计 rollback matrix:model、prompt、index、feature、tool、route Rollback Matrix 20 设计 continuous training trigger 和发布约束 CT Trigger Policy 21 设计 monitoring dashboard:quality、drift、cost、latency、human override Monitoring Spec 22 设计 artifact lineage 和 trace schema Lineage Spec 23 设计 governance evidence binder Evidence Binder Index 24 写金融零售案例 1:Credit Policy RAG Case Study 1 25 写金融零售案例 2:AML / Fraud / Payment Agent Case Study 2 26 做一次 release tabletop:candidate fail、canary rollback、vendor change Tabletop Decision Log 27 写 post-release review 模板和 issue loop Post-Release Review 28 写 build vs buy ADR:TFX / managed MLOps / internal platform Architecture Decision Record 29 整理 15 页作品集包 Portfolio Deck 30 准备 8-10 个面试答案和 5 分钟讲述 Interview Pack
10. Interview Answers
Q1:CD4ML 和普通 CI/CD 最大区别是什么?
版本 回答 30 秒 普通 CI/CD 主要发布代码,CD4ML 发布的是 code、data、feature、model、prompt、eval 和 deployment route 的组合。它不仅要能部署,还要能复现训练、证明质量、分阶段放量、监控漂移并回滚到受控状态。 2 分钟 我会把 CD4ML 理解成 AI release engineering。传统软件中,同一个 commit 构建出 artifact,通过测试就能部署。ML/AI 中,模型行为取决于训练数据、标签、特征、参数、prompt、RAG index、tool schema 和生产分布。所以 release 必须是一个 bundle:代码 commit、数据 snapshot、feature view、model artifact、prompt/policy version、eval run、deployment manifest 和 rollback target。金融零售场景还要把这些证据放进 release gate 和 audit binder。
Q2:为什么模型上线需要 model/data/code/prompt 版本耦合?
版本 回答 30 秒 因为模型行为不是 model artifact 单独决定的。数据、特征、prompt、policy、RAG index 和 serving route 任一变化都会改变输出。没有版本耦合,就无法复现、审计、回滚或解释事故。 2 分钟 我会用 release bundle 管理耦合。比如 AML Copilot 的一个版本不仅包括模型,还包括交易特征、case evidence tool、prompt、guardrail policy、RAG index、eval dataset 和 judge version。事故时如果只知道 model version,无法判断是新 prompt 导致过度自信,还是 index 引用了旧政策,还是 tool schema 允许了错误动作。版本耦合让我们能做 lineage、gate、rollback 和 model risk evidence。
Q3:CI/CD/CT 在 MLOps 中如何分工?
版本 回答 30 秒 CI 验证代码、schema、feature、prompt、policy 和安全;CD 把通过 gate 的 release bundle 分阶段部署;CT 根据数据漂移、标签到达、业务规则变化或性能退化生成候选模型,但候选模型不能绕过 release gate 自动进生产。 2 分钟 CI 是工程质量入口,包括 pipeline tests、feature transform tests、schema tests、prompt tests、policy tests、eval code tests 和 dependency scan。CD 是 deployment orchestration,包括 registry promotion、staging、shadow、canary、ramp 和 rollback。CT 是 continuous training,触发器可以是数据漂移、标签到达、模型退化或周期刷新。关键是 CT 只自动生成 candidate,不等于自动发布。高风险金融场景必须经过 eval、model risk、business approval 和 canary。
Q4:Feature validation 为什么是 release gate 的硬门槛?
版本 回答 30 秒 特征是模型真实看到的业务世界。schema、freshness、missingness、range、label leakage 或 training-serving skew 出错,会让模型在看似成功部署的情况下产生错误决策。 2 分钟 传统 API 测试可能只证明服务可用,但不能证明模型输入是对的。比如欺诈模型如果线上交易金额用毛额而训练用净额,模型会系统性偏移;AML velocity 特征延迟两天,会漏掉关键行为;信贷模型如果把未来 outcome 放进训练,就是 label leakage。Feature validation 要检查 schema、分布、缺失、freshness、range、cardinality、access control 和 training-serving skew。Tier 1 模型没有通过这些检查不应进入 canary。
Q5:你如何设计 AI release gate?
版本 回答 30 秒 我会按 risk tier 设计 gate stack:scope、data、CI、feature validation、offline eval、security/privacy、shadow、canary、ramp、post-release review。高风险 use case 的 critical failure 必须为 0,并且要有 rollback target。 2 分钟 Release gate 要把工程、质量和风险分开看。工程上看构建、测试、schema、container 和 deployment manifest。质量上看 golden set、regression set、red-team、slice analysis、cost 和 latency。风险上看客户影响、监管流程、人工控制、隐私、安全、model risk 和 residual risk。决策可以是 go、limited go、no-go、rollback 或 risk acceptance。金融零售里,客户可见错误承诺、PII 泄露、越权工具动作、unsupported regulated claim 都是 hard stop。
Q6:Shadow、canary、ramp 应该如何用?
版本 回答 30 秒 Shadow 用真实输入旁路评估但不影响用户;canary 用小流量真实用户验证质量、成本、延迟和人工反馈;ramp 按 segment 扩展。每一步都要有成功标准和自动停止条件。 2 分钟 我会先在 shadow 中跑真实请求 replay,确认 trace、policy、latency、cost 和输出质量,但不触发写工具和客户可见动作。Canary 从低风险 segment 或 trained internal users 开始,比如 1%-5% 流量。监控 critical failure、human override、complaint、DLP、latency 和 cost。Ramp 不能按百分比机械扩大,要按产品、地区、用户、风险等级和支持能力分段。任何 high-risk slice 退化都应暂停扩展。
Q7:AI 系统如何设计 rollback?
版本 回答 30 秒 AI rollback 必须按组件设计:model route、prompt、RAG index、feature view、tool permission、policy config 和 batch output 都可能需要分别回滚。高风险系统没有 rollback target 不应上线。 2 分钟 我会在 release bundle 中明确 rollback target。比如 Credit Policy RAG 出现过期引用,可以回滚 index;prompt 导致错误拒答,可以回滚 prompt;fraud model false positive 上升,可以回滚 model 或 threshold;Payment Agent tool misuse,则先关闭写工具,保留 read-only summary。Rollback 还要包括 impacted population query、证据保全、客户补救和 regression case 更新。
Q8:Model registry 在治理中有什么价值?
版本 回答 30 秒 Model registry 不是文件仓库,而是模型发布控制面。它记录 artifact、lineage、metrics、risk tier、approval status、deployment stage、rollback target 和 review expiry。 2 分钟 在金融零售里,model registry 要服务工程和治理两边。工程需要 artifact hash、signature、container、endpoint 和 stage。治理需要 training data、feature view、eval result、owner、risk tier、approval、限制、monitoring dashboard 和下次 validation 日期。这样 model promotion 不只是“把模型复制到生产”,而是一个可审计的 stage transition。
Q9:如何把 NIST AI RMF 接到 MLOps release?
版本 回答 30 秒 我会用 Govern / Map / Measure / Manage 对齐 release:Govern 定义责任和证据,Map 定义 use case 和风险,Measure 用 eval 和监控度量风险,Manage 用 gate、rollback、issue remediation 和 risk acceptance 处置风险。 2 分钟 NIST AI RMF 可以成为跨职能沟通语言。Map 阶段对应 use case intake、risk tier、数据和客户影响。Measure 阶段对应 offline eval、feature validation、red-team、shadow/canary monitoring。Manage 阶段对应 release gate、rollback、fallback、incident response 和 remediation。Govern 贯穿全程,包括 owner、policy、approval、evidence binder 和 review cadence。这样 MLOps 不只是技术 pipeline,而是 AI 风险管理的执行机制。
Q10:你如何把 CD4ML 做成作品集?
版本 回答 30 秒 我会选一个金融零售高风险用例,展示 release architecture、release bundle、gate model、eval report、shadow/canary plan、rollback runbook 和 evidence binder,让面试官看到我能把 AI 从实验带到受控生产。 2 分钟 作品集不应只展示模型效果。我会用 Credit Policy RAG 或 AML Copilot 做主线,先说明业务流程和 risk tier,再展示 CD4ML 架构:source control、data validation、feature store、training pipeline、model/prompt registry、eval gate、deployment、monitoring、rollback。然后展示一个 candidate release:baseline vs candidate、slice analysis、limited-go decision、canary plan、rollback trigger 和 post-release review。最后把所有证据组织成 binder,体现产品、架构、风险和治理一体化能力。
11. Portfolio Package
一个高级 CD4ML / MLOps release engineering 作品集建议做成 15-20 页,不要只放 pipeline 截图。
Page 内容 展示能力 1 Executive summary:为什么金融零售 AI 需要 release engineering 高管沟通 2 Source anchors:CD4ML、Google MLOps、TFX、NIST AI RMF 学习锚点和方法论来源 3 Use case:Credit Policy RAG / AML Copilot / Fraud Model 业务理解 4 Risk tier and approved use 风险分级和边界 5 CD4ML reference architecture 架构设计 6 Release bundle:code/data/feature/model/prompt/eval/deployment 版本耦合 7 Data and feature validation 数据产品和特征治理 8 Model / prompt / policy registry 资产控制 9 CI/CD/CT workflow 工程体系 10 Eval gate:golden、regression、red-team、slice 质量门禁 11 Release gate memo 决策表达 12 Shadow / canary / ramp plan 上线策略 13 Rollback and incident trigger 可靠性 14 Monitoring dashboard 生产运营 15 Governance evidence binder 审计和模型风险证据 16 Financial retail case examples 行业化表达 17 Build vs buy ADR 平台产品判断 18 Interview story 求职转化
11.1 作品集标题示例
CD4ML Release Engineering Pack:
Controlled Production Rollout for Credit Policy RAG
MLOps Continuous Delivery Evidence Binder:
AML Copilot Model / Prompt / Data / Eval Release Governance
Risk-Tiered AI Release System:
Fraud Scoring Continuous Training, Canary and Rollback Design
11.2 5 分钟讲述结构
1. 我选择了一个金融零售高风险 AI 用例。
2. 我没有从模型开始,而是先定义 approved use、prohibited use、risk tier 和客户影响。
3. 我把 AI release 定义成 code、data、feature、model、prompt、eval 和 deployment 的 bundle。
4. 我设计了 CI/CD/CT,但 CT 只生成 candidate,不绕过 release gate。
5. 我用 feature validation、offline eval、security/privacy review 和 model risk gate 控制上线。
6. 我用 shadow、canary 和 ramp 降低生产风险。
7. 我为 model、prompt、index、feature、tool 和 route 分别设计 rollback。
8. 我把 lineage、approval、monitoring、incident 和 post-release review 整理成 evidence binder。
11.3 自检清单
Check 达标标准 Architecture 有端到端 CD4ML pipeline,不只是训练脚本 Version coupling release bundle 绑定 code、data、feature、model、prompt、policy、eval、deployment CI/CD/CT 说明 CI、CD、CT 分工,且 CT 不自动绕过 gate Feature validation 覆盖 schema、distribution、missingness、freshness、leakage、serving skew Eval gate 覆盖 golden、regression、red-team、slice、cost、latency、critical failure Promotion workflow 包含 staging、shadow、canary、ramp、post-release review Rollback 能按 model、prompt、index、feature、tool、policy、route 回滚 Registry model / dataset / feature / prompt / eval / deployment registry 有最小字段 Reproducibility 能复现训练和评估,保留 container、snapshot、参数、hash Lineage production trace 能追溯 release_id 和关键组件版本 Governance 有 gate memo、approval、exception、monitoring、issue、evidence binder Financial retail 案例体现 AML、KYC、fraud、credit、payment、customer-facing AI 的风险差异
12. Final Principle
AI release engineering 的成熟度可以用一句话检验:
当明天模型、prompt、数据、特征、RAG index 或工具权限发生变化时,团队能否在同一天构建候选版本、复现训练来源、跑完风险分级 eval、做出 gate 决策、受控放量、实时监控、快速回滚,并拿出完整治理证据?
如果答案是肯定的,MLOps 就不只是模型工程,而是金融零售 AI 规模化的生产操作系统。