AI 扩展计划 / Playbooks

AI Experimentation Platform / Release Science Playbook

这些来源用于校准 online controlled experiments、Microsoft ExP、CUPED、sequential testing、guardrail metrics、multiple testing、safe deployment 和 AI 风险治理语言。正式项目必须按访问日期复核产品状态、统计方法、监管要求和机构内部政策。

726 行AI_EXPERIMENTATION_PLATFORM_RELEASE_SCIENCE_PLAYBOOK.md

AI Experimentation Platform & Release Science Playbook

面向对象: AI Platform PM / AI Product Architect / Experimentation PM / Decision Science / Risk Product / 金融零售 AI 转型负责人。核心问题: 如何把 AI eval、线上实验、渐进发布、风险门禁和因果决策连接成一套可复用、可审计、可回滚的 Release Science 平台能力。学习目标: 能设计 AI experimentation platform, 能把 LLM / RAG / Agent 的离线评估桥接到线上 controlled experiments, 能用 guardrail、CUPED、sequential testing、ramp / rollback 和 risk-based release gate 支撑 scale / stop 决策。作品集定位: 本手册可转化为高级 AI 产品架构作品集证据, 包括 Experimentation Platform Capability Map、Metric Tree、Guardrail Matrix、Ramp Plan、Stop Rule、Release Review Memo、Post-Experiment Decision Record 和金融零售案例包。边界说明: 本文不是 BA 基础需求分析、统计学入门、法律意见、合规意见或模型验证报告。金融零售正式项目必须由 business owner、risk、model risk、legal、compliance、privacy、security、data owner、architecture review 和 operations owner 共同确认。

Source Anchors

Anchor	Official / primary source	本 playbook 中的用法
Kohavi, Tang, Xu: Trustworthy Online Controlled Experiments	https://www.cambridge.org/core/books/trustworthy-online-controlled-experiments/D97B26382EB0EB2DC2019A7A7B518F59	用于 online controlled experiments、trustworthiness、organizational metrics、实验文化、泄漏与干扰、长期实验和平台化能力的术语锚定。
Microsoft Research ExP	https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/	用于理解大规模 experimentation platform 如何把低摩擦实验、可信分析、scorecard、A/B infrastructure 和 GenAI continuous improvement 结合。
CUPED paper	https://robotics.stanford.edu/~ronnyk/2013-02CUPEDImprovingSensitivityOfControlledExperiments.pdf	用于 variance reduction、pre-experiment covariates、实验灵敏度、触发样本和 pre-triggering discipline。
NIST AI RMF	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 组织 AI 风险识别、度量、监控、处置和 evidence。
Statsig sequential testing	https://docs.statsig.com/experiments/advanced-setup/sequential-testing	用于解释 fixed horizon、peeking problem、sequential testing 和提前决策的统计纪律。
Eppo experiment protocols / guardrails	https://docs.geteppo.com/quick-starts/analysis-integration/defining-protocols/	用于 pre-register metrics、analysis methods、decision criteria 和 guardrail 方案标准化。
Optimizely false discovery rate control	https://support.optimizely.com/hc/en-us/articles/4410283967245-False-discovery-rate-control	用于 multiple testing、secondary / monitoring metrics、FDR 和切片探索风险。
Microsoft Azure Safe Deployment Practices	https://learn.microsoft.com/en-us/azure/well-architected/operational-excellence/safe-deployments	用于 release science、progressive exposure、blast radius、ring deployment 和风险化发布治理。
Microsoft SRM article	https://www.microsoft.com/en-us/research/articles/diagnosing-sample-ratio-mismatch-in-a-b-testing/	用于将 sample ratio mismatch 作为实验可信度硬门禁。
Interleaved search evaluation	https://authors.library.caltech.edu/records/r3zrn-kd453	用于推荐、搜索、RAG retrieval、ranking model 的 interleaving 线上比较。
OpenAI Evals guide	https://developers.openai.com/api/docs/guides/evals	用于说明 LLM eval task、run、analysis 到线上实验的桥接；截至 2026-06-29, OpenAI 文档显示 Evals platform 已进入 deprecation timeline, 因此本文把它作为 eval design language, 不把它作为唯一平台依赖。
LaunchDarkly guarded rollouts	https://launchdarkly.com/docs/home/releases/creating-guarded-rollouts	用于 progressive rollout、metric regression detection、automatic rollback、randomization unit 和 Agent config rollout 的产品形态参考。

1. 高级定位: Release Science 是 AI 生产决策系统

金融零售 AI 团队常见失败不是模型完全不可用, 而是 release decision 不可信:

离线 eval 分数提升, 但真实客服流程没有改善。
RAG answer quality 变好, 但投诉、误引用、升级率或合规风险恶化。
推荐模型点击率上升, 但利润率、客户长期价值或公平性下降。
KYC extraction 自动化率上升, 但漏提字段导致后续返工。
Fraud model 拦截率提升, 但误杀高价值客户和支付失败投诉增加。
Agent 工具调用通过测试, 但线上长尾意图触发越权、重复提交或错误操作。

一句话:

AI Experimentation Platform = assignment、exposure、metrics、eval、release gate、ramp、rollback、evidence 和 learning loop 的统一控制面。 Release Science = 用统计证据、风险阈值、渐进暴露和因果推断决定 AI 版本何时进入 pilot、scale、freeze、rollback 或 retire。

这不是“多跑几个 A/B test”。高级 AI PM / Product Architect 要回答的是:

决策问题	错误做法	成熟做法
新模型是否上线	offline score 高就发布	offline eval、shadow、canary、online controlled experiment、guardrail gate 分层放行
实验能否解读	看 dashboard 的显著性	先检查 SRM、assignment、exposure、triggering、metric lineage、multiple testing 和 peeking
风险是否可接受	只看 primary metric	用 guardrail matrix 和 stop rule 管客户伤害、合规、成本、延迟、公平性、人工升级
何时扩大流量	“没有事故就全量”	按 risk tier、minimum exposure、sequential boundary、operations readiness 和 rollback capacity 逐步 ramp
AI eval 如何连接业务价值	把 judge score 当 ROI	把 eval 作为 release gate, 用 online outcome 和 causal decisioning 证明增量效果
平台为什么值得建	每个团队自己试验	用统一 protocol、metrics catalog、evidence binder、release review 和 institutional memory 降低决策噪音

2. 为什么重要: AI 系统的发布风险不同于普通功能

传统功能发布主要关心功能正确性、性能和用户体验。AI release 还多了六类不稳定性:

AI release 风险	表现	Release Science 控制
Behavior drift	模型、prompt、RAG index、tool schema 改动后行为非线性变化	component lineage、offline regression suite、shadow trace comparison
Context drift	知识库、政策、客户上下文或产品规则刷新	source freshness metric、retrieval eval、policy effective-date gate
Human-AI interaction	用户过度信任、忽略 AI、复制错误或绕过流程	adoption-adjusted metrics、human override、quality audit
Long-tail harm	小概率高损失错误在平均分中被掩盖	critical guardrail、zero-tolerance failure class、risk-based stop rule
Feedback contamination	新策略改变用户行为和标签分布	holdout、delayed outcome tracking、counterfactual logging
Operational coupling	AI 结果影响队列、人工复核、支付、KYC、CRM、Agent 工具	ramp capacity check、manual fallback、rollback rehearsal

高级表达:

AI release 不是 deploy event, 而是 controlled exposure of probabilistic behavior under explicit risk appetite.

3. 能力地图: AI Experimentation Platform Control Plane

3.1 参考架构

AI change request
  -> component registry
  -> experiment protocol selection
  -> assignment and exposure service
  -> offline eval and shadow comparison
  -> feature flag / release gate
  -> online controlled experiment
  -> metric pipeline and semantic layer
  -> statistical analysis and decision engine
  -> risk-based release review
  -> ramp / rollback orchestration
  -> post-experiment decision record
  -> evidence binder and learning repository

3.2 平台核心组件

组件	责任	高级产品问题	关键证据
Experiment Registry	管 experiment id、hypothesis、owner、risk tier、protocol、state	谁在什么场景试验什么干预	Experiment card、approval trail
Assignment Service	稳定随机分流、cluster assignment、holdout、eligibility	randomization unit 是 customer、agent、case、merchant、branch 还是 household	Assignment log、salt、split config
Exposure Tracker	记录用户实际看到或被 AI 影响	assignment 不等于 exposure, 谁真正受到 treatment	Exposure event、trigger reason
Feature Flag / Release Gate	控制流量、segment、ramp、kill switch	实验、发布和配置变更是否同一控制面	Flag config、targeting rule、rollback state
Metrics Catalog	定义 primary、secondary、guardrail、invariant、diagnostic metrics	指标口径是否统一、方向是否明确、延迟是否可接受	Metric contract、lineage、owner
Eval Bridge	把 offline eval、shadow eval、online outcome 连接	eval score 是否能解释线上行为	Eval-to-online mapping、calibration report
Analysis Engine	SRM、CUPED、sequential testing、multiple testing、slice analysis	分析是否可信, 是否支持提前停止	Scorecard、variance plan、decision boundary
Release Review Workflow	go / limited go / no-go / rollback / exception	统计显著但风险不可接受时如何决策	Release review memo、risk sign-off
Evidence Binder	保存设计、数据、结果、审批、异常、复盘	审计和模型风险团队能否复现当时判断	Immutable evidence package
Learning Repository	记录实验结论、失败模式、meta-analysis	组织是否避免重复犯错	Decision log、pattern library

3.3 架构模式选择

模式	适用场景	优势	风险控制
Central Experimentation Platform	多业务线、多团队、多指标体系	统一 assignment、metrics、analysis、evidence	强制 protocol、metrics catalog、SRM gate
Embedded Experiment SDK	需要低延迟前端或服务端分流	与 feature flag / config 发布紧密集成	SDK version gate、fallback variation、telemetry validation
Warehouse-Native Analysis	机构已有成熟数据仓库和治理	减少数据复制, 复用 semantic layer	数据延迟、PII 权限、metric lineage 审查
Decisioning-Coupled Experiment	风控、推荐、KYC、Agent tool routing	分流与业务决策同源, 可记录 counterfactual	决策日志、policy version、outcome delay
Shadow / Replay Platform	高风险模型或 Agent 工具上线前验证	无客户暴露即可比较新旧策略	shadow 不证明用户行为改变, 仍需线上验证
Champion-Challenger Framework	Fraud、KYC、credit、routing 策略迭代	稳定基线和挑战者长期比较	champion lock、challenger cap、override monitoring

4. Release Science Operating Model

4.1 五段式发布路径

Design
  hypothesis, treatment, metric tree, guardrail, sample size, stop rule

Dry run
  A/A test, telemetry validation, SRM expectation, metrics lineage check

Shadow
  new AI runs without affecting customer or staff decision

Controlled exposure
  canary, A/B, interleaving, cluster, switchback, champion-challenger

Scale decision
  release review, ramp, rollback readiness, post-experiment decision

4.2 决策分层

层级	决策	证据	典型结论
L0 Technical readiness	是否可运行、可观测、可回滚	integration test、telemetry validation、fallback drill	允许进入 shadow
L1 Eval readiness	离线质量是否超过最低门槛	golden set、red-team set、slice regression	允许小流量 canary
L2 Experiment validity	线上实验是否可信	SRM、triggering、sample size、variance、multiple testing discipline	允许解读实验结果
L3 Risk acceptability	风险是否在 appetite 内	guardrail matrix、incident log、manual audit	go / limited go / rollback
L4 Business impact	是否产生可归因增量价值	primary outcome、CUPED-adjusted estimate、causal review	scale / iterate / stop
L5 Portfolio learning	是否沉淀平台能力	pattern reuse、cost-benefit、future experiment backlog	platformize / retire / merge

5. 实验设计方法库

5.1 Online Controlled Experiments / A/B Testing

适用于 treatment 对用户或业务流程有直接影响, 且可以稳定随机分配的场景。

设计要点	高级判断
Randomization unit	金融零售不总是 user。客服 Copilot 可能按 agent 或 case；支付欺诈可能按 transaction、card、merchant 或 account；KYC 可能按 application；推荐可能按 session 或 customer。
Assignment vs exposure	assignment 是被分配到 treatment, exposure 是实际受到 AI 影响。AI 产品常出现 assigned but not exposed, 必须保留 ITT 与 triggered analysis 两套视角。
Invariant metrics	country、device、channel、case type、risk tier、traffic allocation 应用于检验分流和数据质量。SRM 失败通常先冻结解读。
Outcome delay	fraud loss、KYC defect、complaint、chargeback 可能延迟数天到数周。短期 proxy 不能替代最终 outcome。
Heterogeneity	平均提升可能掩盖高风险 segment 伤害。必须按 customer segment、case complexity、agent tenure、risk tier、language、channel 切片。

5.2 CUPED / Variance Reduction

CUPED 的产品意义不是“统计技巧”, 而是缩短高成本 AI 实验的学习周期。

维度	设计规则
可用前提	covariate 必须来自 treatment 触发前, 不能被 treatment 影响。
金融零售 covariate	客服历史 AHT、agent 历史 QA、客户过去投诉率、merchant 历史 fraud rate、KYC applicant 历史补件率、推荐用户历史购买倾向。
不适合情况	新用户无历史数据比例过高、covariate 与 outcome 相关性弱、pre-period 数据质量差、covariate 被 release 影响。
产品决策	在 sample size plan 中写明原始方差、预期相关性、CUPED 后方差、最小可检测效果和实验时长变化。
风险提醒	CUPED 提升灵敏度不修复错误分流、错误 exposure、干扰、指标污染或 multiple testing 问题。

5.3 Sequential Testing

Sequential testing 允许按预先声明的边界提前停止, 但不是每天看 p-value 后自由解释。

场景	做法
高风险 release	预先定义 interim looks, 每次只按 stop rule 判定 escalate / pause / rollback。
成本高实验	使用 sequential boundary 减少不必要 exposure。
长周期 outcome	对 early guardrail 使用 sequential monitoring, 对最终 business outcome 保持固定窗口或分层分析。
组织治理	dashboard 明确显示 fixed horizon / sequential method, 禁止在未注册方法之间切换解释。

5.4 Guardrail Metrics

Guardrail 不是“顺便看一下”的指标, 而是 release contract。

类型	示例	决策含义
Customer harm	投诉率、误拒率、错误承诺、资金影响、升级主管	超阈值即 pause 或 rollback
Compliance / policy	KYC 漏提字段、PII 泄露、未经授权建议、记录保留失败	critical 类可设为 zero tolerance
Operational	AHT、reopen rate、manual override、queue backlog、fallback rate	防止 AI 把成本转移给运营
Technical	latency、timeout、tool error、retrieval empty rate、schema failure	控制系统可靠性和用户体验
Financial	fraud loss、false positive cost、margin、chargeback、refund	将收益与风险成本合并判断
Fairness / segment	高龄客户、语言、渠道、地区、风险等级差异	防止平均收益掩盖特定人群伤害

5.5 Multiple Testing

AI 实验通常有大量 metrics、segments、prompts、models、arms 和 judge dimensions。未经控制的多重比较会制造“显著幻觉”。

问题	控制方式
多个 primary metric	强制选择一个 primary 或建立 OEC composite, 其余为 secondary / guardrail。
多个 variants	使用事前声明的 contrast, 对多 arm 比较应用 FWER / FDR 或 hierarchical testing。
大量 segment exploration	切片作为 diagnosis, scale decision 只使用 pre-registered segments 或二次验证实验。
LLM eval 多维度	将 critical safety metrics 设为 hard gate, quality dimensions 用分层比较, 避免平均 judge score 掩盖严重错误。

5.6 Network Effects / Interference

当一个人的 treatment 影响另一个人的 outcome, 传统独立随机假设会失效。

场景	干扰路径	推荐设计
客服 Copilot	同一主管组分享提示词和流程技巧	agent team / supervisor group cluster randomization
推荐系统	曝光改变库存、价格、商户流量和后续用户选择	session holdout、market / merchant cluster、switchback
支付欺诈	拦截策略影响攻击者行为和商户路由	merchant / card / account cluster, time-window switchback
Agent 工具	工具执行改变 case 状态, 影响后续用户或员工处理	case-level isolation, workflow state lock, tool action audit
RAG 知识助手	团队成员复制答案进入共享知识库或宏模板	team-level randomization, shared artifact monitoring

5.7 Interleaving

Interleaving 适合比较 ranking / retrieval / recommendation 两个候选策略, 尤其当普通 A/B 需要大量流量时。

用法	金融零售例子	注意事项
Retrieval interleaving	RAG 知识助手比较两个 retriever / reranker 返回的资料排序	只比较 ranking preference, 不直接证明最终答案风险可接受。
Recommendation interleaving	银行 App next-best-action 或电商推荐策略比较	需要控制位置偏差、库存约束、重复曝光和长期价值。
Search ranking interleaving	内部政策搜索、客服知识库搜索、产品目录搜索	点击不等于正确, 需结合 downstream task success 和 QA。

5.8 Shadow Launch

Shadow launch 是 high-risk AI release 的关键阶段, 但它不是最终上线证据。

Shadow 能证明	Shadow 不能证明
新模型可在生产流量上运行	用户或员工是否会改变行为
输出与 champion 的差异	业务指标是否提升
latency、cost、timeout、schema error	客户体验和真实 adoption
高风险 case 的离线人工审查结果	治疗效应和长期结果

5.9 Ramp / Rollback

Ramp 是受控暴露, rollback 是产品能力, 不是事故后的临时操作。

Release tier	典型 ramp	必备控制
Low risk internal assistive	internal dogfood -> 10% staff -> 50% -> 100%	usage、latency、feedback、support channel
Medium risk employee copilot	shadow -> 5% agents -> 20% -> 50% -> all trained agents	QA audit、manual override、case reopen、supervisor review
High risk customer / financial impact	shadow -> 1% eligible low-risk segment -> 5% -> 10% -> release review	critical guardrail、manual fallback、risk sign-off、rollback drill
Agent actioning	read-only -> draft-only -> supervised action -> limited autonomous action	tool allowlist、approval threshold、kill switch、transaction idempotency

6. LLM / RAG / Agent Eval 到线上实验的桥接

6.1 桥接原则

离线 eval 回答“这个版本在已知样本上的行为是否更好”。线上实验回答“这个版本在真实流程中是否产生净增量价值且不造成不可接受风险”。两者之间必须有桥接层。

offline eval
  component quality: retrieval, groundedness, tool call correctness
  -> shadow launch
     production trace comparison without user impact
  -> canary
     limited real exposure with hard guardrails
  -> controlled experiment
     causal estimate on workflow and business outcomes
  -> ramp decision
     risk-adjusted value and operational readiness

6.2 Eval-to-Online Mapping

AI 类型	Offline eval	Shadow signal	Online primary	Online guardrail
Customer Copilot	answer correctness、policy compliance、tone、citation	draft acceptance difference、QA reviewer preference	resolved cases per hour、quality-adjusted AHT	complaint、reopen、wrong commitment、escalation
RAG 知识助手	retrieval recall、groundedness、citation validity	old vs new answer divergence、missing source rate	task success、agent accepted answer、search deflection	stale source、PII exposure、policy mismatch、latency
Recommendation	offline precision / recall、business rule pass	ranking diff、inventory / merchant exposure shift	conversion、margin、engagement quality	churn、complaint、fairness、inventory starvation
KYC extraction	field-level F1、document type robustness	human reviewer diff、critical field miss	straight-through processing with audit pass	false accept、false reject、manual rework、policy breach
Payment fraud	AUC / PR、loss simulation、reason code stability	champion vs challenger alert diff	fraud loss avoided net of false positive cost	false decline、VIP impact、chargeback delay、manual queue
Agent tool rollout	tool selection accuracy、argument schema pass、policy refusal	tool-call diff、idempotency check	successful supervised action rate	unauthorized action、duplicate action、rollback failure

6.3 Bridge Anti-Patterns

反模式	为什么危险	替代做法
用 LLM judge score 直接决定全量	judge bias 和 offline set coverage 不足会掩盖真实流程伤害	judge score 进入 release gate, 再走 shadow + controlled exposure
只看 adoption	员工可能采用错误建议, 或因为管理压力被动使用	adoption 与 quality、outcome、override 一起看
离线 golden set 长期不更新	生产新失败无法进入回归集	从 incident、human override、complaint、shadow diff 反哺 eval dataset
把 prompt 改动当低风险配置	prompt 可能改变政策边界、工具调用和客户承诺	prompt version 纳入 component registry 和 release gate
Agent 工具一次性开放	actioning 风险远高于 answer drafting	read -> draft -> supervised action -> limited autonomous action

7. Causal Decisioning: 从显著性到决策质量

7.1 决策公式

release decision =
  treatment effect estimate
  + uncertainty
  + guardrail status
  + risk severity
  + operational readiness
  + reversibility
  + strategic value

统计显著不是发布许可。高风险 AI 版本可能在 primary metric 上显著提升, 但仍因 guardrail 失败被 rollback。

7.2 因果设计纪律

Discipline	产品动作
Treatment definition	写清 AI 改变的是信息、排序、草稿、建议、自动动作、阈值还是权限。
Unit of analysis	分清 customer、case、session、agent、merchant、transaction、branch、household。
Assignment logging	每次分配记录 config、salt、version、eligibility、reason。
Exposure logging	记录 treatment 是否真正被看见、采用、覆盖或执行。
Counterfactual preservation	fraud、recommendation、agent action 需要保留 champion decision 或 alternative ranking。
Triggered analysis	对实际触发 AI 的样本分析, 同时保留 ITT 防止选择偏差。
Delayed outcome	对 chargeback、complaint、KYC defect 使用 mature outcome window。
Heterogeneous treatment effect	用预注册 segment 分析谁受益、谁受害。

7.3 无法随机时的替代策略

场景	替代设计	风险提醒
法规或伦理不允许随机 withholding	phased rollout、stepped wedge、matched controls	rollout 顺序可能与能力强弱相关
供应量或库存约束	switchback、geo / branch cluster	时间和地点冲击需控制
低频高损失事件	shadow + simulation + human review + limited canary	不能把模拟结果包装为线上因果效果
组织级工具	team-level cluster randomization	cluster 数不足会降低 power
政策统一发布日期	pre/post with synthetic control or interrupted time series	外部事件和季节性必须显式建模

8. 金融零售案例包

8.1 客服 Copilot Rollout

设计项	决策
Treatment	Copilot 提供政策检索、回复草稿、下一步建议, 由 agent 最终发送。
Randomization unit	agent-team cluster, 避免同一主管组内知识扩散污染。
Offline gate	政策正确性、引用有效性、禁用承诺、敏感场景升级。
Shadow	对真实 case 生成草稿, QA 比较 champion 宏模板和 Copilot 草稿。
Primary metric	quality-adjusted resolved cases per hour。
Guardrail	投诉率、reopen rate、错误承诺、主管升级、客户等待时间、agent override。
Ramp	5% trained agents -> 20% -> 50% -> all eligible teams, 每阶段完成 QA 抽样后进入下一阶段。
Stop rule	任何 critical policy breach 或投诉率超过 control 上限即 pause；连续两个观察窗口 AHT 改善但 QA 下降则 limited go。
决策输出	如果低复杂度 case 显著改善且高复杂度 case guardrail 接近边界, scale 低复杂度并把高复杂度保留人工优先。

8.2 RAG 知识助手模型升级

设计项	决策
Treatment	embedding model、reranker、answer model 或 citation policy 升级。
Randomization unit	agent-session 或 case, 高共享团队使用 team-level randomization。
Offline gate	retrieval recall@k、groundedness、citation freshness、policy effective-date match。
Shadow	同一 query 同时跑 champion 和 challenger, 比较 source、answer、latency、拒答。
Online experiment	对低风险内部搜索使用 A/B；对 ranking 差异强的检索比较可先做 interleaving。
Primary metric	task success rate 或 accepted answer with no downstream correction。
Guardrail	stale policy citation、PII leakage、hallucinated source、latency p95、empty retrieval rate。
Stop rule	critical source hallucination 触发 rollback；latency p95 超过 SLO 且没有显著质量收益则 stop。
决策输出	如果 challenger 提升困难查询但降低常见查询, 可用 query intent router 而非全量替换。

8.3 推荐系统策略上线

设计项	决策
Treatment	next-best-offer / product ranking / content recommendation strategy。
Randomization unit	customer 或 session；存在库存、商户流量和社交扩散时使用 cluster 或 switchback。
Offline gate	policy eligibility、fairness、margin rules、suppression list、customer suitability。
Interleaving	对搜索 / ranking 候选先做 interleaving, 快速判断 preference。
Online primary	risk-adjusted conversion、margin、qualified engagement。
Guardrail	opt-out、complaint、unsuitable offer、minority segment lift gap、inventory starvation、long-term churn。
Multiple testing	一个 primary, 次级业务指标分层；segment discovery 只作为诊断。
Stop rule	任一 suitability guardrail 失败立即 rollback；短期点击提升但 margin 或 complaint 恶化则不 scale。
决策输出	对高 uplift 且低风险 segment 做 targeted rollout, 对负 uplift segment 降级到 champion。

8.4 KYC Extraction Model Release

设计项	决策
Treatment	OCR + document understanding 模型升级, 自动抽取身份、地址、公司注册、受益人字段。
Randomization unit	application 或 document package。
Shadow	challenger 只生成字段, human reviewer 不看或只在审计界面对比。
Primary metric	audit-pass straight-through processing rate。
Guardrail	false accept、critical field miss、manual rework、document type slice regression、policy breach。
Ramp	shadow -> dual-read low-risk document types -> supervised auto-fill -> limited auto-approve for low-risk cases。
Stop rule	critical field miss 高于 control 或高风险 document type 出现回归即冻结该 slice。
决策输出	不做全局 go/no-go, 按 document type、country、risk tier 和 image quality 做分层 release。

8.5 Payment Fraud Rule / Model Champion-Challenger

设计项	决策
Treatment	fraud score model 或规则组合挑战 champion。
Randomization unit	transaction、card、merchant 或 account, 依据干扰和攻击者适应路径选择。
Shadow	challenger 计算分数和 action, 不影响授权；保留 champion action。
Primary metric	net fraud loss avoided after false positive cost。
Guardrail	false decline、VIP impact、merchant complaint、manual review backlog、chargeback lag。
Causal issue	真实 fraud outcome 延迟, attacker behavior 会适应策略, 需 mature window 和 holdout。
Ramp	low-risk segment score-only -> manual review routing -> limited step-up -> limited decline。
Stop rule	false decline 高于 risk appetite 或 manual queue 超 capacity 立即降级。
决策输出	如果 challenger 提高检出但增加误杀, 通过 threshold policy、step-up authentication 或 segment-specific champion-challenger 取代直接全量。

8.6 Agent Tool Rollout

设计项	决策
Treatment	Agent 可调用 CRM update、refund quote、case routing、payment investigation、KYC checklist 等工具。
Randomization unit	tool intent、case type 或 agent cohort。
Offline gate	tool selection accuracy、argument validation、policy refusal、idempotency、permission check。
Shadow	Agent 生成 intended tool call, 系统比较 human action 和 policy engine。
Progressive autonomy	read-only -> draft tool call -> human approval -> limited autonomous execution。
Primary metric	successful supervised action rate 或 reduced manual handling with no quality regression。
Guardrail	unauthorized action、duplicate action、wrong account、missing approval、compensation failure、kill switch failure。
Stop rule	任何 unauthorized financial action 或 duplicate execution 触发 immediate rollback。
决策输出	工具能力按 action risk 分层, 低风险读工具可扩大, 资金和客户承诺类工具保持审批。

9. 产品决策与治理

9.1 平台产品决策

决策	推荐原则
先建平台还是先做业务实验	有两个以上团队重复需要 assignment、metric catalog、release gate、evidence binder 时平台化；单一低风险用例先轻量协议化。
Feature flag 与 experimentation 是否合并	发布控制必须和实验分流共享 exposure log, 但分析权限、审批和指标治理可独立。
Metrics catalog 谁拥有	平台维护技术规范, business / risk / finance 拥有指标定义和决策方向。
Guardrail 是否允许 override	critical safety / compliance 不允许业务单方 override；非 critical guardrail 可用 documented exception 和 compensating control。
AI eval 平台与 A/B 平台关系	EvalOps 负责 component behavior, Experimentation 负责真实流程因果效果, Release Science 将二者作为门禁链路。
何时引入 sequential testing	高成本、高风险或需要持续监控的 rollout；必须配套预注册 stop rule 和教育。
何时用 interleaving	ranking / retrieval 候选差异大且偏好信号可被同一用户会话观察；不替代最终风险评估。

9.2 RACI

活动	AI PM	Experimentation PM	Data Science	Risk Product	Platform Engineering	Model Risk / Compliance	Operations
Hypothesis and treatment	A	C	C	C	C	C	C
Metric tree	A	A	C	C	C	C	C
Assignment design	C	A	A	C	R	C	C
Offline eval gate	A	C	R	C	R	C	C
Guardrail matrix	A	A	C	A	C	A	C
Ramp / rollback	A	C	C	A	R	C	A
Release review	A	A	C	A	C	A	A
Post-experiment decision	A	A	R	C	C	C	C

R = responsible, A = accountable, C = consulted.

9.3 Release Gate 状态机

Draft
  -> protocol approved
  -> telemetry validated
  -> shadow running
  -> canary running
  -> experiment running
  -> analysis locked
  -> release review
  -> limited go / scale go / no-go / rollback / retire

9.4 Risk-Based Release Gate

Risk tier	AI 能力	最低门禁
Tier 1 Internal productivity	内部总结、搜索、非客户可见草稿	telemetry validation、offline eval、usage monitoring、manual feedback
Tier 2 Employee assistive	客服、运营、KYC reviewer copilot	offline eval、shadow、QA audit、A/B、guardrail、rollback
Tier 3 Customer visible	客户可见回复、推荐、报价、解释	legal/compliance review、controlled exposure、critical guardrail、complaint monitoring
Tier 4 Financial / eligibility impact	欺诈拦截、KYC 放行、信贷、支付、资金动作	model risk review、champion-challenger、mature outcome window、manual fallback、board-level risk reporting as applicable
Tier 5 Agent autonomous action	自动执行客户账户或资金相关动作	tool-level permissions、human approval threshold、kill switch、transaction reconciliation、incident drill

10. 可落地交付物模板

以下模板用“字段 + 写法 + 示例”呈现, 可以直接复用为项目文档结构。示例统一采用“RAG 知识助手模型升级”场景, 便于形成一套完整作品集证据。

10.1 Experiment Design Doc

字段	写法	示例
Experiment name	业务场景 + AI 变更 + release 阶段	RAG Knowledge Assistant Reranker Upgrade Canary
Decision owner	对 scale / stop 负责的人	AI Platform PM 与 Customer Operations Director 联合负责
Hypothesis	treatment 如何改变行为和 outcome	新 reranker 提高政策答案引用准确率, 使 agent 更快完成复杂 case, 且不增加错误引用
Treatment	明确被改变的系统行为	对 eligible policy-search queries 使用 reranker v3, answer model 和 UI 不变
Control	当前 champion	retriever v2 + reranker v1
Population	eligible traffic 和排除范围	英文客服政策查询, 排除投诉升级、法律争议和资金补偿 case
Randomization unit	分配单位和理由	agent-session, 因为同一 session 内多次查询共享上下文
Exposure event	何时认为受到 treatment	agent 看到 reranker v3 支持的 answer card
Primary metric	一个主要成功标准	QA-verified accepted answer rate
Guardrails	风险阈值和动作	hallucinated citation 为 critical, p95 latency、empty retrieval、policy mismatch 为 release guardrail
Analysis method	固定窗口、sequential、CUPED、multiple testing 规则	7 天固定窗口 primary readout；latency 和 citation critical issue 做 sequential monitoring；使用 agent 历史 accepted answer rate 做 CUPED
Sample / duration	样本和时长计划	最少 18,000 exposed sessions, 覆盖两个工作周和周末低流量
Stop rule	预注册停止条件	出现 2 起 confirmed hallucinated source 或 p95 latency 连续 4 小时超 SLO 25% 即 pause
Ramp plan	流量阶段	shadow 3 天 -> 5% -> 20% -> 50% -> release review
Evidence	决策所需证据包	design doc、eval report、shadow diff、SRM check、scorecard、QA sample、release memo

10.2 Metric Tree

层级	Metric	定义	决策用途
North Star	Quality-adjusted knowledge task success	agent 接受答案且 QA 抽样无 policy error	scale decision
Primary	QA-verified accepted answer rate	accepted answer 中通过 QA 的比例	实验主判断
Secondary	time to answer	从 query 到 answer accepted 的时长	生产力收益
Secondary	search refinement rate	同一 case 内二次改写 query 比例	retrieval friction
Guardrail	hallucinated citation rate	引用不存在、无权限或不支持结论	hard release gate
Guardrail	stale policy source rate	引用过期政策或失效版本	compliance gate
Guardrail	p95 latency	answer card ready 的 p95 时间	UX / ops gate
Invariant	traffic split by language / queue / agent tenure	treatment 与 control 分布一致性	实验可信度
Diagnostic	retriever empty rate	无候选文档返回比例	根因定位
Diagnostic	judge-human disagreement	LLM judge 与 QA reviewer 不一致比例	eval calibration

10.3 Guardrail Matrix

Guardrail	Severity	Threshold	Detection	Action	Owner
Hallucinated citation	Critical	confirmed count >= 2 in canary	QA review + citation validator	pause treatment, run RCA, keep control	AI PM + Compliance
Stale policy source	High	treatment worse than control by 0.2pp and statistically credible	source effective-date check	freeze ramp, patch source filter	Knowledge owner
p95 latency	Medium	25% above control for 4 consecutive hours	live telemetry	hold ramp, route high-latency traffic to control	Platform engineering
Empty retrieval	Medium	10% relative increase vs control	retrieval logs	rollback reranker for affected query class	Search platform
Agent override spike	Medium	15% relative increase in override on complex cases	UI event	segment review before scale	Operations
Complaint linkage	High	any confirmed customer complaint tied to wrong answer	complaint triage	limited go only after review	Customer risk

10.4 Ramp Plan

Stage	Traffic	Entry criteria	Monitoring window	Exit criteria
Shadow	0% customer / staff impact	offline eval passes all critical gates	3 business days	no critical diff, latency within SLO, QA sample accepted
Canary 5%	eligible low-risk sessions	telemetry validated, shadow review approved	24 hours	no critical guardrail, SRM pass, p95 stable
Canary 20%	eligible sessions excluding high-risk queues	5% stage release review complete	48 hours	primary trend non-negative, guardrails within appetite
Experiment 50%	full eligible population	sample size plan supports readout	7-14 days	analysis locked and release memo ready
Scale	100% eligible or targeted segment	release review approves scale	first 14 days after scale	daily guardrail review, rollback switch active

10.5 Stop Rule

Rule class	Concrete rule	Decision
Critical safety	confirmed hallucinated legal / policy citation count reaches 2 during canary	immediate pause, route to control, executive notification
Compliance	any PII exposure to unauthorized agent group	rollback, incident process, access control RCA
Operational	manual queue backlog increases 20% for 2 consecutive business days	freeze ramp, reduce treatment to previous stage
Statistical validity	SRM test fails on primary assignment unit	invalidate current readout, investigate assignment / logging before analysis
Business harm	primary metric negative and guardrail negative after minimum exposure	no-go, preserve learnings, open redesign decision
Sequential success	primary crosses pre-registered boundary and all critical guardrails pass	early release review, not automatic scale

10.6 Sample Size / Variance Plan

字段	写法	示例
Primary metric baseline	用最近稳定窗口, 排除异常事件	accepted answer rate baseline 62.0% from last 28 days
Minimum detectable effect	业务有意义的最小增量	+1.5pp, 因为低于该收益无法覆盖模型成本和运营审查
Unit of analysis	与 randomization unit 一致或说明聚合	agent-session
Variance source	标出高方差来源	agent tenure、queue、case complexity、weekday seasonality
CUPED covariate	treatment 前且与 outcome 相关	agent previous 28-day accepted answer rate and queue complexity mix
Expected variance reduction	由历史回放估计	预估 20-30% variance reduction, 不用于改变 guardrail critical threshold
Duration	覆盖业务周期	至少两个工作周, 覆盖周末低流量和月末政策更新
Power caveat	说明无法检测的风险	critical safety 以事件审查为 gate, 不依赖显著性

10.7 Release Review Memo

Section	内容	示例写法
Decision requested	请求 go、limited go、no-go、rollback 或 scale	请求将 reranker v3 从 50% 实验扩大到英文低风险政策查询 100%
Evidence summary	eval、shadow、online、guardrail	Offline groundedness +3.8pp, shadow 无 critical diff, online primary +1.7pp, guardrails pass
Validity checks	SRM、exposure、metric lineage、multiple testing	SRM pass; 94% assigned sessions exposed; primary pre-registered; segment analysis used for diagnosis
Risk position	residual risk 与补偿控制	高复杂度投诉队列仍保持 control, 因 citation risk 接近 yellow threshold
Operational readiness	training、support、rollback、on-call	rollback flag verified, support runbook active, QA sampling doubled for 14 days
Decision	明确结论	limited go: scale low-risk queues, keep high-risk queues on champion, review after 14 days
Accountability	owner 和复盘日期	AI PM owns ramp, Knowledge owner owns source freshness, next review on 2026-07-13

10.8 Post-Experiment Decision Template

Section	决策记录
Experiment conclusion	Reranker v3 improves QA-verified accepted answer rate in low-risk English policy queries without guardrail breach.
Decision	Scale low-risk English queues; keep complaint escalation and legal-sensitive queries on champion.
Causal confidence	Randomization and SRM checks passed; exposure rate high; CUPED-adjusted estimate consistent with unadjusted direction.
Where it works	Short policy lookup, billing FAQ, account maintenance cases, experienced agents.
Where it fails	Legal-sensitive complaints, outdated policy sources, cases requiring cross-product reasoning.
Product action	Add intent router, strengthen source effective-date filter, create separate experiment for high-risk queues.
Governance action	Update release gate: any RAG answer touching legal-sensitive policy requires citation freshness hard check.
Reusable learning	Offline groundedness predicted online QA direction, but latency guardrail was the best early indicator of agent rejection.

11. 30 天训练计划

Day	训练主题	产出
1	读 Kohavi OCE 目录和 experimentation trustworthiness 主题	1 页实验可信度原则卡
2	拆解 Microsoft ExP 平台能力	Experimentation platform capability map
3	设计 AI release 状态机	Draft -> shadow -> canary -> experiment -> scale 状态图
4	为客服 Copilot 写 experiment design doc	完整 design doc v1
5	为客服 Copilot 建 metric tree	primary、secondary、guardrail、invariant、diagnostic 指标树
6	SRM、assignment、exposure 日志设计	telemetry event spec
7	A/A test 和 instrumentation validation	A/A readiness checklist
8	CUPED 训练	sample size / variance plan
9	Sequential testing 训练	stop rule 和 interim look plan
10	Multiple testing 训练	metric hierarchy 和 FDR / segment policy
11	Guardrail matrix 训练	customer harm、compliance、ops、technical、financial guardrails
12	Shadow launch 训练	champion-challenger shadow comparison memo
13	Ramp / rollback 训练	5-stage ramp plan 和 kill switch checklist
14	Release review 演练	release review memo
15	RAG eval bridge	offline-to-online mapping table
16	RAG interleaving	retrieval / reranker interleaving design
17	Agent tool rollout	tool risk tiering and progressive autonomy plan
18	KYC extraction release	document-type segmented release plan
19	Payment fraud champion-challenger	fraud model release science brief
20	Recommendation strategy	interleaving + A/B + long-term holdout design
21	Network effects	cluster randomization / switchback decision memo
22	Causal decisioning	treatment, exposure, outcome, counterfactual map
23	Delayed outcomes	mature outcome window and proxy metric policy
24	Risk-based release gate	tiered gate matrix for six financial retail use cases
25	Evidence binder	design、eval、analysis、approval、incident evidence index
26	Post-experiment decision	scale / stop / segment / redesign decision record
27	Portfolio learning	experiment learning repository taxonomy
28	Executive communication	1-page executive release memo
29	面试演练	6 道高级面试题 30 秒和 2 分钟答案
30	作品集整合	AI Experimentation Platform case portfolio pack

12. 面试题准备

Q1: 你如何设计一个 AI experimentation platform?

30 秒版本: 我不会把它设计成单纯 A/B dashboard, 而是设计成 release science control plane: assignment、exposure、metrics catalog、eval bridge、analysis engine、feature gate、ramp / rollback、release review 和 evidence binder。AI 的特殊点是离线 eval 只能证明组件行为, 线上 controlled experiment 才能证明真实流程增量效果, 所以平台必须把二者连接起来。

2 分钟版本: 我会先定义核心控制对象: experiment、AI component、randomization unit、exposure、metric contract、guardrail、release gate 和 decision record。架构上需要 assignment service 保证稳定分流, exposure tracker 记录实际受到 AI 影响, metrics catalog 管 primary / guardrail / invariant, analysis engine 做 SRM、CUPED、sequential 和 multiple testing, feature gate 负责 ramp 与 rollback。对 LLM / RAG / Agent, 还要有 EvalOps bridge, 把 golden set、shadow trace、online outcome 和 post-release monitoring 接起来。治理上每个高风险 release 都必须有 design doc、sample plan、stop rule、release memo 和 post-experiment decision。

Q2: Offline eval 分数提升, 为什么还需要线上实验?

30 秒版本: Offline eval 证明新版本在已知样本上表现更好, 但不能证明员工会采用、客户体验会改善、流程成本会下降, 也不能完全暴露分布变化和人机协作风险。线上实验用真实 exposure 和 counterfactual 估计增量效果。

2 分钟版本: LLM / RAG / Agent 的离线 eval 是 release gate, 不是 ROI 证据。比如 RAG groundedness 提升, 线上可能因为 latency 更高导致 agent 不采用；客服草稿更完整, 也可能增加错误承诺；fraud model AUC 更高, 但误杀成本更高。因此我会先离线回归和 shadow, 再小流量 controlled exposure, 用 primary business outcome 和 guardrail 联合判断。结论也不一定是全量, 可能是 segment-specific rollout、router、limited go 或 no-go。

Q3: CUPED 在 AI 实验中怎么用?

30 秒版本: CUPED 用 treatment 前的协变量降低方差, 提升实验灵敏度。AI 场景可以用 agent 历史 QA、客户历史购买倾向、merchant 历史 fraud rate 等 pre-experiment data, 但不能使用会被 treatment 影响的变量。

2 分钟版本: 我会在 sample size plan 中先确定 primary metric baseline 和最小有意义效果, 再用历史数据估计 covariate 与 outcome 的相关性。客服 Copilot 可以用 agent 过去 28 天 AHT 和 QA；RAG 可以用 agent 历史 accepted answer rate；支付 fraud 可以用 merchant 历史风险水平。CUPED 能缩短实验周期, 但它不能修复错误 randomization、SRM、干扰或多重比较。高风险 guardrail 仍按 critical threshold 管理, 不因为 CUPED 调整而放松。

Q4: Sequential testing 和每天看结果有什么区别?

30 秒版本: 每天随意看结果会增加 false positive。Sequential testing 是事前声明 interim looks 和边界, 允许提前停止但控制错误率。

2 分钟版本: 我会把 sequential testing 用在高风险 ramp 的 guardrail monitoring 或成本很高的实验上。设计时必须写清观察频率、停止边界、最小样本、success / harm / futility 规则。比如 Agent tool rollout 中, unauthorized action 是 immediate stop, latency 和 error rate 可以 sequential monitoring；最终业务收益仍需要成熟窗口。关键是不要在固定窗口和 sequential 方法之间事后切换解释。

Q5: AI release 的 guardrail metrics 如何设计?

30 秒版本: Guardrail 要覆盖客户伤害、合规、运营、技术、财务和公平性。它们不是辅助观察, 而是 release contract, 部分 critical guardrail 触发即 pause 或 rollback。

2 分钟版本: 我会先按风险路径建 matrix: 这个 AI 可能错误承诺、泄露 PII、误拒支付、漏提 KYC 字段、错误调用工具或增加队列压力。每个 guardrail 要有 severity、threshold、detection、action 和 owner。客服 Copilot 的 guardrail 包括投诉、reopen、错误承诺；RAG 包括 hallucinated citation、stale source；fraud 包括 false decline 和 manual queue；Agent 包括 unauthorized action 和 duplicate action。高风险 guardrail 不靠平均分抵消, 而是直接影响 release decision。

Q6: 推荐或 RAG ranking 为什么可能用 interleaving?

30 秒版本: Interleaving 把两个 ranking 的结果混在同一用户会话中, 用隐式反馈更快比较偏好, 适合搜索、推荐、retrieval / reranking 候选。

2 分钟版本: 普通 A/B 把用户分到不同 ranking, 对小差异可能需要大量流量。Interleaving 在同一上下文中比较两个 ranker, 对 RAG retriever / reranker、知识库搜索和推荐排序很有用。但它主要回答 ranking preference, 不能直接证明最终答案安全或业务价值。因此我会把 interleaving 作为早期筛选, 后续仍用 controlled experiment 验证 downstream task success、complaint、latency 和 fairness guardrails。

Q7: 如果实验存在 network effects, 你怎么处理?

30 秒版本: 先识别干扰路径, 再改变 randomization unit。客服按团队 cluster, 推荐按市场或 session, fraud 按 merchant / account 或 switchback, 避免一个样本的 treatment 影响另一个样本 outcome。

2 分钟版本: Network effects 会破坏独立性假设。比如客服团队会分享 Copilot 话术, 推荐会改变库存和商户曝光, fraud 策略会改变攻击者行为。我的做法是把干扰路径写进 design doc, 优先使用 cluster randomization、switchback、holdout 或 market-level rollout；同时记录 contamination signal。分析时不把普通 user-level p-value 当最终证据, 要按设计单位估计效果并解释 power 损失。

Q8: Payment fraud model 的 champion-challenger release 怎么做?

30 秒版本: 先 shadow challenger, 记录 champion 和 challenger 的 action diff, 再在低风险 segment 渐进暴露。Primary 是 net fraud loss avoided after false positive cost, guardrail 是 false decline、VIP impact、manual queue 和 chargeback delay。

2 分钟版本: 欺诈模型不能只看 AUC。真实决策要权衡 fraud loss、误杀、客户体验、运营队列和攻击者适应。我会让 challenger 在 shadow 中跑一段成熟窗口, 对 action diff 做人工和风险审查；进入 canary 时先用于 step-up 或 manual review, 再有限 decline。保留 champion holdout 和 delayed outcome tracking。任何 false decline 超 risk appetite 或 manual queue 超 capacity 都触发降级。最终可能不是替换 champion, 而是 segment-specific challenger 或 threshold policy。

13. 参考来源链接

Kohavi, Tang, Xu: Trustworthy Online Controlled Experiments: https://www.cambridge.org/core/books/trustworthy-online-controlled-experiments/D97B26382EB0EB2DC2019A7A7B518F59
Microsoft Research Experimentation Platform ExP: https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/
Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data: https://robotics.stanford.edu/~ronnyk/2013-02CUPEDImprovingSensitivityOfControlledExperiments.pdf
NIST AI Risk Management Framework: https://www.nist.gov/itl/ai-risk-management-framework
Statsig Frequentist Sequential Testing: https://docs.statsig.com/experiments/advanced-setup/sequential-testing
Eppo Experiment Protocols: https://docs.geteppo.com/quick-starts/analysis-integration/defining-protocols/
Eppo Guardrail Cutoffs: https://docs.geteppo.com/data-management/organizing-metrics/guardrails/
Optimizely False Discovery Rate Control: https://support.optimizely.com/hc/en-us/articles/4410283967245-False-discovery-rate-control
Microsoft Azure Safe Deployment Practices: https://learn.microsoft.com/en-us/azure/well-architected/operational-excellence/safe-deployments
Microsoft Research: Diagnosing Sample Ratio Mismatch in A/B Testing: https://www.microsoft.com/en-us/research/articles/diagnosing-sample-ratio-mismatch-in-a-b-testing/
Large-scale validation and analysis of interleaved search evaluation: https://authors.library.caltech.edu/records/r3zrn-kd453
OpenAI Evals guide: https://developers.openai.com/api/docs/guides/evals
LaunchDarkly Guarded Rollouts: https://launchdarkly.com/docs/home/releases/creating-guarded-rollouts