AI Experimentation Platform / Release Science Playbook
这些来源用于校准 online controlled experiments、Microsoft ExP、CUPED、sequential testing、guardrail metrics、multiple testing、safe deployment 和 AI 风险治理语言。正式项目必须按访问日期复核产品状态、统计方法、监管要求和机构内部政策。
AI Experimentation Platform & Release Science Playbook
面向对象: AI Platform PM / AI Product Architect / Experimentation PM / Decision Science / Risk Product / 金融零售 AI 转型负责人。 核心问题: 如何把 AI eval、线上实验、渐进发布、风险门禁和因果决策连接成一套可复用、可审计、可回滚的 Release Science 平台能力。 学习目标: 能设计 AI experimentation platform, 能把 LLM / RAG / Agent 的离线评估桥接到线上 controlled experiments, 能用 guardrail、CUPED、sequential testing、ramp / rollback 和 risk-based release gate 支撑 scale / stop 决策。 作品集定位: 本手册可转化为高级 AI 产品架构作品集证据, 包括 Experimentation Platform Capability Map、Metric Tree、Guardrail Matrix、Ramp Plan、Stop Rule、Release Review Memo、Post-Experiment Decision Record 和金融零售案例包。 边界说明: 本文不是 BA 基础需求分析、统计学入门、法律意见、合规意见或模型验证报告。金融零售正式项目必须由 business owner、risk、model risk、legal、compliance、privacy、security、data owner、architecture review 和 operations owner 共同确认。
Source Anchors
这些来源用于校准 online controlled experiments、Microsoft ExP、CUPED、sequential testing、guardrail metrics、multiple testing、safe deployment 和 AI 风险治理语言。正式项目必须按访问日期复核产品状态、统计方法、监管要求和机构内部政策。
| Anchor | Official / primary source | 本 playbook 中的用法 |
|---|---|---|
| Kohavi, Tang, Xu: Trustworthy Online Controlled Experiments | https://www.cambridge.org/core/books/trustworthy-online-controlled-experiments/D97B26382EB0EB2DC2019A7A7B518F59 | 用于 online controlled experiments、trustworthiness、organizational metrics、实验文化、泄漏与干扰、长期实验和平台化能力的术语锚定。 |
| Microsoft Research ExP | https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/ | 用于理解大规模 experimentation platform 如何把低摩擦实验、可信分析、scorecard、A/B infrastructure 和 GenAI continuous improvement 结合。 |
| CUPED paper | https://robotics.stanford.edu/~ronnyk/2013-02CUPEDImprovingSensitivityOfControlledExperiments.pdf | 用于 variance reduction、pre-experiment covariates、实验灵敏度、触发样本和 pre-triggering discipline。 |
| NIST AI RMF | https://www.nist.gov/itl/ai-risk-management-framework | 用 Govern / Map / Measure / Manage 组织 AI 风险识别、度量、监控、处置和 evidence。 |
| Statsig sequential testing | https://docs.statsig.com/experiments/advanced-setup/sequential-testing | 用于解释 fixed horizon、peeking problem、sequential testing 和提前决策的统计纪律。 |
| Eppo experiment protocols / guardrails | https://docs.geteppo.com/quick-starts/analysis-integration/defining-protocols/ | 用于 pre-register metrics、analysis methods、decision criteria 和 guardrail 方案标准化。 |
| Optimizely false discovery rate control | https://support.optimizely.com/hc/en-us/articles/4410283967245-False-discovery-rate-control | 用于 multiple testing、secondary / monitoring metrics、FDR 和切片探索风险。 |
| Microsoft Azure Safe Deployment Practices | https://learn.microsoft.com/en-us/azure/well-architected/operational-excellence/safe-deployments | 用于 release science、progressive exposure、blast radius、ring deployment 和风险化发布治理。 |
| Microsoft SRM article | https://www.microsoft.com/en-us/research/articles/diagnosing-sample-ratio-mismatch-in-a-b-testing/ | 用于将 sample ratio mismatch 作为实验可信度硬门禁。 |
| Interleaved search evaluation | https://authors.library.caltech.edu/records/r3zrn-kd453 | 用于推荐、搜索、RAG retrieval、ranking model 的 interleaving 线上比较。 |
| OpenAI Evals guide | https://developers.openai.com/api/docs/guides/evals | 用于说明 LLM eval task、run、analysis 到线上实验的桥接;截至 2026-06-29, OpenAI 文档显示 Evals platform 已进入 deprecation timeline, 因此本文把它作为 eval design language, 不把它作为唯一平台依赖。 |
| LaunchDarkly guarded rollouts | https://launchdarkly.com/docs/home/releases/creating-guarded-rollouts | 用于 progressive rollout、metric regression detection、automatic rollback、randomization unit 和 Agent config rollout 的产品形态参考。 |
1. 高级定位: Release Science 是 AI 生产决策系统
金融零售 AI 团队常见失败不是模型完全不可用, 而是 release decision 不可信:
- 离线 eval 分数提升, 但真实客服流程没有改善。
- RAG answer quality 变好, 但投诉、误引用、升级率或合规风险恶化。
- 推荐模型点击率上升, 但利润率、客户长期价值或公平性下降。
- KYC extraction 自动化率上升, 但漏提字段导致后续返工。
- Fraud model 拦截率提升, 但误杀高价值客户和支付失败投诉增加。
- Agent 工具调用通过测试, 但线上长尾意图触发越权、重复提交或错误操作。
一句话:
AI Experimentation Platform = assignment、exposure、metrics、eval、release gate、ramp、rollback、evidence 和 learning loop 的统一控制面。 Release Science = 用统计证据、风险阈值、渐进暴露和因果推断决定 AI 版本何时进入 pilot、scale、freeze、rollback 或 retire。
这不是“多跑几个 A/B test”。高级 AI PM / Product Architect 要回答的是:
| 决策问题 | 错误做法 | 成熟做法 |
|---|---|---|
| 新模型是否上线 | offline score 高就发布 | offline eval、shadow、canary、online controlled experiment、guardrail gate 分层放行 |
| 实验能否解读 | 看 dashboard 的显著性 | 先检查 SRM、assignment、exposure、triggering、metric lineage、multiple testing 和 peeking |
| 风险是否可接受 | 只看 primary metric | 用 guardrail matrix 和 stop rule 管客户伤害、合规、成本、延迟、公平性、人工升级 |
| 何时扩大流量 | “没有事故就全量” | 按 risk tier、minimum exposure、sequential boundary、operations readiness 和 rollback capacity 逐步 ramp |
| AI eval 如何连接业务价值 | 把 judge score 当 ROI | 把 eval 作为 release gate, 用 online outcome 和 causal decisioning 证明增量效果 |
| 平台为什么值得建 | 每个团队自己试验 | 用统一 protocol、metrics catalog、evidence binder、release review 和 institutional memory 降低决策噪音 |
2. 为什么重要: AI 系统的发布风险不同于普通功能
传统功能发布主要关心功能正确性、性能和用户体验。AI release 还多了六类不稳定性:
| AI release 风险 | 表现 | Release Science 控制 |
|---|---|---|
| Behavior drift | 模型、prompt、RAG index、tool schema 改动后行为非线性变化 | component lineage、offline regression suite、shadow trace comparison |
| Context drift | 知识库、政策、客户上下文或产品规则刷新 | source freshness metric、retrieval eval、policy effective-date gate |
| Human-AI interaction | 用户过度信任、忽略 AI、复制错误或绕过流程 | adoption-adjusted metrics、human override、quality audit |
| Long-tail harm | 小概率高损失错误在平均分中被掩盖 | critical guardrail、zero-tolerance failure class、risk-based stop rule |
| Feedback contamination | 新策略改变用户行为和标签分布 | holdout、delayed outcome tracking、counterfactual logging |
| Operational coupling | AI 结果影响队列、人工复核、支付、KYC、CRM、Agent 工具 | ramp capacity check、manual fallback、rollback rehearsal |
高级表达:
AI release 不是 deploy event, 而是 controlled exposure of probabilistic behavior under explicit risk appetite.
3. 能力地图: AI Experimentation Platform Control Plane
3.1 参考架构
AI change request
-> component registry
-> experiment protocol selection
-> assignment and exposure service
-> offline eval and shadow comparison
-> feature flag / release gate
-> online controlled experiment
-> metric pipeline and semantic layer
-> statistical analysis and decision engine
-> risk-based release review
-> ramp / rollback orchestration
-> post-experiment decision record
-> evidence binder and learning repository
3.2 平台核心组件
| 组件 | 责任 | 高级产品问题 | 关键证据 |
|---|---|---|---|
| Experiment Registry | 管 experiment id、hypothesis、owner、risk tier、protocol、state | 谁在什么场景试验什么干预 | Experiment card、approval trail |
| Assignment Service | 稳定随机分流、cluster assignment、holdout、eligibility | randomization unit 是 customer、agent、case、merchant、branch 还是 household | Assignment log、salt、split config |
| Exposure Tracker | 记录用户实际看到或被 AI 影响 | assignment 不等于 exposure, 谁真正受到 treatment | Exposure event、trigger reason |
| Feature Flag / Release Gate | 控制流量、segment、ramp、kill switch | 实验、发布和配置变更是否同一控制面 | Flag config、targeting rule、rollback state |
| Metrics Catalog | 定义 primary、secondary、guardrail、invariant、diagnostic metrics | 指标口径是否统一、方向是否明确、延迟是否可接受 | Metric contract、lineage、owner |
| Eval Bridge | 把 offline eval、shadow eval、online outcome 连接 | eval score 是否能解释线上行为 | Eval-to-online mapping、calibration report |
| Analysis Engine | SRM、CUPED、sequential testing、multiple testing、slice analysis | 分析是否可信, 是否支持提前停止 | Scorecard、variance plan、decision boundary |
| Release Review Workflow | go / limited go / no-go / rollback / exception | 统计显著但风险不可接受时如何决策 | Release review memo、risk sign-off |
| Evidence Binder | 保存设计、数据、结果、审批、异常、复盘 | 审计和模型风险团队能否复现当时判断 | Immutable evidence package |
| Learning Repository | 记录实验结论、失败模式、meta-analysis | 组织是否避免重复犯错 | Decision log、pattern library |
3.3 架构模式选择
| 模式 | 适用场景 | 优势 | 风险控制 |
|---|---|---|---|
| Central Experimentation Platform | 多业务线、多团队、多指标体系 | 统一 assignment、metrics、analysis、evidence | 强制 protocol、metrics catalog、SRM gate |
| Embedded Experiment SDK | 需要低延迟前端或服务端分流 | 与 feature flag / config 发布紧密集成 | SDK version gate、fallback variation、telemetry validation |
| Warehouse-Native Analysis | 机构已有成熟数据仓库和治理 | 减少数据复制, 复用 semantic layer | 数据延迟、PII 权限、metric lineage 审查 |
| Decisioning-Coupled Experiment | 风控、推荐、KYC、Agent tool routing | 分流与业务决策同源, 可记录 counterfactual | 决策日志、policy version、outcome delay |
| Shadow / Replay Platform | 高风险模型或 Agent 工具上线前验证 | 无客户暴露即可比较新旧策略 | shadow 不证明用户行为改变, 仍需线上验证 |
| Champion-Challenger Framework | Fraud、KYC、credit、routing 策略迭代 | 稳定基线和挑战者长期比较 | champion lock、challenger cap、override monitoring |
4. Release Science Operating Model
4.1 五段式发布路径
Design
hypothesis, treatment, metric tree, guardrail, sample size, stop rule
Dry run
A/A test, telemetry validation, SRM expectation, metrics lineage check
Shadow
new AI runs without affecting customer or staff decision
Controlled exposure
canary, A/B, interleaving, cluster, switchback, champion-challenger
Scale decision
release review, ramp, rollback readiness, post-experiment decision
4.2 决策分层
| 层级 | 决策 | 证据 | 典型结论 |
|---|---|---|---|
| L0 Technical readiness | 是否可运行、可观测、可回滚 | integration test、telemetry validation、fallback drill | 允许进入 shadow |
| L1 Eval readiness | 离线质量是否超过最低门槛 | golden set、red-team set、slice regression | 允许小流量 canary |
| L2 Experiment validity | 线上实验是否可信 | SRM、triggering、sample size、variance、multiple testing discipline | 允许解读实验结果 |
| L3 Risk acceptability | 风险是否在 appetite 内 | guardrail matrix、incident log、manual audit | go / limited go / rollback |
| L4 Business impact | 是否产生可归因增量价值 | primary outcome、CUPED-adjusted estimate、causal review | scale / iterate / stop |
| L5 Portfolio learning | 是否沉淀平台能力 | pattern reuse、cost-benefit、future experiment backlog | platformize / retire / merge |
5. 实验设计方法库
5.1 Online Controlled Experiments / A/B Testing
适用于 treatment 对用户或业务流程有直接影响, 且可以稳定随机分配的场景。
| 设计要点 | 高级判断 |
|---|---|
| Randomization unit | 金融零售不总是 user。客服 Copilot 可能按 agent 或 case;支付欺诈可能按 transaction、card、merchant 或 account;KYC 可能按 application;推荐可能按 session 或 customer。 |
| Assignment vs exposure | assignment 是被分配到 treatment, exposure 是实际受到 AI 影响。AI 产品常出现 assigned but not exposed, 必须保留 ITT 与 triggered analysis 两套视角。 |
| Invariant metrics | country、device、channel、case type、risk tier、traffic allocation 应用于检验分流和数据质量。SRM 失败通常先冻结解读。 |
| Outcome delay | fraud loss、KYC defect、complaint、chargeback 可能延迟数天到数周。短期 proxy 不能替代最终 outcome。 |
| Heterogeneity | 平均提升可能掩盖高风险 segment 伤害。必须按 customer segment、case complexity、agent tenure、risk tier、language、channel 切片。 |
5.2 CUPED / Variance Reduction
CUPED 的产品意义不是“统计技巧”, 而是缩短高成本 AI 实验的学习周期。
| 维度 | 设计规则 |
|---|---|
| 可用前提 | covariate 必须来自 treatment 触发前, 不能被 treatment 影响。 |
| 金融零售 covariate | 客服历史 AHT、agent 历史 QA、客户过去投诉率、merchant 历史 fraud rate、KYC applicant 历史补件率、推荐用户历史购买倾向。 |
| 不适合情况 | 新用户无历史数据比例过高、covariate 与 outcome 相关性弱、pre-period 数据质量差、covariate 被 release 影响。 |
| 产品决策 | 在 sample size plan 中写明原始方差、预期相关性、CUPED 后方差、最小可检测效果和实验时长变化。 |
| 风险提醒 | CUPED 提升灵敏度不修复错误分流、错误 exposure、干扰、指标污染或 multiple testing 问题。 |
5.3 Sequential Testing
Sequential testing 允许按预先声明的边界提前停止, 但不是每天看 p-value 后自由解释。
| 场景 | 做法 |
|---|---|
| 高风险 release | 预先定义 interim looks, 每次只按 stop rule 判定 escalate / pause / rollback。 |
| 成本高实验 | 使用 sequential boundary 减少不必要 exposure。 |
| 长周期 outcome | 对 early guardrail 使用 sequential monitoring, 对最终 business outcome 保持固定窗口或分层分析。 |
| 组织治理 | dashboard 明确显示 fixed horizon / sequential method, 禁止在未注册方法之间切换解释。 |
5.4 Guardrail Metrics
Guardrail 不是“顺便看一下”的指标, 而是 release contract。
| 类型 | 示例 | 决策含义 |
|---|---|---|
| Customer harm | 投诉率、误拒率、错误承诺、资金影响、升级主管 | 超阈值即 pause 或 rollback |
| Compliance / policy | KYC 漏提字段、PII 泄露、未经授权建议、记录保留失败 | critical 类可设为 zero tolerance |
| Operational | AHT、reopen rate、manual override、queue backlog、fallback rate | 防止 AI 把成本转移给运营 |
| Technical | latency、timeout、tool error、retrieval empty rate、schema failure | 控制系统可靠性和用户体验 |
| Financial | fraud loss、false positive cost、margin、chargeback、refund | 将收益与风险成本合并判断 |
| Fairness / segment | 高龄客户、语言、渠道、地区、风险等级差异 | 防止平均收益掩盖特定人群伤害 |
5.5 Multiple Testing
AI 实验通常有大量 metrics、segments、prompts、models、arms 和 judge dimensions。未经控制的多重比较会制造“显著幻觉”。
| 问题 | 控制方式 |
|---|---|
| 多个 primary metric | 强制选择一个 primary 或建立 OEC composite, 其余为 secondary / guardrail。 |
| 多个 variants | 使用事前声明的 contrast, 对多 arm 比较应用 FWER / FDR 或 hierarchical testing。 |
| 大量 segment exploration | 切片作为 diagnosis, scale decision 只使用 pre-registered segments 或二次验证实验。 |
| LLM eval 多维度 | 将 critical safety metrics 设为 hard gate, quality dimensions 用分层比较, 避免平均 judge score 掩盖严重错误。 |
5.6 Network Effects / Interference
当一个人的 treatment 影响另一个人的 outcome, 传统独立随机假设会失效。
| 场景 | 干扰路径 | 推荐设计 |
|---|---|---|
| 客服 Copilot | 同一主管组分享提示词和流程技巧 | agent team / supervisor group cluster randomization |
| 推荐系统 | 曝光改变库存、价格、商户流量和后续用户选择 | session holdout、market / merchant cluster、switchback |
| 支付欺诈 | 拦截策略影响攻击者行为和商户路由 | merchant / card / account cluster, time-window switchback |
| Agent 工具 | 工具执行改变 case 状态, 影响后续用户或员工处理 | case-level isolation, workflow state lock, tool action audit |
| RAG 知识助手 | 团队成员复制答案进入共享知识库或宏模板 | team-level randomization, shared artifact monitoring |
5.7 Interleaving
Interleaving 适合比较 ranking / retrieval / recommendation 两个候选策略, 尤其当普通 A/B 需要大量流量时。
| 用法 | 金融零售例子 | 注意事项 |
|---|---|---|
| Retrieval interleaving | RAG 知识助手比较两个 retriever / reranker 返回的资料排序 | 只比较 ranking preference, 不直接证明最终答案风险可接受。 |
| Recommendation interleaving | 银行 App next-best-action 或电商推荐策略比较 | 需要控制位置偏差、库存约束、重复曝光和长期价值。 |
| Search ranking interleaving | 内部政策搜索、客服知识库搜索、产品目录搜索 | 点击不等于正确, 需结合 downstream task success 和 QA。 |
5.8 Shadow Launch
Shadow launch 是 high-risk AI release 的关键阶段, 但它不是最终上线证据。
| Shadow 能证明 | Shadow 不能证明 |
|---|---|
| 新模型可在生产流量上运行 | 用户或员工是否会改变行为 |
| 输出与 champion 的差异 | 业务指标是否提升 |
| latency、cost、timeout、schema error | 客户体验和真实 adoption |
| 高风险 case 的离线人工审查结果 | 治疗效应和长期结果 |
5.9 Ramp / Rollback
Ramp 是受控暴露, rollback 是产品能力, 不是事故后的临时操作。
| Release tier | 典型 ramp | 必备控制 |
|---|---|---|
| Low risk internal assistive | internal dogfood -> 10% staff -> 50% -> 100% | usage、latency、feedback、support channel |
| Medium risk employee copilot | shadow -> 5% agents -> 20% -> 50% -> all trained agents | QA audit、manual override、case reopen、supervisor review |
| High risk customer / financial impact | shadow -> 1% eligible low-risk segment -> 5% -> 10% -> release review | critical guardrail、manual fallback、risk sign-off、rollback drill |
| Agent actioning | read-only -> draft-only -> supervised action -> limited autonomous action | tool allowlist、approval threshold、kill switch、transaction idempotency |
6. LLM / RAG / Agent Eval 到线上实验的桥接
6.1 桥接原则
离线 eval 回答“这个版本在已知样本上的行为是否更好”。线上实验回答“这个版本在真实流程中是否产生净增量价值且不造成不可接受风险”。两者之间必须有桥接层。
offline eval
component quality: retrieval, groundedness, tool call correctness
-> shadow launch
production trace comparison without user impact
-> canary
limited real exposure with hard guardrails
-> controlled experiment
causal estimate on workflow and business outcomes
-> ramp decision
risk-adjusted value and operational readiness
6.2 Eval-to-Online Mapping
| AI 类型 | Offline eval | Shadow signal | Online primary | Online guardrail |
|---|---|---|---|---|
| Customer Copilot | answer correctness、policy compliance、tone、citation | draft acceptance difference、QA reviewer preference | resolved cases per hour、quality-adjusted AHT | complaint、reopen、wrong commitment、escalation |
| RAG 知识助手 | retrieval recall、groundedness、citation validity | old vs new answer divergence、missing source rate | task success、agent accepted answer、search deflection | stale source、PII exposure、policy mismatch、latency |
| Recommendation | offline precision / recall、business rule pass | ranking diff、inventory / merchant exposure shift | conversion、margin、engagement quality | churn、complaint、fairness、inventory starvation |
| KYC extraction | field-level F1、document type robustness | human reviewer diff、critical field miss | straight-through processing with audit pass | false accept、false reject、manual rework、policy breach |
| Payment fraud | AUC / PR、loss simulation、reason code stability | champion vs challenger alert diff | fraud loss avoided net of false positive cost | false decline、VIP impact、chargeback delay、manual queue |
| Agent tool rollout | tool selection accuracy、argument schema pass、policy refusal | tool-call diff、idempotency check | successful supervised action rate | unauthorized action、duplicate action、rollback failure |
6.3 Bridge Anti-Patterns
| 反模式 | 为什么危险 | 替代做法 |
|---|---|---|
| 用 LLM judge score 直接决定全量 | judge bias 和 offline set coverage 不足会掩盖真实流程伤害 | judge score 进入 release gate, 再走 shadow + controlled exposure |
| 只看 adoption | 员工可能采用错误建议, 或因为管理压力被动使用 | adoption 与 quality、outcome、override 一起看 |
| 离线 golden set 长期不更新 | 生产新失败无法进入回归集 | 从 incident、human override、complaint、shadow diff 反哺 eval dataset |
| 把 prompt 改动当低风险配置 | prompt 可能改变政策边界、工具调用和客户承诺 | prompt version 纳入 component registry 和 release gate |
| Agent 工具一次性开放 | actioning 风险远高于 answer drafting | read -> draft -> supervised action -> limited autonomous action |
7. Causal Decisioning: 从显著性到决策质量
7.1 决策公式
release decision =
treatment effect estimate
+ uncertainty
+ guardrail status
+ risk severity
+ operational readiness
+ reversibility
+ strategic value
统计显著不是发布许可。高风险 AI 版本可能在 primary metric 上显著提升, 但仍因 guardrail 失败被 rollback。
7.2 因果设计纪律
| Discipline | 产品动作 |
|---|---|
| Treatment definition | 写清 AI 改变的是信息、排序、草稿、建议、自动动作、阈值还是权限。 |
| Unit of analysis | 分清 customer、case、session、agent、merchant、transaction、branch、household。 |
| Assignment logging | 每次分配记录 config、salt、version、eligibility、reason。 |
| Exposure logging | 记录 treatment 是否真正被看见、采用、覆盖或执行。 |
| Counterfactual preservation | fraud、recommendation、agent action 需要保留 champion decision 或 alternative ranking。 |
| Triggered analysis | 对实际触发 AI 的样本分析, 同时保留 ITT 防止选择偏差。 |
| Delayed outcome | 对 chargeback、complaint、KYC defect 使用 mature outcome window。 |
| Heterogeneous treatment effect | 用预注册 segment 分析谁受益、谁受害。 |
7.3 无法随机时的替代策略
| 场景 | 替代设计 | 风险提醒 |
|---|---|---|
| 法规或伦理不允许随机 withholding | phased rollout、stepped wedge、matched controls | rollout 顺序可能与能力强弱相关 |
| 供应量或库存约束 | switchback、geo / branch cluster | 时间和地点冲击需控制 |
| 低频高损失事件 | shadow + simulation + human review + limited canary | 不能把模拟结果包装为线上因果效果 |
| 组织级工具 | team-level cluster randomization | cluster 数不足会降低 power |
| 政策统一发布日期 | pre/post with synthetic control or interrupted time series | 外部事件和季节性必须显式建模 |
8. 金融零售案例包
8.1 客服 Copilot Rollout
| 设计项 | 决策 |
|---|---|
| Treatment | Copilot 提供政策检索、回复草稿、下一步建议, 由 agent 最终发送。 |
| Randomization unit | agent-team cluster, 避免同一主管组内知识扩散污染。 |
| Offline gate | 政策正确性、引用有效性、禁用承诺、敏感场景升级。 |
| Shadow | 对真实 case 生成草稿, QA 比较 champion 宏模板和 Copilot 草稿。 |
| Primary metric | quality-adjusted resolved cases per hour。 |
| Guardrail | 投诉率、reopen rate、错误承诺、主管升级、客户等待时间、agent override。 |
| Ramp | 5% trained agents -> 20% -> 50% -> all eligible teams, 每阶段完成 QA 抽样后进入下一阶段。 |
| Stop rule | 任何 critical policy breach 或投诉率超过 control 上限即 pause;连续两个观察窗口 AHT 改善但 QA 下降则 limited go。 |
| 决策输出 | 如果低复杂度 case 显著改善且高复杂度 case guardrail 接近边界, scale 低复杂度并把高复杂度保留人工优先。 |
8.2 RAG 知识助手模型升级
| 设计项 | 决策 |
|---|---|
| Treatment | embedding model、reranker、answer model 或 citation policy 升级。 |
| Randomization unit | agent-session 或 case, 高共享团队使用 team-level randomization。 |
| Offline gate | retrieval recall@k、groundedness、citation freshness、policy effective-date match。 |
| Shadow | 同一 query 同时跑 champion 和 challenger, 比较 source、answer、latency、拒答。 |
| Online experiment | 对低风险内部搜索使用 A/B;对 ranking 差异强的检索比较可先做 interleaving。 |
| Primary metric | task success rate 或 accepted answer with no downstream correction。 |
| Guardrail | stale policy citation、PII leakage、hallucinated source、latency p95、empty retrieval rate。 |
| Stop rule | critical source hallucination 触发 rollback;latency p95 超过 SLO 且没有显著质量收益则 stop。 |
| 决策输出 | 如果 challenger 提升困难查询但降低常见查询, 可用 query intent router 而非全量替换。 |
8.3 推荐系统策略上线
| 设计项 | 决策 |
|---|---|
| Treatment | next-best-offer / product ranking / content recommendation strategy。 |
| Randomization unit | customer 或 session;存在库存、商户流量和社交扩散时使用 cluster 或 switchback。 |
| Offline gate | policy eligibility、fairness、margin rules、suppression list、customer suitability。 |
| Interleaving | 对搜索 / ranking 候选先做 interleaving, 快速判断 preference。 |
| Online primary | risk-adjusted conversion、margin、qualified engagement。 |
| Guardrail | opt-out、complaint、unsuitable offer、minority segment lift gap、inventory starvation、long-term churn。 |
| Multiple testing | 一个 primary, 次级业务指标分层;segment discovery 只作为诊断。 |
| Stop rule | 任一 suitability guardrail 失败立即 rollback;短期点击提升但 margin 或 complaint 恶化则不 scale。 |
| 决策输出 | 对高 uplift 且低风险 segment 做 targeted rollout, 对负 uplift segment 降级到 champion。 |
8.4 KYC Extraction Model Release
| 设计项 | 决策 |
|---|---|
| Treatment | OCR + document understanding 模型升级, 自动抽取身份、地址、公司注册、受益人字段。 |
| Randomization unit | application 或 document package。 |
| Shadow | challenger 只生成字段, human reviewer 不看或只在审计界面对比。 |
| Primary metric | audit-pass straight-through processing rate。 |
| Guardrail | false accept、critical field miss、manual rework、document type slice regression、policy breach。 |
| Ramp | shadow -> dual-read low-risk document types -> supervised auto-fill -> limited auto-approve for low-risk cases。 |
| Stop rule | critical field miss 高于 control 或高风险 document type 出现回归即冻结该 slice。 |
| 决策输出 | 不做全局 go/no-go, 按 document type、country、risk tier 和 image quality 做分层 release。 |
8.5 Payment Fraud Rule / Model Champion-Challenger
| 设计项 | 决策 |
|---|---|
| Treatment | fraud score model 或规则组合挑战 champion。 |
| Randomization unit | transaction、card、merchant 或 account, 依据干扰和攻击者适应路径选择。 |
| Shadow | challenger 计算分数和 action, 不影响授权;保留 champion action。 |
| Primary metric | net fraud loss avoided after false positive cost。 |
| Guardrail | false decline、VIP impact、merchant complaint、manual review backlog、chargeback lag。 |
| Causal issue | 真实 fraud outcome 延迟, attacker behavior 会适应策略, 需 mature window 和 holdout。 |
| Ramp | low-risk segment score-only -> manual review routing -> limited step-up -> limited decline。 |
| Stop rule | false decline 高于 risk appetite 或 manual queue 超 capacity 立即降级。 |
| 决策输出 | 如果 challenger 提高检出但增加误杀, 通过 threshold policy、step-up authentication 或 segment-specific champion-challenger 取代直接全量。 |
8.6 Agent Tool Rollout
| 设计项 | 决策 |
|---|---|
| Treatment | Agent 可调用 CRM update、refund quote、case routing、payment investigation、KYC checklist 等工具。 |
| Randomization unit | tool intent、case type 或 agent cohort。 |
| Offline gate | tool selection accuracy、argument validation、policy refusal、idempotency、permission check。 |
| Shadow | Agent 生成 intended tool call, 系统比较 human action 和 policy engine。 |
| Progressive autonomy | read-only -> draft tool call -> human approval -> limited autonomous execution。 |
| Primary metric | successful supervised action rate 或 reduced manual handling with no quality regression。 |
| Guardrail | unauthorized action、duplicate action、wrong account、missing approval、compensation failure、kill switch failure。 |
| Stop rule | 任何 unauthorized financial action 或 duplicate execution 触发 immediate rollback。 |
| 决策输出 | 工具能力按 action risk 分层, 低风险读工具可扩大, 资金和客户承诺类工具保持审批。 |
9. 产品决策与治理
9.1 平台产品决策
| 决策 | 推荐原则 |
|---|---|
| 先建平台还是先做业务实验 | 有两个以上团队重复需要 assignment、metric catalog、release gate、evidence binder 时平台化;单一低风险用例先轻量协议化。 |
| Feature flag 与 experimentation 是否合并 | 发布控制必须和实验分流共享 exposure log, 但分析权限、审批和指标治理可独立。 |
| Metrics catalog 谁拥有 | 平台维护技术规范, business / risk / finance 拥有指标定义和决策方向。 |
| Guardrail 是否允许 override | critical safety / compliance 不允许业务单方 override;非 critical guardrail 可用 documented exception 和 compensating control。 |
| AI eval 平台与 A/B 平台关系 | EvalOps 负责 component behavior, Experimentation 负责真实流程因果效果, Release Science 将二者作为门禁链路。 |
| 何时引入 sequential testing | 高成本、高风险或需要持续监控的 rollout;必须配套预注册 stop rule 和教育。 |
| 何时用 interleaving | ranking / retrieval 候选差异大且偏好信号可被同一用户会话观察;不替代最终风险评估。 |
9.2 RACI
| 活动 | AI PM | Experimentation PM | Data Science | Risk Product | Platform Engineering | Model Risk / Compliance | Operations |
|---|---|---|---|---|---|---|---|
| Hypothesis and treatment | A | C | C | C | C | C | C |
| Metric tree | A | A | C | C | C | C | C |
| Assignment design | C | A | A | C | R | C | C |
| Offline eval gate | A | C | R | C | R | C | C |
| Guardrail matrix | A | A | C | A | C | A | C |
| Ramp / rollback | A | C | C | A | R | C | A |
| Release review | A | A | C | A | C | A | A |
| Post-experiment decision | A | A | R | C | C | C | C |
R = responsible, A = accountable, C = consulted.
9.3 Release Gate 状态机
Draft
-> protocol approved
-> telemetry validated
-> shadow running
-> canary running
-> experiment running
-> analysis locked
-> release review
-> limited go / scale go / no-go / rollback / retire
9.4 Risk-Based Release Gate
| Risk tier | AI 能力 | 最低门禁 |
|---|---|---|
| Tier 1 Internal productivity | 内部总结、搜索、非客户可见草稿 | telemetry validation、offline eval、usage monitoring、manual feedback |
| Tier 2 Employee assistive | 客服、运营、KYC reviewer copilot | offline eval、shadow、QA audit、A/B、guardrail、rollback |
| Tier 3 Customer visible | 客户可见回复、推荐、报价、解释 | legal/compliance review、controlled exposure、critical guardrail、complaint monitoring |
| Tier 4 Financial / eligibility impact | 欺诈拦截、KYC 放行、信贷、支付、资金动作 | model risk review、champion-challenger、mature outcome window、manual fallback、board-level risk reporting as applicable |
| Tier 5 Agent autonomous action | 自动执行客户账户或资金相关动作 | tool-level permissions、human approval threshold、kill switch、transaction reconciliation、incident drill |
10. 可落地交付物模板
以下模板用“字段 + 写法 + 示例”呈现, 可以直接复用为项目文档结构。示例统一采用“RAG 知识助手模型升级”场景, 便于形成一套完整作品集证据。
10.1 Experiment Design Doc
| 字段 | 写法 | 示例 |
|---|---|---|
| Experiment name | 业务场景 + AI 变更 + release 阶段 | RAG Knowledge Assistant Reranker Upgrade Canary |
| Decision owner | 对 scale / stop 负责的人 | AI Platform PM 与 Customer Operations Director 联合负责 |
| Hypothesis | treatment 如何改变行为和 outcome | 新 reranker 提高政策答案引用准确率, 使 agent 更快完成复杂 case, 且不增加错误引用 |
| Treatment | 明确被改变的系统行为 | 对 eligible policy-search queries 使用 reranker v3, answer model 和 UI 不变 |
| Control | 当前 champion | retriever v2 + reranker v1 |
| Population | eligible traffic 和排除范围 | 英文客服政策查询, 排除投诉升级、法律争议和资金补偿 case |
| Randomization unit | 分配单位和理由 | agent-session, 因为同一 session 内多次查询共享上下文 |
| Exposure event | 何时认为受到 treatment | agent 看到 reranker v3 支持的 answer card |
| Primary metric | 一个主要成功标准 | QA-verified accepted answer rate |
| Guardrails | 风险阈值和动作 | hallucinated citation 为 critical, p95 latency、empty retrieval、policy mismatch 为 release guardrail |
| Analysis method | 固定窗口、sequential、CUPED、multiple testing 规则 | 7 天固定窗口 primary readout;latency 和 citation critical issue 做 sequential monitoring;使用 agent 历史 accepted answer rate 做 CUPED |
| Sample / duration | 样本和时长计划 | 最少 18,000 exposed sessions, 覆盖两个工作周和周末低流量 |
| Stop rule | 预注册停止条件 | 出现 2 起 confirmed hallucinated source 或 p95 latency 连续 4 小时超 SLO 25% 即 pause |
| Ramp plan | 流量阶段 | shadow 3 天 -> 5% -> 20% -> 50% -> release review |
| Evidence | 决策所需证据包 | design doc、eval report、shadow diff、SRM check、scorecard、QA sample、release memo |
10.2 Metric Tree
| 层级 | Metric | 定义 | 决策用途 |
|---|---|---|---|
| North Star | Quality-adjusted knowledge task success | agent 接受答案且 QA 抽样无 policy error | scale decision |
| Primary | QA-verified accepted answer rate | accepted answer 中通过 QA 的比例 | 实验主判断 |
| Secondary | time to answer | 从 query 到 answer accepted 的时长 | 生产力收益 |
| Secondary | search refinement rate | 同一 case 内二次改写 query 比例 | retrieval friction |
| Guardrail | hallucinated citation rate | 引用不存在、无权限或不支持结论 | hard release gate |
| Guardrail | stale policy source rate | 引用过期政策或失效版本 | compliance gate |
| Guardrail | p95 latency | answer card ready 的 p95 时间 | UX / ops gate |
| Invariant | traffic split by language / queue / agent tenure | treatment 与 control 分布一致性 | 实验可信度 |
| Diagnostic | retriever empty rate | 无候选文档返回比例 | 根因定位 |
| Diagnostic | judge-human disagreement | LLM judge 与 QA reviewer 不一致比例 | eval calibration |
10.3 Guardrail Matrix
| Guardrail | Severity | Threshold | Detection | Action | Owner |
|---|---|---|---|---|---|
| Hallucinated citation | Critical | confirmed count >= 2 in canary | QA review + citation validator | pause treatment, run RCA, keep control | AI PM + Compliance |
| Stale policy source | High | treatment worse than control by 0.2pp and statistically credible | source effective-date check | freeze ramp, patch source filter | Knowledge owner |
| p95 latency | Medium | 25% above control for 4 consecutive hours | live telemetry | hold ramp, route high-latency traffic to control | Platform engineering |
| Empty retrieval | Medium | 10% relative increase vs control | retrieval logs | rollback reranker for affected query class | Search platform |
| Agent override spike | Medium | 15% relative increase in override on complex cases | UI event | segment review before scale | Operations |
| Complaint linkage | High | any confirmed customer complaint tied to wrong answer | complaint triage | limited go only after review | Customer risk |
10.4 Ramp Plan
| Stage | Traffic | Entry criteria | Monitoring window | Exit criteria |
|---|---|---|---|---|
| Shadow | 0% customer / staff impact | offline eval passes all critical gates | 3 business days | no critical diff, latency within SLO, QA sample accepted |
| Canary 5% | eligible low-risk sessions | telemetry validated, shadow review approved | 24 hours | no critical guardrail, SRM pass, p95 stable |
| Canary 20% | eligible sessions excluding high-risk queues | 5% stage release review complete | 48 hours | primary trend non-negative, guardrails within appetite |
| Experiment 50% | full eligible population | sample size plan supports readout | 7-14 days | analysis locked and release memo ready |
| Scale | 100% eligible or targeted segment | release review approves scale | first 14 days after scale | daily guardrail review, rollback switch active |
10.5 Stop Rule
| Rule class | Concrete rule | Decision |
|---|---|---|
| Critical safety | confirmed hallucinated legal / policy citation count reaches 2 during canary | immediate pause, route to control, executive notification |
| Compliance | any PII exposure to unauthorized agent group | rollback, incident process, access control RCA |
| Operational | manual queue backlog increases 20% for 2 consecutive business days | freeze ramp, reduce treatment to previous stage |
| Statistical validity | SRM test fails on primary assignment unit | invalidate current readout, investigate assignment / logging before analysis |
| Business harm | primary metric negative and guardrail negative after minimum exposure | no-go, preserve learnings, open redesign decision |
| Sequential success | primary crosses pre-registered boundary and all critical guardrails pass | early release review, not automatic scale |
10.6 Sample Size / Variance Plan
| 字段 | 写法 | 示例 |
|---|---|---|
| Primary metric baseline | 用最近稳定窗口, 排除异常事件 | accepted answer rate baseline 62.0% from last 28 days |
| Minimum detectable effect | 业务有意义的最小增量 | +1.5pp, 因为低于该收益无法覆盖模型成本和运营审查 |
| Unit of analysis | 与 randomization unit 一致或说明聚合 | agent-session |
| Variance source | 标出高方差来源 | agent tenure、queue、case complexity、weekday seasonality |
| CUPED covariate | treatment 前且与 outcome 相关 | agent previous 28-day accepted answer rate and queue complexity mix |
| Expected variance reduction | 由历史回放估计 | 预估 20-30% variance reduction, 不用于改变 guardrail critical threshold |
| Duration | 覆盖业务周期 | 至少两个工作周, 覆盖周末低流量和月末政策更新 |
| Power caveat | 说明无法检测的风险 | critical safety 以事件审查为 gate, 不依赖显著性 |
10.7 Release Review Memo
| Section | 内容 | 示例写法 |
|---|---|---|
| Decision requested | 请求 go、limited go、no-go、rollback 或 scale | 请求将 reranker v3 从 50% 实验扩大到英文低风险政策查询 100% |
| Evidence summary | eval、shadow、online、guardrail | Offline groundedness +3.8pp, shadow 无 critical diff, online primary +1.7pp, guardrails pass |
| Validity checks | SRM、exposure、metric lineage、multiple testing | SRM pass; 94% assigned sessions exposed; primary pre-registered; segment analysis used for diagnosis |
| Risk position | residual risk 与补偿控制 | 高复杂度投诉队列仍保持 control, 因 citation risk 接近 yellow threshold |
| Operational readiness | training、support、rollback、on-call | rollback flag verified, support runbook active, QA sampling doubled for 14 days |
| Decision | 明确结论 | limited go: scale low-risk queues, keep high-risk queues on champion, review after 14 days |
| Accountability | owner 和复盘日期 | AI PM owns ramp, Knowledge owner owns source freshness, next review on 2026-07-13 |
10.8 Post-Experiment Decision Template
| Section | 决策记录 |
|---|---|
| Experiment conclusion | Reranker v3 improves QA-verified accepted answer rate in low-risk English policy queries without guardrail breach. |
| Decision | Scale low-risk English queues; keep complaint escalation and legal-sensitive queries on champion. |
| Causal confidence | Randomization and SRM checks passed; exposure rate high; CUPED-adjusted estimate consistent with unadjusted direction. |
| Where it works | Short policy lookup, billing FAQ, account maintenance cases, experienced agents. |
| Where it fails | Legal-sensitive complaints, outdated policy sources, cases requiring cross-product reasoning. |
| Product action | Add intent router, strengthen source effective-date filter, create separate experiment for high-risk queues. |
| Governance action | Update release gate: any RAG answer touching legal-sensitive policy requires citation freshness hard check. |
| Reusable learning | Offline groundedness predicted online QA direction, but latency guardrail was the best early indicator of agent rejection. |
11. 30 天训练计划
| Day | 训练主题 | 产出 |
|---|---|---|
| 1 | 读 Kohavi OCE 目录和 experimentation trustworthiness 主题 | 1 页实验可信度原则卡 |
| 2 | 拆解 Microsoft ExP 平台能力 | Experimentation platform capability map |
| 3 | 设计 AI release 状态机 | Draft -> shadow -> canary -> experiment -> scale 状态图 |
| 4 | 为客服 Copilot 写 experiment design doc | 完整 design doc v1 |
| 5 | 为客服 Copilot 建 metric tree | primary、secondary、guardrail、invariant、diagnostic 指标树 |
| 6 | SRM、assignment、exposure 日志设计 | telemetry event spec |
| 7 | A/A test 和 instrumentation validation | A/A readiness checklist |
| 8 | CUPED 训练 | sample size / variance plan |
| 9 | Sequential testing 训练 | stop rule 和 interim look plan |
| 10 | Multiple testing 训练 | metric hierarchy 和 FDR / segment policy |
| 11 | Guardrail matrix 训练 | customer harm、compliance、ops、technical、financial guardrails |
| 12 | Shadow launch 训练 | champion-challenger shadow comparison memo |
| 13 | Ramp / rollback 训练 | 5-stage ramp plan 和 kill switch checklist |
| 14 | Release review 演练 | release review memo |
| 15 | RAG eval bridge | offline-to-online mapping table |
| 16 | RAG interleaving | retrieval / reranker interleaving design |
| 17 | Agent tool rollout | tool risk tiering and progressive autonomy plan |
| 18 | KYC extraction release | document-type segmented release plan |
| 19 | Payment fraud champion-challenger | fraud model release science brief |
| 20 | Recommendation strategy | interleaving + A/B + long-term holdout design |
| 21 | Network effects | cluster randomization / switchback decision memo |
| 22 | Causal decisioning | treatment, exposure, outcome, counterfactual map |
| 23 | Delayed outcomes | mature outcome window and proxy metric policy |
| 24 | Risk-based release gate | tiered gate matrix for six financial retail use cases |
| 25 | Evidence binder | design、eval、analysis、approval、incident evidence index |
| 26 | Post-experiment decision | scale / stop / segment / redesign decision record |
| 27 | Portfolio learning | experiment learning repository taxonomy |
| 28 | Executive communication | 1-page executive release memo |
| 29 | 面试演练 | 6 道高级面试题 30 秒和 2 分钟答案 |
| 30 | 作品集整合 | AI Experimentation Platform case portfolio pack |
12. 面试题准备
Q1: 你如何设计一个 AI experimentation platform?
30 秒版本: 我不会把它设计成单纯 A/B dashboard, 而是设计成 release science control plane: assignment、exposure、metrics catalog、eval bridge、analysis engine、feature gate、ramp / rollback、release review 和 evidence binder。AI 的特殊点是离线 eval 只能证明组件行为, 线上 controlled experiment 才能证明真实流程增量效果, 所以平台必须把二者连接起来。
2 分钟版本: 我会先定义核心控制对象: experiment、AI component、randomization unit、exposure、metric contract、guardrail、release gate 和 decision record。架构上需要 assignment service 保证稳定分流, exposure tracker 记录实际受到 AI 影响, metrics catalog 管 primary / guardrail / invariant, analysis engine 做 SRM、CUPED、sequential 和 multiple testing, feature gate 负责 ramp 与 rollback。对 LLM / RAG / Agent, 还要有 EvalOps bridge, 把 golden set、shadow trace、online outcome 和 post-release monitoring 接起来。治理上每个高风险 release 都必须有 design doc、sample plan、stop rule、release memo 和 post-experiment decision。
Q2: Offline eval 分数提升, 为什么还需要线上实验?
30 秒版本: Offline eval 证明新版本在已知样本上表现更好, 但不能证明员工会采用、客户体验会改善、流程成本会下降, 也不能完全暴露分布变化和人机协作风险。线上实验用真实 exposure 和 counterfactual 估计增量效果。
2 分钟版本: LLM / RAG / Agent 的离线 eval 是 release gate, 不是 ROI 证据。比如 RAG groundedness 提升, 线上可能因为 latency 更高导致 agent 不采用;客服草稿更完整, 也可能增加错误承诺;fraud model AUC 更高, 但误杀成本更高。因此我会先离线回归和 shadow, 再小流量 controlled exposure, 用 primary business outcome 和 guardrail 联合判断。结论也不一定是全量, 可能是 segment-specific rollout、router、limited go 或 no-go。
Q3: CUPED 在 AI 实验中怎么用?
30 秒版本: CUPED 用 treatment 前的协变量降低方差, 提升实验灵敏度。AI 场景可以用 agent 历史 QA、客户历史购买倾向、merchant 历史 fraud rate 等 pre-experiment data, 但不能使用会被 treatment 影响的变量。
2 分钟版本: 我会在 sample size plan 中先确定 primary metric baseline 和最小有意义效果, 再用历史数据估计 covariate 与 outcome 的相关性。客服 Copilot 可以用 agent 过去 28 天 AHT 和 QA;RAG 可以用 agent 历史 accepted answer rate;支付 fraud 可以用 merchant 历史风险水平。CUPED 能缩短实验周期, 但它不能修复错误 randomization、SRM、干扰或多重比较。高风险 guardrail 仍按 critical threshold 管理, 不因为 CUPED 调整而放松。
Q4: Sequential testing 和每天看结果有什么区别?
30 秒版本: 每天随意看结果会增加 false positive。Sequential testing 是事前声明 interim looks 和边界, 允许提前停止但控制错误率。
2 分钟版本: 我会把 sequential testing 用在高风险 ramp 的 guardrail monitoring 或成本很高的实验上。设计时必须写清观察频率、停止边界、最小样本、success / harm / futility 规则。比如 Agent tool rollout 中, unauthorized action 是 immediate stop, latency 和 error rate 可以 sequential monitoring;最终业务收益仍需要成熟窗口。关键是不要在固定窗口和 sequential 方法之间事后切换解释。
Q5: AI release 的 guardrail metrics 如何设计?
30 秒版本: Guardrail 要覆盖客户伤害、合规、运营、技术、财务和公平性。它们不是辅助观察, 而是 release contract, 部分 critical guardrail 触发即 pause 或 rollback。
2 分钟版本: 我会先按风险路径建 matrix: 这个 AI 可能错误承诺、泄露 PII、误拒支付、漏提 KYC 字段、错误调用工具或增加队列压力。每个 guardrail 要有 severity、threshold、detection、action 和 owner。客服 Copilot 的 guardrail 包括投诉、reopen、错误承诺;RAG 包括 hallucinated citation、stale source;fraud 包括 false decline 和 manual queue;Agent 包括 unauthorized action 和 duplicate action。高风险 guardrail 不靠平均分抵消, 而是直接影响 release decision。
Q6: 推荐或 RAG ranking 为什么可能用 interleaving?
30 秒版本: Interleaving 把两个 ranking 的结果混在同一用户会话中, 用隐式反馈更快比较偏好, 适合搜索、推荐、retrieval / reranking 候选。
2 分钟版本: 普通 A/B 把用户分到不同 ranking, 对小差异可能需要大量流量。Interleaving 在同一上下文中比较两个 ranker, 对 RAG retriever / reranker、知识库搜索和推荐排序很有用。但它主要回答 ranking preference, 不能直接证明最终答案安全或业务价值。因此我会把 interleaving 作为早期筛选, 后续仍用 controlled experiment 验证 downstream task success、complaint、latency 和 fairness guardrails。
Q7: 如果实验存在 network effects, 你怎么处理?
30 秒版本: 先识别干扰路径, 再改变 randomization unit。客服按团队 cluster, 推荐按市场或 session, fraud 按 merchant / account 或 switchback, 避免一个样本的 treatment 影响另一个样本 outcome。
2 分钟版本: Network effects 会破坏独立性假设。比如客服团队会分享 Copilot 话术, 推荐会改变库存和商户曝光, fraud 策略会改变攻击者行为。我的做法是把干扰路径写进 design doc, 优先使用 cluster randomization、switchback、holdout 或 market-level rollout;同时记录 contamination signal。分析时不把普通 user-level p-value 当最终证据, 要按设计单位估计效果并解释 power 损失。
Q8: Payment fraud model 的 champion-challenger release 怎么做?
30 秒版本: 先 shadow challenger, 记录 champion 和 challenger 的 action diff, 再在低风险 segment 渐进暴露。Primary 是 net fraud loss avoided after false positive cost, guardrail 是 false decline、VIP impact、manual queue 和 chargeback delay。
2 分钟版本: 欺诈模型不能只看 AUC。真实决策要权衡 fraud loss、误杀、客户体验、运营队列和攻击者适应。我会让 challenger 在 shadow 中跑一段成熟窗口, 对 action diff 做人工和风险审查;进入 canary 时先用于 step-up 或 manual review, 再有限 decline。保留 champion holdout 和 delayed outcome tracking。任何 false decline 超 risk appetite 或 manual queue 超 capacity 都触发降级。最终可能不是替换 champion, 而是 segment-specific challenger 或 threshold policy。
13. 参考来源链接
- Kohavi, Tang, Xu: Trustworthy Online Controlled Experiments: https://www.cambridge.org/core/books/trustworthy-online-controlled-experiments/D97B26382EB0EB2DC2019A7A7B518F59
- Microsoft Research Experimentation Platform ExP: https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/
- Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data: https://robotics.stanford.edu/~ronnyk/2013-02CUPEDImprovingSensitivityOfControlledExperiments.pdf
- NIST AI Risk Management Framework: https://www.nist.gov/itl/ai-risk-management-framework
- Statsig Frequentist Sequential Testing: https://docs.statsig.com/experiments/advanced-setup/sequential-testing
- Eppo Experiment Protocols: https://docs.geteppo.com/quick-starts/analysis-integration/defining-protocols/
- Eppo Guardrail Cutoffs: https://docs.geteppo.com/data-management/organizing-metrics/guardrails/
- Optimizely False Discovery Rate Control: https://support.optimizely.com/hc/en-us/articles/4410283967245-False-discovery-rate-control
- Microsoft Azure Safe Deployment Practices: https://learn.microsoft.com/en-us/azure/well-architected/operational-excellence/safe-deployments
- Microsoft Research: Diagnosing Sample Ratio Mismatch in A/B Testing: https://www.microsoft.com/en-us/research/articles/diagnosing-sample-ratio-mismatch-in-a-b-testing/
- Large-scale validation and analysis of interleaved search evaluation: https://authors.library.caltech.edu/records/r3zrn-kd453
- OpenAI Evals guide: https://developers.openai.com/api/docs/guides/evals
- LaunchDarkly Guarded Rollouts: https://launchdarkly.com/docs/home/releases/creating-guarded-rollouts