AI Shadow Mode:影子模式与反事实评估架构
Shadow mode 是把 AI 系统接入真实业务上下文, 但不让它影响客户、员工操作、系统状态或监管承诺。它回答的不是“模型在测试集上分数高不高”, 而是更接近架构和产品决策的问题:
AI 影子模式架构:Shadow Mode / Counterfactual Evaluation / Silent Launch
Date: 2026-06-30
Status: evergreen
Audience: experienced CBAP / financial retail PM / product architect / solution architect / AI governance lead
Output: advanced architecture note, decision model, evidence pattern, interview asset
Why Shadow Mode Matters For AI Product/Architecture
Shadow mode 是把 AI 系统接入真实业务上下文, 但不让它影响客户、员工操作、系统状态或监管承诺。它回答的不是“模型在测试集上分数高不高”, 而是更接近架构和产品决策的问题:
| Decision question | Shadow mode 的价值 | 不适合用什么替代 |
|---|---|---|
| AI 是否理解真实业务上下文 | 用生产相似输入、策略、队列、权限、延迟和缺失数据验证 | 只用静态 golden set |
| AI 如果参与决策会改变什么 | 记录 proposed decision、current champion decision、human decision 和后续 outcome | 直接上线 A/B |
| 风险是否集中在特定 segment | 在不触达客户的情况下监控 fairness、false positive、override、appeal 和服务水平差异 | 上线后再从投诉中发现 |
| 人工复核和 AI 建议是否一致 | 比较 analyst / underwriter / agent 与 AI 的分歧, 形成 review calibration | 只看模型准确率 |
| 是否可以进入 pilot / assisted mode | 用 evidence packet 支撑 go / limited go / no-go / redesign | 会议式主观判断 |
这份笔记刻意不重复 online experimentation、UAT regression certification、release governance 或 adoption analytics。它定位在更早一层: pre-production and pre-decision architecture。在金融零售里, 这层能力决定 AI 能不能从 demo 进入真实决策流程。
一句话:
Shadow mode turns production-like traffic into decision evidence before the AI is allowed to change customer outcomes.
Concept Diagram
flowchart LR
A[Live business event] --> B[Current champion path]
A --> C[Shadow decisioning path]
B --> B1[Human / rules / existing model decision]
B1 --> B2[Actual action taken]
B2 --> B3[Delayed outcome / label]
C --> C1[Feature snapshot]
C1 --> C2[AI challenger decision]
C2 --> C3[Counterfactual event log]
B1 --> D[Decision comparison]
C2 --> D
B3 --> E[Outcome join]
C3 --> E
D --> F[Segment / fairness / control analysis]
E --> F
F --> G[Gate memo]
G --> H{Rollout decision}
H -->|No-go| I[Redesign / retrain / control update]
H -->|Limited go| J[Assisted mode with HITL]
H -->|Go| K[Controlled production rollout]
架构边界:
- Champion path 仍然拥有业务决定权。
- Challenger path 只能观察、推理、记录、解释, 不能写业务状态。
- Counterfactual log 必须能在 outcome 延迟后回放和归因。
- Gate decision 必须同时看效果、风险、可解释性、操作准备和证据质量。
Core Architecture Model
1. 核心组件
| Component | Responsibility | Architecture decision |
|---|---|---|
| Shadow router | 从真实事件复制最小必要输入给 challenger | 默认异步、只读、可限流, 不阻塞 champion path |
| Feature snapshot service | 固化决策时点可见的特征、政策版本、context 版本 | 防止 outcome leakage 和事后补数 |
| Challenger decision service | 运行 AI model / RAG / agent / copilot 逻辑 | 输出建议、置信度、reason、evidence、abstention |
| Counterfactual event store | 保存 AI 原本会做什么, 但不执行 | append-only, versioned, immutable enough for audit |
| Champion decision capture | 捕获人工、规则或现有模型实际决定 | 记录 actor、policy、timestamp、override reason |
| Outcome joiner | 在标签成熟后把真实 outcome 回连到 shadow event | 支持 7/30/60/90 天标签延迟 |
| Segment and fairness analyzer | 按客户、产品、渠道、地区、语言、风险等级分析 | 不把受保护属性直接暴露给 runtime decision |
| Gate workbench | 汇总 metric、分歧、风险、控制和 evidence | 输出 go / limited go / no-go / rollback trigger |
| Evidence binder | 固化数据血缘、版本、评估、审批、例外、问题单 | 支撑 model risk、internal audit、risk committee |
2. Shadow Mode 的四个成熟度级别
| Level | Mode | What happens | Suitable decision |
|---|---|---|---|
| L0 Replay | 离线历史回放 | 用历史案例重跑 challenger | 是否值得进入生产相似流量 |
| L1 Silent shadow | 真实事件复制, AI 不影响任何用户或员工 | 只写 counterfactual log | 是否能理解真实输入和流程 |
| L2 Human comparison | AI 建议对独立 reviewer 可见, 不进入一线工作台 | 比较 human expert 与 AI | 是否可进入 assisted pilot |
| L3 Assisted silent launch | AI 建议进入员工界面但默认不执行, 需人工确认 | 衡量可用性、分歧、操作负荷 | 是否逐步放权或扩大范围 |
关键原则:
- 不允许 shadow 输出触发客户通知、价格变更、额度变更、冻结、拒绝、催收动作或 SAR / regulatory filing。
- 如果需要员工看到 AI 建议, 必须明确是否会改变人工判断。否则会把 silent launch 变成事实上的 pilot。
- 所有分析必须区分 “AI would have recommended” 和 “business actually did”。
3. Decision Object
Shadow mode 不是只记录一个 score。每次 AI 决策至少应保存:
| Field group | Examples |
|---|---|
| Business context | use case, product, channel, customer segment, workflow step, jurisdiction |
| Input snapshot | event payload hash, feature vector version, policy version, RAG corpus version, tool schema version |
| AI output | recommendation, score, confidence, reason codes, citations, abstain flag, proposed next action |
| Champion output | actual decision, human reviewer, rule/model id, override reason, action taken |
| Comparison | agreement, severity of disagreement, expected customer impact, review queue |
| Outcome | label type, label maturity date, outcome value, appeal/complaint/fraud loss/default indicator |
| Controls | leakage checks, fairness slice, PII minimization, retention class, reviewer calibration status |
| Evidence | trace id, model/prompt version, eval run id, gate memo id, approval or issue id |
Counterfactual Logging And Evaluation Lifecycle
1. Lifecycle
use case intake
-> decision boundary and prohibited actions
-> feature and policy snapshot design
-> shadow router implementation
-> counterfactual event logging
-> daily quality and control checks
-> delayed label / outcome join
-> human comparison and disagreement review
-> segment / fairness scorecard
-> gate memo and rollout recommendation
-> evidence archive
2. Logging Model
Counterfactual logging 要能回答三个问题:
| Question | Required evidence | Failure if missing |
|---|---|---|
| What did AI know at decision time? | feature snapshot, context snapshot, policy version, retrieval corpus version | 事后用未来信息美化结果 |
| What would AI have done? | proposed decision, score, reason, confidence, action intent, abstention | 只保留 score, 无法判断业务动作 |
| What actually happened later? | champion decision, customer outcome, label maturity, complaint/appeal/fraud/default result | 无法估计真实风险或收益 |
3. Outcome Delay Handling
金融零售很多 outcome 不会当天成熟:
| Use case | Immediate proxy | Mature label | Recommended shadow window |
|---|---|---|---|
| Credit line management | underwriter agreement, utilization change | delinquency, loss, complaint, adverse action dispute | 60-180 days |
| AML alert triage | analyst disposition | SAR decision, QA finding, law enforcement feedback where available | 30-120 days |
| KYC onboarding | document verification result | fraud hit, synthetic identity signal, account closure reason | 30-180 days |
| Payment fraud intervention | authorization decision, customer confirmation | confirmed fraud, false decline, chargeback | 7-60 days |
| Collections contact strategy | agent agreement, contact success | cure, re-default, hardship complaint, contact violation | 30-120 days |
| Contact center agent assist | agent acceptance, QA score | complaint, repeat contact, resolution quality | 7-45 days |
Outcome delay 不是纯数据问题, 是产品决策问题。过早 gate 会高估速度和低估损害; 过晚 gate 会拖慢学习。成熟做法是把 gate 分成:
- Readiness gate: 输入完整性、日志质量、控制完整性、初步一致性。
- Risk gate: 高风险 disagreement、fairness slice、prohibited action、leakage、complaint-sensitive cases。
- Outcome gate: 标签成熟后的 lift / harm / false positive / false negative / cost / operational burden。
4. Leakage Control
Shadow mode 最常见的假进步来自 leakage。
| Leakage type | Example | Control |
|---|---|---|
| Future outcome leakage | 用 30 天后逾期状态作为当天额度调整特征 | decision-time feature store, snapshot timestamp enforcement |
| Human decision leakage | 把 underwriter 最终决定输入 challenger | separate champion capture after challenger output |
| Queue leakage | 只对已被人工筛过的好案例 shadow | route-level sampling and population definition |
| Label leakage | 用 QA 后的 AML disposition 训练当天 triage model | label maturity registry and train/eval split by time |
| Policy leakage | 用新版 policy 回放旧案例, 但与当时政策比较 | policy version lock and policy-era analysis |
| Reviewer leakage | human reviewer 看到 AI 后再作“独立”标签 | blind review protocol for comparison samples |
Gate Criteria And Rollout Decision Model
1. Gate Layers
| Gate | Pass condition | Stop condition |
|---|---|---|
| Technical readiness | shadow path stable, trace complete, latency/cost within budget, no champion path impact | missing logs, champion slowdown, non-deterministic versioning |
| Data and leakage | decision-time snapshots complete, no future feature, sampling representative | leakage found in material feature or label |
| Business performance | challenger improves target metric or reduces manual burden without unacceptable harm | average lift comes from narrow or low-risk slice only |
| Risk and fairness | no critical segment regression, protected/proxy monitoring approved | false positive/negative disparity breaches threshold |
| Human comparison | disagreements explainable, SME review supports limited use | AI disagrees on high-severity cases without defensible reason |
| Operational readiness | human review capacity, escalation, fallback, monitoring, evidence owner ready | reviewers cannot absorb alerts or overrides |
| Governance evidence | use case card, data lineage, eval report, gate memo, risk acceptance complete | decision cannot be reconstructed |
2. Decision Model
| Result | Meaning | Action |
|---|---|---|
| No-go | AI is not fit for workflow exposure | redesign model/process, fix data, repeat shadow |
| Continue shadow | evidence incomplete or outcome labels immature | extend window, narrow sample, improve logging |
| Limited go | fit for assisted mode with human approval and tight monitoring | expose suggestions to trained users, no autonomous action |
| Conditional go | fit for narrow segment or low-risk decision | feature flag by segment/channel/product, monitor guardrails |
| Rollout go | fit for controlled production rollout | staged ramp with rollback triggers and audit evidence |
| Decommission challenger | AI adds no value or introduces unmanaged risk | archive evidence, close initiative, record lessons |
3. Rollback Triggers Before Full Rollout
Rollback 不只适用于 production release。Shadow / silent launch 阶段也需要 stop rules:
- Trace completeness below agreed threshold for two consecutive business days.
- Challenger produces prohibited actions, unauthorized tool intent, or customer-impacting recommendation outside approved scope.
- Material leakage discovered in feature, label, prompt context, or human comparison protocol.
- High-severity disagreement rate exceeds threshold in credit, fraud, AML, KYC or vulnerable customer slices.
- Fairness scorecard shows unexplained false positive / false negative disparity in protected or proxy segments.
- Reviewer queue load exceeds planned capacity and delays existing controls.
- Evidence binder cannot reconstruct model, prompt, data, policy and decision versions.
Financial Retail Scenarios
1. Credit Line Management
Shadow question: if AI recommended line increase / decrease / hold, would it improve risk-adjusted growth without unfair treatment or adverse action inconsistency?
| Architecture object | Design |
|---|---|
| Champion | existing credit policy, scorecard, underwriter override |
| Challenger | AI line-management recommender with reason-code constraints |
| Counterfactual action | increase, decrease, freeze, keep, manual review |
| Delayed label | utilization, delinquency, charge-off, complaint, dispute, attrition |
| Critical controls | no adverse action leakage, reason-code consistency, fair lending segment analysis |
| Gate blocker | AI recommends line decrease for protected/proxy segment at materially higher rate without justified risk signal |
2. AML Alert Triage
Shadow question: can AI triage alerts, summarize rationale, and recommend priority without missing suspicious activity or creating analyst automation bias?
| Architecture object | Design |
|---|---|
| Champion | current rules, analyst disposition, QA review |
| Challenger | alert severity ranker + narrative copilot |
| Counterfactual action | close, escalate, request enhanced review, priority rank |
| Delayed label | SAR decision, QA defect, reopened case, typology hit |
| Critical controls | blind SME sample, typology coverage, reviewer calibration, audit narrative trace |
| Gate blocker | AI under-escalates high-risk typology or de-prioritizes vulnerable jurisdiction slice without explanation |
3. KYC Onboarding
Shadow question: can AI reduce manual review and detect synthetic identity risk without discouraging legitimate applicants?
| Architecture object | Design |
|---|---|
| Champion | identity verification vendor, KYC rules, operations review |
| Challenger | document / entity / risk signal synthesis model |
| Counterfactual action | approve, reject, request document, enhanced due diligence |
| Delayed label | fraud confirmation, account closure, AML hit, customer complaint |
| Critical controls | no direct customer messaging in shadow, document provenance, bias by language/geography |
| Gate blocker | false reject concentration by document type, language, country corridor, or accessibility need |
4. Payment Fraud Intervention
Shadow question: would AI intervene on risky payments with lower fraud loss and fewer false declines?
| Architecture object | Design |
|---|---|
| Champion | rules/model authorization and fraud queue |
| Challenger | real-time fraud intervention recommender |
| Counterfactual action | allow, step-up, hold, decline, manual review |
| Delayed label | confirmed fraud, chargeback, customer confirmation, complaint |
| Critical controls | latency budget, false decline harm, scam typology evidence, customer vulnerability signal |
| Gate blocker | fraud savings depend on unacceptable false decline rate for payroll, benefit, remittance or vulnerable customer slices |
5. Collections Contact Strategy
Shadow question: would AI choose a better contact channel, timing and treatment while respecting hardship, consent and conduct risk?
| Architecture object | Design |
|---|---|
| Champion | current collections segmentation and dialer strategy |
| Challenger | treatment optimizer with vulnerability and consent guardrails |
| Counterfactual action | call, SMS, email, letter, hardship route, no contact |
| Delayed label | cure, promise kept, complaint, re-default, contact violation |
| Critical controls | consent, contact frequency, vulnerability escalation, conduct-risk QA |
| Gate blocker | AI increases pressure on vulnerable customers or repeats contact near legal limits |
6. Contact Center Agent Assist
Shadow question: can a copilot suggest policy-grounded answers and next best actions without misleading agents or changing regulated communications?
| Architecture object | Design |
|---|---|
| Champion | agent judgment, knowledge base, QA scorecard |
| Challenger | RAG copilot / summarizer / next-action recommender |
| Counterfactual action | suggested answer, citation, escalation, after-call summary |
| Delayed label | QA score, repeat contact, complaint, resolution, supervisor correction |
| Critical controls | citation grounding, no policy invention, agent independence sample, coaching readiness |
| Gate blocker | AI produces fluent but uncited policy advice in complaint-sensitive or regulated product cases |
Metrics/Control/Evidence Model
1. Metrics
| Metric group | Examples | Decision use |
|---|---|---|
| Decision agreement | champion/challenger agreement, severity-weighted disagreement, SME upheld rate | 是否进入 assisted mode |
| Counterfactual performance | expected loss avoided, fraud captured, false decline avoided, manual queue reduction | 是否有业务价值 |
| Risk and harm | false positive/negative by slice, complaint proxy, adverse action inconsistency | 是否存在客户伤害 |
| Fairness | selection rate, false positive disparity, false negative disparity, calibration by segment | 是否可解释并可控制 |
| Operational | latency, cost per shadow event, queue load, reviewer time, abstention rate | 是否可运营 |
| Evidence quality | trace completeness, version reconstructability, outcome join rate, missingness | 是否可审计 |
| Human comparison | independent reviewer agreement, override rationale quality, automation-bias signal | 是否可交给一线使用 |
2. Control Model
| Control | Purpose | Evidence |
|---|---|---|
| Read-only runtime permissions | 确保 challenger 不影响客户或系统状态 | service account policy, tool deny-list, write-attempt logs |
| Decision-time snapshot | 防止未来信息泄漏 | feature snapshot hash, timestamp, feature availability contract |
| Population definition | 防止只 shadow 好看样本 | sampling plan, inclusion/exclusion rules, traffic report |
| Blind human review | 获取独立 comparison label | reviewer assignment, hidden AI flag, calibration report |
| Segment monitoring | 发现集中伤害和公平性问题 | segment scorecard, threshold breach log |
| Outcome maturity registry | 控制 label 延迟和解释口径 | label plan, maturity dates, join rate |
| Evidence binder | 让 gate decision 可追溯 | decision memo, run ids, lineage, approvals, issue records |
3. Evidence Packet
一份可进入 governance review 的 packet 应包含:
- Use case and decision boundary.
- Customer and employee impact statement.
- Champion/challenger architecture diagram.
- Data, feature, prompt, RAG, model and policy version lineage.
- Counterfactual schema and sample trace.
- Leakage assessment and remediation record.
- Human comparison protocol and calibration results.
- Outcome label plan and maturity analysis.
- Segment/fairness scorecard.
- Gate recommendation with residual risks and controls.
- Rollout limits, rollback triggers and owner map.
- Audit-ready evidence index.
Anti-Patterns And Failure Modes
| Anti-pattern | Why it fails | Better pattern |
|---|---|---|
| “Shadow mode” that writes system state | 已经影响客户, 不能称为 silent | strict read-only permissions and write-attempt alerting |
| Only logging scores | 无法解释 action, reason, confidence, authority | log full decision object |
| Comparing AI to contaminated human labels | reviewer already saw AI output | blind review or independent SME calibration |
| Declaring success before outcome maturity | 短期 proxy 掩盖损害 | staged gate by immediate, risk and mature outcome |
| Average lift hides segment harm | 小群体损害被总体收益掩盖 | segment-level hard gates |
| No leakage registry | 历史回放和真实 shadow 无法比较 | decision-time snapshot and leakage control table |
| Silent launch without operations | reviewers cannot handle disagreements | queue capacity and escalation model |
| No abstention design | AI 被迫对不确定案例给建议 | abstain / escalate as first-class outcome |
| Treating challenger as model-only | RAG, prompt, tool, policy and workflow also change behavior | full AI object versioning |
| Evidence after the fact | audit cannot reconstruct decision | event-first evidence architecture |
Architecture Mapping To RAG / Agent / Copilot / Eval / Governance
| Architecture area | Shadow mode mapping | Key control |
|---|---|---|
| RAG | Log retrieved chunks, corpus version, citation support, no-answer handling | citation accuracy and retrieval coverage by scenario |
| Agent | Log planned tool calls, denied writes, authority boundary, approval path | read-only tool sandbox and action intent classification |
| Copilot | Compare AI suggestion with human final response or action | automation-bias sampling and agent override rationale |
| Eval | Convert shadow disagreements and failures into golden set and regression cases | production-derived eval case registry |
| Governance | Link every gate decision to use case, model version, data lineage, risk acceptance | evidence binder and decision memo |
| Observability | Trace event from business trigger to challenger output to outcome join | OpenTelemetry-style trace ids and metrics |
| Model risk | Support independent challenge with champion/challenger analysis | validation-ready logs and segment reports |
| Product architecture | Define when AI is advisor, recommender, ranker or autonomous actor | decision authority matrix |
ADR Draft
| Field | Content |
|---|---|
| ADR title | Adopt shadow mode and counterfactual event logging before exposing AI to customer-impacting financial retail decisions |
| Status | Proposed for high-impact AI use cases |
| Context | Credit, AML, KYC, fraud, collections and agent-assist AI systems can change eligibility, intervention, escalation, customer treatment or employee judgment. Offline evaluation alone cannot prove readiness because real workflow context, delayed outcomes, segment risk and human comparison are missing. |
| Decision | Implement a read-only shadow decisioning architecture that captures champion decisions, challenger outputs, decision-time feature/context snapshots, delayed outcomes, human comparison, segment/fairness analysis and audit evidence before assisted or production rollout. |
| Alternatives | Offline-only evaluation; immediate pilot with human oversight; A/B testing in production; manual SME review without production-like traffic. |
| Rationale | Shadow mode provides production-similar evidence without customer impact, supports leakage control and delayed outcome learning, and gives governance teams a reconstructable basis for go / limited go / no-go decisions. |
| Consequences | Requires event logging, feature snapshot discipline, outcome join, reviewer capacity and evidence ownership. It delays broad rollout but reduces unmanaged customer, regulatory, model and operational risk. |
| Guardrails | No write permissions, no customer communication, no autonomous decision, explicit leakage registry, segment hard gates, rollback triggers, evidence binder. |
| Success criteria | Complete traceability, stable shadow operations, no material leakage, acceptable high-severity disagreement rate, fair segment scorecard, mature outcome support, operational readiness and governance approval. |
Interview Answer
30秒版本
Shadow mode 是在不影响客户和业务状态的前提下, 让 AI 读取真实业务事件并记录“如果由 AI 决策会怎么做”。我会把 champion 决策、AI challenger 输出、决策时点特征、延迟 outcome、人工复核差异、segment fairness 和审计证据全部记录下来, 再用 gate 判断是否进入 assisted mode 或受控 rollout。重点不是跑一个模型分数, 而是在客户无影响阶段证明 AI 决策边界、风险和运营准备。
2分钟版本
我会先定义决策边界: AI 是建议额度、排序 AML 告警、建议 KYC 处理、拦截支付、推荐催收策略, 还是辅助客服回答。然后设计 read-only shadow path, 让真实业务事件同时进入现有 champion 流程和 AI challenger。Champion 仍然做实际决定; challenger 只能输出建议、置信度、原因、引用和是否 abstain。
核心是 counterfactual event log。它要保存决策时点的 feature snapshot、policy version、RAG corpus、model/prompt/tool 版本, 还要保存实际人工或规则决定。等 outcome 成熟后, 比如欺诈确认、逾期、SAR 结果、投诉、QA 分数, 再回连分析 AI 原本会带来什么收益或伤害。
Gate 不能只看平均准确率。我会看 high-severity disagreement、false positive/negative、fairness slice、leakage、人工复核一致性、trace 完整性、操作队列负荷和 rollback trigger。如果证据不足, 继续 shadow; 如果低风险场景稳定, 先进入 human-approved assisted mode; 如果出现 segment harm 或泄漏, no-go 并回到数据、模型或流程修复。
CTO版本
我会把 shadow mode 当成一个 pre-decision control plane, 而不是一次测试活动。架构上, production event 进入现有 champion path 的同时, 复制最小必要上下文到只读 challenger path。Challenger 的任何 tool intent 都在 sandbox 中记录但不执行; 所有输出都带 trace id、model/prompt/RAG/policy/tool 版本和 decision-time feature snapshot。
数据层需要 append-only counterfactual store、outcome joiner、segment analyzer 和 evidence binder。控制层需要 leakage registry、population sampling plan、blind human comparison、fairness hard gates、operational readiness 和 rollback triggers。这样我们可以在不改变客户结果的前提下回答: AI 会在哪些场景改善决策, 会在哪些场景制造客户伤害, outcome 延迟是否改变结论, 一线团队能否承接, 审计能否重建整个判断链。
我不会让高影响金融 AI 直接从 offline eval 跳到 customer-impacting release。合理路径是 replay -> silent shadow -> human comparison -> assisted mode -> narrow rollout。每一层都要有 gate memo 和 residual risk decision。这样 CTO 可以向风险、审计和业务解释: 我们不是凭 demo 上线, 而是用生产相似证据逐步放权。
7-Day Practice Plan
| Day | Practice | Output |
|---|---|---|
| 1 | 选一个金融零售用例, 明确 champion、challenger、decision boundary、prohibited actions | Shadow use case card |
| 2 | 设计 counterfactual event schema, 包含 feature snapshot、AI output、champion output、trace、outcome plan | Event schema table |
| 3 | 写 leakage control matrix, 覆盖 future outcome、human label、queue selection、policy version | Leakage register |
| 4 | 设计 human comparison protocol, 包含 blind review、SME calibration、disagreement severity | Review protocol |
| 5 | 建 segment/fairness scorecard, 覆盖 false positive/negative、agreement、outcome、complaint proxy | Segment scorecard |
| 6 | 写 rollout gate memo, 给出 no-go / continue shadow / limited go / rollout go 标准 | Gate decision memo |
| 7 | 组合成 portfolio artifact, 用 CTO 版本讲一遍架构、风险、证据和决策 | 5-minute interview narrative |
Source Anchors
| Source | Link | 本文使用方式 |
|---|---|---|
| NIST AI Risk Management Framework | https://www.nist.gov/itl/ai-risk-management-framework | 用 Govern / Map / Measure / Manage 组织 shadow mode 的风险识别、评估、处置和证据语言。 |
| NIST AI RMF Resources and TEVV | https://www.nist.gov/itl/ai-risk-management-framework/ai-risk-management-framework-resources | 用 test, evaluation, verification and validation 思维支持 counterfactual evaluation、measurement 和 independent challenge。 |
| ISO/IEC 42001 | https://www.iso.org/standard/81230.html | 用 AI management system 的 operation、performance evaluation、continual improvement 语境组织 operating model。 |
| ISO/IEC 23894 | https://www.iso.org/standard/77304.html | 用 AI risk management vocabulary 支撑 risk identification、risk treatment 和 monitoring。 |
| Google Rules of Machine Learning | https://developers.google.com/machine-learning/guides/rules-of-ml | 参考 ML 系统工程中的上线前检查、监控、数据和训练/服务一致性原则。 |
| DORA metrics | https://dora.dev/ | 作为 delivery reliability、change quality、rollback/restore thinking 的工程治理锚点, 不把 shadow mode 简化为发布速度指标。 |
| OpenTelemetry docs | https://opentelemetry.io/docs/ | 作为 trace、metric、log、context propagation 的可观测性锚点, 支撑 event-to-outcome 追踪。 |