AI 底层逻辑 / 经典论文

AI Shadow Mode：影子模式与反事实评估架构

Shadow mode 是把 AI 系统接入真实业务上下文, 但不让它影响客户、员工操作、系统状态或监管承诺。它回答的不是“模型在测试集上分数高不高”, 而是更接近架构和产品决策的问题:

440 行ai-foundations/papers/172-ai-shadow-mode-counterfactual-evaluation-silent-launch-architecture.md

AI 影子模式架构：Shadow Mode / Counterfactual Evaluation / Silent Launch

Date: 2026-06-30
Status: evergreen
Audience: experienced CBAP / financial retail PM / product architect / solution architect / AI governance lead
Output: advanced architecture note, decision model, evidence pattern, interview asset

Why Shadow Mode Matters For AI Product/Architecture

Decision question	Shadow mode 的价值	不适合用什么替代
AI 是否理解真实业务上下文	用生产相似输入、策略、队列、权限、延迟和缺失数据验证	只用静态 golden set
AI 如果参与决策会改变什么	记录 proposed decision、current champion decision、human decision 和后续 outcome	直接上线 A/B
风险是否集中在特定 segment	在不触达客户的情况下监控 fairness、false positive、override、appeal 和服务水平差异	上线后再从投诉中发现
人工复核和 AI 建议是否一致	比较 analyst / underwriter / agent 与 AI 的分歧, 形成 review calibration	只看模型准确率
是否可以进入 pilot / assisted mode	用 evidence packet 支撑 go / limited go / no-go / redesign	会议式主观判断

这份笔记刻意不重复 online experimentation、UAT regression certification、release governance 或 adoption analytics。它定位在更早一层: pre-production and pre-decision architecture。在金融零售里, 这层能力决定 AI 能不能从 demo 进入真实决策流程。

一句话:

Shadow mode turns production-like traffic into decision evidence before the AI is allowed to change customer outcomes.

Concept Diagram

flowchart LR
  A[Live business event] --> B[Current champion path]
  A --> C[Shadow decisioning path]

  B --> B1[Human / rules / existing model decision]
  B1 --> B2[Actual action taken]
  B2 --> B3[Delayed outcome / label]

  C --> C1[Feature snapshot]
  C1 --> C2[AI challenger decision]
  C2 --> C3[Counterfactual event log]

  B1 --> D[Decision comparison]
  C2 --> D
  B3 --> E[Outcome join]
  C3 --> E
  D --> F[Segment / fairness / control analysis]
  E --> F
  F --> G[Gate memo]
  G --> H{Rollout decision}
  H -->|No-go| I[Redesign / retrain / control update]
  H -->|Limited go| J[Assisted mode with HITL]
  H -->|Go| K[Controlled production rollout]

架构边界:

Champion path 仍然拥有业务决定权。
Challenger path 只能观察、推理、记录、解释, 不能写业务状态。
Counterfactual log 必须能在 outcome 延迟后回放和归因。
Gate decision 必须同时看效果、风险、可解释性、操作准备和证据质量。

Core Architecture Model

1. 核心组件

Component	Responsibility	Architecture decision
Shadow router	从真实事件复制最小必要输入给 challenger	默认异步、只读、可限流, 不阻塞 champion path
Feature snapshot service	固化决策时点可见的特征、政策版本、context 版本	防止 outcome leakage 和事后补数
Challenger decision service	运行 AI model / RAG / agent / copilot 逻辑	输出建议、置信度、reason、evidence、abstention
Counterfactual event store	保存 AI 原本会做什么, 但不执行	append-only, versioned, immutable enough for audit
Champion decision capture	捕获人工、规则或现有模型实际决定	记录 actor、policy、timestamp、override reason
Outcome joiner	在标签成熟后把真实 outcome 回连到 shadow event	支持 7/30/60/90 天标签延迟
Segment and fairness analyzer	按客户、产品、渠道、地区、语言、风险等级分析	不把受保护属性直接暴露给 runtime decision
Gate workbench	汇总 metric、分歧、风险、控制和 evidence	输出 go / limited go / no-go / rollback trigger
Evidence binder	固化数据血缘、版本、评估、审批、例外、问题单	支撑 model risk、internal audit、risk committee

2. Shadow Mode 的四个成熟度级别

Level	Mode	What happens	Suitable decision
L0 Replay	离线历史回放	用历史案例重跑 challenger	是否值得进入生产相似流量
L1 Silent shadow	真实事件复制, AI 不影响任何用户或员工	只写 counterfactual log	是否能理解真实输入和流程
L2 Human comparison	AI 建议对独立 reviewer 可见, 不进入一线工作台	比较 human expert 与 AI	是否可进入 assisted pilot
L3 Assisted silent launch	AI 建议进入员工界面但默认不执行, 需人工确认	衡量可用性、分歧、操作负荷	是否逐步放权或扩大范围

关键原则:

不允许 shadow 输出触发客户通知、价格变更、额度变更、冻结、拒绝、催收动作或 SAR / regulatory filing。
如果需要员工看到 AI 建议, 必须明确是否会改变人工判断。否则会把 silent launch 变成事实上的 pilot。
所有分析必须区分 “AI would have recommended” 和 “business actually did”。

3. Decision Object

Shadow mode 不是只记录一个 score。每次 AI 决策至少应保存:

Field group	Examples
Business context	use case, product, channel, customer segment, workflow step, jurisdiction
Input snapshot	event payload hash, feature vector version, policy version, RAG corpus version, tool schema version
AI output	recommendation, score, confidence, reason codes, citations, abstain flag, proposed next action
Champion output	actual decision, human reviewer, rule/model id, override reason, action taken
Comparison	agreement, severity of disagreement, expected customer impact, review queue
Outcome	label type, label maturity date, outcome value, appeal/complaint/fraud loss/default indicator
Controls	leakage checks, fairness slice, PII minimization, retention class, reviewer calibration status
Evidence	trace id, model/prompt version, eval run id, gate memo id, approval or issue id

Counterfactual Logging And Evaluation Lifecycle

1. Lifecycle

use case intake
  -> decision boundary and prohibited actions
  -> feature and policy snapshot design
  -> shadow router implementation
  -> counterfactual event logging
  -> daily quality and control checks
  -> delayed label / outcome join
  -> human comparison and disagreement review
  -> segment / fairness scorecard
  -> gate memo and rollout recommendation
  -> evidence archive

2. Logging Model

Counterfactual logging 要能回答三个问题:

Question	Required evidence	Failure if missing
What did AI know at decision time?	feature snapshot, context snapshot, policy version, retrieval corpus version	事后用未来信息美化结果
What would AI have done?	proposed decision, score, reason, confidence, action intent, abstention	只保留 score, 无法判断业务动作
What actually happened later?	champion decision, customer outcome, label maturity, complaint/appeal/fraud/default result	无法估计真实风险或收益

3. Outcome Delay Handling

金融零售很多 outcome 不会当天成熟:

Use case	Immediate proxy	Mature label	Recommended shadow window
Credit line management	underwriter agreement, utilization change	delinquency, loss, complaint, adverse action dispute	60-180 days
AML alert triage	analyst disposition	SAR decision, QA finding, law enforcement feedback where available	30-120 days
KYC onboarding	document verification result	fraud hit, synthetic identity signal, account closure reason	30-180 days
Payment fraud intervention	authorization decision, customer confirmation	confirmed fraud, false decline, chargeback	7-60 days
Collections contact strategy	agent agreement, contact success	cure, re-default, hardship complaint, contact violation	30-120 days
Contact center agent assist	agent acceptance, QA score	complaint, repeat contact, resolution quality	7-45 days

Outcome delay 不是纯数据问题, 是产品决策问题。过早 gate 会高估速度和低估损害; 过晚 gate 会拖慢学习。成熟做法是把 gate 分成:

Readiness gate: 输入完整性、日志质量、控制完整性、初步一致性。
Risk gate: 高风险 disagreement、fairness slice、prohibited action、leakage、complaint-sensitive cases。
Outcome gate: 标签成熟后的 lift / harm / false positive / false negative / cost / operational burden。

4. Leakage Control

Shadow mode 最常见的假进步来自 leakage。

Leakage type	Example	Control
Future outcome leakage	用 30 天后逾期状态作为当天额度调整特征	decision-time feature store, snapshot timestamp enforcement
Human decision leakage	把 underwriter 最终决定输入 challenger	separate champion capture after challenger output
Queue leakage	只对已被人工筛过的好案例 shadow	route-level sampling and population definition
Label leakage	用 QA 后的 AML disposition 训练当天 triage model	label maturity registry and train/eval split by time
Policy leakage	用新版 policy 回放旧案例, 但与当时政策比较	policy version lock and policy-era analysis
Reviewer leakage	human reviewer 看到 AI 后再作“独立”标签	blind review protocol for comparison samples

Gate Criteria And Rollout Decision Model

1. Gate Layers

Gate	Pass condition	Stop condition
Technical readiness	shadow path stable, trace complete, latency/cost within budget, no champion path impact	missing logs, champion slowdown, non-deterministic versioning
Data and leakage	decision-time snapshots complete, no future feature, sampling representative	leakage found in material feature or label
Business performance	challenger improves target metric or reduces manual burden without unacceptable harm	average lift comes from narrow or low-risk slice only
Risk and fairness	no critical segment regression, protected/proxy monitoring approved	false positive/negative disparity breaches threshold
Human comparison	disagreements explainable, SME review supports limited use	AI disagrees on high-severity cases without defensible reason
Operational readiness	human review capacity, escalation, fallback, monitoring, evidence owner ready	reviewers cannot absorb alerts or overrides
Governance evidence	use case card, data lineage, eval report, gate memo, risk acceptance complete	decision cannot be reconstructed

2. Decision Model

Result	Meaning	Action
No-go	AI is not fit for workflow exposure	redesign model/process, fix data, repeat shadow
Continue shadow	evidence incomplete or outcome labels immature	extend window, narrow sample, improve logging
Limited go	fit for assisted mode with human approval and tight monitoring	expose suggestions to trained users, no autonomous action
Conditional go	fit for narrow segment or low-risk decision	feature flag by segment/channel/product, monitor guardrails
Rollout go	fit for controlled production rollout	staged ramp with rollback triggers and audit evidence
Decommission challenger	AI adds no value or introduces unmanaged risk	archive evidence, close initiative, record lessons

3. Rollback Triggers Before Full Rollout

Rollback 不只适用于 production release。Shadow / silent launch 阶段也需要 stop rules:

Trace completeness below agreed threshold for two consecutive business days.
Challenger produces prohibited actions, unauthorized tool intent, or customer-impacting recommendation outside approved scope.
Material leakage discovered in feature, label, prompt context, or human comparison protocol.
High-severity disagreement rate exceeds threshold in credit, fraud, AML, KYC or vulnerable customer slices.
Fairness scorecard shows unexplained false positive / false negative disparity in protected or proxy segments.
Reviewer queue load exceeds planned capacity and delays existing controls.
Evidence binder cannot reconstruct model, prompt, data, policy and decision versions.

Financial Retail Scenarios

1. Credit Line Management

Shadow question: if AI recommended line increase / decrease / hold, would it improve risk-adjusted growth without unfair treatment or adverse action inconsistency?

Architecture object	Design
Champion	existing credit policy, scorecard, underwriter override
Challenger	AI line-management recommender with reason-code constraints
Counterfactual action	increase, decrease, freeze, keep, manual review
Delayed label	utilization, delinquency, charge-off, complaint, dispute, attrition
Critical controls	no adverse action leakage, reason-code consistency, fair lending segment analysis
Gate blocker	AI recommends line decrease for protected/proxy segment at materially higher rate without justified risk signal

2. AML Alert Triage

Shadow question: can AI triage alerts, summarize rationale, and recommend priority without missing suspicious activity or creating analyst automation bias?

Architecture object	Design
Champion	current rules, analyst disposition, QA review
Challenger	alert severity ranker + narrative copilot
Counterfactual action	close, escalate, request enhanced review, priority rank
Delayed label	SAR decision, QA defect, reopened case, typology hit
Critical controls	blind SME sample, typology coverage, reviewer calibration, audit narrative trace
Gate blocker	AI under-escalates high-risk typology or de-prioritizes vulnerable jurisdiction slice without explanation

3. KYC Onboarding

Shadow question: can AI reduce manual review and detect synthetic identity risk without discouraging legitimate applicants?

Architecture object	Design
Champion	identity verification vendor, KYC rules, operations review
Challenger	document / entity / risk signal synthesis model
Counterfactual action	approve, reject, request document, enhanced due diligence
Delayed label	fraud confirmation, account closure, AML hit, customer complaint
Critical controls	no direct customer messaging in shadow, document provenance, bias by language/geography
Gate blocker	false reject concentration by document type, language, country corridor, or accessibility need

4. Payment Fraud Intervention

Shadow question: would AI intervene on risky payments with lower fraud loss and fewer false declines?

Architecture object	Design
Champion	rules/model authorization and fraud queue
Challenger	real-time fraud intervention recommender
Counterfactual action	allow, step-up, hold, decline, manual review
Delayed label	confirmed fraud, chargeback, customer confirmation, complaint
Critical controls	latency budget, false decline harm, scam typology evidence, customer vulnerability signal
Gate blocker	fraud savings depend on unacceptable false decline rate for payroll, benefit, remittance or vulnerable customer slices

5. Collections Contact Strategy

Shadow question: would AI choose a better contact channel, timing and treatment while respecting hardship, consent and conduct risk?

Architecture object	Design
Champion	current collections segmentation and dialer strategy
Challenger	treatment optimizer with vulnerability and consent guardrails
Counterfactual action	call, SMS, email, letter, hardship route, no contact
Delayed label	cure, promise kept, complaint, re-default, contact violation
Critical controls	consent, contact frequency, vulnerability escalation, conduct-risk QA
Gate blocker	AI increases pressure on vulnerable customers or repeats contact near legal limits

6. Contact Center Agent Assist

Shadow question: can a copilot suggest policy-grounded answers and next best actions without misleading agents or changing regulated communications?

Architecture object	Design
Champion	agent judgment, knowledge base, QA scorecard
Challenger	RAG copilot / summarizer / next-action recommender
Counterfactual action	suggested answer, citation, escalation, after-call summary
Delayed label	QA score, repeat contact, complaint, resolution, supervisor correction
Critical controls	citation grounding, no policy invention, agent independence sample, coaching readiness
Gate blocker	AI produces fluent but uncited policy advice in complaint-sensitive or regulated product cases

Metrics/Control/Evidence Model

1. Metrics

Metric group	Examples	Decision use
Decision agreement	champion/challenger agreement, severity-weighted disagreement, SME upheld rate	是否进入 assisted mode
Counterfactual performance	expected loss avoided, fraud captured, false decline avoided, manual queue reduction	是否有业务价值
Risk and harm	false positive/negative by slice, complaint proxy, adverse action inconsistency	是否存在客户伤害
Fairness	selection rate, false positive disparity, false negative disparity, calibration by segment	是否可解释并可控制
Operational	latency, cost per shadow event, queue load, reviewer time, abstention rate	是否可运营
Evidence quality	trace completeness, version reconstructability, outcome join rate, missingness	是否可审计
Human comparison	independent reviewer agreement, override rationale quality, automation-bias signal	是否可交给一线使用

2. Control Model

Control	Purpose	Evidence
Read-only runtime permissions	确保 challenger 不影响客户或系统状态	service account policy, tool deny-list, write-attempt logs
Decision-time snapshot	防止未来信息泄漏	feature snapshot hash, timestamp, feature availability contract
Population definition	防止只 shadow 好看样本	sampling plan, inclusion/exclusion rules, traffic report
Blind human review	获取独立 comparison label	reviewer assignment, hidden AI flag, calibration report
Segment monitoring	发现集中伤害和公平性问题	segment scorecard, threshold breach log
Outcome maturity registry	控制 label 延迟和解释口径	label plan, maturity dates, join rate
Evidence binder	让 gate decision 可追溯	decision memo, run ids, lineage, approvals, issue records

3. Evidence Packet

一份可进入 governance review 的 packet 应包含:

Use case and decision boundary.
Customer and employee impact statement.
Champion/challenger architecture diagram.
Data, feature, prompt, RAG, model and policy version lineage.
Counterfactual schema and sample trace.
Leakage assessment and remediation record.
Human comparison protocol and calibration results.
Outcome label plan and maturity analysis.
Segment/fairness scorecard.
Gate recommendation with residual risks and controls.
Rollout limits, rollback triggers and owner map.
Audit-ready evidence index.

Anti-Patterns And Failure Modes

Anti-pattern	Why it fails	Better pattern
“Shadow mode” that writes system state	已经影响客户, 不能称为 silent	strict read-only permissions and write-attempt alerting
Only logging scores	无法解释 action, reason, confidence, authority	log full decision object
Comparing AI to contaminated human labels	reviewer already saw AI output	blind review or independent SME calibration
Declaring success before outcome maturity	短期 proxy 掩盖损害	staged gate by immediate, risk and mature outcome
Average lift hides segment harm	小群体损害被总体收益掩盖	segment-level hard gates
No leakage registry	历史回放和真实 shadow 无法比较	decision-time snapshot and leakage control table
Silent launch without operations	reviewers cannot handle disagreements	queue capacity and escalation model
No abstention design	AI 被迫对不确定案例给建议	abstain / escalate as first-class outcome
Treating challenger as model-only	RAG, prompt, tool, policy and workflow also change behavior	full AI object versioning
Evidence after the fact	audit cannot reconstruct decision	event-first evidence architecture

Architecture Mapping To RAG / Agent / Copilot / Eval / Governance

Architecture area	Shadow mode mapping	Key control
RAG	Log retrieved chunks, corpus version, citation support, no-answer handling	citation accuracy and retrieval coverage by scenario
Agent	Log planned tool calls, denied writes, authority boundary, approval path	read-only tool sandbox and action intent classification
Copilot	Compare AI suggestion with human final response or action	automation-bias sampling and agent override rationale
Eval	Convert shadow disagreements and failures into golden set and regression cases	production-derived eval case registry
Governance	Link every gate decision to use case, model version, data lineage, risk acceptance	evidence binder and decision memo
Observability	Trace event from business trigger to challenger output to outcome join	OpenTelemetry-style trace ids and metrics
Model risk	Support independent challenge with champion/challenger analysis	validation-ready logs and segment reports
Product architecture	Define when AI is advisor, recommender, ranker or autonomous actor	decision authority matrix

ADR Draft

Field	Content
ADR title	Adopt shadow mode and counterfactual event logging before exposing AI to customer-impacting financial retail decisions
Status	Proposed for high-impact AI use cases
Context	Credit, AML, KYC, fraud, collections and agent-assist AI systems can change eligibility, intervention, escalation, customer treatment or employee judgment. Offline evaluation alone cannot prove readiness because real workflow context, delayed outcomes, segment risk and human comparison are missing.
Decision	Implement a read-only shadow decisioning architecture that captures champion decisions, challenger outputs, decision-time feature/context snapshots, delayed outcomes, human comparison, segment/fairness analysis and audit evidence before assisted or production rollout.
Alternatives	Offline-only evaluation; immediate pilot with human oversight; A/B testing in production; manual SME review without production-like traffic.
Rationale	Shadow mode provides production-similar evidence without customer impact, supports leakage control and delayed outcome learning, and gives governance teams a reconstructable basis for go / limited go / no-go decisions.
Consequences	Requires event logging, feature snapshot discipline, outcome join, reviewer capacity and evidence ownership. It delays broad rollout but reduces unmanaged customer, regulatory, model and operational risk.
Guardrails	No write permissions, no customer communication, no autonomous decision, explicit leakage registry, segment hard gates, rollback triggers, evidence binder.
Success criteria	Complete traceability, stable shadow operations, no material leakage, acceptable high-severity disagreement rate, fair segment scorecard, mature outcome support, operational readiness and governance approval.

Interview Answer

30秒版本

Shadow mode 是在不影响客户和业务状态的前提下, 让 AI 读取真实业务事件并记录“如果由 AI 决策会怎么做”。我会把 champion 决策、AI challenger 输出、决策时点特征、延迟 outcome、人工复核差异、segment fairness 和审计证据全部记录下来, 再用 gate 判断是否进入 assisted mode 或受控 rollout。重点不是跑一个模型分数, 而是在客户无影响阶段证明 AI 决策边界、风险和运营准备。

2分钟版本

我会先定义决策边界: AI 是建议额度、排序 AML 告警、建议 KYC 处理、拦截支付、推荐催收策略, 还是辅助客服回答。然后设计 read-only shadow path, 让真实业务事件同时进入现有 champion 流程和 AI challenger。Champion 仍然做实际决定; challenger 只能输出建议、置信度、原因、引用和是否 abstain。

核心是 counterfactual event log。它要保存决策时点的 feature snapshot、policy version、RAG corpus、model/prompt/tool 版本, 还要保存实际人工或规则决定。等 outcome 成熟后, 比如欺诈确认、逾期、SAR 结果、投诉、QA 分数, 再回连分析 AI 原本会带来什么收益或伤害。

Gate 不能只看平均准确率。我会看 high-severity disagreement、false positive/negative、fairness slice、leakage、人工复核一致性、trace 完整性、操作队列负荷和 rollback trigger。如果证据不足, 继续 shadow; 如果低风险场景稳定, 先进入 human-approved assisted mode; 如果出现 segment harm 或泄漏, no-go 并回到数据、模型或流程修复。

CTO版本

我会把 shadow mode 当成一个 pre-decision control plane, 而不是一次测试活动。架构上, production event 进入现有 champion path 的同时, 复制最小必要上下文到只读 challenger path。Challenger 的任何 tool intent 都在 sandbox 中记录但不执行; 所有输出都带 trace id、model/prompt/RAG/policy/tool 版本和 decision-time feature snapshot。

数据层需要 append-only counterfactual store、outcome joiner、segment analyzer 和 evidence binder。控制层需要 leakage registry、population sampling plan、blind human comparison、fairness hard gates、operational readiness 和 rollback triggers。这样我们可以在不改变客户结果的前提下回答: AI 会在哪些场景改善决策, 会在哪些场景制造客户伤害, outcome 延迟是否改变结论, 一线团队能否承接, 审计能否重建整个判断链。

我不会让高影响金融 AI 直接从 offline eval 跳到 customer-impacting release。合理路径是 replay -> silent shadow -> human comparison -> assisted mode -> narrow rollout。每一层都要有 gate memo 和 residual risk decision。这样 CTO 可以向风险、审计和业务解释: 我们不是凭 demo 上线, 而是用生产相似证据逐步放权。

7-Day Practice Plan

Day	Practice	Output
1	选一个金融零售用例, 明确 champion、challenger、decision boundary、prohibited actions	Shadow use case card
2	设计 counterfactual event schema, 包含 feature snapshot、AI output、champion output、trace、outcome plan	Event schema table
3	写 leakage control matrix, 覆盖 future outcome、human label、queue selection、policy version	Leakage register
4	设计 human comparison protocol, 包含 blind review、SME calibration、disagreement severity	Review protocol
5	建 segment/fairness scorecard, 覆盖 false positive/negative、agreement、outcome、complaint proxy	Segment scorecard
6	写 rollout gate memo, 给出 no-go / continue shadow / limited go / rollout go 标准	Gate decision memo
7	组合成 portfolio artifact, 用 CTO 版本讲一遍架构、风险、证据和决策	5-minute interview narrative

Source Anchors

Source	Link	本文使用方式
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 组织 shadow mode 的风险识别、评估、处置和证据语言。
NIST AI RMF Resources and TEVV	https://www.nist.gov/itl/ai-risk-management-framework/ai-risk-management-framework-resources	用 test, evaluation, verification and validation 思维支持 counterfactual evaluation、measurement 和 independent challenge。
ISO/IEC 42001	https://www.iso.org/standard/81230.html	用 AI management system 的 operation、performance evaluation、continual improvement 语境组织 operating model。
ISO/IEC 23894	https://www.iso.org/standard/77304.html	用 AI risk management vocabulary 支撑 risk identification、risk treatment 和 monitoring。
Google Rules of Machine Learning	https://developers.google.com/machine-learning/guides/rules-of-ml	参考 ML 系统工程中的上线前检查、监控、数据和训练/服务一致性原则。
DORA metrics	https://dora.dev/	作为 delivery reliability、change quality、rollback/restore thinking 的工程治理锚点, 不把 shadow mode 简化为发布速度指标。
OpenTelemetry docs	https://opentelemetry.io/docs/	作为 trace、metric、log、context propagation 的可观测性锚点, 支撑 event-to-outcome 追踪。