AI 扩展计划 / Playbooks

AI Shadow Mode / Counterfactual Evaluation / Silent Launch Playbook

Shadow mode / counterfactual evaluation / silent launch 的目的, 是在 AI 真正影响客户或一线员工之前, 生成可审计、可比较、可决策的 evidence:

405 行AI_SHADOW_MODE_COUNTERFACTUAL_EVALUATION_SILENT_LAUNCH_PLAYBOOK.md

AI Shadow Mode / Counterfactual Evaluation / Silent Launch Playbook

定位: 面向 experienced CBAP / financial retail PM / product architect / solution architect / AI governance lead 的高级落地手册。
核心问题: 如何在不影响客户、员工决策或系统状态的前提下, 用真实业务上下文验证 AI 决策是否值得进入 assisted mode、limited rollout 或正式治理审批。
适用范围: credit line management、AML alert triage、KYC onboarding、payment fraud intervention、collections contact strategy、contact center agent assist。
边界: 本手册不替代法律意见、模型验证报告、合规审批、UAT certification、online experimentation、release governance 或 adoption analytics。

Purpose And When To Use

Purpose

Shadow mode / counterfactual evaluation / silent launch 的目的, 是在 AI 真正影响客户或一线员工之前, 生成可审计、可比较、可决策的 evidence:

Purpose	What it answers	Artifact
Prove workflow fit	AI 是否理解真实输入、政策、例外、延迟和缺失数据	shadow run report
Estimate counterfactual value	如果 AI 参与决策, 会带来多少收益、损害和运营负荷	champion/challenger comparison
Detect concentrated harm	是否对 segment、渠道、语言、地区、产品或 vulnerable customer 不公平	fairness/segment scorecard
Calibrate human review	AI 与 SME、analyst、underwriter、agent 的差异在哪里	disagreement review
Prepare rollout decision	是否 no-go、继续 shadow、assisted mode、limited rollout	gate memo and evidence packet

When To Use

Use shadow mode when	Do not use it as a shortcut when
AI recommendation may change eligibility, intervention, escalation, customer treatment, prioritization or regulated communication	You already know the workflow is unsafe or prohibited
Outcome labels are delayed and offline eval cannot answer business impact	You need production A/B experimentation with customer impact
Human review quality and AI comparison matter	You only need technical regression for a non-decision component
Fairness, leakage, operational readiness or auditability are material	Logs cannot be retained or reconstructed under policy
You need confidence before exposing suggestions to staff	The AI output will secretly influence staff without controls

Use Case Fit

Use case	Shadowable decision	Primary risk	Mature outcome
Credit line management	increase / decrease / hold / review	unfair credit treatment, adverse action inconsistency	delinquency, loss, complaint, attrition
AML alert triage	close / escalate / prioritize / narrative	missed suspicious activity, analyst bias	SAR decision, QA defect, reopened case
KYC onboarding	approve / reject / document request / EDD	false reject, synthetic identity miss, discouragement	fraud hit, closure reason, complaint
Payment fraud intervention	allow / step-up / hold / decline	false decline, fraud loss, vulnerable customer harm	confirmed fraud, chargeback, customer confirmation
Collections contact strategy	channel / timing / hardship route / no contact	conduct risk, consent breach, vulnerable customer harm	cure, re-default, complaint, violation
Contact center agent assist	answer / citation / escalation / summary	hallucinated policy, automation bias, wrong regulated message	QA score, repeat contact, complaint

Operating Model

1. Roles And Decision Rights

Role	Responsibilities	Decision rights
Product owner	defines use case, customer impact, business value, rollout objective	recommends continue / hold / limited go
Senior BA / CBAP	maps decision boundary, policy, exception flows, outcome labels, human workflow	accepts business process completeness
Product architect	designs shadow operating model, authority boundary, evidence artifacts	approves product architecture readiness
Solution architect	designs event routing, logging, snapshot, versioning, access, observability	approves technical readiness
AI / ML owner	owns challenger model, prompt, RAG, tool logic, evals and limitations	approves model candidate for shadow
Operations owner	owns reviewer capacity, queue impact, training and escalation	approves operational readiness
Risk / compliance / model risk	challenges fairness, leakage, customer harm, evidence and residual risk	approves risk acceptance or no-go
Data governance	validates feature availability, lineage, retention and access controls	approves data readiness
Internal audit liaison	reviews evidence reconstructability and control clarity	gives evidence quality feedback

2. End-To-End Flow

intake
  -> decision boundary
  -> population and sampling plan
  -> feature/context snapshot design
  -> read-only challenger integration
  -> counterfactual event logging
  -> leakage and evidence checks
  -> human comparison
  -> delayed outcome join
  -> fairness and segment scorecard
  -> gate decision
  -> assisted mode / continue shadow / no-go

3. Cadence

Cadence	Meeting	Inputs	Outputs
Daily during first week	Shadow run health check	trace completeness, errors, blocked events, write attempts	run issue log
Weekly	Counterfactual review	agreement, disagreement severity, sample reviews, missing labels	tuning and control actions
Biweekly or monthly	Gate readiness review	outcome maturity, fairness scorecard, operations capacity, evidence binder	continue / narrow / expand / stop recommendation
At label maturity	Outcome review	mature labels, losses, complaints, QA defects, appeals	rollout gate memo

Shadow Mode Intake Template

Field	Required content	Example
Use case	Business decision being shadowed	Payment fraud intervention for high-value card-not-present transactions
Business owner	Accountable decision owner	Fraud strategy director
Workflow insertion point	Exact event and step	After authorization risk score, before customer step-up
Champion path	Current actual decision maker	Fraud rules engine + fraud analyst queue
Challenger path	AI candidate behavior	AI recommends allow, step-up, hold or decline with reason
Customer impact prohibited in shadow	Actions AI cannot trigger	No customer message, no hold, no decline, no case note write
Employee exposure	Whether staff can see AI output	Hidden during L1; SME-only review during L2
Population	Included and excluded traffic	Include domestic card-not-present above threshold; exclude disputed accounts
Sampling	How events enter shadow	20% stratified sample by risk band and merchant category
Outcome labels	Mature labels and proxy labels	confirmed fraud, chargeback, customer confirmation, complaint
Fairness segments	Approved segments/proxies	region, age band where permitted, language, product, device, vulnerability flag
Required evidence	Gate artifacts	event schema, leakage register, scorecard, human comparison, gate memo
Initial gate date	First decision date based on label maturity	30-day readiness gate, 60-day outcome gate

Counterfactual Event Schema Template

Field	Type	Description	Example
event_id	string	Stable shadow event id	`shd_pay_20260630_000184`
trace_id	string	Observability trace across router, challenger, store, outcome join	`trc_75f4b28a`
use_case_id	string	Registered AI use case	`PAY-FRAUD-INTERVENTION-AI`
event_time	timestamp	Time of business event	`2026-06-30T14:22:09Z`
decision_time	timestamp	Time challenger produced decision	`2026-06-30T14:22:10Z`
population_slice	object	Product/channel/region/risk-band descriptors	card-not-present, high-value, mobile
feature_snapshot_id	string	Decision-time feature snapshot	`fs_pay_v17_20260630_142209`
policy_version	string	Policy/rule version visible at decision time	`fraud_policy_2026_06_v3`
model_version	string	Challenger model version	`fraud_challenger_v0.8.2`
prompt_version	string	Prompt/system instruction version when applicable	`prompt_fraud_reason_v5`
rag_corpus_version	string	Knowledge corpus version when applicable	`fraud_typology_corpus_2026_06_15`
tool_schema_version	string	Tool contract version when applicable	`readonly_fraud_tools_v2`
ai_recommendation	enum	Proposed decision	step-up
ai_confidence	number	Calibrated confidence or score	0.78
ai_reason_codes	array	Business-readable reasons	unusual merchant, device mismatch, velocity spike
ai_citations	array	Policy or evidence references	fraud policy section 4.2
ai_abstained	boolean	Whether AI declined to recommend	false
champion_decision	enum	Actual decision made by current process	allow
champion_actor	string	Rule, model, human role, or workflow owner	rules_engine_v12
action_taken	string	Actual customer/system impact	authorization approved
disagreement_severity	enum	none / low / medium / high / critical	high
human_review_label	enum	SME judgment for comparison sample	step-up appropriate
outcome_label_status	enum	pending / proxy / mature / unavailable	pending
outcome_value	object	Mature label once available	confirmed_fraud=true, chargeback=false
fairness_slice_id	string	Approved monitoring slice id	mobile_high_value_region_3
leakage_check_status	enum	pass / fail / exception	pass
retention_class	string	Retention and privacy class	regulated_decision_shadow_7y
evidence_refs	array	Linked eval, review, issue and gate ids	eval_run_241, gate_pay_2026_08

Design rule: every event must reconstruct what the challenger knew, what it would have done, what the champion did, and what later happened.

Label/Outcome Plan Template

Label / outcome	Source system	Maturity window	Decision use	Quality control
Champion decision	workflow engine / case system	same day	agreement baseline	reconcile against audit trail
Human SME comparison	independent review queue	3-10 business days	disagreement severity and calibration	blind review for sampled cases
Customer complaint	complaint management	7-60 days	harm proxy	map to event when complaint references decision
Confirmed fraud	fraud case system	7-60 days	payment fraud outcome	exclude unresolved investigations
Delinquency / default	credit servicing	30-180 days	credit line risk outcome	vintage by decision date
SAR / QA disposition	AML case management	30-120 days	AML triage outcome	separate analyst disposition and QA defects
KYC fraud hit	identity / fraud ops	30-180 days	onboarding risk outcome	tag synthetic identity confidence
Collections cure / re-default	servicing and collections	30-120 days	treatment effectiveness	separate payment cure from sustainable cure
Contact center QA	QA platform / CRM	7-45 days	answer quality and resolution	sample by agent, queue and issue type

Outcome rules:

Each gate memo must state which outcomes are mature, proxy-only, or unavailable.
Proxy outcomes can support readiness but cannot prove full business impact.
Outcome join failures are evidence quality issues, not missing details to ignore.
Label definitions must be frozen before gate analysis to avoid decision-driven relabeling.

Leakage Controls Template

Leakage risk	Control design	Evidence	Owner
Future outcome feature enters shadow decision	Feature snapshot only includes values available at event_time	feature availability contract, snapshot audit	data governance
Champion decision influences challenger output	Challenger runs before champion decision capture is made available	event ordering logs	solution architect
Reviewer sees AI before independent label	Blind review queue hides challenger output	reviewer UI screenshot, assignment log	operations owner
Sample excludes hard cases	Population and sampling plan includes risk bands and edge cases	traffic inclusion report	product owner
RAG corpus contains post-event policy	Corpus version is locked by decision_time	corpus version manifest	AI owner
Labels from remediated cases mix with original decisions	Label maturity registry separates initial and corrected disposition	outcome lineage report	Senior BA
Protected attributes leak into runtime decision	Monitoring environment separated from runtime feature set	access control and feature list	risk/compliance
Human-in-the-loop becomes influenced pilot	Staff exposure level is documented and separated by L1/L2/L3	exposure register	product architect

Fairness/Segment Scorecard Template

Segment	Population share	AI recommendation rate	Champion rate	High-severity disagreement	False positive proxy	False negative proxy	Outcome harm signal	Action
Mobile high-value payments	18%	11.4% step-up	6.8% step-up	4.1%	2.2% false step-up	0.9% missed fraud	complaint proxy stable	monitor
New-to-bank KYC	12%	9.6% EDD	5.1% EDD	5.7%	3.4% false EDD	1.3% missed fraud	onboarding drop-off elevated	investigate
Small business credit line	8%	7.0% line decrease	4.9% line decrease	3.3%	1.8% adverse change proxy	2.5% missed risk	complaint proxy stable	require SME review
Vulnerability flag collections	5%	2.1% hardship route	3.8% hardship route	6.2%	0.7% over-route	4.8% under-route	complaint proxy elevated	no-go for this segment
Limited-English contact center	7%	15.2% escalation	10.4% escalation	4.9%	2.9% unnecessary escalation	1.5% under-escalation	repeat contact elevated	improve RAG and language eval

Scorecard rules:

Show champion rate beside AI rate so the review is about decision change, not raw model output.
Separate false positive harm and false negative harm; different use cases value them differently.
Segment thresholds must be approved before reviewing results.
Any critical segment regression blocks broad rollout even when aggregate metrics improve.

Rollout Gate Template

Gate item	Evidence required	Pass standard	Decision
Scope and authority	use case card, prohibited action list, authority matrix	AI has no unapproved customer or system impact	pass
Shadow stability	trace completeness, error rate, latency/cost report	agreed trace completeness and no champion path impact	pass
Leakage	leakage register, failed check log, remediation evidence	no material unresolved leakage	pass
Decision performance	agreement, disagreement, SME upheld rate, outcome metrics	challenger improves or supports target decision without critical harm	conditional pass
Outcome maturity	label maturity report and join rate	gate uses only mature labels for outcome claims	pass
Fairness / segment	scorecard and threshold breaches	no unexplained critical segment regression	investigate if any red slice
Human comparison	blind review protocol, reviewer calibration	high-severity disagreements reviewed and dispositioned	pass
Operations	reviewer capacity, escalation, fallback, runbook	team can operate assisted mode without control degradation	pass
Evidence	evidence packet index and reconstructability sample	sampled decisions can be reconstructed end-to-end	pass
Residual risk	risk acceptance, compensating controls, expiry	accountable owner accepts limited rollout risk	limited go only

Gate outputs:

No-go: material leakage, prohibited action, unfair segment harm, unreviewed critical disagreement or unreconstructable evidence.
Continue shadow: evidence incomplete, labels immature, operations not ready, or value unclear.
Limited go: narrow segment, human approval, explicit rollback triggers, daily monitoring.
Rollout go: mature evidence supports controlled expansion and risk owners approve.

Rollback Trigger Template

Trigger	Threshold example	Detection source	Immediate action	Owner
Write attempt from challenger	any unauthorized write or customer message	access logs / tool sandbox	disable challenger integration	solution architect
Trace completeness breach	below agreed threshold for two business days	observability dashboard	pause gate analysis and fix logging	AI platform owner
Leakage confirmed	any material future or champion leakage	leakage review	invalidate affected run	data governance
Critical disagreement spike	above approved threshold in high-risk slice	daily disagreement report	stop expansion, SME review	product owner
Fairness breach	segment false positive/negative disparity above threshold	segment scorecard	no-go for impacted segment	risk/compliance
Reviewer overload	queue SLA breach or control backlog	operations dashboard	reduce sample or pause L2/L3	operations owner
Outcome harm signal	complaint, appeal, false decline, QA defect spike	outcome joiner / complaint system	rollback to hidden shadow or no-go	product owner
Evidence gap	sampled event cannot reconstruct versions and decision	evidence binder QA	hold gate decision	governance lead

Rollback rule: a rollback trigger must specify who can stop shadow exposure, how quickly the path is disabled, what evidence is preserved, and how re-entry is approved.

Evidence Packet Template

Evidence item	Description	Minimum content
Executive summary	Decision recommendation and risk posture	no-go / continue / limited go / go, scope, residual risks
Use case card	Business context and authority boundary	customer impact, employee exposure, prohibited actions
Architecture diagram	Champion/challenger and logging path	router, snapshot, challenger, event store, outcome join
Event schema	Counterfactual data contract	fields, retention class, versioning, examples
Data lineage	Input and outcome provenance	feature availability, policy versions, RAG corpus, label sources
Leakage register	Leakage assessment and controls	risks, controls, failed checks, remediation
Shadow run report	Operational health	volume, errors, latency, cost, trace completeness
Human comparison	SME / analyst / agent comparison	blind review method, disagreement severity, calibration
Outcome analysis	Mature and proxy outcomes	label maturity, join rate, business metrics, limitations
Fairness scorecard	Segment and harm analysis	thresholds, breaches, mitigations, owner decisions
Gate memo	Go / no-go decision	criteria results, open issues, risk acceptance, rollout limits
Rollback plan	Stop rules and execution path	triggers, owners, communication, re-entry criteria
Audit index	Reconstructable evidence references	trace ids, run ids, approvals, issue records

PM/BA/Architecture Questions

PM Questions

Question	Why it matters
Which customer or business outcome could change if AI were allowed to act?	Defines decision impact and risk tier
Is the AI recommending, ranking, drafting, deciding, or executing?	Determines authority boundary
Which segment could be harmed even if aggregate performance improves?	Forces fairness and customer harm analysis
What is the smallest low-risk surface for limited go?	Avoids broad rollout based on thin evidence
What value would justify added operational and governance cost?	Prevents shadow mode from becoming a research exercise

BA / CBAP Questions

Question	Why it matters
What is the exact business rule, policy or exception AI is shadowing?	Prevents vague “AI assist” scope
Which decisions are observable immediately and which outcomes are delayed?	Builds label/outcome plan
What must be recorded to replay the decision?	Defines evidence and event schema
When does human review need to be blind?	Protects comparison validity
Which workflow step creates customer impact?	Separates silent mode from pilot

Architecture Questions

Question	Why it matters
Can challenger service technically write to any production system?	Confirms read-only control
Are feature, policy, prompt, model, RAG and tool versions reconstructable?	Enables audit and root cause analysis
How are trace ids propagated from event to outcome join?	Connects observability to evaluation
Where are protected attributes or proxies handled for monitoring?	Separates fairness analytics from runtime decisioning
What happens if shadow path fails or exceeds latency/cost budget?	Protects champion path and operations

Release Checklist

This checklist is named “release” because it decides whether the AI can be released from hidden shadow into a more exposed mode. It is not a general software release checklist.

Check	Done condition
Use case registered	use case id, owner, risk tier and decision boundary approved
Authority boundary defined	AI cannot trigger customer/system action in L1/L2 shadow
Population and sampling approved	inclusion/exclusion and stratification documented
Counterfactual schema implemented	event reconstructs input, AI output, champion decision and outcome plan
Feature snapshots locked	only decision-time available features used
Version lineage complete	model, prompt, RAG, policy, tool and data versions captured
Leakage controls passed	no material unresolved leakage
Human comparison ready	blind review and SME calibration protocol active
Outcome plan active	label sources, maturity windows and join logic approved
Fairness scorecard active	segments, thresholds and owners approved
Operational readiness proven	reviewer capacity, escalation, fallback and support path ready
Rollback triggers approved	stop rules, owners and re-entry criteria documented
Evidence packet complete	gate memo can reconstruct sampled events end-to-end
Gate decision recorded	no-go, continue shadow, limited go or rollout go approved by accountable owners

Executive Narrative

1-Minute Steering Committee Version

We are not asking to let the AI change customer outcomes yet. We are asking to continue or expand a controlled shadow mode where the AI sees real business events, produces counterfactual recommendations, and records what it would have done while the existing champion process remains fully in control. This gives us evidence on decision quality, delayed outcomes, fairness, human review differences, operational readiness and audit traceability before any customer-impacting rollout.

3-Minute Risk Committee Version

The control objective is to avoid moving from offline testing directly into customer-impacting AI decisions. In shadow mode, the challenger AI is read-only. It cannot write to production systems, send customer communications, change account status, alter credit lines, close AML alerts, decline payments or direct collections actions. Every recommendation is stored as a counterfactual event with decision-time features, policy and model versions, the actual champion decision and later outcome labels.

We will use this evidence to answer five questions: does AI add decision value, does it create concentrated harm, does it disagree with human experts in high-risk cases, are outcome labels mature enough to support the claim, and can operations and governance reconstruct the decision. If any material leakage, fairness breach, unauthorized action, reviewer overload or evidence gap occurs, the rollout path stops and returns to remediation.

Portfolio Storyline

I designed shadow mode as a pre-decision architecture pattern for regulated financial AI. It combines champion/challenger comparison, counterfactual logging, delayed outcome evaluation, leakage controls, fairness scorecards, human review calibration, rollout gates and rollback triggers. The value is that product, architecture and governance teams can make a disciplined go / no-go decision before the AI is allowed to affect customers.

Interview Drills

Drill 1: Credit Line Management

Question: “How would you validate an AI credit line recommender before launch?”

Strong answer:

I would run it in read-only shadow mode against real line-management events. The current credit policy remains champion, and the AI challenger records increase/decrease/hold/review recommendations with reason codes and decision-time feature snapshots. I would compare against actual decisions, wait for delinquency, utilization, complaint and attrition labels to mature, and run fair lending segment analysis. It can only move to assisted mode if reason-code consistency, high-severity disagreement, fairness and evidence gates pass.

Drill 2: AML Alert Triage

Question: “What makes shadow mode hard for AML?”

Strong answer:

AML labels are delayed and noisy. Analyst disposition, QA review and SAR decision are related but not identical labels. I would separate immediate analyst agreement from mature outcomes, run blind SME comparison on sampled alerts, monitor typology-specific false negatives, and preserve narrative evidence. A model that reduces queue volume but under-escalates high-risk typologies should fail the gate even if aggregate agreement looks good.

Drill 3: KYC Onboarding

Question: “How do you avoid shadow evaluation becoming biased by the current process?”

Strong answer:

I would define the population before sampling, include cases the current process sends to different paths, and keep challenger output separate from champion decision capture. For human comparison, reviewers must be blind to AI output. I would also control policy version and document availability at decision time, because onboarding results can look better if future fraud flags or corrected documents leak into evaluation.

Drill 4: Payment Fraud Intervention

Question: “Why not just A/B test the fraud model?”

Strong answer:

For payment fraud, customer harm from false declines and missed fraud can be immediate. Before any customer-impacting experiment, I want counterfactual evidence using real transaction context. The AI proposes allow, step-up, hold or decline, but the champion path still decides. We then join confirmed fraud, chargebacks, customer confirmations and complaints. Only if false positive and false negative tradeoffs are acceptable by segment would I move to a limited, reversible rollout.

Drill 5: Collections Contact Strategy

Question: “What would block rollout even if cure rate improves?”

Strong answer:

I would block rollout if the AI increases contact pressure on vulnerable customers, violates consent or contact frequency rules, under-routes hardship cases, or concentrates complaints in a protected/proxy segment. Collections AI must be judged on conduct risk and customer treatment, not only cure rate.

Drill 6: Contact Center Agent Assist

Question: “How do you test agent assist without creating automation bias?”

Strong answer:

Start with hidden shadow mode and compare AI suggested answers against actual agent responses and QA outcomes. For sampled cases, run independent review where reviewers do not see the AI output. If we later expose suggestions to agents, that becomes assisted silent launch, and we need acceptance/override tracking, citation quality, repeat contact, complaint and QA monitoring to detect automation bias.

Source Anchors

Source	Link	本手册使用方式
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 组织 shadow evidence、risk gate 和管理层沟通。
NIST AI RMF Resources and TEVV	https://www.nist.gov/itl/ai-risk-management-framework/ai-risk-management-framework-resources	用 TEVV 语境组织 test, evaluation, verification, validation 和 independent challenge。
ISO/IEC 42001	https://www.iso.org/standard/81230.html	用 AI management system 思维组织 operating model、performance evaluation 和 continual improvement。
ISO/IEC 23894	https://www.iso.org/standard/77304.html	用 AI risk management 语境组织 risk treatment、monitoring 和 review。
Google Rules of Machine Learning	https://developers.google.com/machine-learning/guides/rules-of-ml	用 ML engineering rules 校准训练/服务一致性、监控和上线前系统检查。
DORA metrics	https://dora.dev/	用 delivery performance 和 resilience 语言连接 rollout gates、change failure 和 restore thinking。
OpenTelemetry docs	https://opentelemetry.io/docs/	用 traces, metrics, logs 和 context propagation 支撑 counterfactual event-to-outcome observability。