返回 Papers
AI 扩展计划 / Playbooks

AI Shadow Mode / Counterfactual Evaluation / Silent Launch Playbook

Shadow mode / counterfactual evaluation / silent launch 的目的, 是在 AI 真正影响客户或一线员工之前, 生成可审计、可比较、可决策的 evidence:

405AI_SHADOW_MODE_COUNTERFACTUAL_EVALUATION_SILENT_LAUNCH_PLAYBOOK.md

AI Shadow Mode / Counterfactual Evaluation / Silent Launch Playbook

定位: 面向 experienced CBAP / financial retail PM / product architect / solution architect / AI governance lead 的高级落地手册。
核心问题: 如何在不影响客户、员工决策或系统状态的前提下, 用真实业务上下文验证 AI 决策是否值得进入 assisted mode、limited rollout 或正式治理审批。
适用范围: credit line management、AML alert triage、KYC onboarding、payment fraud intervention、collections contact strategy、contact center agent assist。
边界: 本手册不替代法律意见、模型验证报告、合规审批、UAT certification、online experimentation、release governance 或 adoption analytics。


Purpose And When To Use

Purpose

Shadow mode / counterfactual evaluation / silent launch 的目的, 是在 AI 真正影响客户或一线员工之前, 生成可审计、可比较、可决策的 evidence:

PurposeWhat it answersArtifact
Prove workflow fitAI 是否理解真实输入、政策、例外、延迟和缺失数据shadow run report
Estimate counterfactual value如果 AI 参与决策, 会带来多少收益、损害和运营负荷champion/challenger comparison
Detect concentrated harm是否对 segment、渠道、语言、地区、产品或 vulnerable customer 不公平fairness/segment scorecard
Calibrate human reviewAI 与 SME、analyst、underwriter、agent 的差异在哪里disagreement review
Prepare rollout decision是否 no-go、继续 shadow、assisted mode、limited rolloutgate memo and evidence packet

When To Use

Use shadow mode whenDo not use it as a shortcut when
AI recommendation may change eligibility, intervention, escalation, customer treatment, prioritization or regulated communicationYou already know the workflow is unsafe or prohibited
Outcome labels are delayed and offline eval cannot answer business impactYou need production A/B experimentation with customer impact
Human review quality and AI comparison matterYou only need technical regression for a non-decision component
Fairness, leakage, operational readiness or auditability are materialLogs cannot be retained or reconstructed under policy
You need confidence before exposing suggestions to staffThe AI output will secretly influence staff without controls

Use Case Fit

Use caseShadowable decisionPrimary riskMature outcome
Credit line managementincrease / decrease / hold / reviewunfair credit treatment, adverse action inconsistencydelinquency, loss, complaint, attrition
AML alert triageclose / escalate / prioritize / narrativemissed suspicious activity, analyst biasSAR decision, QA defect, reopened case
KYC onboardingapprove / reject / document request / EDDfalse reject, synthetic identity miss, discouragementfraud hit, closure reason, complaint
Payment fraud interventionallow / step-up / hold / declinefalse decline, fraud loss, vulnerable customer harmconfirmed fraud, chargeback, customer confirmation
Collections contact strategychannel / timing / hardship route / no contactconduct risk, consent breach, vulnerable customer harmcure, re-default, complaint, violation
Contact center agent assistanswer / citation / escalation / summaryhallucinated policy, automation bias, wrong regulated messageQA score, repeat contact, complaint

Operating Model

1. Roles And Decision Rights

RoleResponsibilitiesDecision rights
Product ownerdefines use case, customer impact, business value, rollout objectiverecommends continue / hold / limited go
Senior BA / CBAPmaps decision boundary, policy, exception flows, outcome labels, human workflowaccepts business process completeness
Product architectdesigns shadow operating model, authority boundary, evidence artifactsapproves product architecture readiness
Solution architectdesigns event routing, logging, snapshot, versioning, access, observabilityapproves technical readiness
AI / ML ownerowns challenger model, prompt, RAG, tool logic, evals and limitationsapproves model candidate for shadow
Operations ownerowns reviewer capacity, queue impact, training and escalationapproves operational readiness
Risk / compliance / model riskchallenges fairness, leakage, customer harm, evidence and residual riskapproves risk acceptance or no-go
Data governancevalidates feature availability, lineage, retention and access controlsapproves data readiness
Internal audit liaisonreviews evidence reconstructability and control claritygives evidence quality feedback

2. End-To-End Flow

intake
  -> decision boundary
  -> population and sampling plan
  -> feature/context snapshot design
  -> read-only challenger integration
  -> counterfactual event logging
  -> leakage and evidence checks
  -> human comparison
  -> delayed outcome join
  -> fairness and segment scorecard
  -> gate decision
  -> assisted mode / continue shadow / no-go

3. Cadence

CadenceMeetingInputsOutputs
Daily during first weekShadow run health checktrace completeness, errors, blocked events, write attemptsrun issue log
WeeklyCounterfactual reviewagreement, disagreement severity, sample reviews, missing labelstuning and control actions
Biweekly or monthlyGate readiness reviewoutcome maturity, fairness scorecard, operations capacity, evidence bindercontinue / narrow / expand / stop recommendation
At label maturityOutcome reviewmature labels, losses, complaints, QA defects, appealsrollout gate memo

Shadow Mode Intake Template

FieldRequired contentExample
Use caseBusiness decision being shadowedPayment fraud intervention for high-value card-not-present transactions
Business ownerAccountable decision ownerFraud strategy director
Workflow insertion pointExact event and stepAfter authorization risk score, before customer step-up
Champion pathCurrent actual decision makerFraud rules engine + fraud analyst queue
Challenger pathAI candidate behaviorAI recommends allow, step-up, hold or decline with reason
Customer impact prohibited in shadowActions AI cannot triggerNo customer message, no hold, no decline, no case note write
Employee exposureWhether staff can see AI outputHidden during L1; SME-only review during L2
PopulationIncluded and excluded trafficInclude domestic card-not-present above threshold; exclude disputed accounts
SamplingHow events enter shadow20% stratified sample by risk band and merchant category
Outcome labelsMature labels and proxy labelsconfirmed fraud, chargeback, customer confirmation, complaint
Fairness segmentsApproved segments/proxiesregion, age band where permitted, language, product, device, vulnerability flag
Required evidenceGate artifactsevent schema, leakage register, scorecard, human comparison, gate memo
Initial gate dateFirst decision date based on label maturity30-day readiness gate, 60-day outcome gate

Counterfactual Event Schema Template

FieldTypeDescriptionExample
event_idstringStable shadow event idshd_pay_20260630_000184
trace_idstringObservability trace across router, challenger, store, outcome jointrc_75f4b28a
use_case_idstringRegistered AI use casePAY-FRAUD-INTERVENTION-AI
event_timetimestampTime of business event2026-06-30T14:22:09Z
decision_timetimestampTime challenger produced decision2026-06-30T14:22:10Z
population_sliceobjectProduct/channel/region/risk-band descriptorscard-not-present, high-value, mobile
feature_snapshot_idstringDecision-time feature snapshotfs_pay_v17_20260630_142209
policy_versionstringPolicy/rule version visible at decision timefraud_policy_2026_06_v3
model_versionstringChallenger model versionfraud_challenger_v0.8.2
prompt_versionstringPrompt/system instruction version when applicableprompt_fraud_reason_v5
rag_corpus_versionstringKnowledge corpus version when applicablefraud_typology_corpus_2026_06_15
tool_schema_versionstringTool contract version when applicablereadonly_fraud_tools_v2
ai_recommendationenumProposed decisionstep-up
ai_confidencenumberCalibrated confidence or score0.78
ai_reason_codesarrayBusiness-readable reasonsunusual merchant, device mismatch, velocity spike
ai_citationsarrayPolicy or evidence referencesfraud policy section 4.2
ai_abstainedbooleanWhether AI declined to recommendfalse
champion_decisionenumActual decision made by current processallow
champion_actorstringRule, model, human role, or workflow ownerrules_engine_v12
action_takenstringActual customer/system impactauthorization approved
disagreement_severityenumnone / low / medium / high / criticalhigh
human_review_labelenumSME judgment for comparison samplestep-up appropriate
outcome_label_statusenumpending / proxy / mature / unavailablepending
outcome_valueobjectMature label once availableconfirmed_fraud=true, chargeback=false
fairness_slice_idstringApproved monitoring slice idmobile_high_value_region_3
leakage_check_statusenumpass / fail / exceptionpass
retention_classstringRetention and privacy classregulated_decision_shadow_7y
evidence_refsarrayLinked eval, review, issue and gate idseval_run_241, gate_pay_2026_08

Design rule: every event must reconstruct what the challenger knew, what it would have done, what the champion did, and what later happened.


Label/Outcome Plan Template

Label / outcomeSource systemMaturity windowDecision useQuality control
Champion decisionworkflow engine / case systemsame dayagreement baselinereconcile against audit trail
Human SME comparisonindependent review queue3-10 business daysdisagreement severity and calibrationblind review for sampled cases
Customer complaintcomplaint management7-60 daysharm proxymap to event when complaint references decision
Confirmed fraudfraud case system7-60 dayspayment fraud outcomeexclude unresolved investigations
Delinquency / defaultcredit servicing30-180 dayscredit line risk outcomevintage by decision date
SAR / QA dispositionAML case management30-120 daysAML triage outcomeseparate analyst disposition and QA defects
KYC fraud hitidentity / fraud ops30-180 daysonboarding risk outcometag synthetic identity confidence
Collections cure / re-defaultservicing and collections30-120 daystreatment effectivenessseparate payment cure from sustainable cure
Contact center QAQA platform / CRM7-45 daysanswer quality and resolutionsample by agent, queue and issue type

Outcome rules:

  • Each gate memo must state which outcomes are mature, proxy-only, or unavailable.
  • Proxy outcomes can support readiness but cannot prove full business impact.
  • Outcome join failures are evidence quality issues, not missing details to ignore.
  • Label definitions must be frozen before gate analysis to avoid decision-driven relabeling.

Leakage Controls Template

Leakage riskControl designEvidenceOwner
Future outcome feature enters shadow decisionFeature snapshot only includes values available at event_timefeature availability contract, snapshot auditdata governance
Champion decision influences challenger outputChallenger runs before champion decision capture is made availableevent ordering logssolution architect
Reviewer sees AI before independent labelBlind review queue hides challenger outputreviewer UI screenshot, assignment logoperations owner
Sample excludes hard casesPopulation and sampling plan includes risk bands and edge casestraffic inclusion reportproduct owner
RAG corpus contains post-event policyCorpus version is locked by decision_timecorpus version manifestAI owner
Labels from remediated cases mix with original decisionsLabel maturity registry separates initial and corrected dispositionoutcome lineage reportSenior BA
Protected attributes leak into runtime decisionMonitoring environment separated from runtime feature setaccess control and feature listrisk/compliance
Human-in-the-loop becomes influenced pilotStaff exposure level is documented and separated by L1/L2/L3exposure registerproduct architect

Fairness/Segment Scorecard Template

SegmentPopulation shareAI recommendation rateChampion rateHigh-severity disagreementFalse positive proxyFalse negative proxyOutcome harm signalAction
Mobile high-value payments18%11.4% step-up6.8% step-up4.1%2.2% false step-up0.9% missed fraudcomplaint proxy stablemonitor
New-to-bank KYC12%9.6% EDD5.1% EDD5.7%3.4% false EDD1.3% missed fraudonboarding drop-off elevatedinvestigate
Small business credit line8%7.0% line decrease4.9% line decrease3.3%1.8% adverse change proxy2.5% missed riskcomplaint proxy stablerequire SME review
Vulnerability flag collections5%2.1% hardship route3.8% hardship route6.2%0.7% over-route4.8% under-routecomplaint proxy elevatedno-go for this segment
Limited-English contact center7%15.2% escalation10.4% escalation4.9%2.9% unnecessary escalation1.5% under-escalationrepeat contact elevatedimprove RAG and language eval

Scorecard rules:

  • Show champion rate beside AI rate so the review is about decision change, not raw model output.
  • Separate false positive harm and false negative harm; different use cases value them differently.
  • Segment thresholds must be approved before reviewing results.
  • Any critical segment regression blocks broad rollout even when aggregate metrics improve.

Rollout Gate Template

Gate itemEvidence requiredPass standardDecision
Scope and authorityuse case card, prohibited action list, authority matrixAI has no unapproved customer or system impactpass
Shadow stabilitytrace completeness, error rate, latency/cost reportagreed trace completeness and no champion path impactpass
Leakageleakage register, failed check log, remediation evidenceno material unresolved leakagepass
Decision performanceagreement, disagreement, SME upheld rate, outcome metricschallenger improves or supports target decision without critical harmconditional pass
Outcome maturitylabel maturity report and join rategate uses only mature labels for outcome claimspass
Fairness / segmentscorecard and threshold breachesno unexplained critical segment regressioninvestigate if any red slice
Human comparisonblind review protocol, reviewer calibrationhigh-severity disagreements reviewed and dispositionedpass
Operationsreviewer capacity, escalation, fallback, runbookteam can operate assisted mode without control degradationpass
Evidenceevidence packet index and reconstructability samplesampled decisions can be reconstructed end-to-endpass
Residual riskrisk acceptance, compensating controls, expiryaccountable owner accepts limited rollout risklimited go only

Gate outputs:

  • No-go: material leakage, prohibited action, unfair segment harm, unreviewed critical disagreement or unreconstructable evidence.
  • Continue shadow: evidence incomplete, labels immature, operations not ready, or value unclear.
  • Limited go: narrow segment, human approval, explicit rollback triggers, daily monitoring.
  • Rollout go: mature evidence supports controlled expansion and risk owners approve.

Rollback Trigger Template

TriggerThreshold exampleDetection sourceImmediate actionOwner
Write attempt from challengerany unauthorized write or customer messageaccess logs / tool sandboxdisable challenger integrationsolution architect
Trace completeness breachbelow agreed threshold for two business daysobservability dashboardpause gate analysis and fix loggingAI platform owner
Leakage confirmedany material future or champion leakageleakage reviewinvalidate affected rundata governance
Critical disagreement spikeabove approved threshold in high-risk slicedaily disagreement reportstop expansion, SME reviewproduct owner
Fairness breachsegment false positive/negative disparity above thresholdsegment scorecardno-go for impacted segmentrisk/compliance
Reviewer overloadqueue SLA breach or control backlogoperations dashboardreduce sample or pause L2/L3operations owner
Outcome harm signalcomplaint, appeal, false decline, QA defect spikeoutcome joiner / complaint systemrollback to hidden shadow or no-goproduct owner
Evidence gapsampled event cannot reconstruct versions and decisionevidence binder QAhold gate decisiongovernance lead

Rollback rule: a rollback trigger must specify who can stop shadow exposure, how quickly the path is disabled, what evidence is preserved, and how re-entry is approved.


Evidence Packet Template

Evidence itemDescriptionMinimum content
Executive summaryDecision recommendation and risk postureno-go / continue / limited go / go, scope, residual risks
Use case cardBusiness context and authority boundarycustomer impact, employee exposure, prohibited actions
Architecture diagramChampion/challenger and logging pathrouter, snapshot, challenger, event store, outcome join
Event schemaCounterfactual data contractfields, retention class, versioning, examples
Data lineageInput and outcome provenancefeature availability, policy versions, RAG corpus, label sources
Leakage registerLeakage assessment and controlsrisks, controls, failed checks, remediation
Shadow run reportOperational healthvolume, errors, latency, cost, trace completeness
Human comparisonSME / analyst / agent comparisonblind review method, disagreement severity, calibration
Outcome analysisMature and proxy outcomeslabel maturity, join rate, business metrics, limitations
Fairness scorecardSegment and harm analysisthresholds, breaches, mitigations, owner decisions
Gate memoGo / no-go decisioncriteria results, open issues, risk acceptance, rollout limits
Rollback planStop rules and execution pathtriggers, owners, communication, re-entry criteria
Audit indexReconstructable evidence referencestrace ids, run ids, approvals, issue records

PM/BA/Architecture Questions

PM Questions

QuestionWhy it matters
Which customer or business outcome could change if AI were allowed to act?Defines decision impact and risk tier
Is the AI recommending, ranking, drafting, deciding, or executing?Determines authority boundary
Which segment could be harmed even if aggregate performance improves?Forces fairness and customer harm analysis
What is the smallest low-risk surface for limited go?Avoids broad rollout based on thin evidence
What value would justify added operational and governance cost?Prevents shadow mode from becoming a research exercise

BA / CBAP Questions

QuestionWhy it matters
What is the exact business rule, policy or exception AI is shadowing?Prevents vague “AI assist” scope
Which decisions are observable immediately and which outcomes are delayed?Builds label/outcome plan
What must be recorded to replay the decision?Defines evidence and event schema
When does human review need to be blind?Protects comparison validity
Which workflow step creates customer impact?Separates silent mode from pilot

Architecture Questions

QuestionWhy it matters
Can challenger service technically write to any production system?Confirms read-only control
Are feature, policy, prompt, model, RAG and tool versions reconstructable?Enables audit and root cause analysis
How are trace ids propagated from event to outcome join?Connects observability to evaluation
Where are protected attributes or proxies handled for monitoring?Separates fairness analytics from runtime decisioning
What happens if shadow path fails or exceeds latency/cost budget?Protects champion path and operations

Release Checklist

This checklist is named “release” because it decides whether the AI can be released from hidden shadow into a more exposed mode. It is not a general software release checklist.

CheckDone condition
Use case registereduse case id, owner, risk tier and decision boundary approved
Authority boundary definedAI cannot trigger customer/system action in L1/L2 shadow
Population and sampling approvedinclusion/exclusion and stratification documented
Counterfactual schema implementedevent reconstructs input, AI output, champion decision and outcome plan
Feature snapshots lockedonly decision-time available features used
Version lineage completemodel, prompt, RAG, policy, tool and data versions captured
Leakage controls passedno material unresolved leakage
Human comparison readyblind review and SME calibration protocol active
Outcome plan activelabel sources, maturity windows and join logic approved
Fairness scorecard activesegments, thresholds and owners approved
Operational readiness provenreviewer capacity, escalation, fallback and support path ready
Rollback triggers approvedstop rules, owners and re-entry criteria documented
Evidence packet completegate memo can reconstruct sampled events end-to-end
Gate decision recordedno-go, continue shadow, limited go or rollout go approved by accountable owners

Executive Narrative

1-Minute Steering Committee Version

We are not asking to let the AI change customer outcomes yet. We are asking to continue or expand a controlled shadow mode where the AI sees real business events, produces counterfactual recommendations, and records what it would have done while the existing champion process remains fully in control. This gives us evidence on decision quality, delayed outcomes, fairness, human review differences, operational readiness and audit traceability before any customer-impacting rollout.

3-Minute Risk Committee Version

The control objective is to avoid moving from offline testing directly into customer-impacting AI decisions. In shadow mode, the challenger AI is read-only. It cannot write to production systems, send customer communications, change account status, alter credit lines, close AML alerts, decline payments or direct collections actions. Every recommendation is stored as a counterfactual event with decision-time features, policy and model versions, the actual champion decision and later outcome labels.

We will use this evidence to answer five questions: does AI add decision value, does it create concentrated harm, does it disagree with human experts in high-risk cases, are outcome labels mature enough to support the claim, and can operations and governance reconstruct the decision. If any material leakage, fairness breach, unauthorized action, reviewer overload or evidence gap occurs, the rollout path stops and returns to remediation.

Portfolio Storyline

I designed shadow mode as a pre-decision architecture pattern for regulated financial AI. It combines champion/challenger comparison, counterfactual logging, delayed outcome evaluation, leakage controls, fairness scorecards, human review calibration, rollout gates and rollback triggers. The value is that product, architecture and governance teams can make a disciplined go / no-go decision before the AI is allowed to affect customers.


Interview Drills

Drill 1: Credit Line Management

Question: “How would you validate an AI credit line recommender before launch?”

Strong answer:

I would run it in read-only shadow mode against real line-management events. The current credit policy remains champion, and the AI challenger records increase/decrease/hold/review recommendations with reason codes and decision-time feature snapshots. I would compare against actual decisions, wait for delinquency, utilization, complaint and attrition labels to mature, and run fair lending segment analysis. It can only move to assisted mode if reason-code consistency, high-severity disagreement, fairness and evidence gates pass.

Drill 2: AML Alert Triage

Question: “What makes shadow mode hard for AML?”

Strong answer:

AML labels are delayed and noisy. Analyst disposition, QA review and SAR decision are related but not identical labels. I would separate immediate analyst agreement from mature outcomes, run blind SME comparison on sampled alerts, monitor typology-specific false negatives, and preserve narrative evidence. A model that reduces queue volume but under-escalates high-risk typologies should fail the gate even if aggregate agreement looks good.

Drill 3: KYC Onboarding

Question: “How do you avoid shadow evaluation becoming biased by the current process?”

Strong answer:

I would define the population before sampling, include cases the current process sends to different paths, and keep challenger output separate from champion decision capture. For human comparison, reviewers must be blind to AI output. I would also control policy version and document availability at decision time, because onboarding results can look better if future fraud flags or corrected documents leak into evaluation.

Drill 4: Payment Fraud Intervention

Question: “Why not just A/B test the fraud model?”

Strong answer:

For payment fraud, customer harm from false declines and missed fraud can be immediate. Before any customer-impacting experiment, I want counterfactual evidence using real transaction context. The AI proposes allow, step-up, hold or decline, but the champion path still decides. We then join confirmed fraud, chargebacks, customer confirmations and complaints. Only if false positive and false negative tradeoffs are acceptable by segment would I move to a limited, reversible rollout.

Drill 5: Collections Contact Strategy

Question: “What would block rollout even if cure rate improves?”

Strong answer:

I would block rollout if the AI increases contact pressure on vulnerable customers, violates consent or contact frequency rules, under-routes hardship cases, or concentrates complaints in a protected/proxy segment. Collections AI must be judged on conduct risk and customer treatment, not only cure rate.

Drill 6: Contact Center Agent Assist

Question: “How do you test agent assist without creating automation bias?”

Strong answer:

Start with hidden shadow mode and compare AI suggested answers against actual agent responses and QA outcomes. For sampled cases, run independent review where reviewers do not see the AI output. If we later expose suggestions to agents, that becomes assisted silent launch, and we need acceptance/override tracking, citation quality, repeat contact, complaint and QA monitoring to detect automation bias.


Source Anchors

SourceLink本手册使用方式
NIST AI Risk Management Frameworkhttps://www.nist.gov/itl/ai-risk-management-framework用 Govern / Map / Measure / Manage 组织 shadow evidence、risk gate 和管理层沟通。
NIST AI RMF Resources and TEVVhttps://www.nist.gov/itl/ai-risk-management-framework/ai-risk-management-framework-resources用 TEVV 语境组织 test, evaluation, verification, validation 和 independent challenge。
ISO/IEC 42001https://www.iso.org/standard/81230.html用 AI management system 思维组织 operating model、performance evaluation 和 continual improvement。
ISO/IEC 23894https://www.iso.org/standard/77304.html用 AI risk management 语境组织 risk treatment、monitoring 和 review。
Google Rules of Machine Learninghttps://developers.google.com/machine-learning/guides/rules-of-ml用 ML engineering rules 校准训练/服务一致性、监控和上线前系统检查。
DORA metricshttps://dora.dev/用 delivery performance 和 resilience 语言连接 rollout gates、change failure 和 restore thinking。
OpenTelemetry docshttps://opentelemetry.io/docs/用 traces, metrics, logs 和 context propagation 支撑 counterfactual event-to-outcome observability。