返回 Papers
AI 扩展计划 / Playbooks

AI Closed-Loop Learning / Corrective Action Playbook

以下官方来源是本文的治理锚点。本文把它们转成产品、流程、架构、证据和管理层语言,不把任何框架直接等同于监管合规结论。

751AI_CLOSED_LOOP_LEARNING_CORRECTIVE_ACTION_PLAYBOOK.md

AI Closed-Loop Learning / Corrective Action Architecture Playbook

定位:面向 CBAP+、AI Product Manager、AI BA、AI Product Architect、Model Risk、Compliance、Customer Operations、Data Governance、AI Platform 和金融零售业务负责人。本文关注的不是“收集反馈让模型变好”这种基础说法,而是如何把反馈、投诉、人工覆盖、专家审核、eval 失败、漂移信号、事件和审计发现转成有治理、有根因、有变更、有验证、有证据的闭环纠正体系。

适用边界:本文面向 fraud、credit、KYC / AML、payments dispute、customer servicing RAG、complaints、collections、wealth servicing、internal copilot、agentic workflow 和 AI-assisted operations。它把 closed-loop learning 设计成 CAPA-like corrective and preventive action architecture, 不把它简化成 active learning、RLHF、自动重训或 prompt tweaking。

重要说明:本文是学习、作品集和内部方案训练材料,不构成法律意见、合规结论、审计意见、模型验证报告、监管解释或具体机构政策。正式项目必须由 Legal、Compliance、Model Risk、Fair Lending、Privacy、Security、Third Party Risk、Business Owner、Operations、Customer Experience、Data Governance 和管理层结合机构类型、司法辖区、产品范围、客户影响和内部政策确认。访问日期按 2026-06-30 记录。


Source Anchors

以下官方来源是本文的治理锚点。本文把它们转成产品、流程、架构、证据和管理层语言,不把任何框架直接等同于监管合规结论。

AnchorOfficial link本 playbook 的使用方式
NIST AI Risk Management Frameworkhttps://www.nist.gov/itl/ai-risk-management-framework用 Govern / Map / Measure / Manage 组织 closed-loop learning 的责任归属、风险场景、测量、处置、残余风险和持续改进证据。
ISO/IEC 42001 AI management systemhttps://www.iso.org/standard/42001用 AI 管理体系视角设计目标、控制、运行记录、内审、管理评审、持续改进和纠正措施闭环。
Federal Reserve SR 26-2https://www.federalreserve.gov/supervisionreg/srletters/SR2602.htm作为 2026 年银行模型风险管理锚点,强调风险导向、模型生命周期、独立挑战、治理、问题整改和记录。Nuance: SR 26-2 替代 SR 11-7 / SR 21-8,并采用更风险导向、按机构规模和模型重要性分层的做法;其正式范围聚焦 banking organization 的 model risk management,生成式 AI、agentic AI、非模型自动化和 customer-facing AI 仍需要结合更广泛的 AI governance、消费者合规、隐私、安全、第三方、运营和产品控制。
CFPB Consumer Complaint Databasehttps://www.consumerfinance.gov/data-research/consumer-complaints/把消费者投诉主题、叙述、公司响应和趋势作为 customer harm、root cause、remediation 和 external signal calibration 的输入。

Source-to-artifact mapping:

Source lens需要落到的 artifact面试表达
NIST AI RMFAI issue register、risk mapping、measurement dashboard、corrective action workflow、residual risk memo“我用 Govern / Map / Measure / Manage 管理反馈闭环,而不是把反馈当成数据堆积。”
ISO/IEC 42001AI management system procedure、corrective action SOP、management review pack、internal audit evidence“闭环学习必须进入管理体系,能被内审、管理评审和持续改进机制追踪。”
SR 26-2Model issue management、model change log、independent challenge、validation evidence、closure record“银行 AI 的修复不只是模型团队改代码,还要证明问题、根因、变更、验证和残余风险处理。”
CFPB complaintsComplaint taxonomy、customer harm trigger、trend monitor、external complaint reconciliation“投诉是生产反馈信号,不是客服噪音。它能校准内部监控是否漏掉客户伤害。”

1. Executive Framing

1.1 一句话定位

Closed-Loop Learning / Corrective Action Architecture =
把生产反馈和风险信号转成受治理的 issue,
通过 root cause analysis 链接到具体 AI / data / policy / process change,
再用独立、可复现、可审计的证据证明修复有效并防止复发。

对高管来说,它回答四个问题:

  1. 我们如何知道 AI 系统正在造成错误、伤害、控制弱点或性能退化?
  2. 我们如何判断根因是模型、数据、prompt、RAG、tool、流程、政策、vendor 还是治理?
  3. 我们如何确保修复动作被执行到正确 artifact,而不是停留在会议纪要?
  4. 我们如何证明修复有效、客户已恢复、风险已下降、同类问题没有继续发生?

1.2 它不是 active learning

能力主要问题典型输出
Active learning哪些样本最值得专家标注label queue、dataset update
Human feedback operations谁审核、如何标注、如何保证标签质量reviewer workflow、calibration、adjudication
Drift monitoring生产分布和性能是否变了alert、triage、response
Customer harm remediation客户是否受损、如何恢复recourse case、compensation、notification
Closed-loop corrective action为什么失败、修什么、如何证明已修好CAPA record、change linkage、effectiveness evidence

Active learning 可以产生输入,但 closed-loop learning 的终点不是“多一些训练数据”,而是“生产问题被纠正且证据可审计”。

1.3 高级 PM / BA / Architect 的角色

角色能力具体表现
Product judgment判断反馈是否代表客户伤害、风险偏好偏离或产品承诺失败
BA discipline定义 issue taxonomy、状态机、字段、SLA、stakeholder approval 和证据要求
Architecture thinking把 feedback ledger、model registry、prompt repo、vector index、tool policy、case system 和 evidence binder 连接起来
Governance fluency让 Model Risk、Compliance、Legal、Operations 和 Business Owner 参与正确的门禁
Outcome focus不以“改了模型”作为成功,而以客户、控制、性能和复发指标改善作为成功

2. Operating Model

2.1 闭环主流程

Signals
  complaints | appeals | overrides | expert reviews | eval failures | drift alerts | incidents | audits
    -> signal normalization
    -> AI contribution matching
    -> issue taxonomy and severity
    -> containment and customer protection
    -> root cause analysis
    -> corrective and preventive action
    -> linked change artifacts
    -> release and approval gate
    -> effectiveness verification
    -> recurrence monitoring
    -> closure and management reporting

2.2 什么进入 corrective action ledger

Candidate issue进入条件
Customer complaint指向 AI 参与的回答、路由、拒绝、延迟、解释、收费、账户限制或服务失败
Human override spike员工持续覆盖同一模型、prompt、RAG 答案、分类或 agent action
Expert QA failure高风险输出缺少证据、违反政策、分类错误或无法解释
Eval regressionlocked eval、gold set、red-team、fairness、RAG citation 或 tool safety gate 失败
Drift alert分布、score、decision、outcome、embedding、knowledge 或 segment 指标越过 action threshold
IncidentAI 相关生产事件、安全事件、隐私事件、客户伤害或运营中断
Audit / validation finding内审、模型验证、合规、监管准备或第三方评估发现控制缺口

2.3 Issue 状态机

Status进入标准退出标准
New signal信号被接收但未确认 AI 贡献完成 AI contribution matching
Triage已识别相关 AI system、journey、客户影响和初始严重度指定 owner、SLA 和下一步
Contained临时控制已执行或明确不需要客户保护、降级、暂停、人工复核或风险接受记录完成
RCA in progress根因假设正在验证根因分类和证据达到可行动水平
Action approvedcorrective / preventive action 已确定变更请求、审批和验证计划建立
Implemented变更已经部署或 SOP 已更新进入验证窗口
Verifying正在收集效果证据通过 closure gate 或重新打开
Closed修复有效、证据完整、残余风险接受纳入趋势和管理评审

3. Feedback Taxonomy

3.1 Signal taxonomy

Signal category示例主要价值典型风险
Customer complaintCFPB 投诉、内部投诉、客服升级、社媒升级客户可感知伤害和旅程断点客户叙述不一定包含完整技术事实
Appeal / dispute信贷申诉、支付争议、账户限制复核、KYC 申诉自动化错误和恢复路径质量只覆盖有能力申诉的客户
Human override员工改 AI 分类、摘要、建议、答案或风险等级暴露 AI 不可信、政策不清或 UI 误导override 可能是员工偏好而非事实
Expert reviewSME QA、二线审核、model validation sample高质量标签、policy interpretation 和 eval evidence成本高、需要 reviewer calibration
Eval failurelocked test、red-team、gold set、RAG claim QA、tool simulation fail上线门禁和回归测试信号可能过拟合某个 eval set
Drift / monitoringfeature drift、score drift、complaint spike、override trend、RAG freshness生产假设变化和早期预警统计差异不一定等于客户影响
Incident隐私泄露、错误批处理、agent tool misuse、批量误拒需要 containment 和客户恢复事后日志可能不完整
Audit / validation finding模型验证、内审、合规、第三方评估控制设计或运行缺陷关闭可能偏文档化而非实质修复

3.2 Feedback usage policy

Feedback type可用于训练可用于 eval可用于 RCA可用于 audit evidence
Confirmed outcome label是, 需 lineage 和质量门禁可进入 future eval, 需隔离
Human override谨慎, 需审核原因和偏差检查可生成 review sample
Complaint通常不直接训练可转成 harm scenario 和 eval case
Expert adjudication是, 需 calibration是, 需 lock 和版本
Eval failure不直接训练 locked eval 本身
Drift alert不直接训练可触发 sampling
Audit finding不直接训练可形成 control eval

3.3 Signal quality grading

Grade标准处理
A - Actionable evidence有客户影响、AI 版本、证据、可复现路径和 owner进入 RCA 和 action planning
B - Strong indicator有趋势、样本和合理 AI contribution, 但证据不完整补证据、抽样复核、临时监控
C - Weak signal单点叙述或统计波动, 未确认客户影响保留趋势, 不直接触发变更
D - Noise / duplicate重复、无关或无法关联 AI system合并或关闭, 保留审计理由

3.4 关键字段

Field含义示例
signal_id信号唯一编号SIG-2026-06-00182
ai_system_idAI 系统或能力 IDservicing_rag_fee_policy
artifact_versions模型、prompt、retriever、tool、policy 版本prompt_v4.8, index_2026_06_12
customer_journey_step客户旅程位置fee_dispute_chat_answer
customer_impact资金、访问、解释、延迟、隐私、公平性等wrong explanation and missed escalation
segment产品、渠道、地区、语言、客户生命周期mobile, credit_card, Spanish
source_evidence投诉、日志、QA、eval、case、monitor 指针complaint_case_99102, trace_8831
proposed_usagetrain、eval、RCA、monitor、audit、excludeRCA + eval scenario

4. CAPA Workflow

4.1 End-to-end workflow

1. Detect signal
2. Confirm AI contribution
3. Classify issue and severity
4. Contain customer and operational risk
5. Assign owner and SLA
6. Perform root cause analysis
7. Define corrective action and preventive action
8. Link actions to change artifacts
9. Approve and release changes
10. Verify effectiveness
11. Monitor recurrence
12. Close with evidence and residual risk decision

4.2 Stage detail

StageRequired decisionOutput artifact
DetectIs this signal relevant to an AI-involved processSignal record
MatchWhich AI system, version, policy and journey were involvedAI contribution packet
ClassifyWhat issue type and severity applyIssue taxonomy record
ContainWhat immediate protection is requiredContainment order or risk acceptance
RCAWhat root cause explains the issueRCA worksheet
PlanWhat corrective and preventive actions are neededCAPA plan
ChangeWhich artifact changes and who approvesChange request and release plan
VerifyWhat evidence proves effectVerification plan
CloseWho accepts residual risk and closureClosure memo

4.3 Severity matrix

SeverityTriggerRequired response
L0 - Learning signalNo customer impact, useful for improvementTrack trend, consider sampling, no formal CAPA
L1 - Control weaknessInternal QA, eval, drift or audit issue without confirmed customer harmRCA, backlog with due date, verification evidence
L2 - Customer inconvenienceWrong explanation, delay, repeat effort or small financial impactContainment, customer review, corrective action, KRI monitoring
L3 - Material customer harmAccess denial, funds impact, privacy, complaint escalation, fairness concernIncident-style CAPA, customer remediation, executive reporting
L4 - Systemic / regulatory riskBatch harm, protected-class disparity, major privacy / funds / regulator issueCrisis protocol, legal hold, executive risk committee, regulator-ready evidence

4.4 Containment patterns

Pattern适用场景例子
Pause automation高影响路径可能继续伤害客户暂停 RAG 自动回答争议截止日
Narrow scope仅部分 segment 或 intent 有问题关闭西班牙语 fee waiver 自动回答
Raise human gate模型可用但风险变高高金额争议全部人工复核
Revert artifact新 prompt、model、index 或 tool policy 回归回滚到上一版 retriever config
Customer recovery已发生客户影响重新开案、退费、恢复访问、发送更正通知
Evidence hold外部投诉、隐私或监管风险冻结日志、通信、版本、审批和样本

4.5 Closure gates

Gate通过标准
Scope gate影响客户、segment、journey、版本和时间窗已识别
Root cause gate根因分类有证据, 不是泛泛归因于“AI 不准”
Change gate每个 action 链接到具体 artifact、owner、版本和审批
Verification gatebefore / after 和生产 KRI 证明效果
Customer recovery gate客户资金、权益、访问、解释或服务路径已恢复
Recurrence gate复发监控窗口内未超阈值, 或残余风险被正式接受
Evidence gate审计证据完整, 能重建发现、决策、修复和关闭

5. Root Cause Taxonomy

5.1 根因分类

Root cause family典型根因Fix pattern
Data quality字段错、缺失、延迟、映射变更、身份合并错误data contract、validation、backfill、lineage repair
Dataset coverage训练或 eval 缺少关键 segment、语言、产品、边界样本targeted sampling、eval expansion、coverage dashboard
Label / taxonomy标签定义不清、reviewer 分歧、政策解释不一致taxonomy governance、adjudication、label guideline update
Model behavior校准失败、概念漂移、阈值失效、高置信错误recalibration、retraining、threshold governance、human gate
Prompt behavior过度自信、拒答不足、格式不稳、没有升级路径prompt policy、rubric update、response contract、handoff rule
RAG / knowledge旧源、错误检索、chunk 不良、citation 不支撑、权限不清source governance、index rebuild、retriever/reranker change
Tool / agent工具权限过宽、schema 不清、状态机缺口、无审批门tool contract、permission boundary、approval workflow
Workflow队列优先级错、无申诉入口、重复提交、handoff 断裂service blueprint redesign、case routing、SLA control
Policy业务规则模糊、risk appetite 未更新、reason code 不一致policy clarification、rule update、approval matrix
Human adoption员工过度依赖 AI、忽略检查、override 不记录原因training、UI friction、supervisor QA、reason capture
Vendor / third party供应商模型变更、日志不足、SLA 不清、退出路径弱contract control、change notice、evidence rights, fallback
Governanceowner 不清、门禁缺失、残余风险无人接受RACI、release gate、management review、control library

5.2 RCA depth test

Shallow statementBetter RCA question
“模型幻觉”为什么系统允许无证据 claim 进入客户回答
“数据不准”哪个 data contract 失败, 为什么未被 validation 拦截
“用户问法特殊”eval 是否覆盖该 intent、语言和 journey
“员工没审核”UI、SOP、培训和监督为何让 over-reliance 发生
“政策变了”source owner、effective date 和 RAG freshness gate 为什么未触发
“供应商更新了”合同、change notification 和 regression test 为什么没覆盖

5.3 Five-why pattern for AI

Issue: Customer received wrong fee waiver answer.
Why 1: RAG answer cited superseded policy.
Why 2: Retriever ranked old policy summary above current policy.
Why 3: Source metadata did not mark old policy as inactive.
Why 4: Knowledge owner SOP required upload of new policy but not deactivation of superseded content.
Why 5: Release gate tested answer correctness, but not source lifecycle controls.
Corrective action: update metadata and retriever filter.
Preventive action: add source retirement SOP and freshness gate to release checklist.

6. Change Linkage

6.1 Evidence graph

signal_id
  -> issue_id
  -> harm_or_control_classification
  -> root_cause_id
  -> corrective_action_id
  -> preventive_action_id
  -> change_request_id
  -> artifact_version
  -> eval_run_id
  -> release_approval_id
  -> monitoring_window_id
  -> closure_memo_id

6.2 Artifact linkage matrix

ArtifactVersion evidenceTypical approver
Datasetdataset hash, label guideline, split policy, lineageData Owner, Model Owner, Model Risk
Eval seteval version, locked samples, scenario map, pass criteriaModel Validation, Product, Compliance
Promptprompt diff, policy mapping, regression outputProduct, AI Platform, Compliance for high-risk
Modelmodel card, training data, validation report, calibrationModel Owner, Model Risk, Business Owner
RAG indexsource list, effective dates, chunking, retriever configKnowledge Owner, Compliance, AI Platform
Tool policyschema, permissions, approval gates, logsSecurity, Product, Operations
Workflow SOPprocess map, SLA, handoff, case stateOperations, Product, Risk
Customer communicationtemplate, legal review, accessibility reviewLegal, Compliance, CX

6.3 Traceability anti-patterns

Anti-patternWhy it fails
“Issue closed because PR merged”Code change does not prove customer or control outcome improved
“Prompt fixed in playground”No production version, no approval, no regression evidence
“Data refreshed”No lineage, no label quality, no eval protection
“Model retrained”Root cause may be workflow, policy or RAG source, not model
“Monitoring added”Monitoring does not correct existing harm or prove fix

7. Effectiveness Verification

7.1 Verification hierarchy

Evidence typeStrengthUse
Unit / contract testLow to mediumConfirms schema, prompt format, tool permission, source freshness
Locked regression evalMediumConfirms known failure modes no longer fail
Counterfactual replayMedium to highRuns historical affected cases through fixed path
Shadow / canaryHighObserves production-like performance before full rollout
Human QA sampleHigh when calibratedConfirms customer-facing quality and policy adherence
Production KRIHighConfirms complaints, overrides, appeals, drift or harm signals improve
Customer remediation proofRequired for harmConfirms affected customers were restored

7.2 Metric families

Metric familyExample
Qualityaccuracy, citation support, unsupported claim rate, tool success rate
Riskfalse decline, appeal upheld, adverse action defect, privacy exposure
Operationsoverride rate, queue SLA, escalation miss, rework, duplicate contact
Customercomplaint rate, repeat contact, abandonment, remediation SLA
Fairness / segmentdisparity in denial, delay, appeal upheld, language support
Governanceevidence completeness, approval timeliness, recurrence, overdue CAPA

7.3 Closure rule examples

Issue typeClosure evidence
RAG stale policylocked eval pass, source freshness proof, 2-week complaint trend below threshold
Fraud false positive spikesegment false decline proxy improves, analyst override declines, affected customers reviewed
Credit explanation defectreason code parity test passes, adverse action QA pass, complaints stabilize
Agent tool misusetool permission test passes, simulation shows no unauthorized state transition, audit logs complete
Data pipeline skewonline-offline parity restored, backfill complete, affected decisions assessed

7.4 Recurrence monitoring

SeverityMinimum monitoring windowReview cadence
L114 days or next release cycleWeekly
L230 daysWeekly with Product and Ops
L360 daysTwice weekly until stable, then weekly
L490 days or executive-definedExecutive risk committee cadence

8. Update Governance

8.1 Dataset update governance

ControlRequirement
LineageEvery added sample has source, time, AI version, policy version and usage tag
Label qualityReviewer role, rubric, confidence, adjudication and evidence captured
Split protectionTrain, calibration, eval, gold, monitoring and legal hold are separated
Bias controlSegment coverage and selection bias reviewed before training
ApprovalHigh-impact dataset changes reviewed by Model Risk or validation function
RollbackPrior dataset version and model artifact remain recoverable

8.2 Prompt update governance

ControlRequirement
Prompt diffStore exact before / after system, developer and task instructions
Policy mappingEach high-risk instruction maps to approved policy or control
RegressionRun locked scenario tests, refusal tests, citation tests and escalation tests
Human reviewSME reviews high-risk answer changes, not only prompt text
RolloutUse canary or limited cohort for customer-facing prompts
MonitoringWatch unsupported claim, escalation miss, complaint and override signals

8.3 Model update governance

ControlRequirement
Change reasonLink model change to root cause and expected benefit
ValidationCompare challenger against champion on overall, segment, calibration and fairness metrics
Stress testingInclude drifted segments, edge cases and high customer impact cases
Independent challengeModel Risk or validation reviews assumptions for high-impact models
Decision thresholdThreshold changes approved with customer, operations and risk impact
Post-releaseMonitor performance, overrides, appeals, complaints and drift

8.4 RAG update governance

ControlRequirement
Source authorityOnly approved sources can support customer-impact claims
Effective dateCurrent and superseded documents must be explicit
Chunking / metadataChunk policy preserves definitions, exceptions, deadlines and eligibility
Retrieval evaluationTest source recall, citation support, conflict handling and permissions
Knowledge ownerBusiness / Compliance owner signs off source lifecycle
DeactivationSuperseded content is removed, demoted or blocked with audit trail

8.5 Tool and agent update governance

ControlRequirement
Tool contractInput schema, output schema, side effects and failure modes defined
Permission boundaryAgent can only call tools needed for approved purpose
Approval gateHigh-impact actions require human approval or policy confirmation
State machineAgent workflow prevents skipped review, duplicate action and dead-end automation
SimulationRun scenario tests for incorrect tool choice, retries, timeout and rollback
Audit logEvery tool call links to user/session, model, prompt, action and result

8.6 Policy and workflow update governance

ControlRequirement
Policy versioningEligibility, dispute, fee, credit, KYC and complaint policies have effective dates
SOP alignmentHuman review SOP matches AI routing and escalation behavior
Reason code integrityCustomer explanations trace to approved reason or source
Case managementCorrective actions update actual customer records, not only AI artifacts
TrainingEmployees receive workflow and review training when AI behavior changes
AuditabilityProcess map, owner, SLA and evidence fields are updated together

9. Dashboards and KRIs

9.1 Executive dashboard

MetricWhy executives need it
Open CAPA by severityShows unresolved risk exposure
Aging CAPAReveals governance or owner bottlenecks
Repeat issue rateShows whether fixes prevent recurrence
Customer harm linked to AIConnects AI quality to customer impact
Overdue verificationPrevents “implemented but unproven” closure
Residual risk acceptancesShows what risk leadership has accepted

9.2 Product and operations dashboard

MetricSlice
Complaint rate after AI interactionproduct, channel, intent, version
Appeal upheld / reversal ratedecision type, segment, model version
Human override rateteam, output type, reason, prompt/model version
Escalation miss ratejourney step, risk tier, language
Remediation SLAseverity, owner, customer segment
Repeat contact and abandonmentjourney, device, channel

9.3 Model risk dashboard

MetricSlice
Eval failure by scenariomodel, prompt, RAG, tool, release
Drift-to-action conversionalert type, owner, action
Dataset update lineage completenessdataset version, use case
Segment performance changeprotected or high-control segment where lawful
Independent challenge findingsmodel family, issue type
Validation exceptions and closureseverity, due date

9.4 Data / RAG / tool dashboard

MetricMeaning
Data contract violationsBad inputs that can cause downstream AI defects
Feature freshness delayReal-time decision risk
Source freshnessRAG policy staleness
Citation support pass rateWhether answer claims are grounded
Tool failure / retry / unauthorized attemptAgent reliability and security risk
Permission boundary violationsPotential privacy or action authority issue

9.5 KRI thresholds

KRIGreenAmberRed
Repeat issue rateNo recurrence in windowSingle low-severity recurrenceSame root cause causes L2+ issue
Evidence completeness L3+100% required evidenceNon-critical evidence delayedCritical log, customer notice or approval missing
Overdue CAPANone past SLAL1/L2 past SLAL3/L4 past SLA
Unsupported RAG claimBelow baseline1.5x baseline or high-risk sample2x baseline or customer harm
Human override spikeWithin control band2-week upward trendSpike plus complaint or appeal signal

10. RACI

ActivityProductBAArchitectData ScienceAI PlatformOpsModel RiskCompliance / LegalData GovSecurityBusiness Owner
Signal intake designARCCCRCCCCA
Taxonomy and workflowARCCCRCCCCA
AI contribution matchingCRARRCCCCCA
Severity decisionARCCCRCCCCA
Customer containmentACCCCRCCCCA
Root cause analysisARRRRCCCCCA
Dataset change approvalCCCRCCACACA
Prompt / RAG change approvalACRCRCCCCCA
Model change approvalCCCRCCACCCA
Tool / agent permission changeACRCRCCCCAA
Effectiveness verificationARCRRRCCCCA
Closure and residual riskARCCCCACCCA

Legend: R = responsible, A = accountable, C = consulted.


11. Templates

11.1 Corrective Action Intake

FieldRuleExample
Issue titleCustomer or control impact in plain languageRAG gave stale annual fee waiver guidance
Signal sourceComplaint, override, eval, drift, audit, incidentcomplaint + RAG eval failure
AI systemRegistered system and versionservicing_rag v4.8, index_2026_06_12
Journey stepWhere customer or employee saw outputmobile chat fee dispute answer
ImpactCustomer, risk, operations or control impactwrong explanation, possible missed escalation
SeverityL0-L4 with rationaleL2 due to customer-facing wrong policy
ContainmentImmediate protectionroute fee waiver intents to human review
OwnerNamed accountable business ownerCredit Card Servicing Product Owner
Due dateSLA based on severity2026-07-03

11.2 RCA Worksheet

FieldRuleExample
Confirmed failureObservable failure, not opinionAnswer cited superseded fee policy
Direct causeImmediate technical or process causeretriever ranked old policy summary first
Contributing causesOther enabling weaknessessource lifecycle SOP lacked deactivation step
Control failureWhy control did not catch itfreshness gate checked upload date, not effective date
Root cause familyUse taxonomyRAG / knowledge + governance
EvidenceLogs, samples, policies, evalstrace_8831, complaint_99102, eval_fee_v1
Corrective actionFix current defectdeactivate old source and update retriever filter
Preventive actionStop recurrencesource retirement SOP and release gate update

11.3 Change Linkage Memo

SectionContent ruleExample
Issue linkissue_id and severityISS-2026-06-014, L2
Root causeconcise root causecurrent-source priority missing
Artifact changedexact artifact and versionretriever_config v5.3, source_registry 2026-06-30
Approvalwho approved and whyKnowledge Owner + Product + Compliance
Expected effectmeasurable outcomeunsupported fee claim below 5%
Rollbackrollback conditioncomplaint spike or eval pass below 95%
Monitoringproduction KRI and window30-day fee complaint and override monitor

11.4 Effectiveness Verification Memo

FieldEvidence standardExample
BaselineBefore metric with time windowunsupported claim 14% on fee eval v1
Post-fix resultAfter metric with same or stronger method3% on locked fee eval v2
Segment checkKey segments verifiedmobile, contact center, Spanish
Customer signalComplaints, appeals, overridesfee complaint back to baseline for 2 weeks
Control signalNew gate or monitoring worksfreshness gate blocks superseded docs
Residual riskAccepted remaining risklow risk for rarely used legacy fee terms
Closure decisionapprover and dateProduct + Model Risk closure on 2026-07-30

11.5 Executive Status Memo

SectionGood content
Decision neededApprove closure, extend monitoring, pause automation, fund fix or accept residual risk
Customer impactPlain-language customer effect and affected count
Root causeOne sentence, with system and control cause
Actions completedCorrective and preventive actions, not activity list
EvidenceBefore / after results and production KRI
Open riskWhat remains, who owns it and review date
RecommendationClear management action

12. 30-Day Lab

目标:30 天内把 closed-loop learning 从概念训练成可展示的金融零售 AI 产品、架构和治理资产。默认读者已经具备高级 BA / PM / 架构基础,不做基础流程图训练。

Day主题产出
1选择一个金融零售 AI use case, 定义客户旅程和 AI touchpointsAI touchpoint map
2阅读 NIST AI RMF, 映射 Govern / Map / Measure / Managegovernance mapping note
3阅读 ISO/IEC 42001 概览, 提炼 management system controlsAI management system control map
4阅读 SR 26-2, 总结 2026 模型风险管理 nuancemodel risk nuance memo
5分析 CFPB complaint database 可用于哪些 feedback signalscomplaint signal taxonomy
6设计 signal intake schemacorrective action intake schema
7设计 feedback usage policytrain / eval / RCA / audit usage matrix
8设计 issue taxonomyissue taxonomy and examples
9设计 severity matrixL0-L4 response matrix
10设计 containment patternspause, narrow, gate, revert, remediate playbook
11设计 root cause taxonomydata/model/prompt/RAG/tool/workflow/policy/governance map
12完成一个 Five-why RCA drillRCA worksheet
13设计 change linkage modelissue-to-change evidence graph
14设计 dataset update governancedataset change checklist
15设计 prompt update governanceprompt diff and regression gate
16设计 model update governancechallenger validation and threshold memo
17设计 RAG update governancesource freshness and citation support gate
18设计 tool / agent update governancetool permission and state machine gate
19设计 effectiveness verification hierarchyverification evidence matrix
20设计 recurrence monitoring windowsmonitoring plan by severity
21设计 executive dashboardCAPA severity and aging dashboard mock
22设计 operations dashboardcomplaint, override, appeal, remediation view
23设计 model risk dashboardeval, drift, validation and closure view
24写 RACIrole matrix
25写 Corrective Action Intake templatecompleted example
26写 RCA + Change Linkage memocompleted example
27写 Effectiveness Verification memocompleted example
28做 tabletop exercise: RAG stale policy incidentincident simulation pack
29准备 interview storySTAR-T answers
30完成 portfolio packagearchitecture diagram, workflow, dashboards, templates, executive memo

13. Interview Answers

13.1 Closed-loop learning 和 active learning 有什么区别?

30 秒回答:

Active learning 是选择最值得人工标注的样本来提升模型。Closed-loop learning 是把生产反馈、投诉、override、eval 失败、drift、incident 和 audit finding 转成有治理的 issue, 找根因, 链接到具体变更, 再证明修复有效并防复发。它更接近 AI 版 CAPA。

2 分钟展开:

在金融零售里,反馈不能直接等同于训练数据。投诉可能是客户伤害信号,override 可能是员工对 AI 不信任,eval failure 可能是上线门禁缺口,drift 可能是数据、政策或行为变化。我的设计会先做 signal normalization 和 AI contribution matching,再分类 severity 和 issue type。之后进入 RCA,决定修数据、prompt、model、RAG、tool、workflow 还是 policy。每个 action 都链接到具体版本和审批,并用 locked eval、production KRI、投诉趋势和复发监控证明效果。

13.2 如何设计 CAPA-like AI corrective action workflow?

30 秒回答:

我会设计 detect、match、classify、contain、RCA、action、change、verify、monitor、close 的闭环。关键不是流程图好看,而是每一步都有 owner、SLA、证据和关闭标准。

2 分钟展开:

第一步收集信号,包括投诉、申诉、人工覆盖、专家 QA、eval failure、drift、incident 和 audit finding。第二步确认 AI 系统和版本是否参与客户旅程。第三步按客户影响和控制风险分级。高风险场景先 containment,例如暂停自动回答、提高人工复核或恢复客户。然后做 root cause taxonomy,区分 data、model、prompt、RAG、tool、workflow、policy、vendor 和 governance。修复动作必须进入 change management,并链接到 dataset、prompt、model、index、tool policy 或 SOP。最后通过 eval、replay、canary、QA、生产 KRI 和 recurrence window 验证,证据完整后才能关闭。

13.3 如何避免反馈闭环放大偏差?

30 秒回答:

要区分 feedback usage, 记录采样概率和业务动作, 保护 eval set, 按 segment 看覆盖和伤害, 并保留随机 sentinel 样本。不是所有 override 和投诉都能直接训练。

2 分钟展开:

如果只用被模型拦截的交易训练 fraud 模型,就看不到放行交易里的漏报;如果只用会申诉的客户反馈训练信贷解释,就会低估弱势客户的无声伤害。我的设计会给反馈打 usage tag:train、eval、RCA、monitor、audit、exclude。训练数据要有 lineage、label quality 和 segment coverage。评估集要 lock,避免把失败样本反复调参到通过。生产监控要看投诉、appeal upheld、override、abandonment 和 segment disparity。这样闭环学习不是模型自我强化,而是治理过的学习。

13.4 如何证明一个 prompt 或 RAG 修复真的有效?

30 秒回答:

只说 prompt 改了不够。我要看到 prompt diff、policy mapping、locked eval 通过、生产抽样 QA、相关投诉和 override 下降,并在复发窗口内没有同类问题。

2 分钟展开:

以 RAG 过期政策为例,我会先确认 root cause 是 source lifecycle、retriever ranking、prompt citation instruction 还是 handoff rule。修复后,需要锁定 eval cases,验证答案 claim 被当前有效来源支撑,citation 直接支持答案,高风险 intent 会拒答或升级人工。还要跑生产 canary 和人工 QA,监控相关投诉、agent override、unsupported claim、source freshness 和 escalation miss。如果客户曾受影响,还要证明客户通知、纠正和补偿完成。只有质量、客户和控制指标都通过,才能关闭。

13.5 SR 26-2 对 AI closed-loop learning 有什么影响?

30 秒回答:

SR 26-2 是 2026 年银行模型风险管理的重要锚点,强调风险导向、生命周期、治理、独立挑战和问题整改。Nuance 是它正式聚焦 model risk management, 生成式和 agentic AI 的闭环学习还要叠加 AI 管理体系、消费者合规、隐私、安全、第三方和运营控制。

2 分钟展开:

我会用 SR 26-2 的精神要求模型相关问题有完整生命周期治理:issue identification、risk assessment、validation、change control、independent challenge、documentation 和 closure evidence。对于生成式 AI 或 agentic workflow,我不会只问它是否落入传统 model definition,而是从客户影响和控制角度补充 NIST AI RMF、ISO/IEC 42001、complaint signals、security、privacy、third-party 和 operational resilience。也就是说,SR 26-2 给了模型风险纪律,但 AI corrective action architecture 要覆盖更广系统。

13.6 Root cause taxonomy 为什么重要?

30 秒回答:

没有 root cause taxonomy,团队会把所有问题都归因于“模型不准”或“prompt 要改”。金融 AI 的真实根因可能是数据、标签、RAG source、tool permission、workflow、policy、vendor 或 governance。

2 分钟展开:

正确的根因决定正确的修复。比如客户收到错误费用解释,表面是 RAG 答错;真正根因可能是旧政策未停用、retriever 没有 effective-date filter、prompt 没要求当前来源、客服流程没升级冲突答案。如果只改 prompt,问题会复发。我会建立 root cause taxonomy,并要求 RCA 明确 direct cause、contributing cause、control failure 和 preventive action。这样 closed-loop learning 不只是改输出,而是修系统。

13.7 哪些指标证明 corrective action 工作良好?

30 秒回答:

我会看 open CAPA by severity、aging、repeat issue rate、overdue verification、evidence completeness、complaint / appeal / override trend、post-fix recurrence 和 residual risk acceptance。

2 分钟展开:

一个成熟体系不能只统计关闭了多少 ticket。我要看高严重度 issue 是否及时 containment,RCA 是否按根因分类,action 是否链接到具体 artifact,验证是否按计划完成,客户是否恢复,复发是否下降。对产品和运营,看投诉、appeal upheld、override、abandonment、remediation SLA。对模型风险,看 eval regression、drift-to-action、dataset lineage、validation exception closure。对高管,看 overdue CAPA、repeat root cause、residual risk 和投资缺口。

13.8 如何处理 customer complaint 与模型训练的关系?

30 秒回答:

投诉是高价值 customer harm signal, 但不能直接当标签训练。要先做 case review、AI contribution matching、root cause 和 usage tagging。

2 分钟展开:

客户投诉反映客户体验和影响,但叙述可能不完整或带有情绪。我的做法是把 complaint 进入 harm and corrective action ledger,关联 AI trace、case record、policy version 和 journey step。若复核确认 AI 输出错误,可生成 eval scenario、RCA evidence、customer remediation case,必要时形成训练标签。但进入训练前需要专家 adjudication、privacy review、segment bias check 和 dataset lineage。投诉更直接的价值是暴露产品承诺失败和恢复路径缺陷。

13.9 如何把 closed-loop learning 做成作品集?

30 秒回答:

我会展示一个完整 case:信号 taxonomy、CAPA workflow、root cause taxonomy、change linkage、effectiveness dashboard、RACI 和 executive memo,而不是只展示模型指标。

2 分钟展开:

例如信用卡客服 RAG stale policy case。作品集先展示客户旅程和 AI touchpoint,再展示投诉、agent override、RAG eval failure 和 drift cluster 如何进入 corrective action ledger。然后用 root cause taxonomy 证明问题不是单纯 prompt,而是 source lifecycle、retrieval ranking 和 handoff gate。接着展示 change linkage:source registry、retriever config、prompt version、eval set、release approval。最后用 locked eval、生产 QA、投诉趋势、override 下降和 recurrence window 证明有效。这样体现 PM、BA、架构和治理能力。

13.10 如果团队说“我们已经有 monitoring dashboard”,你怎么回应?

30 秒回答:

Dashboard 只是发现问题,不是闭环。我要看告警是否有 owner、severity、RCA、action、change linkage、verification、closure 和 recurrence monitoring。

2 分钟展开:

很多 AI 团队有很漂亮的 drift、latency、quality dashboard,但问题是报警后没人处理,或者处理后没有证据证明修复有效。我的标准是 shift-to-action 和 signal-to-CAPA。每个红色指标必须映射到 triage owner 和 runbook;每个高风险 issue 必须能链接到具体变更;每个关闭必须有 before / after 证据和残余风险接受。否则 dashboard 只是观察系统,不是治理系统。


14. Final Operating View

成熟的 AI closed-loop learning 不是让模型“自动吸收反馈”。成熟状态是组织能持续证明:

We know where AI touches customers and operations.
We capture complaints, appeals, overrides, expert reviews, eval failures, drift, incidents and audits.
We classify issues by customer impact, control weakness and root cause.
We protect customers before permanent fixes are shipped.
We link each corrective action to a specific dataset, prompt, model, RAG, tool, workflow or policy change.
We verify fixes with locked evals, replay, QA, production KRIs and recurrence monitoring.
We prevent eval leakage, feedback bias and undocumented changes.
We can show executives, auditors and regulators what failed, why it failed, what changed, and why it is safe to close.

对 CBAP+、AI PM、AI BA 和 AI Architect 来说,这就是从“AI 系统持续优化”升级为“AI 组织持续学习、纠正、证明和防复发”的核心能力。