AI 扩展计划 / Playbooks

AI Closed-Loop Learning / Corrective Action Playbook

以下官方来源是本文的治理锚点。本文把它们转成产品、流程、架构、证据和管理层语言，不把任何框架直接等同于监管合规结论。

751 行AI_CLOSED_LOOP_LEARNING_CORRECTIVE_ACTION_PLAYBOOK.md

AI Closed-Loop Learning / Corrective Action Architecture Playbook

定位：面向 CBAP+、AI Product Manager、AI BA、AI Product Architect、Model Risk、Compliance、Customer Operations、Data Governance、AI Platform 和金融零售业务负责人。本文关注的不是“收集反馈让模型变好”这种基础说法，而是如何把反馈、投诉、人工覆盖、专家审核、eval 失败、漂移信号、事件和审计发现转成有治理、有根因、有变更、有验证、有证据的闭环纠正体系。

适用边界：本文面向 fraud、credit、KYC / AML、payments dispute、customer servicing RAG、complaints、collections、wealth servicing、internal copilot、agentic workflow 和 AI-assisted operations。它把 closed-loop learning 设计成 CAPA-like corrective and preventive action architecture, 不把它简化成 active learning、RLHF、自动重训或 prompt tweaking。

重要说明：本文是学习、作品集和内部方案训练材料，不构成法律意见、合规结论、审计意见、模型验证报告、监管解释或具体机构政策。正式项目必须由 Legal、Compliance、Model Risk、Fair Lending、Privacy、Security、Third Party Risk、Business Owner、Operations、Customer Experience、Data Governance 和管理层结合机构类型、司法辖区、产品范围、客户影响和内部政策确认。访问日期按 2026-06-30 记录。

Source Anchors

以下官方来源是本文的治理锚点。本文把它们转成产品、流程、架构、证据和管理层语言，不把任何框架直接等同于监管合规结论。

Anchor	Official link	本 playbook 的使用方式
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 组织 closed-loop learning 的责任归属、风险场景、测量、处置、残余风险和持续改进证据。
ISO/IEC 42001 AI management system	https://www.iso.org/standard/42001	用 AI 管理体系视角设计目标、控制、运行记录、内审、管理评审、持续改进和纠正措施闭环。
Federal Reserve SR 26-2	https://www.federalreserve.gov/supervisionreg/srletters/SR2602.htm	作为 2026 年银行模型风险管理锚点，强调风险导向、模型生命周期、独立挑战、治理、问题整改和记录。Nuance: SR 26-2 替代 SR 11-7 / SR 21-8，并采用更风险导向、按机构规模和模型重要性分层的做法；其正式范围聚焦 banking organization 的 model risk management，生成式 AI、agentic AI、非模型自动化和 customer-facing AI 仍需要结合更广泛的 AI governance、消费者合规、隐私、安全、第三方、运营和产品控制。
CFPB Consumer Complaint Database	https://www.consumerfinance.gov/data-research/consumer-complaints/	把消费者投诉主题、叙述、公司响应和趋势作为 customer harm、root cause、remediation 和 external signal calibration 的输入。

Source-to-artifact mapping:

Source lens	需要落到的 artifact	面试表达
NIST AI RMF	AI issue register、risk mapping、measurement dashboard、corrective action workflow、residual risk memo	“我用 Govern / Map / Measure / Manage 管理反馈闭环，而不是把反馈当成数据堆积。”
ISO/IEC 42001	AI management system procedure、corrective action SOP、management review pack、internal audit evidence	“闭环学习必须进入管理体系，能被内审、管理评审和持续改进机制追踪。”
SR 26-2	Model issue management、model change log、independent challenge、validation evidence、closure record	“银行 AI 的修复不只是模型团队改代码，还要证明问题、根因、变更、验证和残余风险处理。”
CFPB complaints	Complaint taxonomy、customer harm trigger、trend monitor、external complaint reconciliation	“投诉是生产反馈信号，不是客服噪音。它能校准内部监控是否漏掉客户伤害。”

1. Executive Framing

1.1 一句话定位

Closed-Loop Learning / Corrective Action Architecture =
把生产反馈和风险信号转成受治理的 issue,
通过 root cause analysis 链接到具体 AI / data / policy / process change,
再用独立、可复现、可审计的证据证明修复有效并防止复发。

对高管来说，它回答四个问题：

我们如何知道 AI 系统正在造成错误、伤害、控制弱点或性能退化？
我们如何判断根因是模型、数据、prompt、RAG、tool、流程、政策、vendor 还是治理？
我们如何确保修复动作被执行到正确 artifact，而不是停留在会议纪要？
我们如何证明修复有效、客户已恢复、风险已下降、同类问题没有继续发生？

1.2 它不是 active learning

能力	主要问题	典型输出
Active learning	哪些样本最值得专家标注	label queue、dataset update
Human feedback operations	谁审核、如何标注、如何保证标签质量	reviewer workflow、calibration、adjudication
Drift monitoring	生产分布和性能是否变了	alert、triage、response
Customer harm remediation	客户是否受损、如何恢复	recourse case、compensation、notification
Closed-loop corrective action	为什么失败、修什么、如何证明已修好	CAPA record、change linkage、effectiveness evidence

Active learning 可以产生输入，但 closed-loop learning 的终点不是“多一些训练数据”，而是“生产问题被纠正且证据可审计”。

1.3 高级 PM / BA / Architect 的角色

角色能力	具体表现
Product judgment	判断反馈是否代表客户伤害、风险偏好偏离或产品承诺失败
BA discipline	定义 issue taxonomy、状态机、字段、SLA、stakeholder approval 和证据要求
Architecture thinking	把 feedback ledger、model registry、prompt repo、vector index、tool policy、case system 和 evidence binder 连接起来
Governance fluency	让 Model Risk、Compliance、Legal、Operations 和 Business Owner 参与正确的门禁
Outcome focus	不以“改了模型”作为成功，而以客户、控制、性能和复发指标改善作为成功

2. Operating Model

2.1 闭环主流程

Signals
  complaints | appeals | overrides | expert reviews | eval failures | drift alerts | incidents | audits
    -> signal normalization
    -> AI contribution matching
    -> issue taxonomy and severity
    -> containment and customer protection
    -> root cause analysis
    -> corrective and preventive action
    -> linked change artifacts
    -> release and approval gate
    -> effectiveness verification
    -> recurrence monitoring
    -> closure and management reporting

2.2 什么进入 corrective action ledger

Candidate issue	进入条件
Customer complaint	指向 AI 参与的回答、路由、拒绝、延迟、解释、收费、账户限制或服务失败
Human override spike	员工持续覆盖同一模型、prompt、RAG 答案、分类或 agent action
Expert QA failure	高风险输出缺少证据、违反政策、分类错误或无法解释
Eval regression	locked eval、gold set、red-team、fairness、RAG citation 或 tool safety gate 失败
Drift alert	分布、score、decision、outcome、embedding、knowledge 或 segment 指标越过 action threshold
Incident	AI 相关生产事件、安全事件、隐私事件、客户伤害或运营中断
Audit / validation finding	内审、模型验证、合规、监管准备或第三方评估发现控制缺口

2.3 Issue 状态机

Status	进入标准	退出标准
New signal	信号被接收但未确认 AI 贡献	完成 AI contribution matching
Triage	已识别相关 AI system、journey、客户影响和初始严重度	指定 owner、SLA 和下一步
Contained	临时控制已执行或明确不需要	客户保护、降级、暂停、人工复核或风险接受记录完成
RCA in progress	根因假设正在验证	根因分类和证据达到可行动水平
Action approved	corrective / preventive action 已确定	变更请求、审批和验证计划建立
Implemented	变更已经部署或 SOP 已更新	进入验证窗口
Verifying	正在收集效果证据	通过 closure gate 或重新打开
Closed	修复有效、证据完整、残余风险接受	纳入趋势和管理评审

3. Feedback Taxonomy

3.1 Signal taxonomy

Signal category	示例	主要价值	典型风险
Customer complaint	CFPB 投诉、内部投诉、客服升级、社媒升级	客户可感知伤害和旅程断点	客户叙述不一定包含完整技术事实
Appeal / dispute	信贷申诉、支付争议、账户限制复核、KYC 申诉	自动化错误和恢复路径质量	只覆盖有能力申诉的客户
Human override	员工改 AI 分类、摘要、建议、答案或风险等级	暴露 AI 不可信、政策不清或 UI 误导	override 可能是员工偏好而非事实
Expert review	SME QA、二线审核、model validation sample	高质量标签、policy interpretation 和 eval evidence	成本高、需要 reviewer calibration
Eval failure	locked test、red-team、gold set、RAG claim QA、tool simulation fail	上线门禁和回归测试信号	可能过拟合某个 eval set
Drift / monitoring	feature drift、score drift、complaint spike、override trend、RAG freshness	生产假设变化和早期预警	统计差异不一定等于客户影响
Incident	隐私泄露、错误批处理、agent tool misuse、批量误拒	需要 containment 和客户恢复	事后日志可能不完整
Audit / validation finding	模型验证、内审、合规、第三方评估	控制设计或运行缺陷	关闭可能偏文档化而非实质修复

3.2 Feedback usage policy

Feedback type	可用于训练	可用于 eval	可用于 RCA	可用于 audit evidence
Confirmed outcome label	是, 需 lineage 和质量门禁	可进入 future eval, 需隔离	是	是
Human override	谨慎, 需审核原因和偏差检查	可生成 review sample	是	是
Complaint	通常不直接训练	可转成 harm scenario 和 eval case	是	是
Expert adjudication	是, 需 calibration	是, 需 lock 和版本	是	是
Eval failure	不直接训练 locked eval 本身	是	是	是
Drift alert	不直接训练	可触发 sampling	是	是
Audit finding	不直接训练	可形成 control eval	是	是

3.3 Signal quality grading

Grade	标准	处理
A - Actionable evidence	有客户影响、AI 版本、证据、可复现路径和 owner	进入 RCA 和 action planning
B - Strong indicator	有趋势、样本和合理 AI contribution, 但证据不完整	补证据、抽样复核、临时监控
C - Weak signal	单点叙述或统计波动, 未确认客户影响	保留趋势, 不直接触发变更
D - Noise / duplicate	重复、无关或无法关联 AI system	合并或关闭, 保留审计理由

3.4 关键字段

Field	含义	示例
signal_id	信号唯一编号	`SIG-2026-06-00182`
ai_system_id	AI 系统或能力 ID	`servicing_rag_fee_policy`
artifact_versions	模型、prompt、retriever、tool、policy 版本	`prompt_v4.8`, `index_2026_06_12`
customer_journey_step	客户旅程位置	`fee_dispute_chat_answer`
customer_impact	资金、访问、解释、延迟、隐私、公平性等	`wrong explanation and missed escalation`
segment	产品、渠道、地区、语言、客户生命周期	`mobile, credit_card, Spanish`
source_evidence	投诉、日志、QA、eval、case、monitor 指针	`complaint_case_99102`, `trace_8831`
proposed_usage	train、eval、RCA、monitor、audit、exclude	`RCA + eval scenario`

4. CAPA Workflow

4.1 End-to-end workflow

1. Detect signal
2. Confirm AI contribution
3. Classify issue and severity
4. Contain customer and operational risk
5. Assign owner and SLA
6. Perform root cause analysis
7. Define corrective action and preventive action
8. Link actions to change artifacts
9. Approve and release changes
10. Verify effectiveness
11. Monitor recurrence
12. Close with evidence and residual risk decision

4.2 Stage detail

Stage	Required decision	Output artifact
Detect	Is this signal relevant to an AI-involved process	Signal record
Match	Which AI system, version, policy and journey were involved	AI contribution packet
Classify	What issue type and severity apply	Issue taxonomy record
Contain	What immediate protection is required	Containment order or risk acceptance
RCA	What root cause explains the issue	RCA worksheet
Plan	What corrective and preventive actions are needed	CAPA plan
Change	Which artifact changes and who approves	Change request and release plan
Verify	What evidence proves effect	Verification plan
Close	Who accepts residual risk and closure	Closure memo

4.3 Severity matrix

Severity	Trigger	Required response
L0 - Learning signal	No customer impact, useful for improvement	Track trend, consider sampling, no formal CAPA
L1 - Control weakness	Internal QA, eval, drift or audit issue without confirmed customer harm	RCA, backlog with due date, verification evidence
L2 - Customer inconvenience	Wrong explanation, delay, repeat effort or small financial impact	Containment, customer review, corrective action, KRI monitoring
L3 - Material customer harm	Access denial, funds impact, privacy, complaint escalation, fairness concern	Incident-style CAPA, customer remediation, executive reporting
L4 - Systemic / regulatory risk	Batch harm, protected-class disparity, major privacy / funds / regulator issue	Crisis protocol, legal hold, executive risk committee, regulator-ready evidence

4.4 Containment patterns

Pattern	适用场景	例子
Pause automation	高影响路径可能继续伤害客户	暂停 RAG 自动回答争议截止日
Narrow scope	仅部分 segment 或 intent 有问题	关闭西班牙语 fee waiver 自动回答
Raise human gate	模型可用但风险变高	高金额争议全部人工复核
Revert artifact	新 prompt、model、index 或 tool policy 回归	回滚到上一版 retriever config
Customer recovery	已发生客户影响	重新开案、退费、恢复访问、发送更正通知
Evidence hold	外部投诉、隐私或监管风险	冻结日志、通信、版本、审批和样本

4.5 Closure gates

Gate	通过标准
Scope gate	影响客户、segment、journey、版本和时间窗已识别
Root cause gate	根因分类有证据, 不是泛泛归因于“AI 不准”
Change gate	每个 action 链接到具体 artifact、owner、版本和审批
Verification gate	before / after 和生产 KRI 证明效果
Customer recovery gate	客户资金、权益、访问、解释或服务路径已恢复
Recurrence gate	复发监控窗口内未超阈值, 或残余风险被正式接受
Evidence gate	审计证据完整, 能重建发现、决策、修复和关闭

5. Root Cause Taxonomy

5.1 根因分类

Root cause family	典型根因	Fix pattern
Data quality	字段错、缺失、延迟、映射变更、身份合并错误	data contract、validation、backfill、lineage repair
Dataset coverage	训练或 eval 缺少关键 segment、语言、产品、边界样本	targeted sampling、eval expansion、coverage dashboard
Label / taxonomy	标签定义不清、reviewer 分歧、政策解释不一致	taxonomy governance、adjudication、label guideline update
Model behavior	校准失败、概念漂移、阈值失效、高置信错误	recalibration、retraining、threshold governance、human gate
Prompt behavior	过度自信、拒答不足、格式不稳、没有升级路径	prompt policy、rubric update、response contract、handoff rule
RAG / knowledge	旧源、错误检索、chunk 不良、citation 不支撑、权限不清	source governance、index rebuild、retriever/reranker change
Tool / agent	工具权限过宽、schema 不清、状态机缺口、无审批门	tool contract、permission boundary、approval workflow
Workflow	队列优先级错、无申诉入口、重复提交、handoff 断裂	service blueprint redesign、case routing、SLA control
Policy	业务规则模糊、risk appetite 未更新、reason code 不一致	policy clarification、rule update、approval matrix
Human adoption	员工过度依赖 AI、忽略检查、override 不记录原因	training、UI friction、supervisor QA、reason capture
Vendor / third party	供应商模型变更、日志不足、SLA 不清、退出路径弱	contract control、change notice、evidence rights, fallback
Governance	owner 不清、门禁缺失、残余风险无人接受	RACI、release gate、management review、control library

5.2 RCA depth test

Shallow statement	Better RCA question
“模型幻觉”	为什么系统允许无证据 claim 进入客户回答
“数据不准”	哪个 data contract 失败, 为什么未被 validation 拦截
“用户问法特殊”	eval 是否覆盖该 intent、语言和 journey
“员工没审核”	UI、SOP、培训和监督为何让 over-reliance 发生
“政策变了”	source owner、effective date 和 RAG freshness gate 为什么未触发
“供应商更新了”	合同、change notification 和 regression test 为什么没覆盖

5.3 Five-why pattern for AI

Issue: Customer received wrong fee waiver answer.
Why 1: RAG answer cited superseded policy.
Why 2: Retriever ranked old policy summary above current policy.
Why 3: Source metadata did not mark old policy as inactive.
Why 4: Knowledge owner SOP required upload of new policy but not deactivation of superseded content.
Why 5: Release gate tested answer correctness, but not source lifecycle controls.
Corrective action: update metadata and retriever filter.
Preventive action: add source retirement SOP and freshness gate to release checklist.

6. Change Linkage

6.1 Evidence graph

signal_id
  -> issue_id
  -> harm_or_control_classification
  -> root_cause_id
  -> corrective_action_id
  -> preventive_action_id
  -> change_request_id
  -> artifact_version
  -> eval_run_id
  -> release_approval_id
  -> monitoring_window_id
  -> closure_memo_id

6.2 Artifact linkage matrix

Artifact	Version evidence	Typical approver
Dataset	dataset hash, label guideline, split policy, lineage	Data Owner, Model Owner, Model Risk
Eval set	eval version, locked samples, scenario map, pass criteria	Model Validation, Product, Compliance
Prompt	prompt diff, policy mapping, regression output	Product, AI Platform, Compliance for high-risk
Model	model card, training data, validation report, calibration	Model Owner, Model Risk, Business Owner
RAG index	source list, effective dates, chunking, retriever config	Knowledge Owner, Compliance, AI Platform
Tool policy	schema, permissions, approval gates, logs	Security, Product, Operations
Workflow SOP	process map, SLA, handoff, case state	Operations, Product, Risk
Customer communication	template, legal review, accessibility review	Legal, Compliance, CX

6.3 Traceability anti-patterns

Anti-pattern	Why it fails
“Issue closed because PR merged”	Code change does not prove customer or control outcome improved
“Prompt fixed in playground”	No production version, no approval, no regression evidence
“Data refreshed”	No lineage, no label quality, no eval protection
“Model retrained”	Root cause may be workflow, policy or RAG source, not model
“Monitoring added”	Monitoring does not correct existing harm or prove fix

7. Effectiveness Verification

7.1 Verification hierarchy

Evidence type	Strength	Use
Unit / contract test	Low to medium	Confirms schema, prompt format, tool permission, source freshness
Locked regression eval	Medium	Confirms known failure modes no longer fail
Counterfactual replay	Medium to high	Runs historical affected cases through fixed path
Shadow / canary	High	Observes production-like performance before full rollout
Human QA sample	High when calibrated	Confirms customer-facing quality and policy adherence
Production KRI	High	Confirms complaints, overrides, appeals, drift or harm signals improve
Customer remediation proof	Required for harm	Confirms affected customers were restored

7.2 Metric families

Metric family	Example
Quality	accuracy, citation support, unsupported claim rate, tool success rate
Risk	false decline, appeal upheld, adverse action defect, privacy exposure
Operations	override rate, queue SLA, escalation miss, rework, duplicate contact
Customer	complaint rate, repeat contact, abandonment, remediation SLA
Fairness / segment	disparity in denial, delay, appeal upheld, language support
Governance	evidence completeness, approval timeliness, recurrence, overdue CAPA

7.3 Closure rule examples

Issue type	Closure evidence
RAG stale policy	locked eval pass, source freshness proof, 2-week complaint trend below threshold
Fraud false positive spike	segment false decline proxy improves, analyst override declines, affected customers reviewed
Credit explanation defect	reason code parity test passes, adverse action QA pass, complaints stabilize
Agent tool misuse	tool permission test passes, simulation shows no unauthorized state transition, audit logs complete
Data pipeline skew	online-offline parity restored, backfill complete, affected decisions assessed

7.4 Recurrence monitoring

Severity	Minimum monitoring window	Review cadence
L1	14 days or next release cycle	Weekly
L2	30 days	Weekly with Product and Ops
L3	60 days	Twice weekly until stable, then weekly
L4	90 days or executive-defined	Executive risk committee cadence

8. Update Governance

8.1 Dataset update governance

Control	Requirement
Lineage	Every added sample has source, time, AI version, policy version and usage tag
Label quality	Reviewer role, rubric, confidence, adjudication and evidence captured
Split protection	Train, calibration, eval, gold, monitoring and legal hold are separated
Bias control	Segment coverage and selection bias reviewed before training
Approval	High-impact dataset changes reviewed by Model Risk or validation function
Rollback	Prior dataset version and model artifact remain recoverable

8.2 Prompt update governance

Control	Requirement
Prompt diff	Store exact before / after system, developer and task instructions
Policy mapping	Each high-risk instruction maps to approved policy or control
Regression	Run locked scenario tests, refusal tests, citation tests and escalation tests
Human review	SME reviews high-risk answer changes, not only prompt text
Rollout	Use canary or limited cohort for customer-facing prompts
Monitoring	Watch unsupported claim, escalation miss, complaint and override signals

8.3 Model update governance

Control	Requirement
Change reason	Link model change to root cause and expected benefit
Validation	Compare challenger against champion on overall, segment, calibration and fairness metrics
Stress testing	Include drifted segments, edge cases and high customer impact cases
Independent challenge	Model Risk or validation reviews assumptions for high-impact models
Decision threshold	Threshold changes approved with customer, operations and risk impact
Post-release	Monitor performance, overrides, appeals, complaints and drift

8.4 RAG update governance

Control	Requirement
Source authority	Only approved sources can support customer-impact claims
Effective date	Current and superseded documents must be explicit
Chunking / metadata	Chunk policy preserves definitions, exceptions, deadlines and eligibility
Retrieval evaluation	Test source recall, citation support, conflict handling and permissions
Knowledge owner	Business / Compliance owner signs off source lifecycle
Deactivation	Superseded content is removed, demoted or blocked with audit trail

8.5 Tool and agent update governance

Control	Requirement
Tool contract	Input schema, output schema, side effects and failure modes defined
Permission boundary	Agent can only call tools needed for approved purpose
Approval gate	High-impact actions require human approval or policy confirmation
State machine	Agent workflow prevents skipped review, duplicate action and dead-end automation
Simulation	Run scenario tests for incorrect tool choice, retries, timeout and rollback
Audit log	Every tool call links to user/session, model, prompt, action and result

8.6 Policy and workflow update governance

Control	Requirement
Policy versioning	Eligibility, dispute, fee, credit, KYC and complaint policies have effective dates
SOP alignment	Human review SOP matches AI routing and escalation behavior
Reason code integrity	Customer explanations trace to approved reason or source
Case management	Corrective actions update actual customer records, not only AI artifacts
Training	Employees receive workflow and review training when AI behavior changes
Auditability	Process map, owner, SLA and evidence fields are updated together

9. Dashboards and KRIs

9.1 Executive dashboard

Metric	Why executives need it
Open CAPA by severity	Shows unresolved risk exposure
Aging CAPA	Reveals governance or owner bottlenecks
Repeat issue rate	Shows whether fixes prevent recurrence
Customer harm linked to AI	Connects AI quality to customer impact
Overdue verification	Prevents “implemented but unproven” closure
Residual risk acceptances	Shows what risk leadership has accepted

9.2 Product and operations dashboard

Metric	Slice
Complaint rate after AI interaction	product, channel, intent, version
Appeal upheld / reversal rate	decision type, segment, model version
Human override rate	team, output type, reason, prompt/model version
Escalation miss rate	journey step, risk tier, language
Remediation SLA	severity, owner, customer segment
Repeat contact and abandonment	journey, device, channel

9.3 Model risk dashboard

Metric	Slice
Eval failure by scenario	model, prompt, RAG, tool, release
Drift-to-action conversion	alert type, owner, action
Dataset update lineage completeness	dataset version, use case
Segment performance change	protected or high-control segment where lawful
Independent challenge findings	model family, issue type
Validation exceptions and closure	severity, due date

9.4 Data / RAG / tool dashboard

Metric	Meaning
Data contract violations	Bad inputs that can cause downstream AI defects
Feature freshness delay	Real-time decision risk
Source freshness	RAG policy staleness
Citation support pass rate	Whether answer claims are grounded
Tool failure / retry / unauthorized attempt	Agent reliability and security risk
Permission boundary violations	Potential privacy or action authority issue

9.5 KRI thresholds

KRI	Green	Amber	Red
Repeat issue rate	No recurrence in window	Single low-severity recurrence	Same root cause causes L2+ issue
Evidence completeness L3+	100% required evidence	Non-critical evidence delayed	Critical log, customer notice or approval missing
Overdue CAPA	None past SLA	L1/L2 past SLA	L3/L4 past SLA
Unsupported RAG claim	Below baseline	1.5x baseline or high-risk sample	2x baseline or customer harm
Human override spike	Within control band	2-week upward trend	Spike plus complaint or appeal signal

10. RACI

Activity	Product	BA	Architect	Data Science	AI Platform	Ops	Model Risk	Compliance / Legal	Data Gov	Security	Business Owner
Signal intake design	A	R	C	C	C	R	C	C	C	C	A
Taxonomy and workflow	A	R	C	C	C	R	C	C	C	C	A
AI contribution matching	C	R	A	R	R	C	C	C	C	C	A
Severity decision	A	R	C	C	C	R	C	C	C	C	A
Customer containment	A	C	C	C	C	R	C	C	C	C	A
Root cause analysis	A	R	R	R	R	C	C	C	C	C	A
Dataset change approval	C	C	C	R	C	C	A	C	A	C	A
Prompt / RAG change approval	A	C	R	C	R	C	C	C	C	C	A
Model change approval	C	C	C	R	C	C	A	C	C	C	A
Tool / agent permission change	A	C	R	C	R	C	C	C	C	A	A
Effectiveness verification	A	R	C	R	R	R	C	C	C	C	A
Closure and residual risk	A	R	C	C	C	C	A	C	C	C	A

Legend: R = responsible, A = accountable, C = consulted.

11. Templates

11.1 Corrective Action Intake

Field	Rule	Example
Issue title	Customer or control impact in plain language	`RAG gave stale annual fee waiver guidance`
Signal source	Complaint, override, eval, drift, audit, incident	`complaint + RAG eval failure`
AI system	Registered system and version	`servicing_rag v4.8, index_2026_06_12`
Journey step	Where customer or employee saw output	`mobile chat fee dispute answer`
Impact	Customer, risk, operations or control impact	`wrong explanation, possible missed escalation`
Severity	L0-L4 with rationale	`L2 due to customer-facing wrong policy`
Containment	Immediate protection	`route fee waiver intents to human review`
Owner	Named accountable business owner	`Credit Card Servicing Product Owner`
Due date	SLA based on severity	`2026-07-03`

11.2 RCA Worksheet

Field	Rule	Example
Confirmed failure	Observable failure, not opinion	`Answer cited superseded fee policy`
Direct cause	Immediate technical or process cause	`retriever ranked old policy summary first`
Contributing causes	Other enabling weaknesses	`source lifecycle SOP lacked deactivation step`
Control failure	Why control did not catch it	`freshness gate checked upload date, not effective date`
Root cause family	Use taxonomy	`RAG / knowledge + governance`
Evidence	Logs, samples, policies, evals	`trace_8831, complaint_99102, eval_fee_v1`
Corrective action	Fix current defect	`deactivate old source and update retriever filter`
Preventive action	Stop recurrence	`source retirement SOP and release gate update`

11.3 Change Linkage Memo

Section	Content rule	Example
Issue link	issue_id and severity	`ISS-2026-06-014, L2`
Root cause	concise root cause	`current-source priority missing`
Artifact changed	exact artifact and version	`retriever_config v5.3, source_registry 2026-06-30`
Approval	who approved and why	`Knowledge Owner + Product + Compliance`
Expected effect	measurable outcome	`unsupported fee claim below 5%`
Rollback	rollback condition	`complaint spike or eval pass below 95%`
Monitoring	production KRI and window	`30-day fee complaint and override monitor`

11.4 Effectiveness Verification Memo

Field	Evidence standard	Example
Baseline	Before metric with time window	`unsupported claim 14% on fee eval v1`
Post-fix result	After metric with same or stronger method	`3% on locked fee eval v2`
Segment check	Key segments verified	`mobile, contact center, Spanish`
Customer signal	Complaints, appeals, overrides	`fee complaint back to baseline for 2 weeks`
Control signal	New gate or monitoring works	`freshness gate blocks superseded docs`
Residual risk	Accepted remaining risk	`low risk for rarely used legacy fee terms`
Closure decision	approver and date	`Product + Model Risk closure on 2026-07-30`

11.5 Executive Status Memo

Section	Good content
Decision needed	Approve closure, extend monitoring, pause automation, fund fix or accept residual risk
Customer impact	Plain-language customer effect and affected count
Root cause	One sentence, with system and control cause
Actions completed	Corrective and preventive actions, not activity list
Evidence	Before / after results and production KRI
Open risk	What remains, who owns it and review date
Recommendation	Clear management action

12. 30-Day Lab

目标：30 天内把 closed-loop learning 从概念训练成可展示的金融零售 AI 产品、架构和治理资产。默认读者已经具备高级 BA / PM / 架构基础，不做基础流程图训练。

Day	主题	产出
1	选择一个金融零售 AI use case, 定义客户旅程和 AI touchpoints	AI touchpoint map
2	阅读 NIST AI RMF, 映射 Govern / Map / Measure / Manage	governance mapping note
3	阅读 ISO/IEC 42001 概览, 提炼 management system controls	AI management system control map
4	阅读 SR 26-2, 总结 2026 模型风险管理 nuance	model risk nuance memo
5	分析 CFPB complaint database 可用于哪些 feedback signals	complaint signal taxonomy
6	设计 signal intake schema	corrective action intake schema
7	设计 feedback usage policy	train / eval / RCA / audit usage matrix
8	设计 issue taxonomy	issue taxonomy and examples
9	设计 severity matrix	L0-L4 response matrix
10	设计 containment patterns	pause, narrow, gate, revert, remediate playbook
11	设计 root cause taxonomy	data/model/prompt/RAG/tool/workflow/policy/governance map
12	完成一个 Five-why RCA drill	RCA worksheet
13	设计 change linkage model	issue-to-change evidence graph
14	设计 dataset update governance	dataset change checklist
15	设计 prompt update governance	prompt diff and regression gate
16	设计 model update governance	challenger validation and threshold memo
17	设计 RAG update governance	source freshness and citation support gate
18	设计 tool / agent update governance	tool permission and state machine gate
19	设计 effectiveness verification hierarchy	verification evidence matrix
20	设计 recurrence monitoring windows	monitoring plan by severity
21	设计 executive dashboard	CAPA severity and aging dashboard mock
22	设计 operations dashboard	complaint, override, appeal, remediation view
23	设计 model risk dashboard	eval, drift, validation and closure view
24	写 RACI	role matrix
25	写 Corrective Action Intake template	completed example
26	写 RCA + Change Linkage memo	completed example
27	写 Effectiveness Verification memo	completed example
28	做 tabletop exercise: RAG stale policy incident	incident simulation pack
29	准备 interview story	STAR-T answers
30	完成 portfolio package	architecture diagram, workflow, dashboards, templates, executive memo

13. Interview Answers

13.1 Closed-loop learning 和 active learning 有什么区别？

30 秒回答：

Active learning 是选择最值得人工标注的样本来提升模型。Closed-loop learning 是把生产反馈、投诉、override、eval 失败、drift、incident 和 audit finding 转成有治理的 issue, 找根因, 链接到具体变更, 再证明修复有效并防复发。它更接近 AI 版 CAPA。

2 分钟展开：

在金融零售里，反馈不能直接等同于训练数据。投诉可能是客户伤害信号，override 可能是员工对 AI 不信任，eval failure 可能是上线门禁缺口，drift 可能是数据、政策或行为变化。我的设计会先做 signal normalization 和 AI contribution matching，再分类 severity 和 issue type。之后进入 RCA，决定修数据、prompt、model、RAG、tool、workflow 还是 policy。每个 action 都链接到具体版本和审批，并用 locked eval、production KRI、投诉趋势和复发监控证明效果。

13.2 如何设计 CAPA-like AI corrective action workflow？

30 秒回答：

我会设计 detect、match、classify、contain、RCA、action、change、verify、monitor、close 的闭环。关键不是流程图好看，而是每一步都有 owner、SLA、证据和关闭标准。

2 分钟展开：

第一步收集信号，包括投诉、申诉、人工覆盖、专家 QA、eval failure、drift、incident 和 audit finding。第二步确认 AI 系统和版本是否参与客户旅程。第三步按客户影响和控制风险分级。高风险场景先 containment，例如暂停自动回答、提高人工复核或恢复客户。然后做 root cause taxonomy，区分 data、model、prompt、RAG、tool、workflow、policy、vendor 和 governance。修复动作必须进入 change management，并链接到 dataset、prompt、model、index、tool policy 或 SOP。最后通过 eval、replay、canary、QA、生产 KRI 和 recurrence window 验证，证据完整后才能关闭。

13.3 如何避免反馈闭环放大偏差？

30 秒回答：

要区分 feedback usage, 记录采样概率和业务动作, 保护 eval set, 按 segment 看覆盖和伤害, 并保留随机 sentinel 样本。不是所有 override 和投诉都能直接训练。

2 分钟展开：

如果只用被模型拦截的交易训练 fraud 模型，就看不到放行交易里的漏报；如果只用会申诉的客户反馈训练信贷解释，就会低估弱势客户的无声伤害。我的设计会给反馈打 usage tag：train、eval、RCA、monitor、audit、exclude。训练数据要有 lineage、label quality 和 segment coverage。评估集要 lock，避免把失败样本反复调参到通过。生产监控要看投诉、appeal upheld、override、abandonment 和 segment disparity。这样闭环学习不是模型自我强化，而是治理过的学习。

13.4 如何证明一个 prompt 或 RAG 修复真的有效？

30 秒回答：

只说 prompt 改了不够。我要看到 prompt diff、policy mapping、locked eval 通过、生产抽样 QA、相关投诉和 override 下降，并在复发窗口内没有同类问题。

2 分钟展开：

以 RAG 过期政策为例，我会先确认 root cause 是 source lifecycle、retriever ranking、prompt citation instruction 还是 handoff rule。修复后，需要锁定 eval cases，验证答案 claim 被当前有效来源支撑，citation 直接支持答案，高风险 intent 会拒答或升级人工。还要跑生产 canary 和人工 QA，监控相关投诉、agent override、unsupported claim、source freshness 和 escalation miss。如果客户曾受影响，还要证明客户通知、纠正和补偿完成。只有质量、客户和控制指标都通过，才能关闭。

13.5 SR 26-2 对 AI closed-loop learning 有什么影响？

30 秒回答：

SR 26-2 是 2026 年银行模型风险管理的重要锚点，强调风险导向、生命周期、治理、独立挑战和问题整改。Nuance 是它正式聚焦 model risk management, 生成式和 agentic AI 的闭环学习还要叠加 AI 管理体系、消费者合规、隐私、安全、第三方和运营控制。

2 分钟展开：

我会用 SR 26-2 的精神要求模型相关问题有完整生命周期治理：issue identification、risk assessment、validation、change control、independent challenge、documentation 和 closure evidence。对于生成式 AI 或 agentic workflow，我不会只问它是否落入传统 model definition，而是从客户影响和控制角度补充 NIST AI RMF、ISO/IEC 42001、complaint signals、security、privacy、third-party 和 operational resilience。也就是说，SR 26-2 给了模型风险纪律，但 AI corrective action architecture 要覆盖更广系统。

13.6 Root cause taxonomy 为什么重要？

30 秒回答：

没有 root cause taxonomy，团队会把所有问题都归因于“模型不准”或“prompt 要改”。金融 AI 的真实根因可能是数据、标签、RAG source、tool permission、workflow、policy、vendor 或 governance。

2 分钟展开：

正确的根因决定正确的修复。比如客户收到错误费用解释，表面是 RAG 答错；真正根因可能是旧政策未停用、retriever 没有 effective-date filter、prompt 没要求当前来源、客服流程没升级冲突答案。如果只改 prompt，问题会复发。我会建立 root cause taxonomy，并要求 RCA 明确 direct cause、contributing cause、control failure 和 preventive action。这样 closed-loop learning 不只是改输出，而是修系统。

13.7 哪些指标证明 corrective action 工作良好？

30 秒回答：

我会看 open CAPA by severity、aging、repeat issue rate、overdue verification、evidence completeness、complaint / appeal / override trend、post-fix recurrence 和 residual risk acceptance。

2 分钟展开：

一个成熟体系不能只统计关闭了多少 ticket。我要看高严重度 issue 是否及时 containment，RCA 是否按根因分类，action 是否链接到具体 artifact，验证是否按计划完成，客户是否恢复，复发是否下降。对产品和运营，看投诉、appeal upheld、override、abandonment、remediation SLA。对模型风险，看 eval regression、drift-to-action、dataset lineage、validation exception closure。对高管，看 overdue CAPA、repeat root cause、residual risk 和投资缺口。

13.8 如何处理 customer complaint 与模型训练的关系？

30 秒回答：

投诉是高价值 customer harm signal, 但不能直接当标签训练。要先做 case review、AI contribution matching、root cause 和 usage tagging。

2 分钟展开：

客户投诉反映客户体验和影响，但叙述可能不完整或带有情绪。我的做法是把 complaint 进入 harm and corrective action ledger，关联 AI trace、case record、policy version 和 journey step。若复核确认 AI 输出错误，可生成 eval scenario、RCA evidence、customer remediation case，必要时形成训练标签。但进入训练前需要专家 adjudication、privacy review、segment bias check 和 dataset lineage。投诉更直接的价值是暴露产品承诺失败和恢复路径缺陷。

13.9 如何把 closed-loop learning 做成作品集？

30 秒回答：

我会展示一个完整 case：信号 taxonomy、CAPA workflow、root cause taxonomy、change linkage、effectiveness dashboard、RACI 和 executive memo，而不是只展示模型指标。

2 分钟展开：

例如信用卡客服 RAG stale policy case。作品集先展示客户旅程和 AI touchpoint，再展示投诉、agent override、RAG eval failure 和 drift cluster 如何进入 corrective action ledger。然后用 root cause taxonomy 证明问题不是单纯 prompt，而是 source lifecycle、retrieval ranking 和 handoff gate。接着展示 change linkage：source registry、retriever config、prompt version、eval set、release approval。最后用 locked eval、生产 QA、投诉趋势、override 下降和 recurrence window 证明有效。这样体现 PM、BA、架构和治理能力。

13.10 如果团队说“我们已经有 monitoring dashboard”，你怎么回应？

30 秒回答：

Dashboard 只是发现问题，不是闭环。我要看告警是否有 owner、severity、RCA、action、change linkage、verification、closure 和 recurrence monitoring。

2 分钟展开：

很多 AI 团队有很漂亮的 drift、latency、quality dashboard，但问题是报警后没人处理，或者处理后没有证据证明修复有效。我的标准是 shift-to-action 和 signal-to-CAPA。每个红色指标必须映射到 triage owner 和 runbook；每个高风险 issue 必须能链接到具体变更；每个关闭必须有 before / after 证据和残余风险接受。否则 dashboard 只是观察系统，不是治理系统。

14. Final Operating View

成熟的 AI closed-loop learning 不是让模型“自动吸收反馈”。成熟状态是组织能持续证明：

We know where AI touches customers and operations.
We capture complaints, appeals, overrides, expert reviews, eval failures, drift, incidents and audits.
We classify issues by customer impact, control weakness and root cause.
We protect customers before permanent fixes are shipped.
We link each corrective action to a specific dataset, prompt, model, RAG, tool, workflow or policy change.
We verify fixes with locked evals, replay, QA, production KRIs and recurrence monitoring.
We prevent eval leakage, feedback bias and undocumented changes.
We can show executives, auditors and regulators what failed, why it failed, what changed, and why it is safe to close.

对 CBAP+、AI PM、AI BA 和 AI Architect 来说，这就是从“AI 系统持续优化”升级为“AI 组织持续学习、纠正、证明和防复发”的核心能力。