AI Closed-Loop Learning / Corrective Action Playbook
以下官方来源是本文的治理锚点。本文把它们转成产品、流程、架构、证据和管理层语言,不把任何框架直接等同于监管合规结论。
AI Closed-Loop Learning / Corrective Action Architecture Playbook
定位:面向 CBAP+、AI Product Manager、AI BA、AI Product Architect、Model Risk、Compliance、Customer Operations、Data Governance、AI Platform 和金融零售业务负责人。本文关注的不是“收集反馈让模型变好”这种基础说法,而是如何把反馈、投诉、人工覆盖、专家审核、eval 失败、漂移信号、事件和审计发现转成有治理、有根因、有变更、有验证、有证据的闭环纠正体系。
适用边界:本文面向 fraud、credit、KYC / AML、payments dispute、customer servicing RAG、complaints、collections、wealth servicing、internal copilot、agentic workflow 和 AI-assisted operations。它把 closed-loop learning 设计成 CAPA-like corrective and preventive action architecture, 不把它简化成 active learning、RLHF、自动重训或 prompt tweaking。
重要说明:本文是学习、作品集和内部方案训练材料,不构成法律意见、合规结论、审计意见、模型验证报告、监管解释或具体机构政策。正式项目必须由 Legal、Compliance、Model Risk、Fair Lending、Privacy、Security、Third Party Risk、Business Owner、Operations、Customer Experience、Data Governance 和管理层结合机构类型、司法辖区、产品范围、客户影响和内部政策确认。访问日期按 2026-06-30 记录。
Source Anchors
以下官方来源是本文的治理锚点。本文把它们转成产品、流程、架构、证据和管理层语言,不把任何框架直接等同于监管合规结论。
| Anchor | Official link | 本 playbook 的使用方式 |
|---|---|---|
| NIST AI Risk Management Framework | https://www.nist.gov/itl/ai-risk-management-framework | 用 Govern / Map / Measure / Manage 组织 closed-loop learning 的责任归属、风险场景、测量、处置、残余风险和持续改进证据。 |
| ISO/IEC 42001 AI management system | https://www.iso.org/standard/42001 | 用 AI 管理体系视角设计目标、控制、运行记录、内审、管理评审、持续改进和纠正措施闭环。 |
| Federal Reserve SR 26-2 | https://www.federalreserve.gov/supervisionreg/srletters/SR2602.htm | 作为 2026 年银行模型风险管理锚点,强调风险导向、模型生命周期、独立挑战、治理、问题整改和记录。Nuance: SR 26-2 替代 SR 11-7 / SR 21-8,并采用更风险导向、按机构规模和模型重要性分层的做法;其正式范围聚焦 banking organization 的 model risk management,生成式 AI、agentic AI、非模型自动化和 customer-facing AI 仍需要结合更广泛的 AI governance、消费者合规、隐私、安全、第三方、运营和产品控制。 |
| CFPB Consumer Complaint Database | https://www.consumerfinance.gov/data-research/consumer-complaints/ | 把消费者投诉主题、叙述、公司响应和趋势作为 customer harm、root cause、remediation 和 external signal calibration 的输入。 |
Source-to-artifact mapping:
| Source lens | 需要落到的 artifact | 面试表达 |
|---|---|---|
| NIST AI RMF | AI issue register、risk mapping、measurement dashboard、corrective action workflow、residual risk memo | “我用 Govern / Map / Measure / Manage 管理反馈闭环,而不是把反馈当成数据堆积。” |
| ISO/IEC 42001 | AI management system procedure、corrective action SOP、management review pack、internal audit evidence | “闭环学习必须进入管理体系,能被内审、管理评审和持续改进机制追踪。” |
| SR 26-2 | Model issue management、model change log、independent challenge、validation evidence、closure record | “银行 AI 的修复不只是模型团队改代码,还要证明问题、根因、变更、验证和残余风险处理。” |
| CFPB complaints | Complaint taxonomy、customer harm trigger、trend monitor、external complaint reconciliation | “投诉是生产反馈信号,不是客服噪音。它能校准内部监控是否漏掉客户伤害。” |
1. Executive Framing
1.1 一句话定位
Closed-Loop Learning / Corrective Action Architecture =
把生产反馈和风险信号转成受治理的 issue,
通过 root cause analysis 链接到具体 AI / data / policy / process change,
再用独立、可复现、可审计的证据证明修复有效并防止复发。
对高管来说,它回答四个问题:
- 我们如何知道 AI 系统正在造成错误、伤害、控制弱点或性能退化?
- 我们如何判断根因是模型、数据、prompt、RAG、tool、流程、政策、vendor 还是治理?
- 我们如何确保修复动作被执行到正确 artifact,而不是停留在会议纪要?
- 我们如何证明修复有效、客户已恢复、风险已下降、同类问题没有继续发生?
1.2 它不是 active learning
| 能力 | 主要问题 | 典型输出 |
|---|---|---|
| Active learning | 哪些样本最值得专家标注 | label queue、dataset update |
| Human feedback operations | 谁审核、如何标注、如何保证标签质量 | reviewer workflow、calibration、adjudication |
| Drift monitoring | 生产分布和性能是否变了 | alert、triage、response |
| Customer harm remediation | 客户是否受损、如何恢复 | recourse case、compensation、notification |
| Closed-loop corrective action | 为什么失败、修什么、如何证明已修好 | CAPA record、change linkage、effectiveness evidence |
Active learning 可以产生输入,但 closed-loop learning 的终点不是“多一些训练数据”,而是“生产问题被纠正且证据可审计”。
1.3 高级 PM / BA / Architect 的角色
| 角色能力 | 具体表现 |
|---|---|
| Product judgment | 判断反馈是否代表客户伤害、风险偏好偏离或产品承诺失败 |
| BA discipline | 定义 issue taxonomy、状态机、字段、SLA、stakeholder approval 和证据要求 |
| Architecture thinking | 把 feedback ledger、model registry、prompt repo、vector index、tool policy、case system 和 evidence binder 连接起来 |
| Governance fluency | 让 Model Risk、Compliance、Legal、Operations 和 Business Owner 参与正确的门禁 |
| Outcome focus | 不以“改了模型”作为成功,而以客户、控制、性能和复发指标改善作为成功 |
2. Operating Model
2.1 闭环主流程
Signals
complaints | appeals | overrides | expert reviews | eval failures | drift alerts | incidents | audits
-> signal normalization
-> AI contribution matching
-> issue taxonomy and severity
-> containment and customer protection
-> root cause analysis
-> corrective and preventive action
-> linked change artifacts
-> release and approval gate
-> effectiveness verification
-> recurrence monitoring
-> closure and management reporting
2.2 什么进入 corrective action ledger
| Candidate issue | 进入条件 |
|---|---|
| Customer complaint | 指向 AI 参与的回答、路由、拒绝、延迟、解释、收费、账户限制或服务失败 |
| Human override spike | 员工持续覆盖同一模型、prompt、RAG 答案、分类或 agent action |
| Expert QA failure | 高风险输出缺少证据、违反政策、分类错误或无法解释 |
| Eval regression | locked eval、gold set、red-team、fairness、RAG citation 或 tool safety gate 失败 |
| Drift alert | 分布、score、decision、outcome、embedding、knowledge 或 segment 指标越过 action threshold |
| Incident | AI 相关生产事件、安全事件、隐私事件、客户伤害或运营中断 |
| Audit / validation finding | 内审、模型验证、合规、监管准备或第三方评估发现控制缺口 |
2.3 Issue 状态机
| Status | 进入标准 | 退出标准 |
|---|---|---|
| New signal | 信号被接收但未确认 AI 贡献 | 完成 AI contribution matching |
| Triage | 已识别相关 AI system、journey、客户影响和初始严重度 | 指定 owner、SLA 和下一步 |
| Contained | 临时控制已执行或明确不需要 | 客户保护、降级、暂停、人工复核或风险接受记录完成 |
| RCA in progress | 根因假设正在验证 | 根因分类和证据达到可行动水平 |
| Action approved | corrective / preventive action 已确定 | 变更请求、审批和验证计划建立 |
| Implemented | 变更已经部署或 SOP 已更新 | 进入验证窗口 |
| Verifying | 正在收集效果证据 | 通过 closure gate 或重新打开 |
| Closed | 修复有效、证据完整、残余风险接受 | 纳入趋势和管理评审 |
3. Feedback Taxonomy
3.1 Signal taxonomy
| Signal category | 示例 | 主要价值 | 典型风险 |
|---|---|---|---|
| Customer complaint | CFPB 投诉、内部投诉、客服升级、社媒升级 | 客户可感知伤害和旅程断点 | 客户叙述不一定包含完整技术事实 |
| Appeal / dispute | 信贷申诉、支付争议、账户限制复核、KYC 申诉 | 自动化错误和恢复路径质量 | 只覆盖有能力申诉的客户 |
| Human override | 员工改 AI 分类、摘要、建议、答案或风险等级 | 暴露 AI 不可信、政策不清或 UI 误导 | override 可能是员工偏好而非事实 |
| Expert review | SME QA、二线审核、model validation sample | 高质量标签、policy interpretation 和 eval evidence | 成本高、需要 reviewer calibration |
| Eval failure | locked test、red-team、gold set、RAG claim QA、tool simulation fail | 上线门禁和回归测试信号 | 可能过拟合某个 eval set |
| Drift / monitoring | feature drift、score drift、complaint spike、override trend、RAG freshness | 生产假设变化和早期预警 | 统计差异不一定等于客户影响 |
| Incident | 隐私泄露、错误批处理、agent tool misuse、批量误拒 | 需要 containment 和客户恢复 | 事后日志可能不完整 |
| Audit / validation finding | 模型验证、内审、合规、第三方评估 | 控制设计或运行缺陷 | 关闭可能偏文档化而非实质修复 |
3.2 Feedback usage policy
| Feedback type | 可用于训练 | 可用于 eval | 可用于 RCA | 可用于 audit evidence |
|---|---|---|---|---|
| Confirmed outcome label | 是, 需 lineage 和质量门禁 | 可进入 future eval, 需隔离 | 是 | 是 |
| Human override | 谨慎, 需审核原因和偏差检查 | 可生成 review sample | 是 | 是 |
| Complaint | 通常不直接训练 | 可转成 harm scenario 和 eval case | 是 | 是 |
| Expert adjudication | 是, 需 calibration | 是, 需 lock 和版本 | 是 | 是 |
| Eval failure | 不直接训练 locked eval 本身 | 是 | 是 | 是 |
| Drift alert | 不直接训练 | 可触发 sampling | 是 | 是 |
| Audit finding | 不直接训练 | 可形成 control eval | 是 | 是 |
3.3 Signal quality grading
| Grade | 标准 | 处理 |
|---|---|---|
| A - Actionable evidence | 有客户影响、AI 版本、证据、可复现路径和 owner | 进入 RCA 和 action planning |
| B - Strong indicator | 有趋势、样本和合理 AI contribution, 但证据不完整 | 补证据、抽样复核、临时监控 |
| C - Weak signal | 单点叙述或统计波动, 未确认客户影响 | 保留趋势, 不直接触发变更 |
| D - Noise / duplicate | 重复、无关或无法关联 AI system | 合并或关闭, 保留审计理由 |
3.4 关键字段
| Field | 含义 | 示例 |
|---|---|---|
| signal_id | 信号唯一编号 | SIG-2026-06-00182 |
| ai_system_id | AI 系统或能力 ID | servicing_rag_fee_policy |
| artifact_versions | 模型、prompt、retriever、tool、policy 版本 | prompt_v4.8, index_2026_06_12 |
| customer_journey_step | 客户旅程位置 | fee_dispute_chat_answer |
| customer_impact | 资金、访问、解释、延迟、隐私、公平性等 | wrong explanation and missed escalation |
| segment | 产品、渠道、地区、语言、客户生命周期 | mobile, credit_card, Spanish |
| source_evidence | 投诉、日志、QA、eval、case、monitor 指针 | complaint_case_99102, trace_8831 |
| proposed_usage | train、eval、RCA、monitor、audit、exclude | RCA + eval scenario |
4. CAPA Workflow
4.1 End-to-end workflow
1. Detect signal
2. Confirm AI contribution
3. Classify issue and severity
4. Contain customer and operational risk
5. Assign owner and SLA
6. Perform root cause analysis
7. Define corrective action and preventive action
8. Link actions to change artifacts
9. Approve and release changes
10. Verify effectiveness
11. Monitor recurrence
12. Close with evidence and residual risk decision
4.2 Stage detail
| Stage | Required decision | Output artifact |
|---|---|---|
| Detect | Is this signal relevant to an AI-involved process | Signal record |
| Match | Which AI system, version, policy and journey were involved | AI contribution packet |
| Classify | What issue type and severity apply | Issue taxonomy record |
| Contain | What immediate protection is required | Containment order or risk acceptance |
| RCA | What root cause explains the issue | RCA worksheet |
| Plan | What corrective and preventive actions are needed | CAPA plan |
| Change | Which artifact changes and who approves | Change request and release plan |
| Verify | What evidence proves effect | Verification plan |
| Close | Who accepts residual risk and closure | Closure memo |
4.3 Severity matrix
| Severity | Trigger | Required response |
|---|---|---|
| L0 - Learning signal | No customer impact, useful for improvement | Track trend, consider sampling, no formal CAPA |
| L1 - Control weakness | Internal QA, eval, drift or audit issue without confirmed customer harm | RCA, backlog with due date, verification evidence |
| L2 - Customer inconvenience | Wrong explanation, delay, repeat effort or small financial impact | Containment, customer review, corrective action, KRI monitoring |
| L3 - Material customer harm | Access denial, funds impact, privacy, complaint escalation, fairness concern | Incident-style CAPA, customer remediation, executive reporting |
| L4 - Systemic / regulatory risk | Batch harm, protected-class disparity, major privacy / funds / regulator issue | Crisis protocol, legal hold, executive risk committee, regulator-ready evidence |
4.4 Containment patterns
| Pattern | 适用场景 | 例子 |
|---|---|---|
| Pause automation | 高影响路径可能继续伤害客户 | 暂停 RAG 自动回答争议截止日 |
| Narrow scope | 仅部分 segment 或 intent 有问题 | 关闭西班牙语 fee waiver 自动回答 |
| Raise human gate | 模型可用但风险变高 | 高金额争议全部人工复核 |
| Revert artifact | 新 prompt、model、index 或 tool policy 回归 | 回滚到上一版 retriever config |
| Customer recovery | 已发生客户影响 | 重新开案、退费、恢复访问、发送更正通知 |
| Evidence hold | 外部投诉、隐私或监管风险 | 冻结日志、通信、版本、审批和样本 |
4.5 Closure gates
| Gate | 通过标准 |
|---|---|
| Scope gate | 影响客户、segment、journey、版本和时间窗已识别 |
| Root cause gate | 根因分类有证据, 不是泛泛归因于“AI 不准” |
| Change gate | 每个 action 链接到具体 artifact、owner、版本和审批 |
| Verification gate | before / after 和生产 KRI 证明效果 |
| Customer recovery gate | 客户资金、权益、访问、解释或服务路径已恢复 |
| Recurrence gate | 复发监控窗口内未超阈值, 或残余风险被正式接受 |
| Evidence gate | 审计证据完整, 能重建发现、决策、修复和关闭 |
5. Root Cause Taxonomy
5.1 根因分类
| Root cause family | 典型根因 | Fix pattern |
|---|---|---|
| Data quality | 字段错、缺失、延迟、映射变更、身份合并错误 | data contract、validation、backfill、lineage repair |
| Dataset coverage | 训练或 eval 缺少关键 segment、语言、产品、边界样本 | targeted sampling、eval expansion、coverage dashboard |
| Label / taxonomy | 标签定义不清、reviewer 分歧、政策解释不一致 | taxonomy governance、adjudication、label guideline update |
| Model behavior | 校准失败、概念漂移、阈值失效、高置信错误 | recalibration、retraining、threshold governance、human gate |
| Prompt behavior | 过度自信、拒答不足、格式不稳、没有升级路径 | prompt policy、rubric update、response contract、handoff rule |
| RAG / knowledge | 旧源、错误检索、chunk 不良、citation 不支撑、权限不清 | source governance、index rebuild、retriever/reranker change |
| Tool / agent | 工具权限过宽、schema 不清、状态机缺口、无审批门 | tool contract、permission boundary、approval workflow |
| Workflow | 队列优先级错、无申诉入口、重复提交、handoff 断裂 | service blueprint redesign、case routing、SLA control |
| Policy | 业务规则模糊、risk appetite 未更新、reason code 不一致 | policy clarification、rule update、approval matrix |
| Human adoption | 员工过度依赖 AI、忽略检查、override 不记录原因 | training、UI friction、supervisor QA、reason capture |
| Vendor / third party | 供应商模型变更、日志不足、SLA 不清、退出路径弱 | contract control、change notice、evidence rights, fallback |
| Governance | owner 不清、门禁缺失、残余风险无人接受 | RACI、release gate、management review、control library |
5.2 RCA depth test
| Shallow statement | Better RCA question |
|---|---|
| “模型幻觉” | 为什么系统允许无证据 claim 进入客户回答 |
| “数据不准” | 哪个 data contract 失败, 为什么未被 validation 拦截 |
| “用户问法特殊” | eval 是否覆盖该 intent、语言和 journey |
| “员工没审核” | UI、SOP、培训和监督为何让 over-reliance 发生 |
| “政策变了” | source owner、effective date 和 RAG freshness gate 为什么未触发 |
| “供应商更新了” | 合同、change notification 和 regression test 为什么没覆盖 |
5.3 Five-why pattern for AI
Issue: Customer received wrong fee waiver answer.
Why 1: RAG answer cited superseded policy.
Why 2: Retriever ranked old policy summary above current policy.
Why 3: Source metadata did not mark old policy as inactive.
Why 4: Knowledge owner SOP required upload of new policy but not deactivation of superseded content.
Why 5: Release gate tested answer correctness, but not source lifecycle controls.
Corrective action: update metadata and retriever filter.
Preventive action: add source retirement SOP and freshness gate to release checklist.
6. Change Linkage
6.1 Evidence graph
signal_id
-> issue_id
-> harm_or_control_classification
-> root_cause_id
-> corrective_action_id
-> preventive_action_id
-> change_request_id
-> artifact_version
-> eval_run_id
-> release_approval_id
-> monitoring_window_id
-> closure_memo_id
6.2 Artifact linkage matrix
| Artifact | Version evidence | Typical approver |
|---|---|---|
| Dataset | dataset hash, label guideline, split policy, lineage | Data Owner, Model Owner, Model Risk |
| Eval set | eval version, locked samples, scenario map, pass criteria | Model Validation, Product, Compliance |
| Prompt | prompt diff, policy mapping, regression output | Product, AI Platform, Compliance for high-risk |
| Model | model card, training data, validation report, calibration | Model Owner, Model Risk, Business Owner |
| RAG index | source list, effective dates, chunking, retriever config | Knowledge Owner, Compliance, AI Platform |
| Tool policy | schema, permissions, approval gates, logs | Security, Product, Operations |
| Workflow SOP | process map, SLA, handoff, case state | Operations, Product, Risk |
| Customer communication | template, legal review, accessibility review | Legal, Compliance, CX |
6.3 Traceability anti-patterns
| Anti-pattern | Why it fails |
|---|---|
| “Issue closed because PR merged” | Code change does not prove customer or control outcome improved |
| “Prompt fixed in playground” | No production version, no approval, no regression evidence |
| “Data refreshed” | No lineage, no label quality, no eval protection |
| “Model retrained” | Root cause may be workflow, policy or RAG source, not model |
| “Monitoring added” | Monitoring does not correct existing harm or prove fix |
7. Effectiveness Verification
7.1 Verification hierarchy
| Evidence type | Strength | Use |
|---|---|---|
| Unit / contract test | Low to medium | Confirms schema, prompt format, tool permission, source freshness |
| Locked regression eval | Medium | Confirms known failure modes no longer fail |
| Counterfactual replay | Medium to high | Runs historical affected cases through fixed path |
| Shadow / canary | High | Observes production-like performance before full rollout |
| Human QA sample | High when calibrated | Confirms customer-facing quality and policy adherence |
| Production KRI | High | Confirms complaints, overrides, appeals, drift or harm signals improve |
| Customer remediation proof | Required for harm | Confirms affected customers were restored |
7.2 Metric families
| Metric family | Example |
|---|---|
| Quality | accuracy, citation support, unsupported claim rate, tool success rate |
| Risk | false decline, appeal upheld, adverse action defect, privacy exposure |
| Operations | override rate, queue SLA, escalation miss, rework, duplicate contact |
| Customer | complaint rate, repeat contact, abandonment, remediation SLA |
| Fairness / segment | disparity in denial, delay, appeal upheld, language support |
| Governance | evidence completeness, approval timeliness, recurrence, overdue CAPA |
7.3 Closure rule examples
| Issue type | Closure evidence |
|---|---|
| RAG stale policy | locked eval pass, source freshness proof, 2-week complaint trend below threshold |
| Fraud false positive spike | segment false decline proxy improves, analyst override declines, affected customers reviewed |
| Credit explanation defect | reason code parity test passes, adverse action QA pass, complaints stabilize |
| Agent tool misuse | tool permission test passes, simulation shows no unauthorized state transition, audit logs complete |
| Data pipeline skew | online-offline parity restored, backfill complete, affected decisions assessed |
7.4 Recurrence monitoring
| Severity | Minimum monitoring window | Review cadence |
|---|---|---|
| L1 | 14 days or next release cycle | Weekly |
| L2 | 30 days | Weekly with Product and Ops |
| L3 | 60 days | Twice weekly until stable, then weekly |
| L4 | 90 days or executive-defined | Executive risk committee cadence |
8. Update Governance
8.1 Dataset update governance
| Control | Requirement |
|---|---|
| Lineage | Every added sample has source, time, AI version, policy version and usage tag |
| Label quality | Reviewer role, rubric, confidence, adjudication and evidence captured |
| Split protection | Train, calibration, eval, gold, monitoring and legal hold are separated |
| Bias control | Segment coverage and selection bias reviewed before training |
| Approval | High-impact dataset changes reviewed by Model Risk or validation function |
| Rollback | Prior dataset version and model artifact remain recoverable |
8.2 Prompt update governance
| Control | Requirement |
|---|---|
| Prompt diff | Store exact before / after system, developer and task instructions |
| Policy mapping | Each high-risk instruction maps to approved policy or control |
| Regression | Run locked scenario tests, refusal tests, citation tests and escalation tests |
| Human review | SME reviews high-risk answer changes, not only prompt text |
| Rollout | Use canary or limited cohort for customer-facing prompts |
| Monitoring | Watch unsupported claim, escalation miss, complaint and override signals |
8.3 Model update governance
| Control | Requirement |
|---|---|
| Change reason | Link model change to root cause and expected benefit |
| Validation | Compare challenger against champion on overall, segment, calibration and fairness metrics |
| Stress testing | Include drifted segments, edge cases and high customer impact cases |
| Independent challenge | Model Risk or validation reviews assumptions for high-impact models |
| Decision threshold | Threshold changes approved with customer, operations and risk impact |
| Post-release | Monitor performance, overrides, appeals, complaints and drift |
8.4 RAG update governance
| Control | Requirement |
|---|---|
| Source authority | Only approved sources can support customer-impact claims |
| Effective date | Current and superseded documents must be explicit |
| Chunking / metadata | Chunk policy preserves definitions, exceptions, deadlines and eligibility |
| Retrieval evaluation | Test source recall, citation support, conflict handling and permissions |
| Knowledge owner | Business / Compliance owner signs off source lifecycle |
| Deactivation | Superseded content is removed, demoted or blocked with audit trail |
8.5 Tool and agent update governance
| Control | Requirement |
|---|---|
| Tool contract | Input schema, output schema, side effects and failure modes defined |
| Permission boundary | Agent can only call tools needed for approved purpose |
| Approval gate | High-impact actions require human approval or policy confirmation |
| State machine | Agent workflow prevents skipped review, duplicate action and dead-end automation |
| Simulation | Run scenario tests for incorrect tool choice, retries, timeout and rollback |
| Audit log | Every tool call links to user/session, model, prompt, action and result |
8.6 Policy and workflow update governance
| Control | Requirement |
|---|---|
| Policy versioning | Eligibility, dispute, fee, credit, KYC and complaint policies have effective dates |
| SOP alignment | Human review SOP matches AI routing and escalation behavior |
| Reason code integrity | Customer explanations trace to approved reason or source |
| Case management | Corrective actions update actual customer records, not only AI artifacts |
| Training | Employees receive workflow and review training when AI behavior changes |
| Auditability | Process map, owner, SLA and evidence fields are updated together |
9. Dashboards and KRIs
9.1 Executive dashboard
| Metric | Why executives need it |
|---|---|
| Open CAPA by severity | Shows unresolved risk exposure |
| Aging CAPA | Reveals governance or owner bottlenecks |
| Repeat issue rate | Shows whether fixes prevent recurrence |
| Customer harm linked to AI | Connects AI quality to customer impact |
| Overdue verification | Prevents “implemented but unproven” closure |
| Residual risk acceptances | Shows what risk leadership has accepted |
9.2 Product and operations dashboard
| Metric | Slice |
|---|---|
| Complaint rate after AI interaction | product, channel, intent, version |
| Appeal upheld / reversal rate | decision type, segment, model version |
| Human override rate | team, output type, reason, prompt/model version |
| Escalation miss rate | journey step, risk tier, language |
| Remediation SLA | severity, owner, customer segment |
| Repeat contact and abandonment | journey, device, channel |
9.3 Model risk dashboard
| Metric | Slice |
|---|---|
| Eval failure by scenario | model, prompt, RAG, tool, release |
| Drift-to-action conversion | alert type, owner, action |
| Dataset update lineage completeness | dataset version, use case |
| Segment performance change | protected or high-control segment where lawful |
| Independent challenge findings | model family, issue type |
| Validation exceptions and closure | severity, due date |
9.4 Data / RAG / tool dashboard
| Metric | Meaning |
|---|---|
| Data contract violations | Bad inputs that can cause downstream AI defects |
| Feature freshness delay | Real-time decision risk |
| Source freshness | RAG policy staleness |
| Citation support pass rate | Whether answer claims are grounded |
| Tool failure / retry / unauthorized attempt | Agent reliability and security risk |
| Permission boundary violations | Potential privacy or action authority issue |
9.5 KRI thresholds
| KRI | Green | Amber | Red |
|---|---|---|---|
| Repeat issue rate | No recurrence in window | Single low-severity recurrence | Same root cause causes L2+ issue |
| Evidence completeness L3+ | 100% required evidence | Non-critical evidence delayed | Critical log, customer notice or approval missing |
| Overdue CAPA | None past SLA | L1/L2 past SLA | L3/L4 past SLA |
| Unsupported RAG claim | Below baseline | 1.5x baseline or high-risk sample | 2x baseline or customer harm |
| Human override spike | Within control band | 2-week upward trend | Spike plus complaint or appeal signal |
10. RACI
| Activity | Product | BA | Architect | Data Science | AI Platform | Ops | Model Risk | Compliance / Legal | Data Gov | Security | Business Owner |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Signal intake design | A | R | C | C | C | R | C | C | C | C | A |
| Taxonomy and workflow | A | R | C | C | C | R | C | C | C | C | A |
| AI contribution matching | C | R | A | R | R | C | C | C | C | C | A |
| Severity decision | A | R | C | C | C | R | C | C | C | C | A |
| Customer containment | A | C | C | C | C | R | C | C | C | C | A |
| Root cause analysis | A | R | R | R | R | C | C | C | C | C | A |
| Dataset change approval | C | C | C | R | C | C | A | C | A | C | A |
| Prompt / RAG change approval | A | C | R | C | R | C | C | C | C | C | A |
| Model change approval | C | C | C | R | C | C | A | C | C | C | A |
| Tool / agent permission change | A | C | R | C | R | C | C | C | C | A | A |
| Effectiveness verification | A | R | C | R | R | R | C | C | C | C | A |
| Closure and residual risk | A | R | C | C | C | C | A | C | C | C | A |
Legend: R = responsible, A = accountable, C = consulted.
11. Templates
11.1 Corrective Action Intake
| Field | Rule | Example |
|---|---|---|
| Issue title | Customer or control impact in plain language | RAG gave stale annual fee waiver guidance |
| Signal source | Complaint, override, eval, drift, audit, incident | complaint + RAG eval failure |
| AI system | Registered system and version | servicing_rag v4.8, index_2026_06_12 |
| Journey step | Where customer or employee saw output | mobile chat fee dispute answer |
| Impact | Customer, risk, operations or control impact | wrong explanation, possible missed escalation |
| Severity | L0-L4 with rationale | L2 due to customer-facing wrong policy |
| Containment | Immediate protection | route fee waiver intents to human review |
| Owner | Named accountable business owner | Credit Card Servicing Product Owner |
| Due date | SLA based on severity | 2026-07-03 |
11.2 RCA Worksheet
| Field | Rule | Example |
|---|---|---|
| Confirmed failure | Observable failure, not opinion | Answer cited superseded fee policy |
| Direct cause | Immediate technical or process cause | retriever ranked old policy summary first |
| Contributing causes | Other enabling weaknesses | source lifecycle SOP lacked deactivation step |
| Control failure | Why control did not catch it | freshness gate checked upload date, not effective date |
| Root cause family | Use taxonomy | RAG / knowledge + governance |
| Evidence | Logs, samples, policies, evals | trace_8831, complaint_99102, eval_fee_v1 |
| Corrective action | Fix current defect | deactivate old source and update retriever filter |
| Preventive action | Stop recurrence | source retirement SOP and release gate update |
11.3 Change Linkage Memo
| Section | Content rule | Example |
|---|---|---|
| Issue link | issue_id and severity | ISS-2026-06-014, L2 |
| Root cause | concise root cause | current-source priority missing |
| Artifact changed | exact artifact and version | retriever_config v5.3, source_registry 2026-06-30 |
| Approval | who approved and why | Knowledge Owner + Product + Compliance |
| Expected effect | measurable outcome | unsupported fee claim below 5% |
| Rollback | rollback condition | complaint spike or eval pass below 95% |
| Monitoring | production KRI and window | 30-day fee complaint and override monitor |
11.4 Effectiveness Verification Memo
| Field | Evidence standard | Example |
|---|---|---|
| Baseline | Before metric with time window | unsupported claim 14% on fee eval v1 |
| Post-fix result | After metric with same or stronger method | 3% on locked fee eval v2 |
| Segment check | Key segments verified | mobile, contact center, Spanish |
| Customer signal | Complaints, appeals, overrides | fee complaint back to baseline for 2 weeks |
| Control signal | New gate or monitoring works | freshness gate blocks superseded docs |
| Residual risk | Accepted remaining risk | low risk for rarely used legacy fee terms |
| Closure decision | approver and date | Product + Model Risk closure on 2026-07-30 |
11.5 Executive Status Memo
| Section | Good content |
|---|---|
| Decision needed | Approve closure, extend monitoring, pause automation, fund fix or accept residual risk |
| Customer impact | Plain-language customer effect and affected count |
| Root cause | One sentence, with system and control cause |
| Actions completed | Corrective and preventive actions, not activity list |
| Evidence | Before / after results and production KRI |
| Open risk | What remains, who owns it and review date |
| Recommendation | Clear management action |
12. 30-Day Lab
目标:30 天内把 closed-loop learning 从概念训练成可展示的金融零售 AI 产品、架构和治理资产。默认读者已经具备高级 BA / PM / 架构基础,不做基础流程图训练。
| Day | 主题 | 产出 |
|---|---|---|
| 1 | 选择一个金融零售 AI use case, 定义客户旅程和 AI touchpoints | AI touchpoint map |
| 2 | 阅读 NIST AI RMF, 映射 Govern / Map / Measure / Manage | governance mapping note |
| 3 | 阅读 ISO/IEC 42001 概览, 提炼 management system controls | AI management system control map |
| 4 | 阅读 SR 26-2, 总结 2026 模型风险管理 nuance | model risk nuance memo |
| 5 | 分析 CFPB complaint database 可用于哪些 feedback signals | complaint signal taxonomy |
| 6 | 设计 signal intake schema | corrective action intake schema |
| 7 | 设计 feedback usage policy | train / eval / RCA / audit usage matrix |
| 8 | 设计 issue taxonomy | issue taxonomy and examples |
| 9 | 设计 severity matrix | L0-L4 response matrix |
| 10 | 设计 containment patterns | pause, narrow, gate, revert, remediate playbook |
| 11 | 设计 root cause taxonomy | data/model/prompt/RAG/tool/workflow/policy/governance map |
| 12 | 完成一个 Five-why RCA drill | RCA worksheet |
| 13 | 设计 change linkage model | issue-to-change evidence graph |
| 14 | 设计 dataset update governance | dataset change checklist |
| 15 | 设计 prompt update governance | prompt diff and regression gate |
| 16 | 设计 model update governance | challenger validation and threshold memo |
| 17 | 设计 RAG update governance | source freshness and citation support gate |
| 18 | 设计 tool / agent update governance | tool permission and state machine gate |
| 19 | 设计 effectiveness verification hierarchy | verification evidence matrix |
| 20 | 设计 recurrence monitoring windows | monitoring plan by severity |
| 21 | 设计 executive dashboard | CAPA severity and aging dashboard mock |
| 22 | 设计 operations dashboard | complaint, override, appeal, remediation view |
| 23 | 设计 model risk dashboard | eval, drift, validation and closure view |
| 24 | 写 RACI | role matrix |
| 25 | 写 Corrective Action Intake template | completed example |
| 26 | 写 RCA + Change Linkage memo | completed example |
| 27 | 写 Effectiveness Verification memo | completed example |
| 28 | 做 tabletop exercise: RAG stale policy incident | incident simulation pack |
| 29 | 准备 interview story | STAR-T answers |
| 30 | 完成 portfolio package | architecture diagram, workflow, dashboards, templates, executive memo |
13. Interview Answers
13.1 Closed-loop learning 和 active learning 有什么区别?
30 秒回答:
Active learning 是选择最值得人工标注的样本来提升模型。Closed-loop learning 是把生产反馈、投诉、override、eval 失败、drift、incident 和 audit finding 转成有治理的 issue, 找根因, 链接到具体变更, 再证明修复有效并防复发。它更接近 AI 版 CAPA。
2 分钟展开:
在金融零售里,反馈不能直接等同于训练数据。投诉可能是客户伤害信号,override 可能是员工对 AI 不信任,eval failure 可能是上线门禁缺口,drift 可能是数据、政策或行为变化。我的设计会先做 signal normalization 和 AI contribution matching,再分类 severity 和 issue type。之后进入 RCA,决定修数据、prompt、model、RAG、tool、workflow 还是 policy。每个 action 都链接到具体版本和审批,并用 locked eval、production KRI、投诉趋势和复发监控证明效果。
13.2 如何设计 CAPA-like AI corrective action workflow?
30 秒回答:
我会设计 detect、match、classify、contain、RCA、action、change、verify、monitor、close 的闭环。关键不是流程图好看,而是每一步都有 owner、SLA、证据和关闭标准。
2 分钟展开:
第一步收集信号,包括投诉、申诉、人工覆盖、专家 QA、eval failure、drift、incident 和 audit finding。第二步确认 AI 系统和版本是否参与客户旅程。第三步按客户影响和控制风险分级。高风险场景先 containment,例如暂停自动回答、提高人工复核或恢复客户。然后做 root cause taxonomy,区分 data、model、prompt、RAG、tool、workflow、policy、vendor 和 governance。修复动作必须进入 change management,并链接到 dataset、prompt、model、index、tool policy 或 SOP。最后通过 eval、replay、canary、QA、生产 KRI 和 recurrence window 验证,证据完整后才能关闭。
13.3 如何避免反馈闭环放大偏差?
30 秒回答:
要区分 feedback usage, 记录采样概率和业务动作, 保护 eval set, 按 segment 看覆盖和伤害, 并保留随机 sentinel 样本。不是所有 override 和投诉都能直接训练。
2 分钟展开:
如果只用被模型拦截的交易训练 fraud 模型,就看不到放行交易里的漏报;如果只用会申诉的客户反馈训练信贷解释,就会低估弱势客户的无声伤害。我的设计会给反馈打 usage tag:train、eval、RCA、monitor、audit、exclude。训练数据要有 lineage、label quality 和 segment coverage。评估集要 lock,避免把失败样本反复调参到通过。生产监控要看投诉、appeal upheld、override、abandonment 和 segment disparity。这样闭环学习不是模型自我强化,而是治理过的学习。
13.4 如何证明一个 prompt 或 RAG 修复真的有效?
30 秒回答:
只说 prompt 改了不够。我要看到 prompt diff、policy mapping、locked eval 通过、生产抽样 QA、相关投诉和 override 下降,并在复发窗口内没有同类问题。
2 分钟展开:
以 RAG 过期政策为例,我会先确认 root cause 是 source lifecycle、retriever ranking、prompt citation instruction 还是 handoff rule。修复后,需要锁定 eval cases,验证答案 claim 被当前有效来源支撑,citation 直接支持答案,高风险 intent 会拒答或升级人工。还要跑生产 canary 和人工 QA,监控相关投诉、agent override、unsupported claim、source freshness 和 escalation miss。如果客户曾受影响,还要证明客户通知、纠正和补偿完成。只有质量、客户和控制指标都通过,才能关闭。
13.5 SR 26-2 对 AI closed-loop learning 有什么影响?
30 秒回答:
SR 26-2 是 2026 年银行模型风险管理的重要锚点,强调风险导向、生命周期、治理、独立挑战和问题整改。Nuance 是它正式聚焦 model risk management, 生成式和 agentic AI 的闭环学习还要叠加 AI 管理体系、消费者合规、隐私、安全、第三方和运营控制。
2 分钟展开:
我会用 SR 26-2 的精神要求模型相关问题有完整生命周期治理:issue identification、risk assessment、validation、change control、independent challenge、documentation 和 closure evidence。对于生成式 AI 或 agentic workflow,我不会只问它是否落入传统 model definition,而是从客户影响和控制角度补充 NIST AI RMF、ISO/IEC 42001、complaint signals、security、privacy、third-party 和 operational resilience。也就是说,SR 26-2 给了模型风险纪律,但 AI corrective action architecture 要覆盖更广系统。
13.6 Root cause taxonomy 为什么重要?
30 秒回答:
没有 root cause taxonomy,团队会把所有问题都归因于“模型不准”或“prompt 要改”。金融 AI 的真实根因可能是数据、标签、RAG source、tool permission、workflow、policy、vendor 或 governance。
2 分钟展开:
正确的根因决定正确的修复。比如客户收到错误费用解释,表面是 RAG 答错;真正根因可能是旧政策未停用、retriever 没有 effective-date filter、prompt 没要求当前来源、客服流程没升级冲突答案。如果只改 prompt,问题会复发。我会建立 root cause taxonomy,并要求 RCA 明确 direct cause、contributing cause、control failure 和 preventive action。这样 closed-loop learning 不只是改输出,而是修系统。
13.7 哪些指标证明 corrective action 工作良好?
30 秒回答:
我会看 open CAPA by severity、aging、repeat issue rate、overdue verification、evidence completeness、complaint / appeal / override trend、post-fix recurrence 和 residual risk acceptance。
2 分钟展开:
一个成熟体系不能只统计关闭了多少 ticket。我要看高严重度 issue 是否及时 containment,RCA 是否按根因分类,action 是否链接到具体 artifact,验证是否按计划完成,客户是否恢复,复发是否下降。对产品和运营,看投诉、appeal upheld、override、abandonment、remediation SLA。对模型风险,看 eval regression、drift-to-action、dataset lineage、validation exception closure。对高管,看 overdue CAPA、repeat root cause、residual risk 和投资缺口。
13.8 如何处理 customer complaint 与模型训练的关系?
30 秒回答:
投诉是高价值 customer harm signal, 但不能直接当标签训练。要先做 case review、AI contribution matching、root cause 和 usage tagging。
2 分钟展开:
客户投诉反映客户体验和影响,但叙述可能不完整或带有情绪。我的做法是把 complaint 进入 harm and corrective action ledger,关联 AI trace、case record、policy version 和 journey step。若复核确认 AI 输出错误,可生成 eval scenario、RCA evidence、customer remediation case,必要时形成训练标签。但进入训练前需要专家 adjudication、privacy review、segment bias check 和 dataset lineage。投诉更直接的价值是暴露产品承诺失败和恢复路径缺陷。
13.9 如何把 closed-loop learning 做成作品集?
30 秒回答:
我会展示一个完整 case:信号 taxonomy、CAPA workflow、root cause taxonomy、change linkage、effectiveness dashboard、RACI 和 executive memo,而不是只展示模型指标。
2 分钟展开:
例如信用卡客服 RAG stale policy case。作品集先展示客户旅程和 AI touchpoint,再展示投诉、agent override、RAG eval failure 和 drift cluster 如何进入 corrective action ledger。然后用 root cause taxonomy 证明问题不是单纯 prompt,而是 source lifecycle、retrieval ranking 和 handoff gate。接着展示 change linkage:source registry、retriever config、prompt version、eval set、release approval。最后用 locked eval、生产 QA、投诉趋势、override 下降和 recurrence window 证明有效。这样体现 PM、BA、架构和治理能力。
13.10 如果团队说“我们已经有 monitoring dashboard”,你怎么回应?
30 秒回答:
Dashboard 只是发现问题,不是闭环。我要看告警是否有 owner、severity、RCA、action、change linkage、verification、closure 和 recurrence monitoring。
2 分钟展开:
很多 AI 团队有很漂亮的 drift、latency、quality dashboard,但问题是报警后没人处理,或者处理后没有证据证明修复有效。我的标准是 shift-to-action 和 signal-to-CAPA。每个红色指标必须映射到 triage owner 和 runbook;每个高风险 issue 必须能链接到具体变更;每个关闭必须有 before / after 证据和残余风险接受。否则 dashboard 只是观察系统,不是治理系统。
14. Final Operating View
成熟的 AI closed-loop learning 不是让模型“自动吸收反馈”。成熟状态是组织能持续证明:
We know where AI touches customers and operations.
We capture complaints, appeals, overrides, expert reviews, eval failures, drift, incidents and audits.
We classify issues by customer impact, control weakness and root cause.
We protect customers before permanent fixes are shipped.
We link each corrective action to a specific dataset, prompt, model, RAG, tool, workflow or policy change.
We verify fixes with locked evals, replay, QA, production KRIs and recurrence monitoring.
We prevent eval leakage, feedback bias and undocumented changes.
We can show executives, auditors and regulators what failed, why it failed, what changed, and why it is safe to close.
对 CBAP+、AI PM、AI BA 和 AI Architect 来说,这就是从“AI 系统持续优化”升级为“AI 组织持续学习、纠正、证明和防复发”的核心能力。