AI Exception / Risk Acceptance:例外治理架构
一句话:
AI Exception / Risk Acceptance / Waiver Architecture 解读
面向对象: AI Product Lead / Senior BA / Product Architect / Enterprise Architect / Model Risk Partner / Operational Risk Lead / Audit Evidence Owner。 核心问题: AI 产品暂时不能满足标准控制时, 如何管理政策例外、临时 waiver、剩余风险接受、补偿控制、到期续期、证据、升级、董事会/审计可见性和硬停止条件。 学习目标: 区分 risk appetite 与 exception management: risk appetite 先定义组织愿意承担的 AI 风险边界, waiver architecture 管理的是偏离标准控制后的限时、限域、留证和退出机制。
Source Anchors
| Source | Link | 用途 |
|---|---|---|
| NIST AI RMF | https://www.nist.gov/itl/ai-risk-management-framework | 用 Govern / Map / Measure / Manage 组织例外背景、风险测量、补偿控制和持续治理 |
| ISO/IEC 42001 | https://www.iso.org/standard/42001 | 用 AI management system 语言设计职责、运行控制、绩效评价、管理评审和持续改进 |
| Federal Reserve SR 26-2 | https://www.federalreserve.gov/supervisionreg/srletters/SR2602.htm | 作为金融机构模型风险管理新锚点; SR 26-2 于 2026-04-17 替代 SR 11-7 和 SR 21-8 |
| FFIEC IT Examination Handbook Management booklet | https://ithandbook.ffiec.gov/it-booklets/management.aspx | 用 IT governance、risk management、third-party、change、audit 和 board reporting 视角组织管理证据 |
一句话:
AI Exception Architecture 是把“暂时不能满足标准控制”转成有边界、有补偿、有到期、有证据、有升级、有硬停止条件的风险接受系统。
1. Thesis
Risk appetite 回答组织总体愿意承担哪些 AI 风险、哪些用途禁止、哪些用途有条件允许、不同 risk tier 的默认控制是什么。
AI exception / waiver 管理回答一个具体 use case 为什么暂时偏离标准控制、偏离哪条政策或控制、剩余风险由谁接受、用什么补偿控制限制风险、什么指标触发立即停止、到期时如何回到标准控制或停止。
所以 waiver 不是 risk appetite 的替代品。它是 risk appetite 已经定义之后, 对控制偏离的受控例外。
2. Why It Matters
AI 团队常见现实是: 产品价值明确, 但某些控制暂时不成熟。比如 golden eval set 仍在扩充, RAG citation checker 尚未覆盖全部文档格式, 第三方 vendor 证据还未自动接入 evidence binder, agent tool 已是 read-only 但 tool deny reason 尚未达到审计字段标准。
没有 waiver architecture 时, 组织会走向两个极端:
| 极端 | 风险 |
|---|---|
| 全部阻塞 | 低风险学习和真实反馈无法发生, 团队可能绕开治理 |
| 口头放行 | 例外变成永久 shadow policy, 审计时无法解释谁接受了什么剩余风险 |
成熟机制允许有限试验, 但不允许例外失控。
3. Core Concepts
| Concept | 含义 | AI 产品化要点 |
|---|---|---|
| Exception | 对标准政策或控制的有限偏离 | 绑定 control id、scope、duration、owner |
| Waiver | 正式批准的临时豁免 | 不能无限期; 必须有到期、续期和退出路径 |
| Risk acceptance | 管理层明确接受剩余风险 | 接受的是 residual risk, 不是放弃控制 |
| Compensating control | 用替代控制降低风险 | traffic cap、HITL、抽样、只读、披露 |
| Residual risk | 补偿控制后仍存在的风险 | 必须可描述、可监控、可升级 |
| Hard stop | 触发立即暂停或回滚的条件 | 预先写入 runbook |
| Shadow policy | 例外长期续期后变成事实政策 | 通过 policy update、control investment 或 closure 处理 |
关键原则: 例外只能偏离控制, 不能绕过禁止用途; 必须限时、限量、限渠道、限用户或限数据; 审批权随 risk tier、客户影响和自动化程度上升; 例外数量、续期次数和过期未关本身就是治理信号。
GenAI / agentic AI 例外必须组合模型风险、操作风险、消费者合规、隐私、安全和第三方控制, 不能只按模型风险审批。
4. Architecture Diagram
Risk appetite and AI policy baseline
-> standard control catalog
-> AI use case / change request
-> control gap
-> exception request
-> residual risk analysis
-> compensating controls
-> tiered approval
-> restricted operation
-> KRI dashboard and evidence binder
-> expiry review
-> close / renew / remediate / policy update / stop
Control plane:
Exception registry
+ policy/control IDs
+ risk tier and scope
+ residual risk owner
+ compensating controls
+ expiry and renewal rules
+ hard stop triggers
+ evidence links
+ board/audit reporting status
This should connect to release governance, incident management, model inventory, vendor risk, privacy review, security review, EvalOps and audit evidence.
5. Exception Taxonomy
| Type | Example | Typical limit |
|---|---|---|
| Eval coverage | 高风险 slice 尚未完整覆盖 | limited pilot, sample review |
| Model validation | independent challenge 未完成最终报告 | shadow mode or capped release |
| Data/privacy | redaction / retention 证据不足 | restricted data, no export |
| Security | gateway 缺少 tool deny attribute | read-only mode, SIEM alert |
| Third-party | vendor SOC / SLA evidence 正在更新 | fallback route, traffic cap |
| Operational | human review SLA 未达标准 | lower traffic, queue monitoring |
| Evidence | evidence binder 自动化缺口 | manual evidence pack with owner |
| Policy interpretation | 新 agentic pattern 未被 policy 覆盖 | short expiry and policy update path |
6. Financial Retail Case
场景: Retail bank customer-service AI copilot 从内部坐席草稿扩展到 limited customer-facing FAQ。
标准 high-tier 控制要求: source citation correctness >= 95%; complaint / fee dispute / credit commitment / account closure route to human; customer-facing answer uses approved disclosure; production trace includes model, prompt, source, policy and risk tier; write-enabled tools remain disabled。
例外请求: citation checker 对 PDF 表格型费用文档覆盖不足, 当前只能自动验证 88% 的引用; 团队希望对 5% 低风险 FAQ 流量做 30 天 pilot。
合格 waiver 决策:
| Area | Decision |
|---|---|
| Scope | 只限低风险 FAQ, 不含投诉、费用减免、信贷承诺、账户关闭 |
| Duration | 30 天, 不允许自动续期 |
| Compensating controls | 每日人工抽样 100 条, RAG source allowlist, no-answer fallback, complaint hard-route |
| Residual risk | 客户可能收到不完整引用, 但高风险意图被排除并有人工抽样 |
| Hard stop | wrong fee commitment > 0; complaint escalation miss > 0; unsupported citation sample rate > 2% |
| Evidence | waiver memo, source registry, sample QA, trace dashboard, daily KRI, expiry decision |
| Exit path | PDF citation checker 修复后回到标准控制; 若未修复则停止 pilot |
不合格做法: “业务压力大先上线再补”、“这个例外以后每月续一下”、“风险团队口头同意”、“客户投诉了再看”。
7. PM / BA / Architect Checklist
Product / BA:
- 例外偏离的是哪条 policy、control 或 release gate。
- 例外是否触碰 prohibited use; 如果触碰, 应直接 no-go。
- business rationale 是否具体到客户价值、运营压力或监管期限。
- scope 是否限制到用户、渠道、地区、数据、工具、模型或流量。
- residual risk 是否说明客户、运营、合规和声誉影响。
- compensating controls 是否能被实际执行, 而不是纸面承诺。
- expiry date 是否明确, 且有 review forum。
Architect / risk:
- runtime 是否能强制 scope 限制。
- feature flag、tool gateway、policy engine 是否支持立即暂停。
- telemetry 是否能标记 exception id、control gap、model/prompt/source/tool version。
- evidence 是否能从系统生成, 不依赖截图。
- approver 是否有权接受该层级 residual risk。
- aging、repeat renewal、expired exception 是否被监控并上报。
8. Code-Lite Experiment
exception_id: AI-WVR-2026-0042
use_case: retail_service_low_risk_faq_pilot
risk_tier: high
control_gap: CTRL-CITATION-CHECKER-PDF-TABLES
scope: {channel: web_faq, traffic_cap: 5_percent, excluded_intents: [complaint, fee_waiver, credit_commitment, account_closure]}
expiry_date: 2026-07-30
residual_risk_owner: head_of_retail_service
compensating_controls: [daily_sample_review_100_cases, no_write_tools, source_allowlist_only, complaint_intent_hard_route]
hard_stop: [confirmed_wrong_fee_commitment_gt_0, complaint_escalation_miss_gt_0, unsupported_citation_sample_rate_gt_2_percent]
evidence: [waiver_memo, daily_kri_dashboard, qa_sample_log, release_trace_samples]
decision: limited_approval
Validation rule examples:
Reject if expiry_date is missing.
Reject if hard_stop is empty.
Reject if prohibited_use is true.
Reject if risk_tier is high/critical and residual_risk_owner lacks authority.
Escalate if renewal_count >= 2 or days_active > 90.
9. Interview Questions
| Question | Strong answer angle |
|---|---|
| How is waiver management different from risk appetite? | Risk appetite defines baseline boundaries; waiver management controls specific deviations from standard controls after the baseline exists. |
| When would you approve an AI exception? | When the use is not prohibited, scope is narrow, residual risk is accepted by the right owner, compensating controls are testable, expiry is fixed and hard stops are enforceable. |
| What makes GenAI exception handling different? | It combines model risk, operational risk, consumer compliance, privacy, security, third-party, data and agent tool-control concerns. |
| How do you prevent exceptions becoming shadow policy? | Track aging, renewals, repeat reasons, expired waivers, remediation backlog and require policy update or closure after limited renewals. |
| What should board/audit see? | Active high/critical exceptions, residual risk accepted, breached conditions, expired items, repeat waivers, hard stops triggered and remediation progress. |
30 秒版本:
Risk appetite 是基线, waiver 是偏离基线后的受控例外。我会要求每个 AI exception 都有 policy/control id、业务理由、scope、expiry、residual risk owner、compensating controls、hard stop、evidence 和 exit path。任何长期续期的例外都要升级, 因为它可能已经变成 shadow policy。
10. Pitfalls
| Pitfall | Why dangerous | Fix |
|---|---|---|
| 无限期 waiver | 例外变成事实政策 | fixed expiry and renewal limit |
| 只写业务理由 | 看不到风险被谁接受 | residual risk memo and approver role |
| 补偿控制不可执行 | 审计时只剩承诺 | map every control to workflow, log or sample evidence |
| 忽略第三方 | vendor 变化可改变数据、模型和可用性风险 | include third-party risk and fallback evidence |
| 只按模型风险审批 | GenAI/agentic AI 还涉及工具、隐私、安全、运营和消费者合规 | cross-domain waiver review |
| 到期不处理 | 过期例外仍在生产运行 | automatic disable and escalation |
| 没有 hard stop | 指标恶化后继续运行 | pre-approved pause/rollback conditions |
| 低层级批准高风险 | residual risk accountability 错位 | tiered authority matrix |
11. Practice Assignment
选择一个金融零售 AI use case, 写一份 exception architecture mini-pack, 包含 risk appetite baseline、control gap、exception taxonomy、residual risk memo、compensating controls、hard stop、expiry/renewal criteria、board/audit dashboard row 和 exit path。
完成标准: 例外没有突破 prohibited use; scope 可以被系统和运营强制执行; residual risk owner 合理; hard stop 可监控、可执行; 到期时可以续期、补齐控制、更新政策或停止; 能用 2 分钟讲清为什么这不是 shadow policy。