AI 底层逻辑 / 经典论文

AI Operational Resilience：降级与连续性架构

一句话:

219 行ai-foundations/papers/108-ai-operational-resilience-bcp-degraded-mode-architecture.md

AI Operational Resilience / BCP / Degraded Mode Architecture 解读

面向对象: AI Product Architect / Enterprise Architect / Operational Resilience Lead / Senior BA / CBAP-level PM / Model Risk / Business Continuity。核心问题: 金融零售 AI 系统不能只在正常状态下安全。模型、RAG、工具、身份、policy engine、eval、HITL、vendor 或 evidence stack 降级时, 关键客户旅程和受监管流程仍要保持受控、可解释、可恢复。学习目标: 建立 AI critical operation mapping、degraded mode taxonomy、BCP/DR decision rights、manual fallback、RTO/RPO/SLO、evidence preservation 和 recovery exercise 的架构能力。

Source Anchors

Source	Link	用途
FFIEC Business Continuity Management booklet	https://ithandbook.ffiec.gov/it-booklets/business-continuity-management.aspx	用 business impact analysis、critical operations、dependency、testing 和 resilience 语言组织 AI BCP
NIST AI RMF	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 定义 AI 风险、控制、监控和降级治理
ISO/IEC 42001	https://www.iso.org/standard/42001	用 AI management system 连接政策、运行控制、绩效评价和持续改进
Federal Reserve SR 26-2	https://www.federalreserve.gov/supervisionreg/srletters/SR2602.htm	2026 模型风险管理新锚点; 替代 SR 11-7 / SR 21-8, 但当前范围排除 generative AI 与 agentic AI 模型
Federal Reserve SR 20-24 / Interagency Sound Practices for Operational Resilience	https://www.federalreserve.gov/supervisionreg/srletters/SR2024.htm	用 operational resilience、critical operations、core business lines、third-party dependency 和 impact tolerance 语言设计 AI 连续性

一句话:

AI resilience is proven in degraded mode, not in the happy path.

1. Thesis

AI operational resilience 不是 incident postmortem。

Postmortem 关注事故之后: 发生了什么、根因是什么、谁受影响、如何防复发。

Operational resilience / BCP 关注事故之前:

哪些 AI-enabled operations 是 critical operations。
哪些依赖失败时必须进入预先批准的 degraded mode。
哪些输出必须停止、模板化、人工复核或切换替代路径。
谁有权触发 fallback、接受积压、恢复自动化。
证据如何在降级期间仍被保全并可审计。
组织是否通过 tabletop、simulation 和 recovery exercise 验证过。

高级 PM / BA / architect 的价值是把 AI 从“功能”做成可持续运营的服务能力。

2. 为什么这对金融零售重要

金融零售 AI 常接入客户权益、费用、投诉、信贷、KYC、AML、欺诈、财富、分行运营和监管材料。正常运行时, 团队容易证明:

模型准确率达标, RAG 引用可用, 工具有审批。
人审队列能接住异常, 日志和证据完整。

真正的问题是压力状态下能否保持受控服务:

Degraded dependency	金融零售影响
Model provider outage	客服、信贷摘要、AML case narrative 失去生成能力
RAG index stale	客户收到过期费用、投诉、信贷或产品政策
Tool gateway degraded	agent 无法查 case, 或写动作必须停止
Identity / entitlement failure	AI 无法确认用户、员工、角色、目的和授权
Policy engine false deny	合规规则过度阻断, 业务停摆
Policy engine false allow	高风险建议或动作绕过控制
Eval pipeline unavailable	变更无法证明未退化, release 应冻结
HITL queue saturation	原本的 human oversight 变成运营瓶颈
Vendor evidence export failure	事故期间无法复原 prompt、retrieval、tool 和审批

运营韧性要求: 关键运营不一定保持全功能, 但必须保持受控、可解释、可恢复、可证明。

3. Core Concepts

Concept	高级定义
Critical operation	失败会显著影响客户、市场、机构安全稳健、监管义务或核心业务线的业务能力
Impact tolerance	一个关键运营可承受的最大中断、退化、积压、错误率或证据缺口
Degraded mode	预先设计的低能力但受控运行状态, 例如只读、模板化、人工优先、缓存答案、停写工具
Fallback decision rights	谁可以触发降级、扩大降级、接受积压、恢复自动化和关闭事件
AI BIA	把 AI dependency failure 映射到客户旅程、业务流程、监管义务、证据链和恢复目标
AI RTO	从 AI 依赖失败到恢复受控服务能力的目标时间
AI RPO	可接受的证据、索引、记忆、日志、交易上下文或审核记录损失窗口
Minimum viable service	降级期间仍必须提供的客户和员工服务边界
Safe stop	当无法安全降级时, 主动停止某类 AI 输出或工具动作

4. Degraded Mode Architecture

customer / employee request
  -> channel gateway
  -> AI service boundary
  -> dependency health and policy router
      -> normal mode: model + RAG + tools + HITL + evidence
      -> degraded model mode: approved fallback model or extractive summary
      -> degraded RAG mode: authoritative search + templates + citations only
      -> degraded tool mode: read-only / draft-only / no side effects
      -> degraded identity mode: no personalization, no sensitive data, manual auth
      -> degraded policy mode: deny risky actions, allow low-risk templates
      -> degraded HITL mode: risk-prioritized triage and surge staffing
      -> degraded evidence mode: local immutable capture and delayed export
  -> customer / employee response
  -> evidence ledger
  -> recovery gate

架构关键不是有多少 fallback, 而是 router 能否根据 dependency health、risk tier、customer impact、regulatory deadline 和 evidence state 做出受控决策。

5. Degraded Mode Taxonomy

Mode	允许	禁止	适用场景
Read-only	查询账户、case、政策摘要	写入 CRM、提交决定、触发客户通知	tool gateway 或审批链路不稳定
Draft-only	生成草稿供员工确认	直接发送客户或监管材料	投诉、信贷、AML 文字生成降级
Template-only	使用预批准话术和静态 FAQ	自由生成承诺、资格、费用、拒绝原因	RAG 或模型可信度下降
Citation-required	只回答能引用权威来源的问题	无来源推断、跨政策组合推断	RAG 召回部分退化
Manual-first	AI 只排队、摘要、检索	自动建议结论或执行动作	高风险流程降级
Cache-assisted	使用最近验证过的低风险答案	回答有效期敏感或个性化问题	vendor outage 且缓存仍新鲜
Safe-stop	暂停某类 AI 功能	继续提供不受控 AI 输出	policy、identity、evidence 同时失效

6. Financial Retail Case

场景: 零售银行的客户服务 AI、信贷政策助手、AML summarizer 和投诉回复 agent 共享 LLM provider、RAG platform、policy engine、identity service、HITL 队列和 evidence export API。周一上午 vendor 模型延迟上升, evidence export 不稳定, 内部 RAG refresh 延迟。

错误做法: 等 vendor 恢复、继续正常流量、让一线员工自行判断、事故后补日志、只按技术 outage 处理。

正确降级:

Workflow	Degraded decision
客服费用 / 权益问题	template-only + citation-required; 不回答个性化例外
信贷政策助手	draft-only; adverse action / eligibility reason 由人工确认
AML summarizer	extractive summary only; 不生成 disposition 建议
投诉回复 agent	no auto-send; 所有客户可见内容进入 high-priority HITL
Tool write actions	全部切 read-only, 保留 manual action path
Evidence	本地 append-only evidence ledger, vendor export 恢复后补同步

Decision rights:

Decision	Owner
触发 domain degrade	Business continuity lead + AI incident commander
停止客户可见自由生成	AI product owner + compliance
停止写工具	Platform owner + security
接受 HITL 积压	Business owner + operations executive
恢复自动化	AI governance / model risk / business owner 联合批准

7. PM / BA / Architect Checklist

每个 AI use case 映射到 critical operation、core business line、customer journey 和 regulatory obligation。
每个 critical AI workflow 有 minimum viable service 定义。
每个 dependency 有 normal / degraded / safe-stop 行为。
每个 degraded mode 有 trigger、owner、客户影响、证据要求和恢复条件。
RTO/RPO/SLO 覆盖模型、RAG、工具、身份、policy、HITL、evidence, 不只覆盖 uptime。
Manual fallback 有人员、队列、SLA、话术、权限、审批记录和 surge trigger。
客户沟通区分 outage、reduced service、manual review、delay 和 correction。
证据保全不完全依赖 vendor portal。
Tabletop exercise 覆盖模型供应商、RAG、身份、policy engine、HITL 和 evidence stack。
恢复自动化前必须通过 regression eval、sample review、evidence completeness 和 business signoff。

8. Code-Lite Experiment

用一个小型 dependency router 训练架构思维:

workflow: complaint_response_agent
risk_tier: tier_1
customer_visible: true
dependencies:
  model_provider: degraded_latency
  rag_index: stale_6h
  policy_engine: healthy
  hitl_queue: saturated
  evidence_export: degraded
result:
  ai_mode: draft_only_template
  tool_mode: read_only
  human_queue: high_priority_complaint
  recovery_gate: evidence_sync_plus_sample_review

练习问题:

为什么 evidence_export degraded 会把客户可见 workflow 降到 draft-only。
为什么 stale RAG 不能继续自由生成费用、投诉或信贷解释。
HITL saturated 时如何按 regulatory deadline 和 customer harm 排队。
如果 policy engine 也 degraded, 哪些 intent 应 safe-stop。
恢复 normal mode 前需要哪些 eval、sample review 和 evidence。

9. Interview Questions

Question	30 秒回答
AI BCP 和 AI incident postmortem 的区别是什么?	Postmortem 是事故后学习, BCP / degraded mode 是事故前设计。AI BCP 要先定义 critical operations、impact tolerance、RTO/RPO、dependency failure、manual fallback、decision rights、customer communication 和 recovery exercise。
如何设计客户可见 AI 的 degraded mode?	按风险把能力从 normal 降到 citation-required、template-only、draft-only、manual-first 或 safe-stop。客户权益、费用、投诉、信贷和投资相关内容不能在 RAG、policy 或 evidence 不完整时自由生成。
SR 26-2 对 GenAI / agentic AI 的当前 nuance 是什么?	SR 26-2 是 2026 年发布的模型风险管理新指引, 替代 SR 11-7 / SR 21-8, 但当前范围明确排除 generative AI 和 agentic AI 模型; 实务上应借鉴治理思想, 但不把它当成 GenAI 合规清单。

10. Pitfalls

Pitfall	为什么危险	更好的做法
只做 uptime DR	AI 可能 HTTP 200 但语义和证据失效	定义 semantic、policy、evidence degraded modes
只准备模型 fallback	RAG、identity、policy、HITL、evidence 同样会失败	做 dependency-by-dependency degraded matrix
把人工作为无限 fallback	人审队列会饱和且技能有限	建容量模型、优先级、surge staffing
降级时继续写工具	模型或 policy 不稳会放大成业务动作	read-only / draft-only / dual control
证据依赖 vendor	vendor incident 时可能导不出 trace	本地 immutable evidence ledger
恢复只看服务绿灯	服务恢复不代表 AI 行为恢复	recovery gate 包含 eval、sample review、policy check、evidence completeness
不演练	文档中的 fallback 往往不可执行	桌面演练、技术演练、业务演练和管理层决策演练