AI Operational Resilience / BCP / Degraded Mode Playbook
核心判断:
AI Operational Resilience / BCP / Degraded Mode Architecture Playbook
定位: 面向 CBAP+、金融零售 AI PM、AI Product Architect、Enterprise Architect、Business Continuity、Operational Resilience、Model Risk、Third-Party Risk、Cyber、Compliance、Internal Audit 和一线运营负责人。本文不是基础 BA 流程文档, 而是训练你把 AI customer-facing / regulated workflows 做成可降级、可恢复、可证明、可演练的生产能力。
核心判断:
AI 的运营韧性不是“模型可用率”, 而是当模型、RAG、工具、身份、policy engine、eval、HITL、vendor 或 evidence stack 降级时, 关键运营是否还能在预设边界内安全运行。
这份 playbook 明确区别于 incident postmortem:
- Incident postmortem 是事后学习。
- Operational resilience / BCP 是事前设计。
- Degraded mode 是事故中保持关键服务连续性的受控状态。
- Recovery exercise 是证明这些设计真实可执行的机制。
1. Executive Framing
金融零售 AI 的关键风险不只来自错误回答, 也来自关键运营在压力状态下没有可执行的降级方案。
当客户服务 AI、信贷政策助手、AML case summarizer、欺诈复核 copilot、投诉回复 agent、财富顾问 copilot、分行员工助手和监管变更助手进入生产后, 它们会逐渐成为业务运营的一部分。它们可能不直接做最终决定, 但会影响:
- 客户得到的说明。
- 员工看到的证据。
- case 的优先级。
- 投诉的处理节奏。
- 信贷或欺诈复核的解释。
- AML / KYC 调查材料。
- 管理层和监管检查需要的 evidence。
正常状态下的 AI governance 会问:
这个 use case 是否批准上线。
模型是否验证。
prompt 是否测试。
RAG 是否有 citation。
工具是否有审批。
人审是否存在。
BCP / operational resilience 要继续问:
如果模型 provider 只剩 20% 容量, 关键流程怎么运行。
如果 RAG index 过期 12 小时, 哪些问题必须拒答或模板化。
如果 identity service 无法返回 entitlement, AI 是否还能个性化。
如果 policy engine false allow, 哪些工具动作必须停。
如果 HITL queue 积压 8 小时, 哪些 case 优先。
如果 evidence export API 故障, 哪些输出必须进入本地证据账本。
如果 vendor 同时降级运行路径和证据路径, 谁能接受残余风险。
运营韧性的高级目标不是让 AI 永不失败, 而是让失败保持在 impact tolerance 内:
- 客户不被误导。
- 受监管流程不失控。
- 关键服务不无序中断。
- 人工队列不被低风险任务打爆。
- 证据不丢失。
- 恢复有门禁。
- 管理层知道谁在做什么决定。
2. Source Anchors
以下来源用于建立监管语言、治理语言和架构控制语言。本文是学习、作品集和架构训练材料, 不构成法律意见、监管解释、审计结论或认证建议。正式项目必须由 legal、compliance、model risk、operational risk、technology、business owner、third-party risk、privacy、security 和 internal audit 结合机构政策与司法辖区复核。
| Anchor | Official link | 本文使用方式 |
|---|---|---|
| FFIEC Business Continuity Management booklet | https://ithandbook.ffiec.gov/it-booklets/business-continuity-management.aspx | 用 business impact analysis、risk assessment、critical operations、dependencies、testing、training、exercises、board / senior management oversight 组织 AI BCP。 |
| NIST AI Risk Management Framework | https://www.nist.gov/itl/ai-risk-management-framework | 用 Govern / Map / Measure / Manage 组织 AI 风险识别、风险度量、控制选择、持续监控和管理层汇报。 |
| ISO/IEC 42001 | https://www.iso.org/standard/42001 | 用 AI management system 语言连接政策、角色职责、运行控制、绩效评价、管理评审和持续改进。 |
| Federal Reserve SR 26-2 | https://www.federalreserve.gov/supervisionreg/srletters/SR2602.htm | 2026 年模型风险管理新指引, 替代 SR 11-7 和 SR 21-8; 当前范围排除 generative AI 与 agentic AI 模型, 但其风险分层、验证、治理、监控和变更控制思想可作为非覆盖 AI 控制设计参考。 |
| Federal Reserve SR 20-24 / Interagency Sound Practices for Operational Resilience | https://www.federalreserve.gov/supervisionreg/srletters/SR2024.htm | 用 operational resilience、critical operations、core business lines、impact tolerance、third-party dependencies、scenario testing 和 continuous improvement 组织 AI service continuity。 |
2.1 2026 SR 26-2 Nuance
SR 26-2 的当前 nuance 对 AI 架构师很重要:
- 它在 2026-04-17 发布。
- 它替代了 SR 11-7 和 SR 21-8。
- 它采用更风险分层的 model risk management 语言。
- 它当前不覆盖 generative AI 和 agentic AI 模型。
- 它不应被误读为 GenAI / agentic AI 的完整监管要求。
- 对金融零售 AI BCP 来说, 它仍可提供治理、验证、持续监控、变更控制和风险接受的结构化思想。
实务表达:
我会明确区分“SR 26-2 是否直接适用”与“SR 26-2 的治理原则是否可借鉴”。客户可见或受监管 GenAI / agentic workflow 的 BCP 设计, 需要叠加 NIST AI RMF、ISO/IEC 42001、FFIEC BCM、operational resilience 和机构内部模型/AI/第三方/网络/业务连续性政策。
3. Critical Operation Mapping
AI BCP 不能从模型清单开始, 要从 critical operations 开始。
3.1 Critical Operation Definition
在金融零售环境中, critical operation 指某个业务能力或流程一旦中断、严重降级或证据缺失, 可能影响:
- 客户资金、账户、权益、费用、申诉或法律权利。
- 核心业务线的持续运营。
- AML / KYC / fraud / credit / complaint / privacy 等受监管义务。
- 机构安全稳健。
- 管理层、审计或监管检查所需的可证明性。
3.2 AI-Enabled Critical Operation Inventory
| Critical operation | AI workflow | Customer / regulatory impact | Minimum viable service |
|---|---|---|---|
| Customer service for fees, disputes, complaints | Customer-facing AI or agent-assist | 错误费用、错误承诺、投诉权利误导 | 预批准模板 + 人工升级 + 权威政策引用 |
| Credit underwriting support | Lending policy assistant / document summarizer | 信贷原因、资格、例外政策、adverse action 解释 | 摘要草稿 + 人工确认 + reason-code consistency |
| AML investigation | AML case summarizer / SAR narrative drafter | 调查质量、SAR narrative、监管检查证据 | Extractive summary + analyst review + source trace |
| Fraud alert triage | Fraud copilot | 账户冻结、交易阻断、客户摩擦 | 只读证据聚合 + 人工 disposition |
| Complaint handling | Complaint response agent | 法定回复期限、客户补救、品牌和监管风险 | Draft-only + deadline priority queue |
| Wealth advice support | Advisor copilot | 投资建议、适当性、披露、客户画像 | Internal-only summary + required advisor review |
| Branch operations support | Employee knowledge assistant | 分行服务一致性、客户身份、流程合规 | Template + SOP search + no personalization if entitlement degraded |
| Regulatory change management | Regulatory summarizer | 政策更新、控制更新、审计证据 | Manual legal review + traceable summary |
3.3 Impact Tolerance
Impact tolerance 不等于技术 SLO。它回答: 在严重但合理的压力场景下, 机构最多能承受多少中断、积压、错误和证据缺口。
| Dimension | Example tolerance |
|---|---|
| Customer-visible misinformation | Tier 1 workflows: zero known unsupported commitments after detection |
| Manual backlog | Complaint high-risk queue: no regulatory deadline breach |
| Evidence gap | Tier 1 AI outputs: no unrecoverable trace gap; local capture allowed for delayed export |
| Tool side effect | No high-risk write action without approval when policy or identity is degraded |
| Policy freshness | Customer fee / credit / complaint answers cannot use superseded policy |
| Recovery approval | Normal automation restarts only after regression, evidence sync and business signoff |
4. AI Dependency Graph
AI BCP 要画 dependency graph, 不是只写系统清单。
4.1 Dependency Types
| Dependency | Failure examples | Resilience question |
|---|---|---|
| Model provider | outage, latency, model rollback, rate limit, quality drift | 是否有 approved fallback model、template path 或 extractive baseline |
| RAG corpus | stale policy, source removal, document conflict | 是否能切权威搜索、模板或 citation-only |
| Vector index | rebuild failure, ACL filter error, metadata corruption | 是否有 index manifest、previous version 和 rollback |
| Tool gateway | CRM / LOS / AML tool degraded, side effect risk | 是否能切 read-only、draft-only 或 manual action |
| Identity / entitlement | SSO degraded, role mapping missing, consent unavailable | 是否能禁用个性化和敏感数据访问 |
| Policy engine | false allow, false deny, latency, unavailable | 是否有 safe default、policy snapshot 和 manual decision rights |
| Eval pipeline | judge unavailable, golden set runner failed | 是否冻结高风险 release 和 recovery |
| HITL queue | backlog, skill mismatch, reviewer outage | 是否有风险优先级、surge team 和 capacity triggers |
| Vendor support | delayed notice, portal down, evidence export failure | 是否有 contract escalation 和 local evidence capture |
| Evidence store | trace loss, export lag, retention failure | 是否能本地 append-only 保存恢复所需字段 |
4.2 Graph View
Critical operation
-> AI workflow
-> customer / employee channel
-> model route
-> prompt / policy bundle
-> RAG source and index
-> tool gateway
-> identity and entitlement
-> policy engine
-> HITL queue
-> evidence ledger
-> vendor and cloud dependencies
-> recovery gate
4.3 Dependency Edge Attributes
| Attribute | Why it matters |
|---|---|
| Synchronous vs asynchronous | 同步依赖失败会立即中断客户旅程 |
| Read vs write | 写依赖失败或误用会造成真实业务影响 |
| Customer-visible vs internal | 客户可见输出需要更严格降级 |
| Regulatory deadline | 投诉、AML、KYC、信贷流程有时间约束 |
| Evidence criticality | 证据缺失会影响复盘、审计和监管响应 |
| Substitutability | 替代越困难, BCP 越要提前演练 |
| Manual capacity | 人工 fallback 是否真实可执行 |
5. Degraded Mode Taxonomy
Degraded mode 是预先批准的运行状态。它不是临时“关一点功能”, 而是有 trigger、allowed behavior、blocked behavior、owner、evidence 和 recovery condition。
5.1 Taxonomy Matrix
| Mode | Allowed behavior | Blocked behavior | Typical trigger |
|---|---|---|---|
| Normal | Full approved AI workflow within policy | Out-of-bound use | All dependencies healthy |
| Conservative generation | Short answer, citation, lower temperature, no speculation | Broad reasoning, unsupported advice | Model quality KRI yellow |
| Citation-required | Answer only with authoritative sources | Unsourced conclusion | RAG confidence degraded |
| Template-only | Pre-approved static language | Free-form customer commitments | Customer-visible regulated topic |
| Draft-only | Employee sees draft and must approve | Auto-send or auto-submit | Evidence, policy, or model not fully healthy |
| Read-only | Retrieve and summarize | Any write, notification, case closure, credit action | Tool, identity, or policy degraded |
| Manual-first | AI supports queueing and retrieval only | AI recommendation or decision framing | HITL or model risk high |
| Cache-assisted | Recently validated low-risk answers | Personalized, time-sensitive, eligibility answers | Vendor outage with fresh cache |
| Safe-stop | Stop a workflow category | Continuing AI output | Multiple critical controls degraded |
5.2 Trigger Design
| Trigger | Degraded mode |
|---|---|
| Model p95 latency exceeds tolerance but quality normal | Conservative generation or fallback model |
| Model quality eval fails Tier 1 golden set | Draft-only or safe-stop for affected workflow |
| RAG index freshness breach for customer policy | Template-only or citation-required |
| Vector ACL uncertainty | Safe-stop for sensitive retrieval; no personalization |
| Tool gateway side-effect uncertainty | Read-only |
| Identity entitlement unavailable | No sensitive data, no personalized answer, manual verification |
| Policy engine unavailable | Deny high-risk action, allow low-risk template where policy snapshot valid |
| HITL queue > capacity threshold | Risk-prioritized manual-first, stop low-value escalations |
| Evidence export degraded | Draft-only for customer-visible Tier 1; local evidence ledger |
| Eval pipeline unavailable after high-risk change | Freeze release and freeze recovery to normal |
6. RTO, RPO And AI SLOs
Traditional DR metrics must be translated for AI.
6.1 Definitions
| Metric | AI interpretation |
|---|---|
| RTO | Time to restore controlled service, not necessarily full automation |
| RPO | Acceptable loss window for trace, retrieved source IDs, prompts, policy decisions, tool calls and human review records |
| SLO | Service level objective for availability, latency, groundedness, policy compliance, HITL queue, evidence completeness and fallback success |
| SLI | Measured indicator such as p95 latency, citation support, unsupported answer rate, queue age or trace completeness |
| Error budget | Allowed degradation before forced mode change, not just downtime allowance |
6.2 Suggested AI Resilience Targets
| Workflow tier | RTO | RPO | Key SLO |
|---|---|---|---|
| Tier 1 customer-visible regulated | Restore controlled service within 30-60 minutes | No unrecoverable evidence gap | Evidence completeness, no unsupported commitment, queue deadline protection |
| Tier 1 employee decision support | Restore read-only or draft-only within 60 minutes | Trace gap under approved local capture window | Source trace, human approval, no high-risk auto action |
| Tier 2 customer service | Restore template or cached service within 2 hours | Recoverable transcript and source references | Template availability, escalation SLA |
| Tier 2 operations support | Restore internal knowledge search within 4 hours | Recoverable query and answer log | Search freshness, manual SOP access |
| Tier 3 productivity assistant | Restore when platform stable | Standard log retention | Cost and availability |
6.3 SLO Catalog
| SLO | Example threshold |
|---|---|
| Model route availability | Tier 1 approved route or fallback route available 99.5% during business hours |
| Grounded answer rate | Tier 1 customer-visible answers with citation support >= 99% |
| Policy freshness | Critical policy update indexed or template updated within approved window |
| Tool safety | 100% high-risk write actions have approval ID and idempotency key |
| HITL queue age | Tier 1 escalations under regulatory or customer harm threshold |
| Evidence completeness | Tier 1 traces include model, prompt, source IDs, policy decision, tool call, human action |
| Fallback activation time | Dependency health trigger to mode switch under defined minutes |
| Recovery gate quality | Restart sample review pass before automation resumes |
7. Manual Fallback Design
Manual fallback is not “let humans handle it”. It is a designed operating model.
7.1 Manual Fallback Components
| Component | Design requirement |
|---|---|
| Scope | Which intents, products, segments, channels and regions enter manual mode |
| Queue | Dedicated queue with risk tier, deadline, customer harm and skill tags |
| Staffing | Named teams, surge pool, hours, handoff and fatigue controls |
| Decision aids | Static SOP, policy snapshots, templates, approved calculators, evidence viewer |
| Authority | Who can approve, reject, override, remediate and communicate |
| Evidence | Manual action record, reason, source, timestamp, approver and customer impact |
| SLA | Queue age, regulatory deadline, customer callback and internal escalation |
| Exit | Criteria to move from manual-first back to draft-only or normal |
7.2 Manual Fallback Flow
degradation trigger
-> freeze risky automation
-> classify active requests by risk and deadline
-> route Tier 1 cases to skilled reviewers
-> provide policy snapshot and evidence viewer
-> record manual decision and customer communication
-> monitor backlog and breach risk
-> recover automation only after gate approval
7.3 Queue Priority Rules
| Priority | Criteria |
|---|---|
| P0 | Potential customer harm, regulatory deadline, irreversible action, privacy exposure |
| P1 | Customer-visible complaint, fee, credit, fraud, KYC / AML, wealth suitability |
| P2 | Employee productivity, internal summary, non-deadline operations |
| P3 | Low-risk general information or training support |
8. Customer Communication
Degraded AI service must be communicated carefully. The message should avoid over-disclosing technical internals while being accurate about service state, customer impact and next steps.
8.1 Communication Principles
| Principle | Meaning |
|---|---|
| Accuracy | Do not claim AI is working normally when high-risk paths are degraded |
| Boundaries | Make clear when information is general, draft, or pending human review |
| No false reassurance | Avoid saying there is no impact before evidence confirms it |
| Timeliness | Provide updates before regulatory or service deadlines are missed |
| Consistency | Customer, branch, call center, complaint and digital channels use aligned language |
| Evidence | Keep the exact message version and affected population record |
8.2 Customer-Facing States
| State | Customer message posture |
|---|---|
| Normal | Standard disclosure and service response |
| Reduced automation | “Some requests are being reviewed by a specialist before completion.” |
| Manual review | “We are reviewing this manually and will respond by the stated timeframe.” |
| Delayed service | “This request may take longer than usual; urgent account or safety concerns can use [approved channel].” |
| Correction | “We are correcting information previously provided and will explain any customer-specific impact.” |
8.3 Regulated Workflow Messaging
| Workflow | Communication control |
|---|---|
| Complaint | Preserve deadline, avoid premature conclusion, route to complaint specialist |
| Credit | Do not generate eligibility, pricing or adverse-action reasons without approved process |
| AML / Fraud | Do not expose investigation logic; communicate operationally approved next steps |
| Wealth | Advisor owns client communication; AI output remains internal support |
| Fees / disputes | Use approved fee, dispute and rights language only |
9. Evidence Preservation
Evidence is part of resilience. If the organization cannot reconstruct what the AI saw, decided, retrieved, proposed, blocked, escalated and communicated, recovery is incomplete.
9.1 Evidence Fields
| Field | Why it matters |
|---|---|
| Request ID / trace ID | Links channels, AI, tools and case systems |
| Use case and risk tier | Drives severity and retention |
| Customer / case reference | Identifies affected population |
| Model provider, model ID, version | Supports model route reconstruction |
| Prompt / policy bundle version | Explains behavior boundary |
| Retrieved source IDs and index version | Validates grounding and policy freshness |
| Policy engine decision ID | Shows allow, deny, escalate or fallback |
| Tool calls and side effects | Identifies real-world impact |
| HITL action and reviewer role | Shows oversight |
| Degraded mode state | Proves system was operating under approved constraints |
| Customer-visible message version | Supports remediation and complaint response |
| Recovery gate result | Shows why automation restarted |
9.2 Local Evidence Ledger
When vendor evidence export is degraded, Tier 1 workflows should write a local append-only record:
trace_id
timestamp
workflow
risk_tier
mode
model_route
prompt_version
source_ids
policy_decision
tool_action
human_review
customer_visible_flag
fallback_reason
hash_of_output
This ledger should be:
- append-only。
- access-controlled。
- time-synchronized。
- retained under approved policy。
- reconciled with vendor export after recovery。
- included in audit evidence binder。
10. BCP / DR Operating Model
10.1 Decision Rights
| Decision | Accountable role |
|---|---|
| Declare AI operational degradation | Incident Commander / Operational Resilience Lead |
| Activate domain degraded mode | Business Owner + AI Product Owner |
| Disable model route | AI Platform Owner |
| Disable write tools | Security / Platform Owner |
| Freeze release or recovery | Model Risk / AI Governance |
| Prioritize HITL queue | Operations Executive |
| Approve customer communication | Business Owner + Legal / Compliance |
| Accept residual risk during extended degradation | Senior management risk committee |
| Restore normal automation | Business Owner + Platform + Model Risk + Compliance |
10.2 Recovery Gate
Normal mode must not resume just because the vendor status page turns green.
Recovery gate should include:
- dependency health stable for defined window。
- no unresolved critical KRI breach。
- regression eval pass for affected workflows。
- evidence sync or approved local ledger reconciliation。
- sampled human review of degraded-period outputs。
- tool side-effect reconciliation。
- HITL queue within tolerance or approved backlog plan。
- business owner signoff。
- model risk / compliance signoff for Tier 1。
- restart decision recorded。
11. Tabletop Exercises
Exercises prove whether BCP is executable.
11.1 Exercise Types
| Exercise | Purpose |
|---|---|
| Tabletop | Test decision rights, communication and tradeoffs |
| Technical failover drill | Test routing, fallback model, index rollback, tool disable |
| Evidence drill | Test local ledger, export reconciliation, audit pack |
| HITL surge drill | Test manual queue capacity and prioritization |
| Vendor exit tabletop | Test contract, data export, replacement route and executive decisions |
| Full scenario simulation | Combine model, RAG, policy, HITL and evidence degradation |
11.2 Scenario 1: Model Provider Outage
At 09:15, Model Provider A reports elevated errors and rate limiting.
Customer service AI, lending assistant and complaint agent use Provider A.
RAG and tools are healthy.
Evidence export remains healthy.
Expected decisions:
- Switch Tier 2 general service to fallback model or template。
- Move Tier 1 complaint responses to draft-only。
- Keep credit reason explanations under human confirmation。
- Monitor fallback quality and latency。
- Notify operations of reduced automation。
11.3 Scenario 2: RAG Index Stale
Policy team updated fee waiver and dispute policies at 07:00.
RAG index refresh failed silently.
Customer-facing AI continues retrieving yesterday's policy.
Expected decisions:
- Disable free-form fee / dispute answers。
- Use approved templates and authoritative policy search。
- Query affected customer-visible answers since 07:00。
- Preserve retrieved source IDs and policy effective dates。
- Recover only after index validation and sample review。
11.4 Exercise Scoring
| Dimension | Pass criteria |
|---|---|
| Detection | Trigger recognized within target time |
| Decision rights | Correct owner makes decision |
| Mode switch | System enters approved degraded state |
| Customer protection | High-risk customer-visible output blocked or reviewed |
| Manual fallback | Queue routing and staffing work |
| Evidence | Required fields captured |
| Communication | Internal and external messages aligned |
| Recovery | Restart gate used before normal mode |
| Improvement | Gaps become funded actions |
12. RACI
| Activity | Accountable | Responsible | Consulted | Informed |
|---|---|---|---|---|
| Critical operation mapping | Business Executive | AI PM / Senior BA | Enterprise Architecture, Risk, Compliance | Operations, Audit |
| AI dependency graph | Enterprise Architect | AI Platform + Product Architect | Security, Data, Vendor Owner | Business Owners |
| Impact tolerance approval | Senior Management / Risk Committee | Operational Resilience Lead | Legal, Compliance, Model Risk | Board Risk Committee as appropriate |
| Degraded mode design | AI Product Owner | Product Architect + Platform Owner | Business Ops, Compliance, Security | Customer Ops |
| Manual fallback plan | Operations Executive | Queue Owners / Workforce Planning | Business Owner, Compliance | Frontline Teams |
| Evidence preservation design | Governance / Audit Evidence Owner | Platform Engineering | Legal, Privacy, Security | Internal Audit |
| Tabletop exercise | Operational Resilience Lead | BCP Team + AI Platform | Vendor, Legal, Business, Risk | Senior Management |
| Trigger monitoring | Platform Owner | SRE / AI Ops | Product, Model Risk | Business Ops |
| Customer communication | Business Owner | Customer Ops / Communications | Legal, Compliance | Frontline Teams |
| Recovery approval | Business Owner + Model Risk | Platform + Product | Compliance, Security, Operations | Senior Management |
13. Templates
13.1 AI BCP Use Case Card
| Field | Example |
|---|---|
| Use case | Complaint response agent |
| Critical operation | Complaint handling |
| Risk tier | Tier 1 |
| Customer-visible | Yes |
| Regulatory exposure | Complaint deadlines, consumer protection, recordkeeping |
| Normal AI behavior | Draft and send approved responses after policy and human checks |
| Minimum viable service | Draft-only, manual review, deadline priority |
| Degraded modes | Template-only, draft-only, read-only, manual-first, safe-stop |
| RTO | Controlled service within 60 minutes |
| RPO | No unrecoverable trace gap |
| Recovery gate | Eval, evidence sync, sample review, business signoff |
13.2 Degraded Mode Decision Card
| Field | Filled example |
|---|---|
| Trigger | Evidence export API degraded for Tier 1 complaint workflow |
| Mode | Draft-only + local evidence ledger |
| Allowed | Generate internal draft from approved template, route to complaint analyst |
| Blocked | Auto-send, free-form promise, CRM closure |
| Decision owner | AI Incident Commander + Complaint Business Owner |
| Evidence required | Trace ID, prompt version, source IDs, policy decision, output hash |
| Customer impact | Response may require manual review; deadlines protected |
| Recovery condition | Vendor export restored, local ledger reconciled, 30-sample review passed |
13.3 Recovery Decision Memo
| Field | Filled example |
|---|---|
| Workflow | Complaint response agent |
| Incident window | 2026-06-30 09:15-13:40 CT |
| Degraded mode | Draft-only + local evidence ledger |
| Customer impact | No auto-send; 42 cases manually reviewed |
| Evidence status | Local ledger reconciled; 0 missing Tier 1 traces |
| Regression status | Complaint policy golden set and citation support passed |
| Tool reconciliation | No automated CRM closure during degraded window |
| Decision | Restore approved complaint intents with 24-hour enhanced monitoring |
| Approvers | Business Owner, Platform Owner, Model Risk, Compliance |
13.4 Communication And Executive Brief Template
| Audience | Required content |
|---|---|
| Customer | Reduced automation, manual review, expected response path, urgent channel |
| Frontline | Current mode, allowed actions, blocked actions, queue and next update |
| Executive | Situation, impact, controls, decisions needed, recovery path, residual risk |
| Regulator / examiner | System scope, affected population, controls, evidence, remediation |
14. Governance Cadence
| Cadence | Forum | Decisions / outputs |
|---|---|---|
| Daily during degradation | AI operational bridge | Mode state, backlog, customer impact, evidence status, next decision |
| Weekly | AI operations review | SLO breaches, degraded-mode activations, queue health, vendor notices |
| Monthly | AI governance / model risk committee | KRI trends, recovery exercise gaps, risk acceptance, control effectiveness |
| Quarterly | Operational resilience review | Critical operation map, impact tolerance, scenario exercise, third-party dependency |
| Semiannual | Executive tabletop | Senior decision rights, customer communication, board / regulator brief |
| Annual | Full BCP / DR exercise | Technical failover, manual fallback, evidence recovery, vendor exit scenario |
| Event-driven | Major model, RAG, tool, policy, identity, evidence or vendor change | Impact assessment, regression scope, degraded mode update |
15. 30-Day Lab
目标: 30 天内完成一套可展示的 AI Operational Resilience / BCP / Degraded Mode Architecture portfolio pack。推荐选择 Complaint response agent、Lending policy assistant、AML case summarizer 或 Customer service AI。
| Day | Task | Artifact |
|---|---|---|
| 1 | 选择一个金融零售 AI workflow, 定义 customer-visible 和 regulated boundary | Use Case Boundary Card |
| 2 | 映射 critical operation 和 core business line | Critical Operation Map |
| 3 | 写 business impact analysis, 覆盖客户、监管、运营和证据 | AI BIA Worksheet |
| 4 | 列出 model、RAG、tool、identity、policy、eval、HITL、vendor、evidence 依赖 | Dependency Register |
| 5 | 画 dependency graph 和 edge attributes | Dependency Graph |
| 6 | 定义 impact tolerance | Impact Tolerance Memo |
| 7 | 设计 minimum viable service | Minimum Service Definition |
| 8 | 设计 degraded mode taxonomy | Degraded Mode Matrix |
| 9 | 为 model provider outage 设计 fallback | Model Fallback Runbook |
| 10 | 为 RAG stale / ACL failure 设计 fallback | RAG Degradation Runbook |
| 11 | 为 tool gateway / side effect failure 设计 fallback | Tool Degradation Runbook |
| 12 | 为 identity / entitlement failure 设计 fallback | Identity Degradation Runbook |
| 13 | 为 policy engine false allow / false deny 设计 fallback | Policy Degradation Runbook |
| 14 | 为 HITL queue saturation 设计 fallback | HITL Surge Plan |
| 15 | 为 evidence export failure 设计 fallback | Evidence Preservation Plan |
| 16 | 定义 RTO / RPO / SLO / SLI | AI Resilience Metrics Table |
| 17 | 设计 manual fallback queue 和 staffing model | Manual Operations Plan |
| 18 | 设计客户和内部沟通矩阵 | Communication Matrix |
| 19 | 写 degraded mode decision card | Decision Card |
| 20 | 写 recovery gate 和 restart memo | Recovery Gate Pack |
| 21 | 设计 monitoring triggers | Trigger Dashboard Spec |
| 22 | 设计 RACI 和 decision rights | RACI |
| 23 | 设计 tabletop scenario 1: model outage | Scenario Script |
| 24 | 设计 tabletop scenario 2: RAG stale | Scenario Script |
| 25 | 设计 tabletop scenario 3: evidence failure | Scenario Script |
| 26 | 运行一次 90 分钟 tabletop, 记录 decision log | Exercise Decision Log |
| 27 | 把 exercise gaps 转成 remediation backlog | Remediation Register |
| 28 | 写 executive brief | Executive One-Pager |
| 29 | 写 1500-2500 字 portfolio case study | Case Study |
| 30 | 准备 8 个面试问答和 5 分钟讲述 | Interview Story Pack |
16. Interview Answers
Q1: AI operational resilience 和普通 BCP 有什么不同?
30 秒:
普通 BCP 关注服务中断、站点切换、人员和流程恢复。AI operational resilience 还要覆盖语义正确性、RAG freshness、policy decision、tool side effect、identity entitlement、HITL capacity 和 evidence completeness。AI 可能没有宕机但已经不安全, 所以需要 normal、degraded、safe-stop 和 recovery gate。
Q2: 如何为客户可见 AI 设计 degraded mode?
30 秒:
我会按风险降级为 citation-required、template-only、draft-only、manual-first 或 safe-stop。费用、投诉、信贷、财富和客户权益相关内容, 在 RAG、policy、identity 或 evidence 不完整时不能自由生成或自动发送。恢复 normal 前必须通过 eval、样本复核、证据同步和业务签字。
Q3: HITL 是不是天然的 fallback?
30 秒:
不是。HITL 只有在有容量、技能、优先级、SLA、证据和决策权限时才是控制。否则它会在事故中变成瓶颈。AI BCP 必须设计队列分级、surge staffing、deadline routing、review payload 和 backlog threshold。
Q4: Evidence export 故障为什么会触发 AI 降级?
30 秒:
因为高风险 AI 输出必须可复原、可审计、可解释。没有 evidence, 事故后无法证明模型版本、prompt、retrieved source、policy decision、tool action 和 human review。Tier 1 客户可见 workflow 在 evidence export 故障时应至少切 draft-only 并启用本地证据账本。
Q5: 如何处理 policy engine false allow?
30 秒:
先把高风险写工具切 read-only, 查询 exposure window 内的 side effects, 保留 policy decision log 和 tool ledger, 再做客户和业务补救。恢复前必须通过 policy regression、审批链路验证和工具幂等检查。
Q6: RTO/RPO 在 AI 里怎么定义?
30 秒:
AI RTO 是恢复受控服务的时间, 不一定是恢复完整自动化。AI RPO 是可接受的证据、source、prompt、tool、human review 记录损失窗口。Tier 1 客户可见流程可能要求 30-60 分钟内恢复 template 或 manual service, 且不能有不可恢复证据缺口。
Q7: 如何设计恢复 normal mode 的 gate?
30 秒:
不能只看供应商状态页变绿。恢复 gate 要看 dependency health、regression eval、样本复核、evidence reconciliation、tool side-effect reconciliation、HITL backlog、客户影响和业务 / 风险 / 合规签字。
Q8: 如何向高管解释 AI BCP 的价值?
30 秒:
我会说 AI BCP 不是阻碍自动化, 而是保护关键运营。当 AI 依赖失败时, 机构需要知道哪些服务继续、哪些服务降级、哪些动作停止、谁做决定、客户如何沟通、证据如何保全、何时恢复。这能降低客户伤害、监管风险和混乱停机。
17. Self-Assessment Checklist
| Check | Passing evidence |
|---|---|
| Critical operation mapping | AI workflows linked to customer journeys, core business lines and regulatory obligations |
| Dependency graph | Model, RAG, tool, identity, policy, eval, HITL, vendor and evidence dependencies mapped |
| Impact tolerance | Business-approved tolerance for downtime, backlog, misinformation and evidence loss |
| Degraded mode taxonomy | Normal, conservative, citation-required, template-only, draft-only, read-only, manual-first, safe-stop defined |
| Trigger logic | Dependency health and risk-tier triggers mapped to mode switch |
| Manual fallback | Queue, staffing, templates, authority, SLA and evidence designed |
| Customer communication | Reduced automation, manual review, delay and correction states prepared |
| Evidence preservation | Local ledger and reconciliation design exists |
| Tabletop exercises | Model, RAG, identity, policy, HITL and evidence scenarios exercised |
| Recovery gate | Eval, evidence, tool reconciliation, queue, business and risk signoff required |
| Governance | RACI, cadence, risk acceptance and remediation tracking active |
18. Final Principle
AI operational resilience 的成熟度可以用一句话检验:
When AI is degraded, do we already know which customer promises stop, which operations continue, which humans take over, which evidence is preserved, which executives decide, and what proof is required before automation returns?
如果答案不清楚, 企业只是把 AI 功能上线了, 还没有把 AI 做成关键运营能力。