AI 扩展计划 / Playbooks

AI Operational Resilience / BCP / Degraded Mode Playbook

核心判断:

684 行AI_OPERATIONAL_RESILIENCE_BCP_DEGRADED_MODE_PLAYBOOK.md

AI Operational Resilience / BCP / Degraded Mode Architecture Playbook

定位: 面向 CBAP+、金融零售 AI PM、AI Product Architect、Enterprise Architect、Business Continuity、Operational Resilience、Model Risk、Third-Party Risk、Cyber、Compliance、Internal Audit 和一线运营负责人。本文不是基础 BA 流程文档, 而是训练你把 AI customer-facing / regulated workflows 做成可降级、可恢复、可证明、可演练的生产能力。

核心判断:

AI 的运营韧性不是“模型可用率”, 而是当模型、RAG、工具、身份、policy engine、eval、HITL、vendor 或 evidence stack 降级时, 关键运营是否还能在预设边界内安全运行。

这份 playbook 明确区别于 incident postmortem:

Incident postmortem 是事后学习。
Operational resilience / BCP 是事前设计。
Degraded mode 是事故中保持关键服务连续性的受控状态。
Recovery exercise 是证明这些设计真实可执行的机制。

1. Executive Framing

金融零售 AI 的关键风险不只来自错误回答, 也来自关键运营在压力状态下没有可执行的降级方案。

当客户服务 AI、信贷政策助手、AML case summarizer、欺诈复核 copilot、投诉回复 agent、财富顾问 copilot、分行员工助手和监管变更助手进入生产后, 它们会逐渐成为业务运营的一部分。它们可能不直接做最终决定, 但会影响:

客户得到的说明。
员工看到的证据。
case 的优先级。
投诉的处理节奏。
信贷或欺诈复核的解释。
AML / KYC 调查材料。
管理层和监管检查需要的 evidence。

正常状态下的 AI governance 会问:

这个 use case 是否批准上线。
模型是否验证。
prompt 是否测试。
RAG 是否有 citation。
工具是否有审批。
人审是否存在。

BCP / operational resilience 要继续问:

如果模型 provider 只剩 20% 容量, 关键流程怎么运行。
如果 RAG index 过期 12 小时, 哪些问题必须拒答或模板化。
如果 identity service 无法返回 entitlement, AI 是否还能个性化。
如果 policy engine false allow, 哪些工具动作必须停。
如果 HITL queue 积压 8 小时, 哪些 case 优先。
如果 evidence export API 故障, 哪些输出必须进入本地证据账本。
如果 vendor 同时降级运行路径和证据路径, 谁能接受残余风险。

运营韧性的高级目标不是让 AI 永不失败, 而是让失败保持在 impact tolerance 内:

客户不被误导。
受监管流程不失控。
关键服务不无序中断。
人工队列不被低风险任务打爆。
证据不丢失。
恢复有门禁。
管理层知道谁在做什么决定。

2. Source Anchors

以下来源用于建立监管语言、治理语言和架构控制语言。本文是学习、作品集和架构训练材料, 不构成法律意见、监管解释、审计结论或认证建议。正式项目必须由 legal、compliance、model risk、operational risk、technology、business owner、third-party risk、privacy、security 和 internal audit 结合机构政策与司法辖区复核。

Anchor	Official link	本文使用方式
FFIEC Business Continuity Management booklet	https://ithandbook.ffiec.gov/it-booklets/business-continuity-management.aspx	用 business impact analysis、risk assessment、critical operations、dependencies、testing、training、exercises、board / senior management oversight 组织 AI BCP。
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 组织 AI 风险识别、风险度量、控制选择、持续监控和管理层汇报。
ISO/IEC 42001	https://www.iso.org/standard/42001	用 AI management system 语言连接政策、角色职责、运行控制、绩效评价、管理评审和持续改进。
Federal Reserve SR 26-2	https://www.federalreserve.gov/supervisionreg/srletters/SR2602.htm	2026 年模型风险管理新指引, 替代 SR 11-7 和 SR 21-8; 当前范围排除 generative AI 与 agentic AI 模型, 但其风险分层、验证、治理、监控和变更控制思想可作为非覆盖 AI 控制设计参考。
Federal Reserve SR 20-24 / Interagency Sound Practices for Operational Resilience	https://www.federalreserve.gov/supervisionreg/srletters/SR2024.htm	用 operational resilience、critical operations、core business lines、impact tolerance、third-party dependencies、scenario testing 和 continuous improvement 组织 AI service continuity。

2.1 2026 SR 26-2 Nuance

SR 26-2 的当前 nuance 对 AI 架构师很重要:

它在 2026-04-17 发布。
它替代了 SR 11-7 和 SR 21-8。
它采用更风险分层的 model risk management 语言。
它当前不覆盖 generative AI 和 agentic AI 模型。
它不应被误读为 GenAI / agentic AI 的完整监管要求。
对金融零售 AI BCP 来说, 它仍可提供治理、验证、持续监控、变更控制和风险接受的结构化思想。

实务表达:

我会明确区分“SR 26-2 是否直接适用”与“SR 26-2 的治理原则是否可借鉴”。客户可见或受监管 GenAI / agentic workflow 的 BCP 设计, 需要叠加 NIST AI RMF、ISO/IEC 42001、FFIEC BCM、operational resilience 和机构内部模型/AI/第三方/网络/业务连续性政策。

3. Critical Operation Mapping

AI BCP 不能从模型清单开始, 要从 critical operations 开始。

3.1 Critical Operation Definition

在金融零售环境中, critical operation 指某个业务能力或流程一旦中断、严重降级或证据缺失, 可能影响:

客户资金、账户、权益、费用、申诉或法律权利。
核心业务线的持续运营。
AML / KYC / fraud / credit / complaint / privacy 等受监管义务。
机构安全稳健。
管理层、审计或监管检查所需的可证明性。

3.2 AI-Enabled Critical Operation Inventory

Critical operation	AI workflow	Customer / regulatory impact	Minimum viable service
Customer service for fees, disputes, complaints	Customer-facing AI or agent-assist	错误费用、错误承诺、投诉权利误导	预批准模板 + 人工升级 + 权威政策引用
Credit underwriting support	Lending policy assistant / document summarizer	信贷原因、资格、例外政策、adverse action 解释	摘要草稿 + 人工确认 + reason-code consistency
AML investigation	AML case summarizer / SAR narrative drafter	调查质量、SAR narrative、监管检查证据	Extractive summary + analyst review + source trace
Fraud alert triage	Fraud copilot	账户冻结、交易阻断、客户摩擦	只读证据聚合 + 人工 disposition
Complaint handling	Complaint response agent	法定回复期限、客户补救、品牌和监管风险	Draft-only + deadline priority queue
Wealth advice support	Advisor copilot	投资建议、适当性、披露、客户画像	Internal-only summary + required advisor review
Branch operations support	Employee knowledge assistant	分行服务一致性、客户身份、流程合规	Template + SOP search + no personalization if entitlement degraded
Regulatory change management	Regulatory summarizer	政策更新、控制更新、审计证据	Manual legal review + traceable summary

3.3 Impact Tolerance

Impact tolerance 不等于技术 SLO。它回答: 在严重但合理的压力场景下, 机构最多能承受多少中断、积压、错误和证据缺口。

Dimension	Example tolerance
Customer-visible misinformation	Tier 1 workflows: zero known unsupported commitments after detection
Manual backlog	Complaint high-risk queue: no regulatory deadline breach
Evidence gap	Tier 1 AI outputs: no unrecoverable trace gap; local capture allowed for delayed export
Tool side effect	No high-risk write action without approval when policy or identity is degraded
Policy freshness	Customer fee / credit / complaint answers cannot use superseded policy
Recovery approval	Normal automation restarts only after regression, evidence sync and business signoff

4. AI Dependency Graph

AI BCP 要画 dependency graph, 不是只写系统清单。

4.1 Dependency Types

Dependency	Failure examples	Resilience question
Model provider	outage, latency, model rollback, rate limit, quality drift	是否有 approved fallback model、template path 或 extractive baseline
RAG corpus	stale policy, source removal, document conflict	是否能切权威搜索、模板或 citation-only
Vector index	rebuild failure, ACL filter error, metadata corruption	是否有 index manifest、previous version 和 rollback
Tool gateway	CRM / LOS / AML tool degraded, side effect risk	是否能切 read-only、draft-only 或 manual action
Identity / entitlement	SSO degraded, role mapping missing, consent unavailable	是否能禁用个性化和敏感数据访问
Policy engine	false allow, false deny, latency, unavailable	是否有 safe default、policy snapshot 和 manual decision rights
Eval pipeline	judge unavailable, golden set runner failed	是否冻结高风险 release 和 recovery
HITL queue	backlog, skill mismatch, reviewer outage	是否有风险优先级、surge team 和 capacity triggers
Vendor support	delayed notice, portal down, evidence export failure	是否有 contract escalation 和 local evidence capture
Evidence store	trace loss, export lag, retention failure	是否能本地 append-only 保存恢复所需字段

4.2 Graph View

Critical operation
  -> AI workflow
  -> customer / employee channel
  -> model route
  -> prompt / policy bundle
  -> RAG source and index
  -> tool gateway
  -> identity and entitlement
  -> policy engine
  -> HITL queue
  -> evidence ledger
  -> vendor and cloud dependencies
  -> recovery gate

4.3 Dependency Edge Attributes

Attribute	Why it matters
Synchronous vs asynchronous	同步依赖失败会立即中断客户旅程
Read vs write	写依赖失败或误用会造成真实业务影响
Customer-visible vs internal	客户可见输出需要更严格降级
Regulatory deadline	投诉、AML、KYC、信贷流程有时间约束
Evidence criticality	证据缺失会影响复盘、审计和监管响应
Substitutability	替代越困难, BCP 越要提前演练
Manual capacity	人工 fallback 是否真实可执行

5. Degraded Mode Taxonomy

Degraded mode 是预先批准的运行状态。它不是临时“关一点功能”, 而是有 trigger、allowed behavior、blocked behavior、owner、evidence 和 recovery condition。

5.1 Taxonomy Matrix

Mode	Allowed behavior	Blocked behavior	Typical trigger
Normal	Full approved AI workflow within policy	Out-of-bound use	All dependencies healthy
Conservative generation	Short answer, citation, lower temperature, no speculation	Broad reasoning, unsupported advice	Model quality KRI yellow
Citation-required	Answer only with authoritative sources	Unsourced conclusion	RAG confidence degraded
Template-only	Pre-approved static language	Free-form customer commitments	Customer-visible regulated topic
Draft-only	Employee sees draft and must approve	Auto-send or auto-submit	Evidence, policy, or model not fully healthy
Read-only	Retrieve and summarize	Any write, notification, case closure, credit action	Tool, identity, or policy degraded
Manual-first	AI supports queueing and retrieval only	AI recommendation or decision framing	HITL or model risk high
Cache-assisted	Recently validated low-risk answers	Personalized, time-sensitive, eligibility answers	Vendor outage with fresh cache
Safe-stop	Stop a workflow category	Continuing AI output	Multiple critical controls degraded

5.2 Trigger Design

Trigger	Degraded mode
Model p95 latency exceeds tolerance but quality normal	Conservative generation or fallback model
Model quality eval fails Tier 1 golden set	Draft-only or safe-stop for affected workflow
RAG index freshness breach for customer policy	Template-only or citation-required
Vector ACL uncertainty	Safe-stop for sensitive retrieval; no personalization
Tool gateway side-effect uncertainty	Read-only
Identity entitlement unavailable	No sensitive data, no personalized answer, manual verification
Policy engine unavailable	Deny high-risk action, allow low-risk template where policy snapshot valid
HITL queue > capacity threshold	Risk-prioritized manual-first, stop low-value escalations
Evidence export degraded	Draft-only for customer-visible Tier 1; local evidence ledger
Eval pipeline unavailable after high-risk change	Freeze release and freeze recovery to normal

6. RTO, RPO And AI SLOs

Traditional DR metrics must be translated for AI.

6.1 Definitions

Metric	AI interpretation
RTO	Time to restore controlled service, not necessarily full automation
RPO	Acceptable loss window for trace, retrieved source IDs, prompts, policy decisions, tool calls and human review records
SLO	Service level objective for availability, latency, groundedness, policy compliance, HITL queue, evidence completeness and fallback success
SLI	Measured indicator such as p95 latency, citation support, unsupported answer rate, queue age or trace completeness
Error budget	Allowed degradation before forced mode change, not just downtime allowance

6.2 Suggested AI Resilience Targets

Workflow tier	RTO	RPO	Key SLO
Tier 1 customer-visible regulated	Restore controlled service within 30-60 minutes	No unrecoverable evidence gap	Evidence completeness, no unsupported commitment, queue deadline protection
Tier 1 employee decision support	Restore read-only or draft-only within 60 minutes	Trace gap under approved local capture window	Source trace, human approval, no high-risk auto action
Tier 2 customer service	Restore template or cached service within 2 hours	Recoverable transcript and source references	Template availability, escalation SLA
Tier 2 operations support	Restore internal knowledge search within 4 hours	Recoverable query and answer log	Search freshness, manual SOP access
Tier 3 productivity assistant	Restore when platform stable	Standard log retention	Cost and availability

6.3 SLO Catalog

SLO	Example threshold
Model route availability	Tier 1 approved route or fallback route available 99.5% during business hours
Grounded answer rate	Tier 1 customer-visible answers with citation support >= 99%
Policy freshness	Critical policy update indexed or template updated within approved window
Tool safety	100% high-risk write actions have approval ID and idempotency key
HITL queue age	Tier 1 escalations under regulatory or customer harm threshold
Evidence completeness	Tier 1 traces include model, prompt, source IDs, policy decision, tool call, human action
Fallback activation time	Dependency health trigger to mode switch under defined minutes
Recovery gate quality	Restart sample review pass before automation resumes

7. Manual Fallback Design

Manual fallback is not “let humans handle it”. It is a designed operating model.

7.1 Manual Fallback Components

Component	Design requirement
Scope	Which intents, products, segments, channels and regions enter manual mode
Queue	Dedicated queue with risk tier, deadline, customer harm and skill tags
Staffing	Named teams, surge pool, hours, handoff and fatigue controls
Decision aids	Static SOP, policy snapshots, templates, approved calculators, evidence viewer
Authority	Who can approve, reject, override, remediate and communicate
Evidence	Manual action record, reason, source, timestamp, approver and customer impact
SLA	Queue age, regulatory deadline, customer callback and internal escalation
Exit	Criteria to move from manual-first back to draft-only or normal

7.2 Manual Fallback Flow

degradation trigger
  -> freeze risky automation
  -> classify active requests by risk and deadline
  -> route Tier 1 cases to skilled reviewers
  -> provide policy snapshot and evidence viewer
  -> record manual decision and customer communication
  -> monitor backlog and breach risk
  -> recover automation only after gate approval

7.3 Queue Priority Rules

Priority	Criteria
P0	Potential customer harm, regulatory deadline, irreversible action, privacy exposure
P1	Customer-visible complaint, fee, credit, fraud, KYC / AML, wealth suitability
P2	Employee productivity, internal summary, non-deadline operations
P3	Low-risk general information or training support

8. Customer Communication

Degraded AI service must be communicated carefully. The message should avoid over-disclosing technical internals while being accurate about service state, customer impact and next steps.

8.1 Communication Principles

Principle	Meaning
Accuracy	Do not claim AI is working normally when high-risk paths are degraded
Boundaries	Make clear when information is general, draft, or pending human review
No false reassurance	Avoid saying there is no impact before evidence confirms it
Timeliness	Provide updates before regulatory or service deadlines are missed
Consistency	Customer, branch, call center, complaint and digital channels use aligned language
Evidence	Keep the exact message version and affected population record

8.2 Customer-Facing States

State	Customer message posture
Normal	Standard disclosure and service response
Reduced automation	“Some requests are being reviewed by a specialist before completion.”
Manual review	“We are reviewing this manually and will respond by the stated timeframe.”
Delayed service	“This request may take longer than usual; urgent account or safety concerns can use [approved channel].”
Correction	“We are correcting information previously provided and will explain any customer-specific impact.”

8.3 Regulated Workflow Messaging

Workflow	Communication control
Complaint	Preserve deadline, avoid premature conclusion, route to complaint specialist
Credit	Do not generate eligibility, pricing or adverse-action reasons without approved process
AML / Fraud	Do not expose investigation logic; communicate operationally approved next steps
Wealth	Advisor owns client communication; AI output remains internal support
Fees / disputes	Use approved fee, dispute and rights language only

9. Evidence Preservation

Evidence is part of resilience. If the organization cannot reconstruct what the AI saw, decided, retrieved, proposed, blocked, escalated and communicated, recovery is incomplete.

9.1 Evidence Fields

Field	Why it matters
Request ID / trace ID	Links channels, AI, tools and case systems
Use case and risk tier	Drives severity and retention
Customer / case reference	Identifies affected population
Model provider, model ID, version	Supports model route reconstruction
Prompt / policy bundle version	Explains behavior boundary
Retrieved source IDs and index version	Validates grounding and policy freshness
Policy engine decision ID	Shows allow, deny, escalate or fallback
Tool calls and side effects	Identifies real-world impact
HITL action and reviewer role	Shows oversight
Degraded mode state	Proves system was operating under approved constraints
Customer-visible message version	Supports remediation and complaint response
Recovery gate result	Shows why automation restarted

9.2 Local Evidence Ledger

When vendor evidence export is degraded, Tier 1 workflows should write a local append-only record:

trace_id
timestamp
workflow
risk_tier
mode
model_route
prompt_version
source_ids
policy_decision
tool_action
human_review
customer_visible_flag
fallback_reason
hash_of_output

This ledger should be:

append-only。
access-controlled。
time-synchronized。
retained under approved policy。
reconciled with vendor export after recovery。
included in audit evidence binder。

10. BCP / DR Operating Model

10.1 Decision Rights

Decision	Accountable role
Declare AI operational degradation	Incident Commander / Operational Resilience Lead
Activate domain degraded mode	Business Owner + AI Product Owner
Disable model route	AI Platform Owner
Disable write tools	Security / Platform Owner
Freeze release or recovery	Model Risk / AI Governance
Prioritize HITL queue	Operations Executive
Approve customer communication	Business Owner + Legal / Compliance
Accept residual risk during extended degradation	Senior management risk committee
Restore normal automation	Business Owner + Platform + Model Risk + Compliance

10.2 Recovery Gate

Normal mode must not resume just because the vendor status page turns green.

Recovery gate should include:

dependency health stable for defined window。
no unresolved critical KRI breach。
regression eval pass for affected workflows。
evidence sync or approved local ledger reconciliation。
sampled human review of degraded-period outputs。
tool side-effect reconciliation。
HITL queue within tolerance or approved backlog plan。
business owner signoff。
model risk / compliance signoff for Tier 1。
restart decision recorded。

11. Tabletop Exercises

Exercises prove whether BCP is executable.

11.1 Exercise Types

Exercise	Purpose
Tabletop	Test decision rights, communication and tradeoffs
Technical failover drill	Test routing, fallback model, index rollback, tool disable
Evidence drill	Test local ledger, export reconciliation, audit pack
HITL surge drill	Test manual queue capacity and prioritization
Vendor exit tabletop	Test contract, data export, replacement route and executive decisions
Full scenario simulation	Combine model, RAG, policy, HITL and evidence degradation

11.2 Scenario 1: Model Provider Outage

At 09:15, Model Provider A reports elevated errors and rate limiting.
Customer service AI, lending assistant and complaint agent use Provider A.
RAG and tools are healthy.
Evidence export remains healthy.

Expected decisions:

Switch Tier 2 general service to fallback model or template。
Move Tier 1 complaint responses to draft-only。
Keep credit reason explanations under human confirmation。
Monitor fallback quality and latency。
Notify operations of reduced automation。

11.3 Scenario 2: RAG Index Stale

Policy team updated fee waiver and dispute policies at 07:00.
RAG index refresh failed silently.
Customer-facing AI continues retrieving yesterday's policy.

Expected decisions:

Disable free-form fee / dispute answers。
Use approved templates and authoritative policy search。
Query affected customer-visible answers since 07:00。
Preserve retrieved source IDs and policy effective dates。
Recover only after index validation and sample review。

11.4 Exercise Scoring

Dimension	Pass criteria
Detection	Trigger recognized within target time
Decision rights	Correct owner makes decision
Mode switch	System enters approved degraded state
Customer protection	High-risk customer-visible output blocked or reviewed
Manual fallback	Queue routing and staffing work
Evidence	Required fields captured
Communication	Internal and external messages aligned
Recovery	Restart gate used before normal mode
Improvement	Gaps become funded actions

12. RACI

Activity	Accountable	Responsible	Consulted	Informed
Critical operation mapping	Business Executive	AI PM / Senior BA	Enterprise Architecture, Risk, Compliance	Operations, Audit
AI dependency graph	Enterprise Architect	AI Platform + Product Architect	Security, Data, Vendor Owner	Business Owners
Impact tolerance approval	Senior Management / Risk Committee	Operational Resilience Lead	Legal, Compliance, Model Risk	Board Risk Committee as appropriate
Degraded mode design	AI Product Owner	Product Architect + Platform Owner	Business Ops, Compliance, Security	Customer Ops
Manual fallback plan	Operations Executive	Queue Owners / Workforce Planning	Business Owner, Compliance	Frontline Teams
Evidence preservation design	Governance / Audit Evidence Owner	Platform Engineering	Legal, Privacy, Security	Internal Audit
Tabletop exercise	Operational Resilience Lead	BCP Team + AI Platform	Vendor, Legal, Business, Risk	Senior Management
Trigger monitoring	Platform Owner	SRE / AI Ops	Product, Model Risk	Business Ops
Customer communication	Business Owner	Customer Ops / Communications	Legal, Compliance	Frontline Teams
Recovery approval	Business Owner + Model Risk	Platform + Product	Compliance, Security, Operations	Senior Management

13. Templates

13.1 AI BCP Use Case Card

Field	Example
Use case	Complaint response agent
Critical operation	Complaint handling
Risk tier	Tier 1
Customer-visible	Yes
Regulatory exposure	Complaint deadlines, consumer protection, recordkeeping
Normal AI behavior	Draft and send approved responses after policy and human checks
Minimum viable service	Draft-only, manual review, deadline priority
Degraded modes	Template-only, draft-only, read-only, manual-first, safe-stop
RTO	Controlled service within 60 minutes
RPO	No unrecoverable trace gap
Recovery gate	Eval, evidence sync, sample review, business signoff

13.2 Degraded Mode Decision Card

Field	Filled example
Trigger	Evidence export API degraded for Tier 1 complaint workflow
Mode	Draft-only + local evidence ledger
Allowed	Generate internal draft from approved template, route to complaint analyst
Blocked	Auto-send, free-form promise, CRM closure
Decision owner	AI Incident Commander + Complaint Business Owner
Evidence required	Trace ID, prompt version, source IDs, policy decision, output hash
Customer impact	Response may require manual review; deadlines protected
Recovery condition	Vendor export restored, local ledger reconciled, 30-sample review passed

13.3 Recovery Decision Memo

Field	Filled example
Workflow	Complaint response agent
Incident window	2026-06-30 09:15-13:40 CT
Degraded mode	Draft-only + local evidence ledger
Customer impact	No auto-send; 42 cases manually reviewed
Evidence status	Local ledger reconciled; 0 missing Tier 1 traces
Regression status	Complaint policy golden set and citation support passed
Tool reconciliation	No automated CRM closure during degraded window
Decision	Restore approved complaint intents with 24-hour enhanced monitoring
Approvers	Business Owner, Platform Owner, Model Risk, Compliance

13.4 Communication And Executive Brief Template

Audience	Required content
Customer	Reduced automation, manual review, expected response path, urgent channel
Frontline	Current mode, allowed actions, blocked actions, queue and next update
Executive	Situation, impact, controls, decisions needed, recovery path, residual risk
Regulator / examiner	System scope, affected population, controls, evidence, remediation

14. Governance Cadence

Cadence	Forum	Decisions / outputs
Daily during degradation	AI operational bridge	Mode state, backlog, customer impact, evidence status, next decision
Weekly	AI operations review	SLO breaches, degraded-mode activations, queue health, vendor notices
Monthly	AI governance / model risk committee	KRI trends, recovery exercise gaps, risk acceptance, control effectiveness
Quarterly	Operational resilience review	Critical operation map, impact tolerance, scenario exercise, third-party dependency
Semiannual	Executive tabletop	Senior decision rights, customer communication, board / regulator brief
Annual	Full BCP / DR exercise	Technical failover, manual fallback, evidence recovery, vendor exit scenario
Event-driven	Major model, RAG, tool, policy, identity, evidence or vendor change	Impact assessment, regression scope, degraded mode update

15. 30-Day Lab

目标: 30 天内完成一套可展示的 AI Operational Resilience / BCP / Degraded Mode Architecture portfolio pack。推荐选择 Complaint response agent、Lending policy assistant、AML case summarizer 或 Customer service AI。

Day	Task	Artifact
1	选择一个金融零售 AI workflow, 定义 customer-visible 和 regulated boundary	Use Case Boundary Card
2	映射 critical operation 和 core business line	Critical Operation Map
3	写 business impact analysis, 覆盖客户、监管、运营和证据	AI BIA Worksheet
4	列出 model、RAG、tool、identity、policy、eval、HITL、vendor、evidence 依赖	Dependency Register
5	画 dependency graph 和 edge attributes	Dependency Graph
6	定义 impact tolerance	Impact Tolerance Memo
7	设计 minimum viable service	Minimum Service Definition
8	设计 degraded mode taxonomy	Degraded Mode Matrix
9	为 model provider outage 设计 fallback	Model Fallback Runbook
10	为 RAG stale / ACL failure 设计 fallback	RAG Degradation Runbook
11	为 tool gateway / side effect failure 设计 fallback	Tool Degradation Runbook
12	为 identity / entitlement failure 设计 fallback	Identity Degradation Runbook
13	为 policy engine false allow / false deny 设计 fallback	Policy Degradation Runbook
14	为 HITL queue saturation 设计 fallback	HITL Surge Plan
15	为 evidence export failure 设计 fallback	Evidence Preservation Plan
16	定义 RTO / RPO / SLO / SLI	AI Resilience Metrics Table
17	设计 manual fallback queue 和 staffing model	Manual Operations Plan
18	设计客户和内部沟通矩阵	Communication Matrix
19	写 degraded mode decision card	Decision Card
20	写 recovery gate 和 restart memo	Recovery Gate Pack
21	设计 monitoring triggers	Trigger Dashboard Spec
22	设计 RACI 和 decision rights	RACI
23	设计 tabletop scenario 1: model outage	Scenario Script
24	设计 tabletop scenario 2: RAG stale	Scenario Script
25	设计 tabletop scenario 3: evidence failure	Scenario Script
26	运行一次 90 分钟 tabletop, 记录 decision log	Exercise Decision Log
27	把 exercise gaps 转成 remediation backlog	Remediation Register
28	写 executive brief	Executive One-Pager
29	写 1500-2500 字 portfolio case study	Case Study
30	准备 8 个面试问答和 5 分钟讲述	Interview Story Pack

16. Interview Answers

Q1: AI operational resilience 和普通 BCP 有什么不同?

30 秒:

普通 BCP 关注服务中断、站点切换、人员和流程恢复。AI operational resilience 还要覆盖语义正确性、RAG freshness、policy decision、tool side effect、identity entitlement、HITL capacity 和 evidence completeness。AI 可能没有宕机但已经不安全, 所以需要 normal、degraded、safe-stop 和 recovery gate。

Q2: 如何为客户可见 AI 设计 degraded mode?

30 秒:

我会按风险降级为 citation-required、template-only、draft-only、manual-first 或 safe-stop。费用、投诉、信贷、财富和客户权益相关内容, 在 RAG、policy、identity 或 evidence 不完整时不能自由生成或自动发送。恢复 normal 前必须通过 eval、样本复核、证据同步和业务签字。

Q3: HITL 是不是天然的 fallback?

30 秒:

不是。HITL 只有在有容量、技能、优先级、SLA、证据和决策权限时才是控制。否则它会在事故中变成瓶颈。AI BCP 必须设计队列分级、surge staffing、deadline routing、review payload 和 backlog threshold。

Q4: Evidence export 故障为什么会触发 AI 降级?

30 秒:

因为高风险 AI 输出必须可复原、可审计、可解释。没有 evidence, 事故后无法证明模型版本、prompt、retrieved source、policy decision、tool action 和 human review。Tier 1 客户可见 workflow 在 evidence export 故障时应至少切 draft-only 并启用本地证据账本。

Q5: 如何处理 policy engine false allow?

30 秒:

先把高风险写工具切 read-only, 查询 exposure window 内的 side effects, 保留 policy decision log 和 tool ledger, 再做客户和业务补救。恢复前必须通过 policy regression、审批链路验证和工具幂等检查。

Q6: RTO/RPO 在 AI 里怎么定义?

30 秒:

AI RTO 是恢复受控服务的时间, 不一定是恢复完整自动化。AI RPO 是可接受的证据、source、prompt、tool、human review 记录损失窗口。Tier 1 客户可见流程可能要求 30-60 分钟内恢复 template 或 manual service, 且不能有不可恢复证据缺口。

Q7: 如何设计恢复 normal mode 的 gate?

30 秒:

不能只看供应商状态页变绿。恢复 gate 要看 dependency health、regression eval、样本复核、evidence reconciliation、tool side-effect reconciliation、HITL backlog、客户影响和业务 / 风险 / 合规签字。

Q8: 如何向高管解释 AI BCP 的价值?

30 秒:

我会说 AI BCP 不是阻碍自动化, 而是保护关键运营。当 AI 依赖失败时, 机构需要知道哪些服务继续、哪些服务降级、哪些动作停止、谁做决定、客户如何沟通、证据如何保全、何时恢复。这能降低客户伤害、监管风险和混乱停机。

17. Self-Assessment Checklist

Check	Passing evidence
Critical operation mapping	AI workflows linked to customer journeys, core business lines and regulatory obligations
Dependency graph	Model, RAG, tool, identity, policy, eval, HITL, vendor and evidence dependencies mapped
Impact tolerance	Business-approved tolerance for downtime, backlog, misinformation and evidence loss
Degraded mode taxonomy	Normal, conservative, citation-required, template-only, draft-only, read-only, manual-first, safe-stop defined
Trigger logic	Dependency health and risk-tier triggers mapped to mode switch
Manual fallback	Queue, staffing, templates, authority, SLA and evidence designed
Customer communication	Reduced automation, manual review, delay and correction states prepared
Evidence preservation	Local ledger and reconciliation design exists
Tabletop exercises	Model, RAG, identity, policy, HITL and evidence scenarios exercised
Recovery gate	Eval, evidence, tool reconciliation, queue, business and risk signoff required
Governance	RACI, cadence, risk acceptance and remediation tracking active

18. Final Principle

AI operational resilience 的成熟度可以用一句话检验:

When AI is degraded, do we already know which customer promises stop, which operations continue, which humans take over, which evidence is preserved, which executives decide, and what proof is required before automation returns?

如果答案不清楚, 企业只是把 AI 功能上线了, 还没有把 AI 做成关键运营能力。