返回 Papers
AI 扩展计划 / Playbooks

AI Operational Resilience / BCP / Degraded Mode Playbook

核心判断:

684AI_OPERATIONAL_RESILIENCE_BCP_DEGRADED_MODE_PLAYBOOK.md

AI Operational Resilience / BCP / Degraded Mode Architecture Playbook

定位: 面向 CBAP+、金融零售 AI PM、AI Product Architect、Enterprise Architect、Business Continuity、Operational Resilience、Model Risk、Third-Party Risk、Cyber、Compliance、Internal Audit 和一线运营负责人。本文不是基础 BA 流程文档, 而是训练你把 AI customer-facing / regulated workflows 做成可降级、可恢复、可证明、可演练的生产能力。

核心判断:

AI 的运营韧性不是“模型可用率”, 而是当模型、RAG、工具、身份、policy engine、eval、HITL、vendor 或 evidence stack 降级时, 关键运营是否还能在预设边界内安全运行。

这份 playbook 明确区别于 incident postmortem:

  • Incident postmortem 是事后学习。
  • Operational resilience / BCP 是事前设计。
  • Degraded mode 是事故中保持关键服务连续性的受控状态。
  • Recovery exercise 是证明这些设计真实可执行的机制。

1. Executive Framing

金融零售 AI 的关键风险不只来自错误回答, 也来自关键运营在压力状态下没有可执行的降级方案。

当客户服务 AI、信贷政策助手、AML case summarizer、欺诈复核 copilot、投诉回复 agent、财富顾问 copilot、分行员工助手和监管变更助手进入生产后, 它们会逐渐成为业务运营的一部分。它们可能不直接做最终决定, 但会影响:

  • 客户得到的说明。
  • 员工看到的证据。
  • case 的优先级。
  • 投诉的处理节奏。
  • 信贷或欺诈复核的解释。
  • AML / KYC 调查材料。
  • 管理层和监管检查需要的 evidence。

正常状态下的 AI governance 会问:

这个 use case 是否批准上线。
模型是否验证。
prompt 是否测试。
RAG 是否有 citation。
工具是否有审批。
人审是否存在。

BCP / operational resilience 要继续问:

如果模型 provider 只剩 20% 容量, 关键流程怎么运行。
如果 RAG index 过期 12 小时, 哪些问题必须拒答或模板化。
如果 identity service 无法返回 entitlement, AI 是否还能个性化。
如果 policy engine false allow, 哪些工具动作必须停。
如果 HITL queue 积压 8 小时, 哪些 case 优先。
如果 evidence export API 故障, 哪些输出必须进入本地证据账本。
如果 vendor 同时降级运行路径和证据路径, 谁能接受残余风险。

运营韧性的高级目标不是让 AI 永不失败, 而是让失败保持在 impact tolerance 内:

  • 客户不被误导。
  • 受监管流程不失控。
  • 关键服务不无序中断。
  • 人工队列不被低风险任务打爆。
  • 证据不丢失。
  • 恢复有门禁。
  • 管理层知道谁在做什么决定。

2. Source Anchors

以下来源用于建立监管语言、治理语言和架构控制语言。本文是学习、作品集和架构训练材料, 不构成法律意见、监管解释、审计结论或认证建议。正式项目必须由 legal、compliance、model risk、operational risk、technology、business owner、third-party risk、privacy、security 和 internal audit 结合机构政策与司法辖区复核。

AnchorOfficial link本文使用方式
FFIEC Business Continuity Management booklethttps://ithandbook.ffiec.gov/it-booklets/business-continuity-management.aspx用 business impact analysis、risk assessment、critical operations、dependencies、testing、training、exercises、board / senior management oversight 组织 AI BCP。
NIST AI Risk Management Frameworkhttps://www.nist.gov/itl/ai-risk-management-framework用 Govern / Map / Measure / Manage 组织 AI 风险识别、风险度量、控制选择、持续监控和管理层汇报。
ISO/IEC 42001https://www.iso.org/standard/42001用 AI management system 语言连接政策、角色职责、运行控制、绩效评价、管理评审和持续改进。
Federal Reserve SR 26-2https://www.federalreserve.gov/supervisionreg/srletters/SR2602.htm2026 年模型风险管理新指引, 替代 SR 11-7 和 SR 21-8; 当前范围排除 generative AI 与 agentic AI 模型, 但其风险分层、验证、治理、监控和变更控制思想可作为非覆盖 AI 控制设计参考。
Federal Reserve SR 20-24 / Interagency Sound Practices for Operational Resiliencehttps://www.federalreserve.gov/supervisionreg/srletters/SR2024.htm用 operational resilience、critical operations、core business lines、impact tolerance、third-party dependencies、scenario testing 和 continuous improvement 组织 AI service continuity。

2.1 2026 SR 26-2 Nuance

SR 26-2 的当前 nuance 对 AI 架构师很重要:

  • 它在 2026-04-17 发布。
  • 它替代了 SR 11-7 和 SR 21-8。
  • 它采用更风险分层的 model risk management 语言。
  • 它当前不覆盖 generative AI 和 agentic AI 模型。
  • 它不应被误读为 GenAI / agentic AI 的完整监管要求。
  • 对金融零售 AI BCP 来说, 它仍可提供治理、验证、持续监控、变更控制和风险接受的结构化思想。

实务表达:

我会明确区分“SR 26-2 是否直接适用”与“SR 26-2 的治理原则是否可借鉴”。客户可见或受监管 GenAI / agentic workflow 的 BCP 设计, 需要叠加 NIST AI RMF、ISO/IEC 42001、FFIEC BCM、operational resilience 和机构内部模型/AI/第三方/网络/业务连续性政策。


3. Critical Operation Mapping

AI BCP 不能从模型清单开始, 要从 critical operations 开始。

3.1 Critical Operation Definition

在金融零售环境中, critical operation 指某个业务能力或流程一旦中断、严重降级或证据缺失, 可能影响:

  • 客户资金、账户、权益、费用、申诉或法律权利。
  • 核心业务线的持续运营。
  • AML / KYC / fraud / credit / complaint / privacy 等受监管义务。
  • 机构安全稳健。
  • 管理层、审计或监管检查所需的可证明性。

3.2 AI-Enabled Critical Operation Inventory

Critical operationAI workflowCustomer / regulatory impactMinimum viable service
Customer service for fees, disputes, complaintsCustomer-facing AI or agent-assist错误费用、错误承诺、投诉权利误导预批准模板 + 人工升级 + 权威政策引用
Credit underwriting supportLending policy assistant / document summarizer信贷原因、资格、例外政策、adverse action 解释摘要草稿 + 人工确认 + reason-code consistency
AML investigationAML case summarizer / SAR narrative drafter调查质量、SAR narrative、监管检查证据Extractive summary + analyst review + source trace
Fraud alert triageFraud copilot账户冻结、交易阻断、客户摩擦只读证据聚合 + 人工 disposition
Complaint handlingComplaint response agent法定回复期限、客户补救、品牌和监管风险Draft-only + deadline priority queue
Wealth advice supportAdvisor copilot投资建议、适当性、披露、客户画像Internal-only summary + required advisor review
Branch operations supportEmployee knowledge assistant分行服务一致性、客户身份、流程合规Template + SOP search + no personalization if entitlement degraded
Regulatory change managementRegulatory summarizer政策更新、控制更新、审计证据Manual legal review + traceable summary

3.3 Impact Tolerance

Impact tolerance 不等于技术 SLO。它回答: 在严重但合理的压力场景下, 机构最多能承受多少中断、积压、错误和证据缺口。

DimensionExample tolerance
Customer-visible misinformationTier 1 workflows: zero known unsupported commitments after detection
Manual backlogComplaint high-risk queue: no regulatory deadline breach
Evidence gapTier 1 AI outputs: no unrecoverable trace gap; local capture allowed for delayed export
Tool side effectNo high-risk write action without approval when policy or identity is degraded
Policy freshnessCustomer fee / credit / complaint answers cannot use superseded policy
Recovery approvalNormal automation restarts only after regression, evidence sync and business signoff

4. AI Dependency Graph

AI BCP 要画 dependency graph, 不是只写系统清单。

4.1 Dependency Types

DependencyFailure examplesResilience question
Model provideroutage, latency, model rollback, rate limit, quality drift是否有 approved fallback model、template path 或 extractive baseline
RAG corpusstale policy, source removal, document conflict是否能切权威搜索、模板或 citation-only
Vector indexrebuild failure, ACL filter error, metadata corruption是否有 index manifest、previous version 和 rollback
Tool gatewayCRM / LOS / AML tool degraded, side effect risk是否能切 read-only、draft-only 或 manual action
Identity / entitlementSSO degraded, role mapping missing, consent unavailable是否能禁用个性化和敏感数据访问
Policy enginefalse allow, false deny, latency, unavailable是否有 safe default、policy snapshot 和 manual decision rights
Eval pipelinejudge unavailable, golden set runner failed是否冻结高风险 release 和 recovery
HITL queuebacklog, skill mismatch, reviewer outage是否有风险优先级、surge team 和 capacity triggers
Vendor supportdelayed notice, portal down, evidence export failure是否有 contract escalation 和 local evidence capture
Evidence storetrace loss, export lag, retention failure是否能本地 append-only 保存恢复所需字段

4.2 Graph View

Critical operation
  -> AI workflow
  -> customer / employee channel
  -> model route
  -> prompt / policy bundle
  -> RAG source and index
  -> tool gateway
  -> identity and entitlement
  -> policy engine
  -> HITL queue
  -> evidence ledger
  -> vendor and cloud dependencies
  -> recovery gate

4.3 Dependency Edge Attributes

AttributeWhy it matters
Synchronous vs asynchronous同步依赖失败会立即中断客户旅程
Read vs write写依赖失败或误用会造成真实业务影响
Customer-visible vs internal客户可见输出需要更严格降级
Regulatory deadline投诉、AML、KYC、信贷流程有时间约束
Evidence criticality证据缺失会影响复盘、审计和监管响应
Substitutability替代越困难, BCP 越要提前演练
Manual capacity人工 fallback 是否真实可执行

5. Degraded Mode Taxonomy

Degraded mode 是预先批准的运行状态。它不是临时“关一点功能”, 而是有 trigger、allowed behavior、blocked behavior、owner、evidence 和 recovery condition。

5.1 Taxonomy Matrix

ModeAllowed behaviorBlocked behaviorTypical trigger
NormalFull approved AI workflow within policyOut-of-bound useAll dependencies healthy
Conservative generationShort answer, citation, lower temperature, no speculationBroad reasoning, unsupported adviceModel quality KRI yellow
Citation-requiredAnswer only with authoritative sourcesUnsourced conclusionRAG confidence degraded
Template-onlyPre-approved static languageFree-form customer commitmentsCustomer-visible regulated topic
Draft-onlyEmployee sees draft and must approveAuto-send or auto-submitEvidence, policy, or model not fully healthy
Read-onlyRetrieve and summarizeAny write, notification, case closure, credit actionTool, identity, or policy degraded
Manual-firstAI supports queueing and retrieval onlyAI recommendation or decision framingHITL or model risk high
Cache-assistedRecently validated low-risk answersPersonalized, time-sensitive, eligibility answersVendor outage with fresh cache
Safe-stopStop a workflow categoryContinuing AI outputMultiple critical controls degraded

5.2 Trigger Design

TriggerDegraded mode
Model p95 latency exceeds tolerance but quality normalConservative generation or fallback model
Model quality eval fails Tier 1 golden setDraft-only or safe-stop for affected workflow
RAG index freshness breach for customer policyTemplate-only or citation-required
Vector ACL uncertaintySafe-stop for sensitive retrieval; no personalization
Tool gateway side-effect uncertaintyRead-only
Identity entitlement unavailableNo sensitive data, no personalized answer, manual verification
Policy engine unavailableDeny high-risk action, allow low-risk template where policy snapshot valid
HITL queue > capacity thresholdRisk-prioritized manual-first, stop low-value escalations
Evidence export degradedDraft-only for customer-visible Tier 1; local evidence ledger
Eval pipeline unavailable after high-risk changeFreeze release and freeze recovery to normal

6. RTO, RPO And AI SLOs

Traditional DR metrics must be translated for AI.

6.1 Definitions

MetricAI interpretation
RTOTime to restore controlled service, not necessarily full automation
RPOAcceptable loss window for trace, retrieved source IDs, prompts, policy decisions, tool calls and human review records
SLOService level objective for availability, latency, groundedness, policy compliance, HITL queue, evidence completeness and fallback success
SLIMeasured indicator such as p95 latency, citation support, unsupported answer rate, queue age or trace completeness
Error budgetAllowed degradation before forced mode change, not just downtime allowance

6.2 Suggested AI Resilience Targets

Workflow tierRTORPOKey SLO
Tier 1 customer-visible regulatedRestore controlled service within 30-60 minutesNo unrecoverable evidence gapEvidence completeness, no unsupported commitment, queue deadline protection
Tier 1 employee decision supportRestore read-only or draft-only within 60 minutesTrace gap under approved local capture windowSource trace, human approval, no high-risk auto action
Tier 2 customer serviceRestore template or cached service within 2 hoursRecoverable transcript and source referencesTemplate availability, escalation SLA
Tier 2 operations supportRestore internal knowledge search within 4 hoursRecoverable query and answer logSearch freshness, manual SOP access
Tier 3 productivity assistantRestore when platform stableStandard log retentionCost and availability

6.3 SLO Catalog

SLOExample threshold
Model route availabilityTier 1 approved route or fallback route available 99.5% during business hours
Grounded answer rateTier 1 customer-visible answers with citation support >= 99%
Policy freshnessCritical policy update indexed or template updated within approved window
Tool safety100% high-risk write actions have approval ID and idempotency key
HITL queue ageTier 1 escalations under regulatory or customer harm threshold
Evidence completenessTier 1 traces include model, prompt, source IDs, policy decision, tool call, human action
Fallback activation timeDependency health trigger to mode switch under defined minutes
Recovery gate qualityRestart sample review pass before automation resumes

7. Manual Fallback Design

Manual fallback is not “let humans handle it”. It is a designed operating model.

7.1 Manual Fallback Components

ComponentDesign requirement
ScopeWhich intents, products, segments, channels and regions enter manual mode
QueueDedicated queue with risk tier, deadline, customer harm and skill tags
StaffingNamed teams, surge pool, hours, handoff and fatigue controls
Decision aidsStatic SOP, policy snapshots, templates, approved calculators, evidence viewer
AuthorityWho can approve, reject, override, remediate and communicate
EvidenceManual action record, reason, source, timestamp, approver and customer impact
SLAQueue age, regulatory deadline, customer callback and internal escalation
ExitCriteria to move from manual-first back to draft-only or normal

7.2 Manual Fallback Flow

degradation trigger
  -> freeze risky automation
  -> classify active requests by risk and deadline
  -> route Tier 1 cases to skilled reviewers
  -> provide policy snapshot and evidence viewer
  -> record manual decision and customer communication
  -> monitor backlog and breach risk
  -> recover automation only after gate approval

7.3 Queue Priority Rules

PriorityCriteria
P0Potential customer harm, regulatory deadline, irreversible action, privacy exposure
P1Customer-visible complaint, fee, credit, fraud, KYC / AML, wealth suitability
P2Employee productivity, internal summary, non-deadline operations
P3Low-risk general information or training support

8. Customer Communication

Degraded AI service must be communicated carefully. The message should avoid over-disclosing technical internals while being accurate about service state, customer impact and next steps.

8.1 Communication Principles

PrincipleMeaning
AccuracyDo not claim AI is working normally when high-risk paths are degraded
BoundariesMake clear when information is general, draft, or pending human review
No false reassuranceAvoid saying there is no impact before evidence confirms it
TimelinessProvide updates before regulatory or service deadlines are missed
ConsistencyCustomer, branch, call center, complaint and digital channels use aligned language
EvidenceKeep the exact message version and affected population record

8.2 Customer-Facing States

StateCustomer message posture
NormalStandard disclosure and service response
Reduced automation“Some requests are being reviewed by a specialist before completion.”
Manual review“We are reviewing this manually and will respond by the stated timeframe.”
Delayed service“This request may take longer than usual; urgent account or safety concerns can use [approved channel].”
Correction“We are correcting information previously provided and will explain any customer-specific impact.”

8.3 Regulated Workflow Messaging

WorkflowCommunication control
ComplaintPreserve deadline, avoid premature conclusion, route to complaint specialist
CreditDo not generate eligibility, pricing or adverse-action reasons without approved process
AML / FraudDo not expose investigation logic; communicate operationally approved next steps
WealthAdvisor owns client communication; AI output remains internal support
Fees / disputesUse approved fee, dispute and rights language only

9. Evidence Preservation

Evidence is part of resilience. If the organization cannot reconstruct what the AI saw, decided, retrieved, proposed, blocked, escalated and communicated, recovery is incomplete.

9.1 Evidence Fields

FieldWhy it matters
Request ID / trace IDLinks channels, AI, tools and case systems
Use case and risk tierDrives severity and retention
Customer / case referenceIdentifies affected population
Model provider, model ID, versionSupports model route reconstruction
Prompt / policy bundle versionExplains behavior boundary
Retrieved source IDs and index versionValidates grounding and policy freshness
Policy engine decision IDShows allow, deny, escalate or fallback
Tool calls and side effectsIdentifies real-world impact
HITL action and reviewer roleShows oversight
Degraded mode stateProves system was operating under approved constraints
Customer-visible message versionSupports remediation and complaint response
Recovery gate resultShows why automation restarted

9.2 Local Evidence Ledger

When vendor evidence export is degraded, Tier 1 workflows should write a local append-only record:

trace_id
timestamp
workflow
risk_tier
mode
model_route
prompt_version
source_ids
policy_decision
tool_action
human_review
customer_visible_flag
fallback_reason
hash_of_output

This ledger should be:

  • append-only。
  • access-controlled。
  • time-synchronized。
  • retained under approved policy。
  • reconciled with vendor export after recovery。
  • included in audit evidence binder。

10. BCP / DR Operating Model

10.1 Decision Rights

DecisionAccountable role
Declare AI operational degradationIncident Commander / Operational Resilience Lead
Activate domain degraded modeBusiness Owner + AI Product Owner
Disable model routeAI Platform Owner
Disable write toolsSecurity / Platform Owner
Freeze release or recoveryModel Risk / AI Governance
Prioritize HITL queueOperations Executive
Approve customer communicationBusiness Owner + Legal / Compliance
Accept residual risk during extended degradationSenior management risk committee
Restore normal automationBusiness Owner + Platform + Model Risk + Compliance

10.2 Recovery Gate

Normal mode must not resume just because the vendor status page turns green.

Recovery gate should include:

  • dependency health stable for defined window。
  • no unresolved critical KRI breach。
  • regression eval pass for affected workflows。
  • evidence sync or approved local ledger reconciliation。
  • sampled human review of degraded-period outputs。
  • tool side-effect reconciliation。
  • HITL queue within tolerance or approved backlog plan。
  • business owner signoff。
  • model risk / compliance signoff for Tier 1。
  • restart decision recorded。

11. Tabletop Exercises

Exercises prove whether BCP is executable.

11.1 Exercise Types

ExercisePurpose
TabletopTest decision rights, communication and tradeoffs
Technical failover drillTest routing, fallback model, index rollback, tool disable
Evidence drillTest local ledger, export reconciliation, audit pack
HITL surge drillTest manual queue capacity and prioritization
Vendor exit tabletopTest contract, data export, replacement route and executive decisions
Full scenario simulationCombine model, RAG, policy, HITL and evidence degradation

11.2 Scenario 1: Model Provider Outage

At 09:15, Model Provider A reports elevated errors and rate limiting.
Customer service AI, lending assistant and complaint agent use Provider A.
RAG and tools are healthy.
Evidence export remains healthy.

Expected decisions:

  • Switch Tier 2 general service to fallback model or template。
  • Move Tier 1 complaint responses to draft-only。
  • Keep credit reason explanations under human confirmation。
  • Monitor fallback quality and latency。
  • Notify operations of reduced automation。

11.3 Scenario 2: RAG Index Stale

Policy team updated fee waiver and dispute policies at 07:00.
RAG index refresh failed silently.
Customer-facing AI continues retrieving yesterday's policy.

Expected decisions:

  • Disable free-form fee / dispute answers。
  • Use approved templates and authoritative policy search。
  • Query affected customer-visible answers since 07:00。
  • Preserve retrieved source IDs and policy effective dates。
  • Recover only after index validation and sample review。

11.4 Exercise Scoring

DimensionPass criteria
DetectionTrigger recognized within target time
Decision rightsCorrect owner makes decision
Mode switchSystem enters approved degraded state
Customer protectionHigh-risk customer-visible output blocked or reviewed
Manual fallbackQueue routing and staffing work
EvidenceRequired fields captured
CommunicationInternal and external messages aligned
RecoveryRestart gate used before normal mode
ImprovementGaps become funded actions

12. RACI

ActivityAccountableResponsibleConsultedInformed
Critical operation mappingBusiness ExecutiveAI PM / Senior BAEnterprise Architecture, Risk, ComplianceOperations, Audit
AI dependency graphEnterprise ArchitectAI Platform + Product ArchitectSecurity, Data, Vendor OwnerBusiness Owners
Impact tolerance approvalSenior Management / Risk CommitteeOperational Resilience LeadLegal, Compliance, Model RiskBoard Risk Committee as appropriate
Degraded mode designAI Product OwnerProduct Architect + Platform OwnerBusiness Ops, Compliance, SecurityCustomer Ops
Manual fallback planOperations ExecutiveQueue Owners / Workforce PlanningBusiness Owner, ComplianceFrontline Teams
Evidence preservation designGovernance / Audit Evidence OwnerPlatform EngineeringLegal, Privacy, SecurityInternal Audit
Tabletop exerciseOperational Resilience LeadBCP Team + AI PlatformVendor, Legal, Business, RiskSenior Management
Trigger monitoringPlatform OwnerSRE / AI OpsProduct, Model RiskBusiness Ops
Customer communicationBusiness OwnerCustomer Ops / CommunicationsLegal, ComplianceFrontline Teams
Recovery approvalBusiness Owner + Model RiskPlatform + ProductCompliance, Security, OperationsSenior Management

13. Templates

13.1 AI BCP Use Case Card

FieldExample
Use caseComplaint response agent
Critical operationComplaint handling
Risk tierTier 1
Customer-visibleYes
Regulatory exposureComplaint deadlines, consumer protection, recordkeeping
Normal AI behaviorDraft and send approved responses after policy and human checks
Minimum viable serviceDraft-only, manual review, deadline priority
Degraded modesTemplate-only, draft-only, read-only, manual-first, safe-stop
RTOControlled service within 60 minutes
RPONo unrecoverable trace gap
Recovery gateEval, evidence sync, sample review, business signoff

13.2 Degraded Mode Decision Card

FieldFilled example
TriggerEvidence export API degraded for Tier 1 complaint workflow
ModeDraft-only + local evidence ledger
AllowedGenerate internal draft from approved template, route to complaint analyst
BlockedAuto-send, free-form promise, CRM closure
Decision ownerAI Incident Commander + Complaint Business Owner
Evidence requiredTrace ID, prompt version, source IDs, policy decision, output hash
Customer impactResponse may require manual review; deadlines protected
Recovery conditionVendor export restored, local ledger reconciled, 30-sample review passed

13.3 Recovery Decision Memo

FieldFilled example
WorkflowComplaint response agent
Incident window2026-06-30 09:15-13:40 CT
Degraded modeDraft-only + local evidence ledger
Customer impactNo auto-send; 42 cases manually reviewed
Evidence statusLocal ledger reconciled; 0 missing Tier 1 traces
Regression statusComplaint policy golden set and citation support passed
Tool reconciliationNo automated CRM closure during degraded window
DecisionRestore approved complaint intents with 24-hour enhanced monitoring
ApproversBusiness Owner, Platform Owner, Model Risk, Compliance

13.4 Communication And Executive Brief Template

AudienceRequired content
CustomerReduced automation, manual review, expected response path, urgent channel
FrontlineCurrent mode, allowed actions, blocked actions, queue and next update
ExecutiveSituation, impact, controls, decisions needed, recovery path, residual risk
Regulator / examinerSystem scope, affected population, controls, evidence, remediation

14. Governance Cadence

CadenceForumDecisions / outputs
Daily during degradationAI operational bridgeMode state, backlog, customer impact, evidence status, next decision
WeeklyAI operations reviewSLO breaches, degraded-mode activations, queue health, vendor notices
MonthlyAI governance / model risk committeeKRI trends, recovery exercise gaps, risk acceptance, control effectiveness
QuarterlyOperational resilience reviewCritical operation map, impact tolerance, scenario exercise, third-party dependency
SemiannualExecutive tabletopSenior decision rights, customer communication, board / regulator brief
AnnualFull BCP / DR exerciseTechnical failover, manual fallback, evidence recovery, vendor exit scenario
Event-drivenMajor model, RAG, tool, policy, identity, evidence or vendor changeImpact assessment, regression scope, degraded mode update

15. 30-Day Lab

目标: 30 天内完成一套可展示的 AI Operational Resilience / BCP / Degraded Mode Architecture portfolio pack。推荐选择 Complaint response agent、Lending policy assistant、AML case summarizer 或 Customer service AI。

DayTaskArtifact
1选择一个金融零售 AI workflow, 定义 customer-visible 和 regulated boundaryUse Case Boundary Card
2映射 critical operation 和 core business lineCritical Operation Map
3写 business impact analysis, 覆盖客户、监管、运营和证据AI BIA Worksheet
4列出 model、RAG、tool、identity、policy、eval、HITL、vendor、evidence 依赖Dependency Register
5画 dependency graph 和 edge attributesDependency Graph
6定义 impact toleranceImpact Tolerance Memo
7设计 minimum viable serviceMinimum Service Definition
8设计 degraded mode taxonomyDegraded Mode Matrix
9为 model provider outage 设计 fallbackModel Fallback Runbook
10为 RAG stale / ACL failure 设计 fallbackRAG Degradation Runbook
11为 tool gateway / side effect failure 设计 fallbackTool Degradation Runbook
12为 identity / entitlement failure 设计 fallbackIdentity Degradation Runbook
13为 policy engine false allow / false deny 设计 fallbackPolicy Degradation Runbook
14为 HITL queue saturation 设计 fallbackHITL Surge Plan
15为 evidence export failure 设计 fallbackEvidence Preservation Plan
16定义 RTO / RPO / SLO / SLIAI Resilience Metrics Table
17设计 manual fallback queue 和 staffing modelManual Operations Plan
18设计客户和内部沟通矩阵Communication Matrix
19写 degraded mode decision cardDecision Card
20写 recovery gate 和 restart memoRecovery Gate Pack
21设计 monitoring triggersTrigger Dashboard Spec
22设计 RACI 和 decision rightsRACI
23设计 tabletop scenario 1: model outageScenario Script
24设计 tabletop scenario 2: RAG staleScenario Script
25设计 tabletop scenario 3: evidence failureScenario Script
26运行一次 90 分钟 tabletop, 记录 decision logExercise Decision Log
27把 exercise gaps 转成 remediation backlogRemediation Register
28写 executive briefExecutive One-Pager
29写 1500-2500 字 portfolio case studyCase Study
30准备 8 个面试问答和 5 分钟讲述Interview Story Pack

16. Interview Answers

Q1: AI operational resilience 和普通 BCP 有什么不同?

30 秒:

普通 BCP 关注服务中断、站点切换、人员和流程恢复。AI operational resilience 还要覆盖语义正确性、RAG freshness、policy decision、tool side effect、identity entitlement、HITL capacity 和 evidence completeness。AI 可能没有宕机但已经不安全, 所以需要 normal、degraded、safe-stop 和 recovery gate。

Q2: 如何为客户可见 AI 设计 degraded mode?

30 秒:

我会按风险降级为 citation-required、template-only、draft-only、manual-first 或 safe-stop。费用、投诉、信贷、财富和客户权益相关内容, 在 RAG、policy、identity 或 evidence 不完整时不能自由生成或自动发送。恢复 normal 前必须通过 eval、样本复核、证据同步和业务签字。

Q3: HITL 是不是天然的 fallback?

30 秒:

不是。HITL 只有在有容量、技能、优先级、SLA、证据和决策权限时才是控制。否则它会在事故中变成瓶颈。AI BCP 必须设计队列分级、surge staffing、deadline routing、review payload 和 backlog threshold。

Q4: Evidence export 故障为什么会触发 AI 降级?

30 秒:

因为高风险 AI 输出必须可复原、可审计、可解释。没有 evidence, 事故后无法证明模型版本、prompt、retrieved source、policy decision、tool action 和 human review。Tier 1 客户可见 workflow 在 evidence export 故障时应至少切 draft-only 并启用本地证据账本。

Q5: 如何处理 policy engine false allow?

30 秒:

先把高风险写工具切 read-only, 查询 exposure window 内的 side effects, 保留 policy decision log 和 tool ledger, 再做客户和业务补救。恢复前必须通过 policy regression、审批链路验证和工具幂等检查。

Q6: RTO/RPO 在 AI 里怎么定义?

30 秒:

AI RTO 是恢复受控服务的时间, 不一定是恢复完整自动化。AI RPO 是可接受的证据、source、prompt、tool、human review 记录损失窗口。Tier 1 客户可见流程可能要求 30-60 分钟内恢复 template 或 manual service, 且不能有不可恢复证据缺口。

Q7: 如何设计恢复 normal mode 的 gate?

30 秒:

不能只看供应商状态页变绿。恢复 gate 要看 dependency health、regression eval、样本复核、evidence reconciliation、tool side-effect reconciliation、HITL backlog、客户影响和业务 / 风险 / 合规签字。

Q8: 如何向高管解释 AI BCP 的价值?

30 秒:

我会说 AI BCP 不是阻碍自动化, 而是保护关键运营。当 AI 依赖失败时, 机构需要知道哪些服务继续、哪些服务降级、哪些动作停止、谁做决定、客户如何沟通、证据如何保全、何时恢复。这能降低客户伤害、监管风险和混乱停机。


17. Self-Assessment Checklist

CheckPassing evidence
Critical operation mappingAI workflows linked to customer journeys, core business lines and regulatory obligations
Dependency graphModel, RAG, tool, identity, policy, eval, HITL, vendor and evidence dependencies mapped
Impact toleranceBusiness-approved tolerance for downtime, backlog, misinformation and evidence loss
Degraded mode taxonomyNormal, conservative, citation-required, template-only, draft-only, read-only, manual-first, safe-stop defined
Trigger logicDependency health and risk-tier triggers mapped to mode switch
Manual fallbackQueue, staffing, templates, authority, SLA and evidence designed
Customer communicationReduced automation, manual review, delay and correction states prepared
Evidence preservationLocal ledger and reconciliation design exists
Tabletop exercisesModel, RAG, identity, policy, HITL and evidence scenarios exercised
Recovery gateEval, evidence, tool reconciliation, queue, business and risk signoff required
GovernanceRACI, cadence, risk acceptance and remediation tracking active

18. Final Principle

AI operational resilience 的成熟度可以用一句话检验:

When AI is degraded, do we already know which customer promises stop, which operations continue, which humans take over, which evidence is preserved, which executives decide, and what proof is required before automation returns?

如果答案不清楚, 企业只是把 AI 功能上线了, 还没有把 AI 做成关键运营能力。