返回 Papers
AI 扩展计划 / Playbooks

AI Operating Model / RACI Runbook

这些来源作为学习锚点, 不构成法律、合规或运维咨询意见。

557AI_OPERATING_MODEL_RACI_RUNBOOK.md

AI Operating Model / RACI / Runbook

定位: 面向 AI BA / AI PM / AI Solutions Architect / Enterprise Architect 的企业 AI 运营模型手册。 目标: 让 AI 系统上线后有人负责、有人监控、有人修复、有人审批、有人复盘。 核心观点: AI launch is not the finish line. Production AI needs operating ownership.


Source Anchors

这些来源作为学习锚点, 不构成法律、合规或运维咨询意见。

AnchorLink用法
NIST AI RMFhttps://www.nist.gov/itl/ai-risk-management-framework组织 AI risk lifecycle 和持续监控
ISO/IEC 42001https://www.iso.org/standard/42001建立 AI management system、责任和持续改进
EU AI Acthttps://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng用 risk-based lens 思考高风险、人类监督和文档
OWASP LLM Top 10https://owasp.org/www-project-top-10-for-large-language-model-applications/为 LLM 风险建立 runbook
TOGAFhttps://www.opengroup.org/togaf用 architecture governance 和 review board 管理变化
SRE conceptshttps://sre.google/借鉴 SLO、incident、postmortem、error budget 思路

1. 为什么企业 AI 需要 Operating Model

AI 系统会持续变化:

  • 模型版本变化。
  • Prompt 变化。
  • 知识库变化。
  • Policy 变化。
  • 用户行为变化。
  • Vendor 行为变化。
  • 风险和监管要求变化。

如果没有 operating model, 常见问题是:

  • 知识库过期没人管。
  • Prompt 被改了但没人跑回归。
  • 用户不信任但没人收集原因。
  • 模型输出错误但没人定级。
  • 风控以为产品负责, 产品以为技术负责。
  • Vendor 更新模型导致质量下降。
  • 上线后没有复盘, 只有事故后补救。

Operating model 要回答:

Who owns the AI capability?
Who owns data and knowledge?
Who approves changes?
Who monitors quality?
Who handles incidents?
Who decides release and rollback?
Who proves business value?

2. Core Roles

RoleResponsibility
AI Product Ownerowns value, scope, roadmap, adoption, release decision
Business Process Ownerowns workflow performance and operational policy
AI BAowns process evidence, requirements, stakeholder alignment
Solution Architectowns architecture, integration, controls, NFRs, rollback
Data Ownerowns source data quality, access, retention, lineage
Knowledge Ownerowns policy/content accuracy, versioning, freshness
Model / Platform Ownerowns model gateway, platform reliability, provider changes
EvalOps Ownerowns golden set, eval runner, release gate, quality dashboard
Risk / Complianceowns risk classification, controls, oversight, audit evidence
Security / Privacyowns IAM, data protection, vendor/security review
Operations Leadowns day-to-day workflow adoption and frontline support
Vendor Managerowns vendor SLA, contract, incidents, review cadence
Frontline Championowns user feedback, training, trust-building

3. RACI: Use Case Intake

ActivityPMBAProcess OwnerArchitectRiskDataOps
Identify business problemARRCCIC
Define baseline metricsARRICCR
Assess AI fit / no-AI optionARCCCCC
Initial risk tierCCCCA/RCI
Discovery decisionARACCCC

4. RACI: Data and Knowledge Readiness

ActivityPMBAArchitectData OwnerKnowledge OwnerSecurityRisk
Source inventoryCRCA/RRCC
Data classificationICCRCA/RC
Access control designICRCCA/RC
Knowledge versioningCCCIA/RCC
Retention and loggingCCRCCA/RA/R
Data readiness sign-offCRCA/RA/RCC

5. RACI: Prompt / Model / Knowledge Change

ActivityPMArchitectEvalOpsKnowledge OwnerPlatform OwnerRiskOps
Change requestA/RCCRCCC
Impact assessmentARRRCCC
Regression evalCCA/RCCCI
Risk reviewCCCCIA/RI
Release approvalARRCRCC
Rollback decisionARRCRCR

6. RACI: Incident Response

ActivityPMOpsArchitectEvalOpsRiskSecurityVendor
Triage incidentARRRCCC
Severity classificationARCCA/RCC
ContainmentARRCCA/RR
User/customer communicationA/RRIICCI
Root cause analysisCRRRCCC
Corrective actionARRRCCC
PostmortemARRRA/RCC

7. Governance Cadence

CadenceMeetingInputsDecisions
DailyPilot standupincidents, feedback, defectsquick fixes, user support
WeeklyAI quality revieweval results, overrides, complaintsprompt/index/action fixes
BiweeklyProduct/ops adoption reviewadoption dashboard, workflow metricstraining, rollout adjustments
MonthlyAI risk/governance reviewrisk register, incidents, model changesrisk acceptance, controls
QuarterlyAI capability reviewROI, cost, quality trend, vendor reviewscale, refresh, retire

8. Required Operating Artifacts

  • Use case inventory。
  • Owner map。
  • RACI。
  • Data and knowledge owner registry。
  • Prompt/model/index version registry。
  • Eval dashboard。
  • Incident log。
  • Risk register。
  • Change log。
  • Adoption dashboard。
  • Vendor review log。
  • Quarterly capability review。

9. Runbook: Hallucination Incident

Trigger

  • Unsupported factual claim。
  • Wrong citation。
  • User reports incorrect answer。
  • QA flags hallucinated rationale。

Immediate containment

  1. Capture prompt, evidence IDs, model version, output, user role。
  2. Classify severity。
  3. If high risk, disable affected route or switch to fallback。
  4. Notify PM, Ops, EvalOps, Risk。

Diagnosis

  • Was evidence missing?
  • Was retrieval wrong?
  • Was policy stale?
  • Did prompt over-instruct?
  • Did model ignore evidence?
  • Did evaluator miss this case?

Corrective actions

  • Add failure to golden set。
  • Fix retrieval metadata。
  • Update prompt or output schema。
  • Add citation validator。
  • Add human review for similar cases。

Postmortem

  • What failed?
  • Why did gate not catch it?
  • Which control will prevent recurrence?

10. Runbook: Prompt Injection / Tool Misuse

Trigger

  • Model follows instruction from retrieved document。
  • User asks model to bypass policy。
  • Tool call attempts unauthorized action。

Immediate containment

  1. Stop affected tool route if action risk exists。
  2. Preserve logs and tool traces。
  3. Notify Security, Architect, Risk, PM。
  4. Review access and action policy。

Diagnosis

  • Was retrieved content labeled as evidence?
  • Was instruction hierarchy clear?
  • Was tool allowlist enforced?
  • Did entitlement filter run?
  • Was red-team case in eval suite?

Corrective actions

  • Add prompt injection tests。
  • Harden tool gateway。
  • Add action approval。
  • Improve content sanitization。
  • Update security training。

11. Runbook: Data Leakage

Trigger

  • Output includes unauthorized customer data。
  • Logs contain sensitive fields。
  • Cache served answer across permission boundary。

Immediate containment

  1. Disable affected cache/retrieval route。
  2. Preserve evidence for investigation。
  3. Notify Security/Privacy/Risk。
  4. Identify affected users/records。

Diagnosis

  • Was entitlement checked before retrieval?
  • Was cache key missing role/product/region/version?
  • Were logs redacted?
  • Did prompt include unnecessary data?

Corrective actions

  • Fix entitlement filter。
  • Fix cache key。
  • Redact logs。
  • Add leakage eval。
  • Update data minimization rule。

12. Runbook: Model / Provider Outage

Trigger

  • Provider unavailable。
  • Latency above SLA。
  • Cost spike。
  • Model returns abnormal errors。

Immediate containment

  1. Route to fallback model/provider if approved。
  2. Degrade to retrieval-only or template path。
  3. Inform operations users。
  4. Monitor queue and retry.

Diagnosis

  • Vendor outage?
  • Rate limit?
  • Internal network?
  • Prompt/token explosion?
  • Tool dependency?

Corrective actions

  • Adjust routing。
  • Add circuit breaker。
  • Update capacity plan。
  • Review vendor SLA。

13. Runbook: Eval Regression

Trigger

  • New prompt/model/index fails regression。
  • Critical case fails。
  • LLM judge drift detected。

Immediate containment

  1. Block release。
  2. Revert to previous version bundle。
  3. Open defect with owner。

Diagnosis

  • Prompt changed?
  • Model changed?
  • Index changed?
  • Rubric changed?
  • Data changed?

Corrective actions

  • Fix and rerun eval。
  • Add missing test cases。
  • Calibrate judge。
  • Document release decision。

14. Runbook: Knowledge Staleness

Trigger

  • Old policy cited。
  • Product team updates fee or rule。
  • User flags outdated answer。

Immediate containment

  1. Mark stale source。
  2. Remove from retrieval or lower priority。
  3. Notify knowledge owner。
  4. Add temporary warning if needed。

Corrective actions

  • Reindex updated content。
  • Update metadata and effective date。
  • Rerun policy Q/A eval。
  • Monitor stale citation rate。

15. Runbook: User Trust Drop

Trigger

  • Adoption drops。
  • Override rises。
  • Users stop using suggestions。
  • Qualitative feedback says “not trustworthy”。

Diagnosis

  • Accuracy issue?
  • Latency issue?
  • Too much text?
  • Poor workflow fit?
  • Managers not reinforcing?
  • Users fear audit/blame?

Corrective actions

  • Improve UX and explanation。
  • Add citations。
  • Shorten output。
  • Train champions。
  • Clarify accountability。
  • Narrow use case。

16. Financial Retail Examples

AML Copilot Operating Model

Owners:

  • PM owns adoption and scope。
  • AML Ops owns workflow。
  • Compliance owns SAR boundary。
  • Knowledge owner owns SOP and typology library。
  • EvalOps owns red-flag eval。
  • Architect owns audit and RAG/tool architecture。

Cadence:

  • weekly QA defect review。
  • monthly typology update。
  • quarterly risk review。

Customer Service RAG

Owners:

  • Knowledge manager owns articles。
  • Contact center ops owns adoption。
  • QA owns answer quality sample。
  • PM owns roadmap。
  • EvalOps owns policy Q/A regression。

Runbook focus:

  • outdated policy。
  • unauthorized promise。
  • escalation miss。

Payments Exception Agent

Owners:

  • Payment ops owns process。
  • Architect owns tool gateway。
  • Risk owns action approval。
  • Platform owns idempotency and audit。

Runbook focus:

  • duplicate action。
  • wrong repair recommendation。
  • payment rail outage。

Lending Assistant

Owners:

  • Credit policy owns policy。
  • Underwriting owns decisions。
  • Compliance owns fair lending review。
  • EvalOps owns reason-code eval。

Runbook focus:

  • unsupported reason code。
  • protected/proxy factor issue。
  • incorrect policy citation。

17. Adoption and Change Management

Adoption is not login count. Adoption means users safely change how work is done.

Adoption metrics

  • eligible users activated。
  • eligible cases touched。
  • repeat usage。
  • accepted suggestions。
  • edited suggestions。
  • override reasons。
  • time saved。
  • QA defects。
  • trust survey。
  • escalation rate。

Trust-building

  • Start with read-only support。
  • Show citations and evidence。
  • Keep user in control。
  • Explain limitations。
  • Capture feedback。
  • Close feedback loop visibly。

Resistance handling

ResistanceResponse
AI will replace my judgmentPosition as decision support, not decision owner
AI makes mistakesShow eval, citations, human review and feedback loop
It slows me downReduce output length, improve workflow fit
I do not know who is accountableClarify RACI and approval boundary
I do not trust the knowledgeShow source, version, owner, update process

18. Interview Talking Points

How do you operate AI after launch?

30-second answer:

I define an operating model before release: product owner, process owner, data and knowledge owners, platform owner, EvalOps owner, risk/compliance, security and frontline champions. I set RACI for changes, eval gates, incidents, vendor reviews and adoption. Production AI needs dashboards, runbooks, quarterly reviews and rollback paths.

What runbooks do you need for enterprise AI?

Answer:

At minimum: hallucination, prompt injection/tool misuse, data leakage, provider outage, eval regression, stale knowledge, user trust drop and high-risk output escape. Each runbook needs trigger, containment, diagnosis, corrective action, owner and postmortem.

Why is adoption part of architecture?

Answer:

If users do not safely change workflow, the architecture has not delivered value. Adoption affects where AI enters the process, what evidence users need, how controls are accepted, and what metrics prove ROI.


19. Operating Model Checklist

  • AI product owner assigned。
  • Business process owner assigned。
  • Data owner assigned。
  • Knowledge owner assigned。
  • EvalOps owner assigned。
  • Risk/compliance owner assigned。
  • Security/privacy owner assigned。
  • Vendor owner assigned。
  • RACI approved。
  • Release gate defined。
  • Incident runbooks drafted。
  • Monitoring dashboard live。
  • Adoption dashboard live。
  • Quarterly review scheduled。

20. Connections

Existing assetUse
docs/abpa/templates/09-operating-model-raci.mdCreate RACI
docs/AI_ARCHITECTURE_REVIEW_GATE_CHECKLISTS.mdUse release gate evidence
docs/AI_REQUIREMENTS_TO_EVAL_COOKBOOK.mdConnect eval to release and incident
docs/AI_VENDOR_BUILD_BUY_ADOPTION_PLAYBOOK.mdVendor and adoption governance
docs/AI_GOVERNANCE_EVALOPS_RISK_90_PLAN.mdDeep governance practice
docs/AI_ARCHITECTURE_DIAGRAM_PLAYBOOK.mdDraw operating model and runbook architecture

21. Final Rule

An AI system is not production-ready until you can answer:

Who owns quality?
Who owns data?
Who owns knowledge?
Who approves changes?
Who handles incidents?
Who can roll back?
Who measures adoption?
Who proves business value?