AI 扩展计划 / Playbooks

AI Operating Model / RACI Runbook

这些来源作为学习锚点, 不构成法律、合规或运维咨询意见。

557 行AI_OPERATING_MODEL_RACI_RUNBOOK.md

AI Operating Model / RACI / Runbook

定位: 面向 AI BA / AI PM / AI Solutions Architect / Enterprise Architect 的企业 AI 运营模型手册。目标: 让 AI 系统上线后有人负责、有人监控、有人修复、有人审批、有人复盘。核心观点: AI launch is not the finish line. Production AI needs operating ownership.

Source Anchors

这些来源作为学习锚点, 不构成法律、合规或运维咨询意见。

Anchor	Link	用法
NIST AI RMF	https://www.nist.gov/itl/ai-risk-management-framework	组织 AI risk lifecycle 和持续监控
ISO/IEC 42001	https://www.iso.org/standard/42001	建立 AI management system、责任和持续改进
EU AI Act	https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng	用 risk-based lens 思考高风险、人类监督和文档
OWASP LLM Top 10	https://owasp.org/www-project-top-10-for-large-language-model-applications/	为 LLM 风险建立 runbook
TOGAF	https://www.opengroup.org/togaf	用 architecture governance 和 review board 管理变化
SRE concepts	https://sre.google/	借鉴 SLO、incident、postmortem、error budget 思路

1. 为什么企业 AI 需要 Operating Model

AI 系统会持续变化:

模型版本变化。
Prompt 变化。
知识库变化。
Policy 变化。
用户行为变化。
Vendor 行为变化。
风险和监管要求变化。

如果没有 operating model, 常见问题是:

知识库过期没人管。
Prompt 被改了但没人跑回归。
用户不信任但没人收集原因。
模型输出错误但没人定级。
风控以为产品负责, 产品以为技术负责。
Vendor 更新模型导致质量下降。
上线后没有复盘, 只有事故后补救。

Operating model 要回答:

Who owns the AI capability?
Who owns data and knowledge?
Who approves changes?
Who monitors quality?
Who handles incidents?
Who decides release and rollback?
Who proves business value?

2. Core Roles

Role	Responsibility
AI Product Owner	owns value, scope, roadmap, adoption, release decision
Business Process Owner	owns workflow performance and operational policy
AI BA	owns process evidence, requirements, stakeholder alignment
Solution Architect	owns architecture, integration, controls, NFRs, rollback
Data Owner	owns source data quality, access, retention, lineage
Knowledge Owner	owns policy/content accuracy, versioning, freshness
Model / Platform Owner	owns model gateway, platform reliability, provider changes
EvalOps Owner	owns golden set, eval runner, release gate, quality dashboard
Risk / Compliance	owns risk classification, controls, oversight, audit evidence
Security / Privacy	owns IAM, data protection, vendor/security review
Operations Lead	owns day-to-day workflow adoption and frontline support
Vendor Manager	owns vendor SLA, contract, incidents, review cadence
Frontline Champion	owns user feedback, training, trust-building

3. RACI: Use Case Intake

Activity	PM	BA	Process Owner	Architect	Risk	Data	Ops
Identify business problem	A	R	R	C	C	I	C
Define baseline metrics	A	R	R	I	C	C	R
Assess AI fit / no-AI option	A	R	C	C	C	C	C
Initial risk tier	C	C	C	C	A/R	C	I
Discovery decision	A	R	A	C	C	C	C

4. RACI: Data and Knowledge Readiness

Activity	PM	BA	Architect	Data Owner	Knowledge Owner	Security	Risk
Source inventory	C	R	C	A/R	R	C	C
Data classification	I	C	C	R	C	A/R	C
Access control design	I	C	R	C	C	A/R	C
Knowledge versioning	C	C	C	I	A/R	C	C
Retention and logging	C	C	R	C	C	A/R	A/R
Data readiness sign-off	C	R	C	A/R	A/R	C	C

5. RACI: Prompt / Model / Knowledge Change

Activity	PM	Architect	EvalOps	Knowledge Owner	Platform Owner	Risk	Ops
Change request	A/R	C	C	R	C	C	C
Impact assessment	A	R	R	R	C	C	C
Regression eval	C	C	A/R	C	C	C	I
Risk review	C	C	C	C	I	A/R	I
Release approval	A	R	R	C	R	C	C
Rollback decision	A	R	R	C	R	C	R

6. RACI: Incident Response

Activity	PM	Ops	Architect	EvalOps	Risk	Security	Vendor
Triage incident	A	R	R	R	C	C	C
Severity classification	A	R	C	C	A/R	C	C
Containment	A	R	R	C	C	A/R	R
User/customer communication	A/R	R	I	I	C	C	I
Root cause analysis	C	R	R	R	C	C	C
Corrective action	A	R	R	R	C	C	C
Postmortem	A	R	R	R	A/R	C	C

7. Governance Cadence

Cadence	Meeting	Inputs	Decisions
Daily	Pilot standup	incidents, feedback, defects	quick fixes, user support
Weekly	AI quality review	eval results, overrides, complaints	prompt/index/action fixes
Biweekly	Product/ops adoption review	adoption dashboard, workflow metrics	training, rollout adjustments
Monthly	AI risk/governance review	risk register, incidents, model changes	risk acceptance, controls
Quarterly	AI capability review	ROI, cost, quality trend, vendor review	scale, refresh, retire

8. Required Operating Artifacts

Use case inventory。
Owner map。
RACI。
Data and knowledge owner registry。
Prompt/model/index version registry。
Eval dashboard。
Incident log。
Risk register。
Change log。
Adoption dashboard。
Vendor review log。
Quarterly capability review。

9. Runbook: Hallucination Incident

Trigger

Unsupported factual claim。
Wrong citation。
User reports incorrect answer。
QA flags hallucinated rationale。

Immediate containment

Capture prompt, evidence IDs, model version, output, user role。
Classify severity。
If high risk, disable affected route or switch to fallback。
Notify PM, Ops, EvalOps, Risk。

Diagnosis

Was evidence missing?
Was retrieval wrong?
Was policy stale?
Did prompt over-instruct?
Did model ignore evidence?
Did evaluator miss this case?

Corrective actions

Add failure to golden set。
Fix retrieval metadata。
Update prompt or output schema。
Add citation validator。
Add human review for similar cases。

Postmortem

What failed?
Why did gate not catch it?
Which control will prevent recurrence?

10. Runbook: Prompt Injection / Tool Misuse

Trigger

Model follows instruction from retrieved document。
User asks model to bypass policy。
Tool call attempts unauthorized action。

Immediate containment

Stop affected tool route if action risk exists。
Preserve logs and tool traces。
Notify Security, Architect, Risk, PM。
Review access and action policy。

Diagnosis

Was retrieved content labeled as evidence?
Was instruction hierarchy clear?
Was tool allowlist enforced?
Did entitlement filter run?
Was red-team case in eval suite?

Corrective actions

Add prompt injection tests。
Harden tool gateway。
Add action approval。
Improve content sanitization。
Update security training。

11. Runbook: Data Leakage

Trigger

Output includes unauthorized customer data。
Logs contain sensitive fields。
Cache served answer across permission boundary。

Immediate containment

Disable affected cache/retrieval route。
Preserve evidence for investigation。
Notify Security/Privacy/Risk。
Identify affected users/records。

Diagnosis

Was entitlement checked before retrieval?
Was cache key missing role/product/region/version?
Were logs redacted?
Did prompt include unnecessary data?

Corrective actions

Fix entitlement filter。
Fix cache key。
Redact logs。
Add leakage eval。
Update data minimization rule。

12. Runbook: Model / Provider Outage

Trigger

Provider unavailable。
Latency above SLA。
Cost spike。
Model returns abnormal errors。

Immediate containment

Route to fallback model/provider if approved。
Degrade to retrieval-only or template path。
Inform operations users。
Monitor queue and retry.

Diagnosis

Vendor outage?
Rate limit?
Internal network?
Prompt/token explosion?
Tool dependency?

Corrective actions

Adjust routing。
Add circuit breaker。
Update capacity plan。
Review vendor SLA。

13. Runbook: Eval Regression

Trigger

New prompt/model/index fails regression。
Critical case fails。
LLM judge drift detected。

Immediate containment

Block release。
Revert to previous version bundle。
Open defect with owner。

Diagnosis

Prompt changed?
Model changed?
Index changed?
Rubric changed?
Data changed?

Corrective actions

Fix and rerun eval。
Add missing test cases。
Calibrate judge。
Document release decision。

14. Runbook: Knowledge Staleness

Trigger

Old policy cited。
Product team updates fee or rule。
User flags outdated answer。

Immediate containment

Mark stale source。
Remove from retrieval or lower priority。
Notify knowledge owner。
Add temporary warning if needed。

Corrective actions

Reindex updated content。
Update metadata and effective date。
Rerun policy Q/A eval。
Monitor stale citation rate。

15. Runbook: User Trust Drop

Trigger

Adoption drops。
Override rises。
Users stop using suggestions。
Qualitative feedback says “not trustworthy”。

Diagnosis

Accuracy issue?
Latency issue?
Too much text?
Poor workflow fit?
Managers not reinforcing?
Users fear audit/blame?

Corrective actions

Improve UX and explanation。
Add citations。
Shorten output。
Train champions。
Clarify accountability。
Narrow use case。

16. Financial Retail Examples

AML Copilot Operating Model

Owners:

PM owns adoption and scope。
AML Ops owns workflow。
Compliance owns SAR boundary。
Knowledge owner owns SOP and typology library。
EvalOps owns red-flag eval。
Architect owns audit and RAG/tool architecture。

Cadence:

weekly QA defect review。
monthly typology update。
quarterly risk review。

Customer Service RAG

Owners:

Knowledge manager owns articles。
Contact center ops owns adoption。
QA owns answer quality sample。
PM owns roadmap。
EvalOps owns policy Q/A regression。

Runbook focus:

outdated policy。
unauthorized promise。
escalation miss。

Payments Exception Agent

Owners:

Payment ops owns process。
Architect owns tool gateway。
Risk owns action approval。
Platform owns idempotency and audit。

Runbook focus:

duplicate action。
wrong repair recommendation。
payment rail outage。

Lending Assistant

Owners:

Credit policy owns policy。
Underwriting owns decisions。
Compliance owns fair lending review。
EvalOps owns reason-code eval。

Runbook focus:

unsupported reason code。
protected/proxy factor issue。
incorrect policy citation。

17. Adoption and Change Management

Adoption is not login count. Adoption means users safely change how work is done.

Adoption metrics

eligible users activated。
eligible cases touched。
repeat usage。
accepted suggestions。
edited suggestions。
override reasons。
time saved。
QA defects。
trust survey。
escalation rate。

Trust-building

Start with read-only support。
Show citations and evidence。
Keep user in control。
Explain limitations。
Capture feedback。
Close feedback loop visibly。

Resistance handling

Resistance	Response
AI will replace my judgment	Position as decision support, not decision owner
AI makes mistakes	Show eval, citations, human review and feedback loop
It slows me down	Reduce output length, improve workflow fit
I do not know who is accountable	Clarify RACI and approval boundary
I do not trust the knowledge	Show source, version, owner, update process

18. Interview Talking Points

How do you operate AI after launch?

30-second answer:

I define an operating model before release: product owner, process owner, data and knowledge owners, platform owner, EvalOps owner, risk/compliance, security and frontline champions. I set RACI for changes, eval gates, incidents, vendor reviews and adoption. Production AI needs dashboards, runbooks, quarterly reviews and rollback paths.

What runbooks do you need for enterprise AI?

Answer:

At minimum: hallucination, prompt injection/tool misuse, data leakage, provider outage, eval regression, stale knowledge, user trust drop and high-risk output escape. Each runbook needs trigger, containment, diagnosis, corrective action, owner and postmortem.

Why is adoption part of architecture?

Answer:

If users do not safely change workflow, the architecture has not delivered value. Adoption affects where AI enters the process, what evidence users need, how controls are accepted, and what metrics prove ROI.

19. Operating Model Checklist

AI product owner assigned。
Business process owner assigned。
Data owner assigned。
Knowledge owner assigned。
EvalOps owner assigned。
Risk/compliance owner assigned。
Security/privacy owner assigned。
Vendor owner assigned。
RACI approved。
Release gate defined。
Incident runbooks drafted。
Monitoring dashboard live。
Adoption dashboard live。
Quarterly review scheduled。

20. Connections

Existing asset	Use
`docs/abpa/templates/09-operating-model-raci.md`	Create RACI
`docs/AI_ARCHITECTURE_REVIEW_GATE_CHECKLISTS.md`	Use release gate evidence
`docs/AI_REQUIREMENTS_TO_EVAL_COOKBOOK.md`	Connect eval to release and incident
`docs/AI_VENDOR_BUILD_BUY_ADOPTION_PLAYBOOK.md`	Vendor and adoption governance
`docs/AI_GOVERNANCE_EVALOPS_RISK_90_PLAN.md`	Deep governance practice
`docs/AI_ARCHITECTURE_DIAGRAM_PLAYBOOK.md`	Draw operating model and runbook architecture

21. Final Rule

An AI system is not production-ready until you can answer:

Who owns quality?
Who owns data?
Who owns knowledge?
Who approves changes?
Who handles incidents?
Who can roll back?
Who measures adoption?
Who proves business value?