AI Operating Model / RACI Runbook
这些来源作为学习锚点, 不构成法律、合规或运维咨询意见。
AI Operating Model / RACI / Runbook
定位: 面向 AI BA / AI PM / AI Solutions Architect / Enterprise Architect 的企业 AI 运营模型手册。 目标: 让 AI 系统上线后有人负责、有人监控、有人修复、有人审批、有人复盘。 核心观点: AI launch is not the finish line. Production AI needs operating ownership.
Source Anchors
这些来源作为学习锚点, 不构成法律、合规或运维咨询意见。
| Anchor | Link | 用法 |
|---|---|---|
| NIST AI RMF | https://www.nist.gov/itl/ai-risk-management-framework | 组织 AI risk lifecycle 和持续监控 |
| ISO/IEC 42001 | https://www.iso.org/standard/42001 | 建立 AI management system、责任和持续改进 |
| EU AI Act | https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng | 用 risk-based lens 思考高风险、人类监督和文档 |
| OWASP LLM Top 10 | https://owasp.org/www-project-top-10-for-large-language-model-applications/ | 为 LLM 风险建立 runbook |
| TOGAF | https://www.opengroup.org/togaf | 用 architecture governance 和 review board 管理变化 |
| SRE concepts | https://sre.google/ | 借鉴 SLO、incident、postmortem、error budget 思路 |
1. 为什么企业 AI 需要 Operating Model
AI 系统会持续变化:
- 模型版本变化。
- Prompt 变化。
- 知识库变化。
- Policy 变化。
- 用户行为变化。
- Vendor 行为变化。
- 风险和监管要求变化。
如果没有 operating model, 常见问题是:
- 知识库过期没人管。
- Prompt 被改了但没人跑回归。
- 用户不信任但没人收集原因。
- 模型输出错误但没人定级。
- 风控以为产品负责, 产品以为技术负责。
- Vendor 更新模型导致质量下降。
- 上线后没有复盘, 只有事故后补救。
Operating model 要回答:
Who owns the AI capability?
Who owns data and knowledge?
Who approves changes?
Who monitors quality?
Who handles incidents?
Who decides release and rollback?
Who proves business value?
2. Core Roles
| Role | Responsibility |
|---|---|
| AI Product Owner | owns value, scope, roadmap, adoption, release decision |
| Business Process Owner | owns workflow performance and operational policy |
| AI BA | owns process evidence, requirements, stakeholder alignment |
| Solution Architect | owns architecture, integration, controls, NFRs, rollback |
| Data Owner | owns source data quality, access, retention, lineage |
| Knowledge Owner | owns policy/content accuracy, versioning, freshness |
| Model / Platform Owner | owns model gateway, platform reliability, provider changes |
| EvalOps Owner | owns golden set, eval runner, release gate, quality dashboard |
| Risk / Compliance | owns risk classification, controls, oversight, audit evidence |
| Security / Privacy | owns IAM, data protection, vendor/security review |
| Operations Lead | owns day-to-day workflow adoption and frontline support |
| Vendor Manager | owns vendor SLA, contract, incidents, review cadence |
| Frontline Champion | owns user feedback, training, trust-building |
3. RACI: Use Case Intake
| Activity | PM | BA | Process Owner | Architect | Risk | Data | Ops |
|---|---|---|---|---|---|---|---|
| Identify business problem | A | R | R | C | C | I | C |
| Define baseline metrics | A | R | R | I | C | C | R |
| Assess AI fit / no-AI option | A | R | C | C | C | C | C |
| Initial risk tier | C | C | C | C | A/R | C | I |
| Discovery decision | A | R | A | C | C | C | C |
4. RACI: Data and Knowledge Readiness
| Activity | PM | BA | Architect | Data Owner | Knowledge Owner | Security | Risk |
|---|---|---|---|---|---|---|---|
| Source inventory | C | R | C | A/R | R | C | C |
| Data classification | I | C | C | R | C | A/R | C |
| Access control design | I | C | R | C | C | A/R | C |
| Knowledge versioning | C | C | C | I | A/R | C | C |
| Retention and logging | C | C | R | C | C | A/R | A/R |
| Data readiness sign-off | C | R | C | A/R | A/R | C | C |
5. RACI: Prompt / Model / Knowledge Change
| Activity | PM | Architect | EvalOps | Knowledge Owner | Platform Owner | Risk | Ops |
|---|---|---|---|---|---|---|---|
| Change request | A/R | C | C | R | C | C | C |
| Impact assessment | A | R | R | R | C | C | C |
| Regression eval | C | C | A/R | C | C | C | I |
| Risk review | C | C | C | C | I | A/R | I |
| Release approval | A | R | R | C | R | C | C |
| Rollback decision | A | R | R | C | R | C | R |
6. RACI: Incident Response
| Activity | PM | Ops | Architect | EvalOps | Risk | Security | Vendor |
|---|---|---|---|---|---|---|---|
| Triage incident | A | R | R | R | C | C | C |
| Severity classification | A | R | C | C | A/R | C | C |
| Containment | A | R | R | C | C | A/R | R |
| User/customer communication | A/R | R | I | I | C | C | I |
| Root cause analysis | C | R | R | R | C | C | C |
| Corrective action | A | R | R | R | C | C | C |
| Postmortem | A | R | R | R | A/R | C | C |
7. Governance Cadence
| Cadence | Meeting | Inputs | Decisions |
|---|---|---|---|
| Daily | Pilot standup | incidents, feedback, defects | quick fixes, user support |
| Weekly | AI quality review | eval results, overrides, complaints | prompt/index/action fixes |
| Biweekly | Product/ops adoption review | adoption dashboard, workflow metrics | training, rollout adjustments |
| Monthly | AI risk/governance review | risk register, incidents, model changes | risk acceptance, controls |
| Quarterly | AI capability review | ROI, cost, quality trend, vendor review | scale, refresh, retire |
8. Required Operating Artifacts
- Use case inventory。
- Owner map。
- RACI。
- Data and knowledge owner registry。
- Prompt/model/index version registry。
- Eval dashboard。
- Incident log。
- Risk register。
- Change log。
- Adoption dashboard。
- Vendor review log。
- Quarterly capability review。
9. Runbook: Hallucination Incident
Trigger
- Unsupported factual claim。
- Wrong citation。
- User reports incorrect answer。
- QA flags hallucinated rationale。
Immediate containment
- Capture prompt, evidence IDs, model version, output, user role。
- Classify severity。
- If high risk, disable affected route or switch to fallback。
- Notify PM, Ops, EvalOps, Risk。
Diagnosis
- Was evidence missing?
- Was retrieval wrong?
- Was policy stale?
- Did prompt over-instruct?
- Did model ignore evidence?
- Did evaluator miss this case?
Corrective actions
- Add failure to golden set。
- Fix retrieval metadata。
- Update prompt or output schema。
- Add citation validator。
- Add human review for similar cases。
Postmortem
- What failed?
- Why did gate not catch it?
- Which control will prevent recurrence?
10. Runbook: Prompt Injection / Tool Misuse
Trigger
- Model follows instruction from retrieved document。
- User asks model to bypass policy。
- Tool call attempts unauthorized action。
Immediate containment
- Stop affected tool route if action risk exists。
- Preserve logs and tool traces。
- Notify Security, Architect, Risk, PM。
- Review access and action policy。
Diagnosis
- Was retrieved content labeled as evidence?
- Was instruction hierarchy clear?
- Was tool allowlist enforced?
- Did entitlement filter run?
- Was red-team case in eval suite?
Corrective actions
- Add prompt injection tests。
- Harden tool gateway。
- Add action approval。
- Improve content sanitization。
- Update security training。
11. Runbook: Data Leakage
Trigger
- Output includes unauthorized customer data。
- Logs contain sensitive fields。
- Cache served answer across permission boundary。
Immediate containment
- Disable affected cache/retrieval route。
- Preserve evidence for investigation。
- Notify Security/Privacy/Risk。
- Identify affected users/records。
Diagnosis
- Was entitlement checked before retrieval?
- Was cache key missing role/product/region/version?
- Were logs redacted?
- Did prompt include unnecessary data?
Corrective actions
- Fix entitlement filter。
- Fix cache key。
- Redact logs。
- Add leakage eval。
- Update data minimization rule。
12. Runbook: Model / Provider Outage
Trigger
- Provider unavailable。
- Latency above SLA。
- Cost spike。
- Model returns abnormal errors。
Immediate containment
- Route to fallback model/provider if approved。
- Degrade to retrieval-only or template path。
- Inform operations users。
- Monitor queue and retry.
Diagnosis
- Vendor outage?
- Rate limit?
- Internal network?
- Prompt/token explosion?
- Tool dependency?
Corrective actions
- Adjust routing。
- Add circuit breaker。
- Update capacity plan。
- Review vendor SLA。
13. Runbook: Eval Regression
Trigger
- New prompt/model/index fails regression。
- Critical case fails。
- LLM judge drift detected。
Immediate containment
- Block release。
- Revert to previous version bundle。
- Open defect with owner。
Diagnosis
- Prompt changed?
- Model changed?
- Index changed?
- Rubric changed?
- Data changed?
Corrective actions
- Fix and rerun eval。
- Add missing test cases。
- Calibrate judge。
- Document release decision。
14. Runbook: Knowledge Staleness
Trigger
- Old policy cited。
- Product team updates fee or rule。
- User flags outdated answer。
Immediate containment
- Mark stale source。
- Remove from retrieval or lower priority。
- Notify knowledge owner。
- Add temporary warning if needed。
Corrective actions
- Reindex updated content。
- Update metadata and effective date。
- Rerun policy Q/A eval。
- Monitor stale citation rate。
15. Runbook: User Trust Drop
Trigger
- Adoption drops。
- Override rises。
- Users stop using suggestions。
- Qualitative feedback says “not trustworthy”。
Diagnosis
- Accuracy issue?
- Latency issue?
- Too much text?
- Poor workflow fit?
- Managers not reinforcing?
- Users fear audit/blame?
Corrective actions
- Improve UX and explanation。
- Add citations。
- Shorten output。
- Train champions。
- Clarify accountability。
- Narrow use case。
16. Financial Retail Examples
AML Copilot Operating Model
Owners:
- PM owns adoption and scope。
- AML Ops owns workflow。
- Compliance owns SAR boundary。
- Knowledge owner owns SOP and typology library。
- EvalOps owns red-flag eval。
- Architect owns audit and RAG/tool architecture。
Cadence:
- weekly QA defect review。
- monthly typology update。
- quarterly risk review。
Customer Service RAG
Owners:
- Knowledge manager owns articles。
- Contact center ops owns adoption。
- QA owns answer quality sample。
- PM owns roadmap。
- EvalOps owns policy Q/A regression。
Runbook focus:
- outdated policy。
- unauthorized promise。
- escalation miss。
Payments Exception Agent
Owners:
- Payment ops owns process。
- Architect owns tool gateway。
- Risk owns action approval。
- Platform owns idempotency and audit。
Runbook focus:
- duplicate action。
- wrong repair recommendation。
- payment rail outage。
Lending Assistant
Owners:
- Credit policy owns policy。
- Underwriting owns decisions。
- Compliance owns fair lending review。
- EvalOps owns reason-code eval。
Runbook focus:
- unsupported reason code。
- protected/proxy factor issue。
- incorrect policy citation。
17. Adoption and Change Management
Adoption is not login count. Adoption means users safely change how work is done.
Adoption metrics
- eligible users activated。
- eligible cases touched。
- repeat usage。
- accepted suggestions。
- edited suggestions。
- override reasons。
- time saved。
- QA defects。
- trust survey。
- escalation rate。
Trust-building
- Start with read-only support。
- Show citations and evidence。
- Keep user in control。
- Explain limitations。
- Capture feedback。
- Close feedback loop visibly。
Resistance handling
| Resistance | Response |
|---|---|
| AI will replace my judgment | Position as decision support, not decision owner |
| AI makes mistakes | Show eval, citations, human review and feedback loop |
| It slows me down | Reduce output length, improve workflow fit |
| I do not know who is accountable | Clarify RACI and approval boundary |
| I do not trust the knowledge | Show source, version, owner, update process |
18. Interview Talking Points
How do you operate AI after launch?
30-second answer:
I define an operating model before release: product owner, process owner, data and knowledge owners, platform owner, EvalOps owner, risk/compliance, security and frontline champions. I set RACI for changes, eval gates, incidents, vendor reviews and adoption. Production AI needs dashboards, runbooks, quarterly reviews and rollback paths.
What runbooks do you need for enterprise AI?
Answer:
At minimum: hallucination, prompt injection/tool misuse, data leakage, provider outage, eval regression, stale knowledge, user trust drop and high-risk output escape. Each runbook needs trigger, containment, diagnosis, corrective action, owner and postmortem.
Why is adoption part of architecture?
Answer:
If users do not safely change workflow, the architecture has not delivered value. Adoption affects where AI enters the process, what evidence users need, how controls are accepted, and what metrics prove ROI.
19. Operating Model Checklist
- AI product owner assigned。
- Business process owner assigned。
- Data owner assigned。
- Knowledge owner assigned。
- EvalOps owner assigned。
- Risk/compliance owner assigned。
- Security/privacy owner assigned。
- Vendor owner assigned。
- RACI approved。
- Release gate defined。
- Incident runbooks drafted。
- Monitoring dashboard live。
- Adoption dashboard live。
- Quarterly review scheduled。
20. Connections
| Existing asset | Use |
|---|---|
docs/abpa/templates/09-operating-model-raci.md | Create RACI |
docs/AI_ARCHITECTURE_REVIEW_GATE_CHECKLISTS.md | Use release gate evidence |
docs/AI_REQUIREMENTS_TO_EVAL_COOKBOOK.md | Connect eval to release and incident |
docs/AI_VENDOR_BUILD_BUY_ADOPTION_PLAYBOOK.md | Vendor and adoption governance |
docs/AI_GOVERNANCE_EVALOPS_RISK_90_PLAN.md | Deep governance practice |
docs/AI_ARCHITECTURE_DIAGRAM_PLAYBOOK.md | Draw operating model and runbook architecture |
21. Final Rule
An AI system is not production-ready until you can answer:
Who owns quality?
Who owns data?
Who owns knowledge?
Who approves changes?
Who handles incidents?
Who can roll back?
Who measures adoption?
Who proves business value?