AI 扩展计划 / Playbooks

AI Enterprise Reference Architecture / Control Plane Playbook

以下来源用于校准治理、控制、可观测和安全语言。本文是学习、作品集和架构设计训练材料, 不构成法律、合规、审计或认证意见。正式项目需要按机构类型、司法辖区、客户影响、数据类型、供应商合同和内部政策复核。

807 行AI_ENTERPRISE_REFERENCE_ARCHITECTURE_CONTROL_PLANE_PLAYBOOK.md

AI Enterprise Reference Architecture Control Plane Playbook

定位: 面向高级 AI PM / AI Architect / Enterprise Architect / Platform PM / Model Risk / 金融零售数字化负责人, 把企业 AI 从单点项目方案升级为可复用、可治理、可观测、可审计的 reference architecture 与 control plane。目标: 能讲清楚 AI application plane、model plane、data / knowledge plane、tool / action plane、policy / control plane、eval / observability plane、identity / security plane、governance / evidence plane 如何组合成企业级 AI 平台架构。核心观点: 企业 AI 的难点不是把一个模型接进一个流程, 而是让所有 AI 用例在统一控制面下完成路由、权限、策略、评估、监控、证据、风险接受和持续改进。

Source Anchors

Anchor	Link	本文使用方式
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 组织 AI 风险识别、测量、处置、监控和治理证据。
ISO/IEC 42001:2023	https://www.iso.org/standard/81230.html	作为 AI Management System 锚点, 把组织级政策、目标、流程、运行、绩效评价和持续改进接入 control plane。
OpenTelemetry Documentation	https://opentelemetry.io/docs/	用 traces、metrics、logs、collector 和 vendor-neutral telemetry 语言设计 AI observability、trace replay 和 evidence pipeline。
NIST Cybersecurity Framework 2.0	https://www.nist.gov/cyberframework	用 cybersecurity risk management 视角补齐 identity、protect、detect、respond、recover、security evidence 和 resilience control。

One-Sentence Positioning

Enterprise AI Reference Architecture = 把业务体验、流程编排、模型、数据知识、工具动作、安全身份、策略控制、评估观测和治理证据放进同一张企业架构蓝图, 让每个 AI 用例不是“单点上线”, 而是“在统一控制面中受控运行”。

更短的面试表达:

AI reference architecture 的核心是 control plane: 让模型可以创新, 但让访问、动作、风险、发布、监控和审计由平台规则管理。

1. 为什么企业 AI 需要 Reference Architecture, 而不是项目级方案图

项目级方案图通常只回答一个问题: “这个 use case 怎么跑通?” 企业级 reference architecture 要回答的是另一组问题:

哪些 AI 能力可以跨业务线复用, 哪些必须隔离?
哪些模型、数据、知识库、工具和 agent workflow 已被批准使用?
一个请求经过了哪些模型、检索、工具、策略、审批和人工复核?
高风险动作如何被阻断、降级、审批、双控或记录?
模型、prompt、RAG、tool schema 或供应商变更后, 哪些用例受影响?
上线前的 eval、上线后的 monitoring、事故后的 replay 是否形成闭环?
管理层、审计、模型风险、安全和业务 owner 能否看到同一套证据?

项目级方案图在 PoC 阶段有价值, 但在金融零售企业里会很快失控:

项目级 AI 方案常见问题	企业 reference architecture 的回答
每个团队各自接模型、各自写 prompt、各自做日志	建立 model gateway、prompt registry、policy engine、trace schema 和 evidence binder
风险等级靠项目经理经验判断	建立 use case intake、risk tier、impact assessment 和 release gate
RAG 知识库权限跟业务系统权限脱节	建立 data / knowledge plane, 把 data classification、document entitlement、retrieval filter 和 citation evidence 绑定
Agent 直接拿业务系统 token 调工具	建立 tool gateway, 由 policy engine 做权限、目的、租户、风险、审批和审计判断
Eval 是上线前一次性测试表	建立 EvalOps, 把 offline eval、online monitoring、production trace mining 和 regression gate 打通
审计证据散落在 Jira、Confluence、notebook 和供应商邮件	建立 governance / evidence plane, 每次决策、变更、批准、例外和事故可追溯
安全只靠 prompt 约束模型	建立 identity / security fabric, 把身份、会话、最小权限、DLP、secrets、egress 和 kill switch 做成平台控制

企业 AI 架构师需要从“画应用图”提升到“设计可运行的组织控制系统”。对 CBAP / PM / BA 背景的人来说, 关键不是重新学习基础架构概念, 而是把业务能力、风险责任、流程门禁、平台组件和证据链映射到同一套企业架构。

1.1 Reference Architecture 的设计边界

本文定义的企业 AI reference architecture 覆盖四类场景:

场景	示例	架构重点
Customer-facing AI	客服助手、理财问答、投诉解释、产品推荐	客户影响、话术合规、证据引用、人工升级、投诉闭环
Employee copilot	分行员工助手、运营知识助手、合规政策问答、开发助手	内部权限、知识质量、引用、员工依赖、培训与监督
Decision support AI	AML case summary、信贷审批建议、欺诈调查、催收策略建议	人机分工、模型风险、解释证据、override、二线复核
Agentic workflow	工单处理、退款草稿、CRM 更新、case routing、供应商查询	tool gateway、审批、幂等、回滚、动作证据、kill switch

不建议把所有 AI 场景都塞进同一个运行时。Reference architecture 是统一语言和控制模型, 不是单一技术栈。成熟架构允许不同 business unit 使用不同模型、编排框架、知识库和部署方式, 但必须共享核心控制面。

2. 八层架构: Enterprise AI Reference Architecture

2.1 总览

Experience / Application Layer
  -> Orchestration / Workflow Layer
  -> Model Layer
  -> Data / Knowledge Layer
  -> Tool / Action Layer
  -> Policy / Control Layer
  -> Eval / Observability Layer
  -> Governance / Evidence Layer

Identity / Security Fabric 横切所有层:
  identity, session, tenant, role, entitlement, purpose, secrets, DLP, network, egress, incident response

这八层不是严格线性调用顺序, 而是企业 AI 平台的责任分解。一次请求可能同时穿过多层: 用户在应用层发起请求, workflow 编排模型和检索, 模型提出工具调用, 工具动作经过 policy control, 全链路进入 observability, 上线和变更证据进入 governance plane。

2.2 Layer 1: Experience / Application

设计问题	参考答案
谁在使用 AI?	客户、员工、运营专员、风控调查员、合规分析师、开发者、管理层
AI 插入哪个业务流程节点?	前台问答、员工辅助、case triage、draft generation、decision recommendation、action execution
用户看到什么边界?	AI 角色、置信表达、证据引用、禁止用途、人工升级、申诉或纠错入口
如何避免过度依赖?	显示 evidence、limitations、approval status、human owner、decision finality

关键组件:

Component	责任
AI-enabled channel	Web、mobile、branch desktop、contact center console、operations workbench、developer IDE
Conversation / task UI	对话、表单、case view、review queue、approval packet、diff preview
Human oversight UX	approve、reject、edit、escalate、override、feedback、incident flag
Disclosure and boundary	对客户可见 AI 使用说明、证据引用、人工服务入口、错误纠正路径

金融零售设计原则:

客户可见 AI 不应该把概率性输出包装成确定性承诺。
员工 copilot 必须让员工知道哪些内容来自政策、哪些来自模型推断。
高风险建议应进入 draft / recommendation, 不应直接成为最终业务决定。
审批界面必须展示 tool 参数、数据来源、风险等级、策略命中和差异影响。

2.3 Layer 2: Orchestration / Workflow

这一层负责把业务任务拆成 AI workflow, 但不拥有最终授权。

Component	责任
Workflow engine	把 AI step 嵌入 BPMN / case management / event-driven workflow
Agent orchestrator	管理 plan、memory、tool proposal、multi-step state、retry、timeout、fallback
Prompt assembly service	组装 system instruction、developer instruction、task context、policy snippet、retrieved evidence
State manager	保存 workflow state、conversation state、case state、tool execution state
Queue and escalation	人工复核队列、异常队列、SLA、回退路径

核心控制点:

控制点	说明
Workflow insertion point	明确 AI 是建议、草稿、质检、路由还是执行动作
Step budget	限制 agent 最大步骤、最大工具调用、最大成本和最大运行时间
Deterministic wrapper	关键流程用状态机和规则约束模型输出, 不让模型自由改流程
Human checkpoint	高风险节点强制人工确认或双控
Fallback path	模型失败、检索失败、策略拒绝、供应商不可用时进入人工或规则路径

2.4 Layer 3: Model

Model layer 不是“选一个最强模型”, 而是建立模型供应、路由、版本、成本、性能和风险治理。

Component	责任
Model gateway	模型路由、allowlist、version pinning、rate limit、token budget、fallback、logging
Model registry	模型卡、供应商、用途、风险等级、数据边界、批准状态、退役日期
Inference runtime	托管模型、私有部署模型、API 模型、小模型、embedding、reranker、classifier
Prompt / template registry	prompt 版本、owner、用途、eval baseline、变更记录
Model policy	哪些 use case 能用哪些模型, 哪些数据能发给哪些供应商或部署区域

模型选型要按任务和风险拆开:

模型类型	适用场景	控制重点
Frontier LLM API	高复杂推理、复杂客服、总结、代码、规划	数据边界、供应商风险、成本、版本变更、输出控制
Private / self-hosted model	敏感数据、内部知识、固定任务、高可控场景	模型性能、运维能力、安全补丁、容量规划
Small task model	分类、路由、抽取、政策匹配、风险标签	可解释性、阈值、稳定性、回归测试
Embedding / reranker	RAG 检索和排序	语义漂移、索引版本、召回率、权限过滤
Judge model	eval、质检、在线监控初筛	校准、人审一致性、false pass、judge drift

2.5 Layer 4: Data / Knowledge

AI 系统的数据层不只是数据湖或向量库, 而是“知识可用性、权限、血缘、质量、时效和证据”的组合。

Component	责任
Data source registry	业务系统、文档库、CRM、交易系统、政策库、case archive、第三方数据
Knowledge ingestion pipeline	清洗、切分、分类、脱敏、标签、owner 审批、索引发布
Vector / hybrid search	embedding、keyword、rerank、metadata filter、entitlement filter
Knowledge graph / ontology	客户、账户、产品、交易、商户、实体、政策、流程关系
Semantic layer	指标定义、业务术语、口径、权限、可解释查询
Data quality and lineage	来源、更新时间、owner、审批状态、失效日期、质量规则

金融零售的 data / knowledge plane 必须处理:

问题	控制方式
敏感数据进入 prompt	data classification、field minimization、masking、DLP、purpose check
知识库过期	freshness SLA、document owner、expiry review、retrieval time filter
权限继承错误	document entitlement、customer scope、case scope、tenant isolation
检索内容被污染	source trust level、approval workflow、document signature、poisoning eval
引用无法证明答案	citation requirement、groundedness eval、evidence bundle
历史 case 复用带来偏差	dataset governance、representativeness review、fairness slice analysis

2.6 Layer 5: Tool / Action

Tool / action plane 是 agentic AI 的风险集中区。模型可以提出动作, 但动作必须通过平台执行边界。

Component	责任
Tool catalog	工具名称、业务能力、schema、owner、risk tier、allowed workflow
Tool gateway	参数校验、权限检查入口、dry run、idempotency、execution policy、result wrapping
Connector runtime	CRM、core banking、case management、payment、email、ticketing、data query
Action ledger	记录动作请求、策略判断、审批、执行结果、回滚或补偿
Sandbox / simulator	对高风险动作先生成预览、影响评估或测试执行

工具动作分层:

Action tier	示例	默认控制
Read-only	查询公开政策、读取当前 case、检索产品说明	最小权限、日志、scope filter
Draft	起草客户回复、生成 SAR narrative 草稿、生成信贷备忘录	人工编辑和确认
Low-risk write	更新内部标签、创建普通任务、补充 case note	权限检查、日志、可撤销
Customer-impacting write	修改客户状态、发送客户通知、调整额度建议	审批、客户影响提示、证据记录
Financial / legal / compliance action	退款、账户冻结、SAR filing、关闭 AML case、外部监管报送	双控、强审计、分层授权、回滚或补偿方案
Prohibited action	绕过 KYC、删除审计日志、导出全量客户数据、伪造客户同意	直接拒绝、告警、incident review

2.7 Layer 6: Policy / Control

Policy / control layer 是本文的核心。它把企业规则、风险偏好、安全控制、合规要求、人工复核和发布门禁变成可执行控制。

Component	责任
Policy engine	RBAC、ABAC、purpose、tenant、risk tier、data sensitivity、workflow state、approval rule
Control library	按风险等级定义 required controls、compensating controls、evidence required
Risk tier service	根据 use case、客户影响、动作类型、数据敏感度、自动化程度生成风险等级
Eval gate	用 eval contract、threshold、critical failure、slice regression 决定 release status
Approval engine	人审、双控、例外批准、risk acceptance、expiry、review cadence
Kill switch	按 use case、model、tool、tenant、workflow、action tier、supplier 分层停用
Policy-as-code repository	策略版本、测试、审批、发布、回滚、变更证据

策略决策不是只有 allow / deny:

Decision	适用场景	输出
allow	低风险、权限匹配、数据和用途合规	继续执行并记录 trace
allow_with_redaction	可用但需脱敏或字段最小化	脱敏后的上下文、响应或工具参数
require_human_review	判断复杂、客户影响中等、模型置信不足	进入人工复核队列
require_dual_control	高风险合规、资金、账户、监管动作	两个独立角色批准
dry_run_only	有副作用但可先预览	生成计划、diff、影响范围
deny	越权、越租户、用途不符、违反政策	拒绝、记录规则和解释
degrade	模型、工具或知识源异常	切换小模型、规则路径或人工流程
kill_switched	场景或组件被暂停	阻断并触发运营通知

2.8 Layer 7: Eval / Observability

AI observability 不只是系统可用性, 还要回答“AI 是否仍然适合当前用途”。

Component	责任
Trace instrumentation	request、identity、prompt、model、retrieval、tool、policy、approval、output、feedback
Metrics pipeline	latency、cost、quality、grounding、tool accuracy、fallback、override、complaint、incident
Log and event store	结构化日志、敏感字段处理、retention、query、audit export
Eval runner	offline eval、regression eval、red-team eval、slice eval、model migration eval
Online monitoring	production sample review、judge screening、human QA、drift signal、alert
Replay and incident lab	对生产 trace 做 replay, 验证修复和回归

建议的 AI trace schema:

Field	说明
trace_id / session_id / case_id	串联一次 AI 交互、工作流和业务案例
user_id / role / tenant / purpose	身份、角色、租户、业务目的
use_case_id / risk_tier	用例编号和风险等级
model_id / model_version / prompt_version	模型和 prompt 版本
data_sources / retrieval_ids / citations	检索来源、文档编号、引用证据
tool_calls / action_tier / policy_decision	工具调用、动作等级、策略结果
approval_id / approver_role	人审或双控记录
eval_scores / quality_flags	质量、安全、grounding、policy compliance 信号
user_feedback / override / complaint	生产反馈和业务后果
final_output_hash / redaction_status	输出摘要、脱敏状态和审计引用

2.9 Layer 8: Governance / Evidence

Governance / evidence layer 把架构、风险、合规、模型管理、安全、运营和审计连接起来。

Component	责任
AI inventory	use case、owner、risk tier、model、data、tool、supplier、status
Architecture decision record	关键架构决策、替代方案、取舍、影响范围
Evidence binder	intake、risk assessment、design review、eval、approval、release、monitoring、incident、review
Management review dashboard	KPI、KRI、control exceptions、incident、audit findings、corrective actions
Exception and risk acceptance	例外原因、补偿控制、批准人、到期复核、关闭证据
Retirement record	停用原因、数据处理、客户影响、依赖清理、证据保留

证据不是事后补材料, 而是运行时副产品:

AI use case intake
  -> risk tier and impact assessment
  -> architecture review
  -> eval contract and eval run
  -> release gate memo
  -> production monitoring
  -> incident / exception / change record
  -> management review and improvement action

2.10 Identity / Security Fabric

Identity / security 不作为第九层, 而是横切所有层的安全织网:

安全对象	控制点
Human identity	SSO、MFA、role、department、license、training status、least privilege
Workload identity	service account、mTLS、secret rotation、scoped credential、service-to-service auth
Tenant / customer scope	租户隔离、客户授权、case scope、account scope、region boundary
Purpose binding	查询或动作必须绑定业务目的, 如 servicing、AML investigation、credit review
Data protection	classification、masking、tokenization、encryption、DLP、log redaction
Tool credential	模型不接触底层 token, tool gateway 代表系统执行 scoped action
Network and egress	allowlist、private endpoint、vendor boundary、external send control
Incident response	alert、triage、containment、kill switch、customer impact assessment、postmortem

3. Control Plane vs Data Plane

3.1 核心区别

维度	Data Plane	Control Plane
关注点	业务请求实际如何被处理	什么被允许、被路由、被记录、被评估、被审批
典型对象	用户输入、检索内容、模型响应、工具结果、业务数据	policy、identity、risk tier、model route、eval gate、trace、approval、evidence
变化频率	随每次请求变化	随策略、用例、模型、流程和治理要求变化
成功标准	完成业务任务	任务在正确边界内完成, 且可解释、可监控、可审计
失败风险	答错、慢、成本高、流程中断	越权、违规、不可追溯、无法证明、无法停用、风险失控

一句话:

Data plane 让 AI 做事; control plane 决定 AI 在什么身份、什么目的、什么风险等级、什么证据要求下做事。

3.2 Control Plane 核心组件

Control component	关键职责	金融零售例子
Model gateway	模型路由、版本固定、供应商边界、成本限制、fallback、调用日志	信贷和 AML 用经批准模型; 客服低风险问题可用成本更低模型
Tool gateway	工具目录、参数校验、权限、幂等、dry run、审批、执行日志	AI 可起草退款, 但实际退款必须走审批和支付系统控制
Policy engine	RBAC / ABAC / purpose / risk / data sensitivity / workflow state 决策	分行员工不能通过 AI 查询非本人服务范围客户数据
Eval gate	上线前质量、安全、合规、业务指标门禁	客户回复 AI 必须通过 groundedness、policy compliance、投诉敏感话术 eval
Logging / trace	全链路 trace、结构化事件、审计导出、replay	AML case summary 能追溯检索文档、模型版本、分析师编辑和最终决定
Risk tier	按客户影响、自动化程度、动作风险、数据敏感度分级	内部知识问答低风险; 账户冻结建议和 SAR narrative 高风险
Approval and exception	人审、双控、例外批准、到期复核	大额退款、监管报送、账户限制必须双控
Kill switch	按模型、用例、工具、租户、供应商分层停用	某模型出现错误金融建议时仅停用财富问答, 不影响内部政策检索

3.3 一次受控 AI 请求的逻辑序列

sequenceDiagram
  participant U as User / Channel
  participant I as Identity and Session
  participant A as AI Application
  participant W as Orchestration / Workflow
  participant P as Policy Engine
  participant K as Data / Knowledge Plane
  participant M as Model Gateway
  participant T as Tool Gateway
  participant E as Eval / Observability
  participant G as Governance Evidence

  U->>I: Authenticate, role, tenant, purpose
  I->>A: Session context and entitlement
  A->>W: User task and case context
  W->>P: Check use case, risk tier, data access
  P->>K: Allowed retrieval scope
  K->>W: Evidence with labels and citations
  W->>M: Model request with prompt version and context labels
  M->>E: Log model route and token/cost metrics
  M->>W: Response or tool proposal
  W->>P: Check proposed action, data, purpose, risk
  alt tool allowed or approved
    P->>T: allow / approval requirement / dry run
    T->>E: Log tool request and result
    T->>W: Scoped tool result
  else denied
    P->>E: Log denial and policy rule
    P->>W: Safe refusal or escalation path
  end
  W->>A: Answer, draft, decision support, or escalation
  A->>E: Final output, feedback hook, quality signal
  E->>G: Evidence bundle, gate status, monitoring record

3.4 Risk Tier 设计

Risk tier 不是纯技术分级, 它应组合业务影响、客户影响、动作能力、数据敏感度、自动化程度、可逆性和监管敏感度。

Tier	定义	示例	必备控制
Tier 0 Utility	不处理敏感数据, 不影响客户或决策	文档摘要、会议纪要、内部公开知识问答	基础日志、模型 allowlist、用户提示
Tier 1 Internal Assist	内部员工辅助, 有低敏或中敏数据, 无直接客户决定	分行员工政策问答、运营 SOP 助手	身份权限、RAG citation、反馈、抽样质检
Tier 2 Customer / Case Support	影响客户沟通或 case 处理, 但人工最终确认	客服回复草稿、投诉摘要、AML case summary	eval gate、人审、trace、知识权限、policy compliance
Tier 3 Regulated Decision Support	影响信贷、欺诈、AML、资金、权益, 人类负责最终决定	信贷审批建议、欺诈调查建议、SAR narrative 草稿	模型风险评估、专家 eval、双控、override log、monitoring
Tier 4 Automated High-Impact Action	自动执行客户或监管高影响动作	自动冻结账户、自动拒贷、自动监管报送	严格审批、法律合规确认、强审计、实时监控、kill switch、管理层风险接受

Risk tier 应驱动 control requirements:

risk tier
  -> allowed model class
  -> allowed data class
  -> allowed tool action tier
  -> eval depth
  -> human oversight mode
  -> logging retention
  -> release authority
  -> monitoring cadence
  -> evidence binder depth

4. Reference Architecture Views

Reference architecture 不能只有一张大图。不同受众需要不同视图, 但这些视图必须指向同一套架构对象。

4.1 Capability View

Capability view 用于向 CTO、CDAO、COO、CRO、业务负责人解释“企业 AI 平台要建设哪些能力”。

Capability domain	L2 capabilities
AI Product and Portfolio	use case intake、value hypothesis、risk tiering、roadmap、benefits tracking
AI Experience	customer assistant、employee copilot、case workbench、approval UX、feedback UX
AI Workflow	orchestration、agent state、human handoff、case routing、fallback、SLA
Model Platform	model gateway、model registry、prompt registry、inference runtime、cost management
Knowledge Platform	ingestion、classification、RAG、knowledge graph、semantic layer、lineage
Tool and Action Platform	tool catalog、tool gateway、connector runtime、dry run、action ledger
Policy and Control	policy engine、risk tier、approval、exception、kill switch、control library
EvalOps	dataset registry、eval runner、judge calibration、release gate、production eval
Observability	traces、metrics、logs、dashboards、alerts、replay、incident analysis
Identity and Security	entitlement、purpose binding、DLP、secrets、egress、tenant isolation
Governance and Evidence	AI inventory、ADR、evidence binder、management review、audit export

Capability view 的输出不是技术选型, 而是平台建设路线:

Phase 1: inventory + model gateway + basic RAG + trace
Phase 2: policy engine + tool gateway + eval gate + risk tier
Phase 3: evidence binder + production eval + advanced workflow + management dashboard

4.2 C4 / Container View

C4 container view 用于向架构评审委员会说明系统边界、运行时组件和责任分配。

flowchart TB
  Channels[Channels: Web, Mobile, Contact Center, Ops Workbench] --> App[AI Application Services]
  App --> IAM[Identity and Entitlement Service]
  App --> Orchestrator[AI Orchestration Service]
  Orchestrator --> Prompt[Prompt and Context Service]
  Orchestrator --> ModelGW[Model Gateway]
  Orchestrator --> Retrieval[Knowledge Retrieval Service]
  Orchestrator --> ToolGW[Tool Gateway]
  Retrieval --> Knowledge[(Knowledge Index / Vector Store)]
  Retrieval --> DataLake[(Data Lake / Document Store)]
  ToolGW --> Policy[Policy Engine]
  ToolGW --> Connectors[Business Connectors]
  Connectors --> Core[Core Banking / CRM / Case / Payment / AML Systems]
  Policy --> ControlLib[(Control Library and Policy Store)]
  ModelGW --> ModelReg[(Model and Prompt Registry)]
  Orchestrator --> Trace[Trace and Event Collector]
  ModelGW --> Trace
  ToolGW --> Trace
  Policy --> Trace
  Trace --> Obs[(Observability Store)]
  Trace --> Eval[EvalOps Platform]
  Eval --> Evidence[(Evidence Binder)]
  ControlLib --> Evidence
  ModelReg --> Evidence

Container 责任边界:

Container	Owner	主要接口	关键风险
AI Application Services	Product engineering	channel API、workflow API	用户边界不清、过度依赖、客户影响
AI Orchestration Service	AI platform	task API、state API、tool proposal	agent runaway、状态不可追溯
Model Gateway	AI platform / architecture	model route API	供应商变更、数据外传、成本失控
Knowledge Retrieval Service	Data / AI platform	retrieval API	权限过滤错误、知识过期、证据不足
Tool Gateway	Platform / security	tool execution API	越权动作、副作用、审批绕过
Policy Engine	Risk / security / architecture	policy decision API	策略冲突、规则未版本化、例外失控
EvalOps Platform	AI quality / model risk	eval run API	数据集不足、judge 未校准、门禁失效
Evidence Binder	Governance / audit	evidence API / export	证据不完整、敏感日志泄露

4.3 Sequence View

Sequence view 用于说明关键场景的动态行为, 尤其是控制点出现在哪里。

建议至少为每个 high-risk AI use case 画四类序列:

Sequence	必须展示
Normal path	身份、检索、模型、工具、策略、审批、输出、trace
Policy denial path	哪条策略拒绝, 如何向用户解释, 如何记录
Human review path	人工看到什么证据, 如何批准、修改或拒绝
Incident / replay path	如何从生产 trace 进入 incident, 如何 replay 和修复

示例: 客服 AI 生成退款处理草稿:

Customer complaint
  -> agent desktop loads case
  -> identity service confirms agent role and customer scope
  -> retrieval service fetches refund policy and transaction history with entitlement filter
  -> model gateway generates refund rationale and proposed action
  -> tool gateway classifies refund as customer-impacting action
  -> policy engine requires supervisor approval above threshold
  -> supervisor sees approval packet: reason, policy citation, amount, customer impact, prior refunds
  -> approved action executed through payment / case system
  -> trace, approval, tool result and final customer message stored in evidence binder

4.4 Control View

Control view 把架构组件映射到控制目标、风险、控制、证据和 owner。它是金融零售架构评审最有价值的视图。

Risk	Control objective	Control	Evidence	Owner
模型访问未批准数据	仅在允许的数据边界内调用模型	model gateway data policy、DLP、region route	model route log、DLP event、policy version	AI Platform / Security
RAG 泄露客户或租户数据	检索必须遵守 entitlement 和 purpose	metadata filter、case scope、document ACL	retrieval trace、document ids、access decision	Data Owner / Security
Agent 越权调用工具	工具动作必须绑定身份、角色、目的、风险等级	tool gateway + policy engine	tool decision log、approval record	Platform / Risk
高风险输出未经评估上线	发布必须通过 eval gate	eval contract、critical failure threshold、slice regression	eval run report、release memo	Model Risk / Product
事故无法复盘	所有关键 AI 交互可追踪和 replay	trace schema、log retention、output hash	trace link、incident replay report	Observability / Ops
审计证据不完整	生命周期证据常态化归档	evidence binder required fields	binder completeness dashboard	Governance / Audit

4.5 Operating View

Operating view 说明组织如何运行这套 reference architecture。

Forum / process	频率	输入	输出
AI Use Case Intake	每周或双周	use case brief、value、data、users、automation level	risk tier、owner、review path
Architecture Review	按项目里程碑	C4、sequence、data flow、control view	ADR、required controls、conditions
Eval and Release Gate	每次重大变更	eval report、risk assessment、monitoring plan	go / limited go / no-go / exception
Production Monitoring Review	每周到每月, 按风险分级	quality、cost、latency、override、incident、complaint	remediation、dataset update、policy change
AI Incident Review	事件触发	trace、impact、root cause、containment	postmortem、control improvement、regression cases
Management Review	季度或年度	KPI / KRI、exceptions、audit findings、portfolio	risk appetite update、resource decision、corrective action

RACI 示例:

Activity	Product Owner	AI Platform	Enterprise Architect	Data Owner	Security	Model Risk	Compliance	Operations
Use case value and scope	A	C	C	C	C	C	C	C
Reference architecture conformance	C	R	A	C	R	C	C	C
Model and prompt route	C	A	R	C	C	R	C	I
Data / knowledge access	C	R	C	A	R	C	C	I
Tool action policy	R	R	C	C	A	C	R	C
Eval gate	R	R	C	C	C	A	C	C
Production monitoring	R	R	C	C	R	R	C	A
Evidence binder	R	R	C	R	R	R	R	C

5. Financial Retail Case: AML / Credit / Customer Service AI Platform Reference Architecture

5.1 案例背景

一家区域性银行希望建设统一 AI 平台, 支撑三类优先用例:

Domain	Use case	业务目标	风险特点
AML	Alert triage summary、entity resolution explanation、SAR narrative draft	降低分析师整理时间, 提升 case 一致性	合规敏感、证据要求高、不能自动关闭高风险 case
Credit	信贷申请资料摘要、政策匹配、审批备忘录草稿、异常材料提示	提升审批效率和政策一致性	客户权益、高影响决策、fairness、模型风险
Customer Service	客服知识问答、投诉摘要、回复草稿、退款建议	降低平均处理时长, 提升客户体验	客户可见输出、误导风险、退款和承诺风险

关键架构选择:

不为三个业务线分别搭三套 AI 平台。
建立统一 model gateway、knowledge ingestion、policy engine、tool gateway、eval gate、trace schema 和 evidence binder。
允许每个业务域拥有自己的知识库、工作流、eval dataset、审批规则和风险阈值。

5.2 Capability Mapping

Shared platform capability	AML	Credit	Customer Service
Model gateway	经批准 LLM 生成 case summary 和 narrative draft	资料摘要、政策解释、审批 memo	回复草稿、意图识别、知识问答
Knowledge plane	AML policy、typology、case history、entity graph	credit policy、product terms、underwriting guideline	FAQ、policy、product guide、complaint handbook
Tool gateway	case note update、entity lookup、SAR draft package	LOS case update、document checklist、policy checklist	CRM note、ticket routing、refund draft
Policy engine	不允许自动关闭 SAR / high-risk case	不允许自动 approve / decline credit	金额阈值以上退款必须审批
EvalOps	narrative completeness、evidence citation、policy compliance	policy match、fairness slice、reason code quality	groundedness、tone、complaint escalation
Observability	analyst override、case rework、regulatory finding	override、appeal、adverse action review	CSAT、complaint、refund error、handoff
Evidence binder	case trace、source documents、analyst decision	model / policy / decision support evidence	customer response trace、approval and escalation

5.3 AML Reference Architecture

AML case workbench
  -> identity confirms analyst role, unit, case assignment
  -> orchestration loads alert, customer profile, transaction pattern, prior case links
  -> knowledge plane retrieves AML policy, typology, case procedures, entity graph evidence
  -> model gateway generates case summary and suspicious pattern explanation
  -> policy engine blocks auto-disposition of high-risk alerts
  -> tool gateway allows draft case note, not final SAR submission
  -> analyst reviews, edits, attaches evidence, decides disposition
  -> trace and decision package enter evidence binder

AML 关键控制:

Risk	Control
AI 编造可疑活动理由	groundedness eval、source citation、unsupported claim flag
AI 漏掉关键交易模式	typology coverage eval、historical case backtesting、analyst override review
自动化关闭可疑 case	policy engine 禁止高风险 auto-close, 人工最终决定
审计无法复盘	trace 记录 alert、source、model、prompt、draft、analyst edit、final disposition

5.4 Credit Reference Architecture

Loan origination system
  -> applicant documents and bureau data loaded under purpose binding
  -> AI summarizes application and checks required documents
  -> credit policy retrieval returns applicable rules and exceptions
  -> model drafts underwriting memo and reason code candidates
  -> policy engine classifies as regulated decision support
  -> human underwriter confirms facts, edits memo, makes decision in LOS
  -> model output, human edits, override and final decision evidence are linked

Credit 关键控制:

Risk	Control
AI 形成事实错误或政策误读	document-grounded answer、policy citation、underwriter confirmation
AI 影响 protected class 或产生不公平建议	fairness slice eval、feature policy、reason code review、model risk validation
客户被错误拒绝或无法解释	adverse action reason trace、human accountability、appeal evidence
模型供应商或版本变化影响决策支持	version pinning、regression eval、release gate、change notification

5.5 Customer Service Reference Architecture

Contact center console
  -> agent handles authenticated customer session
  -> AI retrieves product policy, customer case history and current complaint context
  -> model generates answer draft with citations and escalation flag
  -> policy engine checks complaint, vulnerability, refund, legal phrase and customer impact
  -> low-risk answer can be sent by agent after review
  -> refund or commitment language enters approval queue
  -> final response and approval trace are stored for QA and complaint monitoring

Customer Service 关键控制:

Risk	Control
客户可见错误承诺	approved phrase library、policy compliance eval、human review
投诉未升级	complaint intent classifier、mandatory escalation policy、QA sampling
敏感信息泄露	DLP、customer scope、masked display、external send control
质量下降但业务没发现	online monitoring、CSAT / complaint / recontact correlation、trace sampling

5.6 三个域共享的 Route-to-Release

Use case intake
  -> risk tier assignment
  -> architecture view package
  -> data / knowledge readiness review
  -> model and tool route approval
  -> eval contract
  -> offline eval and red-team
  -> pilot release with monitoring
  -> production gate
  -> ongoing review and evidence refresh

6. Artifact Templates

6.1 Reference Architecture Canvas

Section	内容
Use case / domain	业务域、流程节点、用户、客户影响、主要价值假设
Risk tier	Tier 0-4, 分级理由: 数据敏感度、动作能力、自动化程度、可逆性、监管敏感度
Experience / application	入口、用户角色、AI 边界、人工升级、反馈入口
Orchestration / workflow	workflow steps、AI insertion point、state、fallback、SLA
Model plane	model class、model gateway route、prompt version、供应商或部署边界
Data / knowledge plane	数据源、知识源、classification、entitlement、freshness、lineage
Tool / action plane	tool catalog、action tier、schema、dry run、approval、idempotency
Policy / control plane	policy decisions、required controls、risk acceptance、kill switch
Eval / observability plane	eval datasets、metrics、critical failures、trace schema、monitoring
Identity / security fabric	SSO、role、tenant、purpose、DLP、secrets、egress、incident controls
Governance / evidence plane	AI inventory、ADR、release memo、evidence binder、review cadence
Key ADRs	关键决策、替代方案、取舍、后果
Open risks and compensating controls	未完全消除的风险、补偿控制、owner、复核日期

6.2 Control Plane Checklist

Question	Yes / No	Evidence
Use case 是否登记在 AI inventory, 且 owner、risk tier、status 清楚?		inventory record
模型调用是否经过 model gateway, 而不是应用直接调用供应商 API?		model route log
Prompt、模型、RAG index、tool schema 是否有版本和 owner?		registry records
检索是否执行 data classification、entitlement、purpose 和 freshness 控制?		retrieval trace
工具调用是否经过 tool gateway 和 policy engine?		tool decision log
高风险动作是否支持 dry run、approval、dual control 或禁止执行?		approval policy
Risk tier 是否驱动 eval depth、human oversight、logging 和 release authority?		risk tier matrix
Eval gate 是否包含 critical failure 和业务场景切片, 而不是只看平均分?		eval report
Production trace 是否记录 identity、model、prompt、retrieval、tool、policy、output 和 feedback?		trace sample
DLP、secrets、log redaction 和 retention 是否覆盖 prompt、context、tool result 和 output?		security test
Kill switch 是否能按 use case、model、tool、tenant、workflow 分层执行?		kill switch drill
Evidence binder 是否能导出 intake、设计、eval、release、monitoring、incident 和 exception 证据?		binder export

6.3 Architecture Decision Matrix

Decision	Option A	Option B	Option C	Recommended decision	Rationale	Evidence required
Model deployment	Public API	Private hosted	Hybrid route	Hybrid route by risk tier	低风险用 API 提速, 高敏数据和高影响决策用受控部署或专用 route	model policy、data classification、cost model
RAG architecture	Per-app RAG	Shared retrieval service	Federated knowledge plane	Federated knowledge plane	共享 ingestion 和治理, 保留业务域知识隔离	knowledge registry、ACL test、freshness SLA
Tool execution	App direct API	Agent direct token	Tool gateway	Tool gateway	模型不应持有业务系统 token, 动作必须可审计和可审批	tool catalog、policy log、approval test
Policy implementation	Hardcoded app rules	Central policy engine	Manual review only	Central policy engine + human review for high risk	规则可版本化、可测试、可复用, 高风险保留人工责任	policy tests、decision logs
Eval strategy	Manual UAT	Generic benchmark	Use-case eval contract	Use-case eval contract	金融零售风险依赖场景、政策、数据和客户影响	eval dataset、threshold、release memo
Observability	Infra metrics only	App logs	AI trace schema	AI trace schema + OpenTelemetry-compatible export	需要串联模型、检索、工具、策略和业务反馈	trace sample、dashboard
Evidence	Static document folder	Runtime evidence binder	Audit-only sampling	Runtime evidence binder	证据应是运行副产品, 降低审计准备成本	binder completeness report

6.4 Route-to-Release Diagram

flowchart LR
  A[Use Case Intake] --> B[Risk Tier and Impact Assessment]
  B --> C[Reference Architecture Canvas]
  C --> D[Data / Knowledge Readiness]
  C --> E[Model and Tool Route Approval]
  D --> F[Eval Contract]
  E --> F
  F --> G[Offline Eval and Red-team]
  G --> H{Eval Gate}
  H -->|No-go| R[Remediation and Re-test]
  R --> G
  H -->|Limited go| P[Pilot with Monitoring]
  H -->|Go| P
  P --> I{Production Gate}
  I -->|Scale| S[Production Release]
  I -->|Restrict| L[Limited Scope Release]
  I -->|Rollback| RB[Rollback / Manual Process]
  S --> M[Ongoing Monitoring]
  L --> M
  M --> N[Trace Mining and Incident Review]
  N --> O[Dataset / Policy / Control Update]
  O --> F
  M --> V[Evidence Binder and Management Review]

Release memo 最小字段:

Field	内容
Use case and release scope	哪个业务域、哪些用户、哪些流程、哪些地区、哪些渠道
Risk tier and rationale	分级、客户影响、数据类型、动作能力、自动化程度
Architecture summary	八层架构摘要、关键组件、边界、控制点
Model / data / tool versions	模型、prompt、knowledge index、tool schema、policy version
Eval result	数据集、指标、critical failures、slice regression、red-team outcome
Control status	required controls、exceptions、compensating controls、owner
Monitoring plan	signals、thresholds、sample rate、review cadence、incident triggers
Rollback / kill switch	触发条件、关停范围、人工替代流程
Approval	Product、Architecture、Security、Risk、Compliance、Operations

7. Interview Answers

7.1 30 秒版本

企业 AI 不能只画项目级方案图, 因为真正的风险来自跨用例的模型、数据、工具、权限、评估和审计不可控。我会用 enterprise AI reference architecture 把 experience、workflow、model、data / knowledge、tool / action、policy / control、eval / observability、governance / evidence 八层串起来, 再用 model gateway、tool gateway、policy engine、eval gate、trace 和 risk tier 组成 control plane。这样每个 AI 用例都能在统一边界内做模型路由、数据权限、动作审批、上线门禁、生产监控和证据沉淀。

7.2 2 分钟版本

我会先区分 data plane 和 control plane。Data plane 负责业务请求实际执行, 比如检索知识、调用模型、生成回复、提出工具动作。Control plane 负责决定这些动作是否被允许、是否需要审批、如何评估、如何记录和如何停用。

在架构上, 我会拆成八层。第一层是 experience / application, 处理客户、员工、运营和审批体验。第二层是 orchestration / workflow, 管 workflow state、agent step、fallback 和人工升级。第三层是 model plane, 通过 model gateway 管模型路由、版本、供应商边界和成本。第四层是 data / knowledge plane, 管数据分类、知识库、RAG、权限和证据引用。第五层是 tool / action plane, 通过 tool gateway 管业务系统动作、schema、幂等和执行日志。第六层是 policy / control plane, 用 policy engine、risk tier、approval、eval gate 和 kill switch 管风险。第七层是 eval / observability, 把 offline eval、online monitoring、trace、metrics、logs 和 replay 接起来。第八层是 governance / evidence, 形成 AI inventory、ADR、release memo、evidence binder 和 management review。

在金融零售落地时, AML、信贷和客服可以共享 model gateway、policy engine、tool gateway、EvalOps 和 trace schema, 但保留各自的知识库、评估集、审批规则和风险阈值。比如 AML 可以让 AI 生成 case summary 和 SAR narrative 草稿, 但 policy engine 禁止自动关闭高风险 case; 信贷 AI 可以生成审批备忘录, 但最终决定由 underwriter 负责; 客服 AI 可以生成回复草稿, 但退款或承诺类动作必须走审批。

7.3 Chief Architect / CTO 版本

我会把 enterprise AI reference architecture 定义为一套平台控制面, 而不是一套单一应用框架。CTO 需要关心的不是“哪个团队接了哪个模型”, 而是企业是否拥有统一的 AI route、control、observe、govern 能力。

我的目标架构有三个原则。

第一, runtime 解耦。Application、workflow、model、knowledge、tool 可以由不同团队和技术栈实现, 但必须通过 model gateway、knowledge entitlement、tool gateway 和 policy engine 接入统一控制。

第二, risk-based control。不同风险等级走不同模型、数据、工具、eval、审批、日志保留和发布权限。内部低风险知识问答不需要和自动化账户冻结同等治理, 但两者都要纳入 inventory、trace 和 evidence。

第三, evidence by design。所有关键请求都生成结构化 trace, 串联 identity、purpose、model、prompt、retrieval、tool、policy、approval、output 和 feedback。上线前的 eval gate、上线后的 monitoring、事故后的 replay 和管理层 review 都从同一条证据链派生。

对金融零售来说, 这套架构的价值是把创新速度和风险控制同时产品化。平台团队提供 reusable capabilities, 业务团队专注用例和价值, 风险与合规团队通过控制视图和 evidence binder 做有效挑战, 架构团队通过 reference architecture 保证扩展时不形成一堆不可治理的 AI 孤岛。

7.4 追问准备

追问	回答要点
为什么不让每个业务线自己选模型和建 RAG?	可以允许业务线选择适合自己的模型和知识源, 但必须接入统一 gateway、policy、trace、eval 和 evidence。否则供应商风险、权限、成本、事故和审计都会碎片化。
Control plane 会不会拖慢创新?	好的 control plane 是平台产品, 不是审批墙。低风险用例可以走快速 route, 高风险用例自动触发更深 eval 和审批。用 risk tier 区分控制强度, 反而能让团队知道怎样更快上线。
哪个组件最关键?	对 agentic AI, tool gateway 和 policy engine 最关键; 对 RAG, knowledge entitlement 和 eval 最关键; 对企业治理, trace schema 和 evidence binder 最关键。整体上 model gateway 是入口, control plane 是核心。
如何判断架构成熟度?	看五件事: AI inventory 是否完整, 模型和工具是否都经过 gateway, risk tier 是否驱动 release gate, production trace 是否可 replay, evidence binder 是否能支持审计和管理层 review。
如果预算有限先做什么?	先做 inventory、model gateway、基础 trace、risk tier 和高风险用例 eval gate。然后补 tool gateway、policy engine、production eval 和 evidence binder。不要先做复杂 agent, 却没有工具权限和审计边界。

8. 作品集表达方式

如果把本文转成作品集, 建议输出四件资产:

Artifact	展示重点
Enterprise AI Reference Architecture Canvas	展示八层架构、身份安全横切、控制面和证据面
Financial Retail Control Plane Case	展示 AML / credit / customer service 如何共享平台能力又保持域隔离
C4 + Sequence + Control View Pack	展示从高层能力到容器、序列和控制证据的完整架构表达
Route-to-Release Gate Memo	展示 risk tier、eval gate、pilot、production monitoring、kill switch 和 evidence binder

面试时不要从技术组件堆叠开始, 而要从企业问题开始:

我解决的是 AI 从 PoC 扩散到企业生产后出现的治理断裂: 模型难控、数据难控、工具动作难控、上线质量难控、生产风险难控、审计证据难控。因此我用 reference architecture 和 control plane 把业务价值、平台能力和风险治理统一起来。