AI 扩展计划 / Playbooks

AI Enterprise Integration / Event-Driven Agent Playbook

以下官方或 primary sources 作为术语和架构锚点。本文把它们转成企业 AI 集成、事件驱动 agent 和金融零售治理语言。

1,106 行AI_ENTERPRISE_INTEGRATION_EVENT_DRIVEN_AGENT_PLAYBOOK.md

AI Enterprise Integration / Event-Driven Agent Playbook

受众: AI Architect、Product Architect、AI Platform PM、Integration Architect、Enterprise Architect、金融零售技术产品负责人。核心问题: 当企业 AI Agent 不再只是回答问题, 而要跨 CRM、支付、KYC、AML、工单、核心银行、通知和审批系统协作时, 如何选择 API、事件、工作流引擎和 agent tool contract, 并让动作可授权、可审计、可重放、可恢复、可治理。学习目标: 不讲 BA 基础和泛泛需求分析。目标是训练高级角色能设计 agent integration pattern、OpenAPI / AsyncAPI / CloudEvents 契约、idempotency、outbox / inbox、saga、schema governance、human-in-loop queue、dead-letter handling、action approval、audit trail、replay runbook 和金融零售级治理证据。

重要说明: 本文是学习、作品集和架构训练材料, 不是法律意见、合规结论、审计意见或生产架构批准。金融零售正式项目必须由 Business Owner、Architecture、Engineering、Security、Privacy、Legal、Compliance、Model Risk、Operational Risk、Internal Audit 和数据/系统 owner 共同确认适用边界、客户影响、监管义务和上线门禁。

Source Anchors

以下官方或 primary sources 作为术语和架构锚点。本文把它们转成企业 AI 集成、事件驱动 agent 和金融零售治理语言。

Source	Official / primary link	本手册使用方式
AsyncAPI Documentation	https://www.asyncapi.com/docs	用于描述 event-driven API、message channel、operation、schema、bindings 和异步契约治理。
CloudEvents Specification	https://cloudevents.io/	用于定义事件包络的标准字段, 例如 `id`、`source`、`type`、`specversion`、`subject`、`time` 和 data 内容边界。
OpenAPI Specification	https://spec.openapis.org/oas/latest.html	用于同步 HTTP API 契约、request / response schema、operationId、security scheme、error model 和 API versioning。
CNCF Serverless Workflow	https://serverlessworkflow.io/	用于描述事件驱动工作流、状态、动作、补偿、分支、超时和长事务编排。
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 组织 agent integration 的风险识别、度量、控制、证据和持续治理。

1. 一句话定位

企业 AI Agent 的难点不是“模型能不能调用工具”, 而是:

Agent 的每一次外部动作,
是否有清晰的契约、权限、审批、幂等、补偿、事件证据、审计轨迹和恢复路径。

本手册关注的是 AI Enterprise Integration 的控制面:

User intent
-> agent plan
-> tool / API / event / workflow contract
-> policy and approval
-> execution
-> observation
-> audit event
-> replay / compensation / investigation
-> governance learning loop

适合放入作品集的最终产出:

Portfolio artifact	展示能力
Agent Integration Decision Matrix	能判断一个能力应做成 API、event、workflow 还是 human queue。
OpenAPI + AsyncAPI Contract Pack	能把同步命令和异步事件分开建模, 形成可评审契约。
CloudEvents Event Envelope Standard	能统一事件元数据、trace、tenant、risk、schema version 和审计字段。
Tool Contract Card	能把 agent tool 从函数包装升级为可授权、可审计、可治理的业务能力。
Idempotency / Outbox / Inbox Design	能避免 agent 重试导致重复退款、重复通知、重复建单或重复 SAR evidence 写入。
Saga and Compensation Map	能设计跨支付、case、通知、账务和人工审批的长事务恢复策略。
HITL Queue Design	能把 action approval、exception handling、four-eyes review 和 SLA 落到系统设计。
Event Schema Governance Pack	能建立 schema owner、兼容性规则、版本策略、consumer impact review 和 deprecation gate。
Replay and DLQ Runbook	能处理事件失败、死信、重放、顺序、去重和事故复盘。
NIST AI RMF Control Mapping	能把 agent 集成风险转成治理、度量、管理和审计证据。

2. 高级框架: Agent Integration Control Plane

AI agent 集成不是单个 connector, 而是一个控制面。建议分为八层:

1. Intent and risk classification
2. Capability discovery and tool selection
3. Contract validation
4. Policy, entitlement and approval
5. Execution through API / event / workflow / queue
6. Observation normalization
7. Audit, trace, replay and recovery
8. Governance, eval and schema lifecycle

2.1 八层控制面

Layer	关键问题	主要设计对象	失控信号
Intent and risk	这个请求是查询、建议、草稿、写入、外发还是客户权益动作?	intent taxonomy、risk tier、decision boundary	低风险路径误触发高风险工具。
Capability discovery	Agent 能发现哪些能力, 每个能力的 owner、scope、限制和风险是什么?	capability catalog、tool card、service registry	Agent 使用临时工具、未知 owner 或未评审 connector。
Contract validation	输入、输出、错误、事件和状态是否可机器验证?	OpenAPI、AsyncAPI、CloudEvents、JSON Schema	参数靠自然语言拼接, 返回结果不可稳定解析。
Policy and approval	谁有权执行, 哪些动作需审批, 哪些场景必须停机?	ABAC / RBAC、policy engine、approval rule、kill switch	模型直接决定权限或审批。
Execution	该用同步 API、异步事件、工作流引擎还是人工队列?	API gateway、event bus、workflow engine、HITL queue	一个 HTTP call 承载长事务和多系统副作用。
Observation	工具结果如何返回给 agent, 如何防 prompt injection 和误读?	normalized observation schema、trusted / untrusted label	工具输出被当作系统指令, 或缺少来源和时间。
Audit and recovery	失败、重试、重放、补偿和调查依赖哪些证据?	trace ID、audit event、idempotency key、DLQ、replay log	只能看到最终错误, 看不到调用链和决策原因。
Governance	契约、schema、工具、模型和 workflow 如何变更?	ADR、schema review、release gate、eval set、risk review	consumer 被破坏、审计证据缺失、风险接受不可追溯。

2.2 Agent 集成成熟度

Maturity	典型状态	主要风险	下一步升级
Level 0: Manual copy	员工从多个系统复制粘贴给 AI	数据泄露、证据断裂、不可复盘	识别高频上下文和动作路径。
Level 1: Point connector	单 use case 写专用 API wrapper	权限散落、错误处理不一致、复用差	建 tool contract 和 audit event。
Level 2: Tool gateway	工具统一注册、鉴权和日志	仍缺事件契约、异步恢复和 schema 治理	引入 OpenAPI / AsyncAPI / CloudEvents。
Level 3: Event-driven agent	Agent 通过事件和 workflow 参与长流程	replay、ordering、DLQ、schema 兼容性复杂	建 outbox / inbox、saga、HITL queue 和治理门禁。
Level 4: Governed integration platform	多业务线复用 tool、event、workflow、approval 和 audit	平台 ownership、成本、变更治理和组织协调难	建 integration operating model、control dashboard 和 quarterly review。

3. API vs Event vs Workflow Engine vs Human Queue

高级设计的核心不是“哪个技术先进”, 而是根据业务语义选择集成模式。

3.1 决策矩阵

Pattern	适合做什么	不适合做什么	金融零售例子	关键控制
Synchronous API	需要立即结果、边界清晰、可快速失败的查询或命令	长事务、多系统协调、需要人工等待的流程	查询 dispute case、拉取 KYC profile、校验退款资格	OpenAPI、timeout、rate limit、idempotency key、structured error。
Event-driven pub/sub	状态变化通知、事实广播、多个 consumer 独立响应	需要立即强一致返回、单一审批决策	`payment.dispute.opened`、`kyc.case.status_changed`、`aml.alert.evidence_requested`	AsyncAPI、CloudEvents、schema compatibility、outbox、consumer lag。
Workflow engine	长时间、多步骤、分支、补偿、人工审批和跨系统编排	简单查询、无需状态管理的单次动作	KYC review workflow、payment dispute provisional credit、AML evidence pack generation	state machine、saga、compensation、timeout、human task、audit state。
Human-in-loop queue	高风险判断、例外处理、双人审批、客户影响动作	低风险高频自动化、纯系统事件广播	大额退款审批、SAR narrative review、账户冻结建议复核	queue SLA、assignment、maker-checker、evidence view、decision log。
Agent-only planning	低风险推理、草稿、摘要和下一步建议	写系统、资金动作、监管材料提交、客户承诺	客服回复草稿、case summary、investigation checklist	no-side-effect mode、citation、review gate、output schema。

3.2 选择规则

判断问题	设计结论
调用方必须立即知道结果吗?	是: API。否: event 或 workflow。
是否有多个下游系统需要独立响应同一事实?	是: event。不要让 agent 逐个同步调用下游。
是否有等待、人工审批、补偿、超时和多步骤状态?	是: workflow engine。不要把长事务塞进一个 API endpoint。
是否会改变资金、账户、客户权益、监管记录或外发沟通?	至少需要 approval policy、audit event 和 recovery path。
失败后是否可以安全重试?	不能安全重试时, 先设计 idempotency 和 compensation。
事件是否代表事实已经发生, 还是请求某人做事?	已发生事实用 past-tense event。请求动作优先用 command API 或 workflow task。
Agent 是否只是 consumer 之一?	事件流中 agent 应像普通 consumer 一样受 schema、权限和 backpressure 控制。

3.3 Command、Event、Query 的边界

类型	命名倾向	语义	契约重点
Query	`GET /cases/{id}`、`searchCustomerEvidence`	不改变业务状态, 返回当前事实	entitlement、data minimization、freshness、pagination。
Command	`POST /refunds`、`approveDisputeCredit`	请求系统执行动作, 可能成功或失败	idempotency、authorization、validation、error code、approval ID。
Event	`payment.dispute.opened`、`refund.approved`	描述已经发生的事实	immutable event、schema version、source、trace、ordering、replay。
Workflow task	`ReviewKycException`、`ApproveHighRiskAction`	等待人或系统完成一个步骤	assignment、SLA、state transition、decision reason、escalation。

4. Contract-First: OpenAPI、AsyncAPI、CloudEvents

Agent integration 必须 contract-first。自然语言 prompt 不能替代接口契约。

4.1 OpenAPI 用于同步能力

OpenAPI 适合描述 HTTP API, 尤其是 query 和 command。

设计项	高级要求
`operationId`	必须稳定、语义明确, 可映射到 tool name。
request schema	所有字段要有类型、枚举、格式、最小必要性和敏感度标记。
response schema	返回结构化 observation, 避免让 agent 从自由文本里猜状态。
error model	用可分类错误: validation、entitlement、approval_required、conflict、rate_limited、dependency_timeout。
security scheme	明确 user-delegated token、service account、mTLS、OAuth scope 或 internal auth。
idempotency	所有写 command 接受 `Idempotency-Key`, 并返回 idempotent replay result。
audit headers	支持 `X-Correlation-Id`、`X-Request-Id`、`X-Actor-Id`、`X-Case-Id`、`X-Policy-Decision-Id`。

4.2 AsyncAPI 用于异步事件和消息

AsyncAPI 适合描述 event bus、message broker、stream、topic 和 channel。

设计项	高级要求
channel	命名体现 domain 和事实语义, 例如 `payments.disputes.events.v1`。
operation	区分 publish 和 subscribe, 明确 producer / consumer owner。
message	payload schema、headers、examples、content type、correlation ID。
bindings	记录 Kafka、AMQP、SNS/SQS、EventBridge 等 broker-specific 约束。
schema version	每个 message 绑定 schema version 和兼容策略。
consumer contract	重要 consumer 要声明 SLA、过滤条件、failure mode、DLQ owner。
replay policy	说明事件保留时间、可重放范围、重放审批和去重要求。

4.3 CloudEvents 用于统一事件包络

CloudEvents 的价值是统一事件 metadata, 让不同 producer、consumer、agent 和审计系统可以理解同一个事件的基本身份。

建议企业标准事件包络:

{
  "specversion": "1.0",
  "id": "evt_20260629_000123",
  "source": "urn:service:payment-dispute",
  "type": "com.momofinance.payment.dispute.opened.v1",
  "subject": "dispute/DSP-2026-09291",
  "time": "2026-06-29T14:35:20Z",
  "datacontenttype": "application/json",
  "dataschema": "https://schema.example.com/payment/dispute-opened/v1",
  "data": {
    "dispute_id": "DSP-2026-09291",
    "customer_id_ref": "cust_ref_9f22",
    "amount": "128.40",
    "currency": "USD",
    "reason_code": "goods_not_received",
    "risk_tier": "medium"
  }
}

企业扩展字段建议放在 extension attributes 或受治理的 data header 中, 并保持命名一致:

字段	用途
`correlationid`	跨 API、event、workflow 和 agent trace 的业务链路 ID。
`causationid`	当前事件由哪个 command、event 或 workflow step 触发。
`tenantid`	多租户隔离和审计过滤。
`actorid`	人、service account 或 agent identity。
`policydecisionid`	授权、审批、风控或 guardrail 决策引用。
`schemaid`	schema registry 中的不可变 schema 标识。
`riskclass`	low、medium、high、regulated、customer-impacting。
`traceparent`	分布式追踪上下文。

4.4 契约拆分原则

不良做法	推荐做法
一个 `POST /agent/doEverything` 承载查询、审批、写入和通知	把 query、command、event、workflow task 分开建模。
API 返回自然语言“成功了”	返回 machine-readable status、entity ID、version、effective time 和 audit reference。
Event payload 直接复用数据库表	建 domain event schema, 屏蔽内部表结构和敏感字段。
Schema 变化靠口头通知	schema registry + compatibility check + consumer impact review。
Agent tool 只写函数名和描述	tool contract 包含 side effect、risk tier、approval、idempotency、audit、rollback。

5. Agent Integration Patterns

5.1 Read-Only Evidence Agent

Agent -> query APIs / retrieval -> evidence pack -> draft recommendation -> human review

适用:

AML evidence aggregation。
KYC profile summary。
Payment dispute case summary。
客服历史交互归纳。

关键控制:

所有读接口按用户权限和 case entitlement 过滤。
observation 带 source system、record ID、retrieval time、policy version。
agent 只能输出摘要、差异和建议, 不能写入 source of truth。
对监管或客户可见内容需要 human review。

5.2 Draft-Then-Approve Action Agent

Agent creates action proposal
-> policy engine evaluates
-> human approves
-> command API executes
-> event confirms result

适用:

客服退款草案。
dispute provisional credit 建议。
KYC exception close recommendation。
客户沟通回复草稿。

关键控制:

proposal 和 execution 是两个不同对象。
approval ID 必须传入 command API。
execution 后产生不可变 domain event。
审计记录保存 agent suggestion、human edit diff、approval reason 和 command result。

5.3 Event-Subscribed Agent

Business event -> event bus -> agent consumer -> enrichment / triage / task creation

适用:

新 AML alert 触发 evidence aggregation。
KYC 状态变更触发缺件检查。
dispute opened 触发 case summary。
客服投诉升级触发 policy retrieval。

关键控制:

Agent consumer 有独立 consumer group、lag monitoring 和 DLQ。
Agent 不改变原事件, 只生成 enrichment event 或 workflow task。
处理逻辑用 inbox 去重, 避免重复消费。
每个 output event 标记由哪个 input event 和 model / prompt / tool version 产生。

5.4 Workflow-Orchestrated Agent

Workflow engine owns state
-> agent performs bounded task
-> workflow decides next state
-> human queue handles exceptions

适用:

KYC event workflow。
AML investigation checklist。
payment dispute resolution。
高风险客户服务 action。

关键控制:

Agent 是 workflow 的 task worker, 不是状态所有者。
Workflow 负责 timeout、retry、compensation、manual escalation 和 state transition。
Agent output 必须符合 task output schema。
高风险 transition 由 workflow policy gate 控制。

5.5 Tool Gateway with Policy Enforcement

Agent -> tool gateway -> policy engine -> connector -> source system

适用:

多 agent 共用 CRM、case、payment、notification、document、KYC vendor connector。
平台级 tool catalog 和权限治理。

关键控制:

gateway 执行 allowlist、scopes、field masking、rate limit、approval gate 和 audit。
connector 不信任 agent 参数, 仍做 schema validation 和 entitlement check。
高风险工具提供 dry-run endpoint, 先返回 action impact preview。
gateway 统一记录 tool call span 和 policy decision。

6. Tool Contract: 从函数到受治理能力

每个 agent tool 都应有一张 contract card。它不是工程注释, 而是 architecture、risk、product 和 audit 的共同契约。

6.1 Tool Contract Card 模板

字段	内容
Tool name	`payment_dispute.create_provisional_credit_proposal`
Business capability	为符合条件的 dispute case 生成 provisional credit 提案。
Owner	Payment Dispute Platform Owner。
Source system	Dispute Case Management、Payment Ledger read API。
Side effect tier	Draft only。不会直接入账。
Risk class	Customer-impacting, regulated operation support。
Allowed intents	dispute investigation、customer complaint handling。
Disallowed intents	marketing retention、general refund request、non-dispute adjustment。
Input schema	`case_id`、`customer_id_ref`、`amount`、`currency`、`reason_code`、`evidence_refs`。
Output schema	proposal ID、eligibility status、reason codes、required approvals、evidence gaps。
Permission	Case-assigned agent or supervisor with dispute entitlement。
Approval	execution requires human approval; amount threshold controls second approval。
Idempotency	key = `case_id + proposed_action_type + evidence_hash`。
Audit	actor、agent session、tool version、input hash、evidence refs、policy decision ID、proposal ID。
Failure mode	eligibility_conflict、missing_evidence、approval_required、source_timeout、policy_denied。
Replay behavior	draft creation can be replayed idempotently; execution command cannot proceed without valid approval ID。
Observability	trace span、latency、error category、policy denial rate、human override rate。

6.2 Tool Risk Tier

Tier	语义	默认控制
T0 Read	只读查询, 但可能包含敏感信息	entitlement、field masking、audit、rate limit。
T1 Analyze	聚合、摘要、分类、打标签, 不写 source of truth	evidence refs、confidence boundary、review sample。
T2 Draft	生成草稿、建议、提案、待审批对象	human review、edit diff、approval reason。
T3 Write	修改 case、客户资料、任务状态、内部记录	approval policy、idempotency、outbox event、rollback / compensation。
T4 External / Financial	发客户通知、触发退款、冻结账户、提交监管材料	dual control、step-up auth、pre-action preview、saga、post-action audit。

6.3 Action Approval

Action approval 必须从 UI 按钮升级为系统契约。

设计项	要求
Approval subject	明确批准的是哪个 proposal, 不是批准一句自然语言建议。
Approval scope	指定金额、客户、case、动作、有效期和可执行次数。
Approval evidence	保存 AI evidence、source records、human edits、policy checks。
Approval authority	根据金额、客户风险、动作类型、地区和岗位决定 single / dual approval。
Approval revocation	执行前可撤销; 执行后走 compensation。
Approval binding	command API 必须校验 approval ID 和 proposal hash, 防止批准 A 执行 B。
Approval audit	记录 approver、timestamp、reason code、UI version、decision context。

7. Event Schema Governance

事件 schema 治理的目标不是写漂亮字段, 而是让 producer、consumer、agent、workflow、audit 和 replay 在变更时都不被破坏。

7.1 Domain Event 标准

维度	规则
命名	用事实过去式: `payment.dispute.opened`, `kyc.case.approved`, `aml.evidence.pack.generated`。
粒度	事件代表业务事实, 不代表数据库行变更。
不可变	已发布事件不修改, 通过新事件纠正。
最小必要	不把完整客户资料塞入事件, 用引用和 entitlement-aware API 拉取详情。
版本	breaking change 使用新 event type 或新 major schema version。
可追踪	每个事件带 correlation、causation、source、actor、policy decision 和 schema ID。
可重放	consumer 必须能识别 replay, 并以 inbox / idempotency 防重复副作用。

7.2 Compatibility Policy

变更	兼容性	处理方式
增加 optional field	通常兼容	schema registry 自动检查, consumer 忽略未知字段。
增加 required field	breaking	新 schema major version, consumer impact review。
修改字段类型	breaking	新 schema major version, 提供迁移窗口。
删除字段	breaking	先 deprecate, 观察 consumer 使用, 再下线。
修改枚举含义	高风险	architecture review, consumer contract test, release note。
更改事件语义	breaking	新 event type, 不复用旧名字。

7.3 Schema Review Gate

每个高影响事件上线前至少回答:

问题	期望证据
这个事件描述的业务事实是什么?	event definition 和例子。
谁是 producer owner?	service owner、on-call、data steward。
哪些 consumer 依赖它?	consumer inventory、SLA、risk tier。
是否包含 PII / PCI / AML / KYC 敏感字段?	data classification 和 minimization decision。
是否支持 replay 和去重?	event ID、idempotency key、inbox strategy。
schema 变更如何通知和验证?	compatibility check、contract test、release gate。
事件失败如何处理?	retry、DLQ、owner、triage runbook。

8. Reliability Patterns for Event-Driven Agent

8.1 Idempotency

Agent 最危险的失败之一是“重试成功两次”。所有写动作都应设计 idempotency。

场景	Idempotency key 设计
创建 dispute evidence pack	`case_id + evidence_scope + source_snapshot_version`
创建 refund proposal	`case_id + action_type + amount + evidence_hash`
执行 approved refund	`approval_id + proposal_hash`
发送客户通知	`case_id + notification_template_id + recipient + content_hash`
创建 KYC missing document task	`customer_id_ref + requirement_code + policy_version`

关键原则:

Idempotency key 由业务语义生成, 不是随机 UUID。
API 返回相同 key 的原始结果, 不重复执行副作用。
保存 request hash, 防止同 key 不同 payload 被误认为同一请求。
高风险动作必须有 idempotency record retention policy。

8.2 Outbox / Inbox

Outbox 解决“数据库提交成功但事件没发出”的问题。Inbox 解决“事件重复投递导致 consumer 重复副作用”的问题。

Command API
-> write business state and outbox row in same transaction
-> outbox publisher emits CloudEvent
-> consumer receives event
-> inbox records event id / processing result
-> consumer performs idempotent side effect

Pattern	适用点	控制
Outbox	任何写入 source of truth 后要发布事件的服务	same transaction、publisher retry、outbox lag alert。
Inbox	任何有副作用的 event consumer, 包括 agent worker	event ID dedupe、processing status、retry count、poison message flag。
Agent memory write	Agent 根据事件写入 case memory 或 task note	memory idempotency、source event reference、human visibility。

8.3 Sagas and Compensation

金融零售长事务通常不能依赖分布式事务, 需要 saga。

Saga step	Forward action	Compensation
Create dispute proposal	创建待审批提案	取消提案并记录 reason。
Reserve credit action	锁定待执行额度或校验可执行窗口	释放 reservation。
Apply provisional credit	写入账务或 case 状态	发起 reversal / adjustment workflow。
Notify customer	发送通知	发送纠正通知或人工联系任务。
Close workflow task	更新 case milestone	重新打开 task 并附加 incident note。

设计原则:

Compensation 不等于 rollback。客户可见动作和账务动作通常需要反向交易或补救流程。
Saga state 应由 workflow engine 或 transaction coordinator 维护, 不由 agent prompt 维护。
每个 step 要记录 actor、approval、input hash、output event 和 compensation eligibility。

8.4 Replay

Replay 是调查和恢复能力, 不是普通业务操作。

Replay 类型	用途	控制
Read-only replay	重新运行 agent summary 或 classification, 不写任何系统	safe mode、model / prompt version pinning、output comparison。
Event replay	重新投递历史事件给 consumer	replay marker、consumer inbox、rate limit、approval。
Workflow replay	重建 workflow state 或从某一步继续	state snapshot、versioned workflow definition、manual approval。
Incident replay	复现事故输入、上下文、工具结果和输出	evidence preservation、redacted trace、audit binder。

Replay 前必须确认:

是否会触发写动作。
consumer 是否支持 replay marker。
idempotency / inbox 是否有效。
使用原始模型版本还是当前版本。
是否需要 risk / compliance approval。

8.5 Dead-Letter Handling

DLQ 不是垃圾桶, 是运营队列。

DLQ category	例子	处理策略
Schema failure	event 缺少 required field, enum 不识别	停止自动重试, 通知 producer owner, 开 schema incident。
Entitlement failure	agent consumer 无权拉取 referenced record	检查权限配置, 不绕过最小权限。
Dependency timeout	KYC vendor、case API、ledger API 超时	指数退避、circuit breaker、workflow wait state。
Policy denied	高风险动作未满足审批或数据条件	转 human queue, 不自动降级绕过。
Poison message	单条消息重复失败且阻塞队列	隔离消息, 创建 triage task, 记录 decision。

DLQ runbook 要包含 owner、SLA、重试上限、手工修复权限、重放流程和客户影响评估。

8.6 Ordering, Backpressure and Exactly-Once Myth

问题	设计判断
Ordering	只在真正需要的 aggregate 内保证顺序, 例如同一 `case_id`。跨客户或跨 case 不追求全局顺序。
Backpressure	Agent consumer 可能慢于普通服务, 必须有 lag alert、work shedding、priority queue。
Exactly-once	不把 exactly-once 当作业务保证。用 at-least-once delivery + idempotent consumer + inbox 达到业务安全。
Retry	区分 transient failure 和 permanent failure。validation / entitlement / policy denied 不应盲目重试。
Timeout	Agent task 要有 step timeout 和 total workflow timeout, 防止无限循环和成本失控。

9. Human-in-Loop Queues

Human-in-loop queue 是事件驱动 agent 的关键组件, 不只是 UI 列表。

9.1 Queue 类型

Queue	触发条件	决策输出
Approval queue	Agent 提案需要执行写动作或客户影响动作	approve、reject、edit and approve、request evidence。
Exception queue	policy conflict、schema issue、missing evidence、low confidence	resolve、escalate、return to agent with constraints。
Quality review queue	抽样复核 agent summary、classification、draft	accept、correct、calibration feedback。
Incident triage queue	DLQ、tool misuse、prompt injection、replay request	contain、assign owner、start incident。
Dual-control queue	大额、敏感、监管或不可逆动作	second approval、segregation-of-duties validation。

9.2 Queue Contract

每个 human queue 必须定义:

字段	说明
Work item type	`ApproveRefundProposal`、`ReviewKycException`、`ValidateAmlEvidencePack`。
Entry criteria	哪些 policy / risk / event 条件触发。
Evidence bundle	人必须看到哪些 source records、AI output、tool calls、policy decisions。
Allowed decisions	每个按钮代表的状态转移和后果。
SLA and priority	按客户影响、监管时限、金额、风险级别排序。
Assignment rule	技能、权限、地区、segregation of duties。
Audit fields	reviewer、decision、reason、edits、time spent、override pattern。
Exit event	人工决定后发布什么 event 或 command。

9.3 反模式

反模式	风险
审批人只能看到 AI 结论, 看不到证据	形成 rubber stamp。
所有异常都进同一个队列	SLA 失真, 高风险项被低风险项淹没。
人工修改不记录 diff	无法评估 agent 质量和 reviewer 判断。
审批和执行没有绑定	批准内容和实际执行内容可能不一致。
Queue backlog 无 ownership	人在回路中变成事故延迟器。

10. Audit, Observability and Governance

10.1 Agent Action Audit Event

建议每个工具动作都生成统一 audit event:

{
  "audit_event_type": "agent.tool_action.executed",
  "correlation_id": "corr_8b9c",
  "agent_session_id": "ags_20260629_001",
  "actor_type": "human_delegated_agent",
  "actor_id": "user_1827",
  "agent_id": "customer_service_action_agent",
  "tool_name": "case.update_status",
  "tool_version": "2.3.0",
  "risk_tier": "T3_WRITE",
  "input_hash": "sha256:...",
  "output_ref": "case_event_7781",
  "approval_id": "apv_6182",
  "policy_decision_id": "pdp_9182",
  "idempotency_key": "case-4421-close-v2",
  "source_event_id": "evt_20260629_000123",
  "model": "model-alias-prod-2026-06",
  "prompt_version": "cs-action-v17",
  "decision": "executed",
  "timestamp": "2026-06-29T15:04:11Z"
}

10.2 Traceability Chain

每个高风险动作要能从客户影响追溯回:

Customer-facing outcome
-> domain event
-> command API
-> approval ID
-> agent proposal
-> tool calls
-> retrieved evidence
-> source event / user request
-> model / prompt / policy / schema versions

10.3 NIST AI RMF 映射

RMF Function	Agent integration 问题	证据
Govern	谁拥有 tool、event、workflow、schema、approval 和 residual risk?	RACI、tool catalog owner、schema governance board、risk acceptance。
Map	哪些 agent action 会影响客户、资金、身份、AML/KYC、投诉或监管材料?	use case inventory、risk tier、data flow、event map、decision boundary。
Measure	如何度量工具误用、schema failure、DLQ、human override、replay、latency 和成本?	dashboard、contract test、eval set、incident trend、approval analytics。
Manage	出现失败时如何停机、降级、重放、补偿、通知和修复?	kill switch、DLQ runbook、saga compensation、postmortem、release gate。

10.4 Governance Cadence

Cadence	内容
Per release	contract diff、schema compatibility、tool risk review、approval rule regression、replay test。
Weekly ops review	DLQ aging、consumer lag、tool error rate、policy denial、manual queue SLA。
Monthly architecture review	new tool onboarding、event taxonomy change、workflow bottleneck、saga incidents。
Quarterly risk review	high-risk action sample、audit evidence completeness、NIST RMF control mapping、residual risk。

11. Architecture / Product Mapping

11.1 平台能力地图

Architecture component	Product capability	关键需求语言
Tool catalog	业务团队能发现可用 agent tools	每个 tool 展示 owner、risk tier、allowed intent、schema、approval 和 SLA。
API gateway	同步工具调用边界	执行 auth、rate limit、schema validation、idempotency 和 audit header。
Event bus	业务事实广播	支持 CloudEvents、AsyncAPI、schema registry、consumer lag 和 replay。
Workflow engine	长事务和人机协作	支持 state、branch、timeout、compensation、human task 和 versioned definition。
Policy engine	权限和动作门禁	用 user、case、risk、tool、amount、jurisdiction、approval 判断 allow / deny / review。
HITL queue	审批和例外处理	evidence bundle、decision UI、SLA、assignment、audit、exit event。
Audit store	可复盘证据	保存 action、policy、approval、trace、input hash、output ref 和 version。
Schema registry	契约治理	compatibility、owner、version、consumer impact、deprecation。
Replay service	恢复和调查	read-only replay、event replay、workflow replay、incident replay。
Observability	生产控制	trace、metrics、logs、cost、quality、DLQ、queue、consumer lag。

11.2 产品 backlog 的高级切片

Epic	关键 stories
Governed tool onboarding	作为 platform owner, 我可以注册 tool contract, 设置 risk tier、scope、approval 和 owner, 使 agent 只能使用已评审工具。
Event contract platform	作为 integration architect, 我可以用 AsyncAPI 发布 event contract, 并在 schema breaking change 前看到受影响 consumer。
Human approval binding	作为 risk owner, 我可以确保批准的 proposal hash 与执行 command payload 匹配。
Safe replay	作为 incident commander, 我可以在 read-only mode 重放 agent trace, 不触发写动作。
DLQ operations	作为 operations lead, 我可以按 failure category、age、risk tier 和 owner 管理死信。
Saga recovery	作为 product architect, 我可以看到每个 workflow step 的 forward action、compensation 和客户影响。
Audit evidence export	作为 internal audit, 我可以抽样导出 tool action、approval、policy decision、event 和 trace 证据链。

12. Financial Retail Cases

12.1 Payment Dispute Agent

目标: 帮助 dispute analyst 聚合证据、生成调查摘要、提出 provisional credit 或拒付处理建议, 但不让 agent 直接改账。

payment.dispute.opened event
-> workflow starts investigation
-> agent aggregates transaction, merchant, customer contact, policy evidence
-> agent creates proposal
-> human approval queue
-> approved command executes
-> dispute.action.executed event

设计点	决策
API	OpenAPI 描述 `getDisputeCase`、`getTransactionDetails`、`createCreditProposal`、`executeApprovedCredit`。
Event	AsyncAPI 描述 `payment.dispute.opened.v1`、`dispute.evidence.pack.generated.v1`、`dispute.credit.approved.v1`。
Workflow	Serverless Workflow 风格定义 evidence gathering、eligibility check、approval、execution、notification。
Idempotency	credit execution key = `approval_id + proposal_hash`。
Approval	金额、客户风险和争议类型决定 single 或 dual approval。
Audit	保存 merchant data、transaction refs、policy version、AI proposal、human edit diff、execution event。
DLQ	merchant API timeout 进入 dependency DLQ; schema mismatch 进入 integration DLQ。
Replay	investigation summary 可 read-only replay; credit execution 不可无审批重放。

关键面试观点:

支付争议 agent 的产品价值来自缩短调查时间, 不是自动退款。
自动化边界应停在 proposal 和 evidence pack, 执行必须由 approval-bound command 完成。

12.2 AML Evidence Aggregation Agent

目标: 在 AML alert 或 case 阶段自动汇总交易、客户、关系网络、历史 alerts 和政策证据, 辅助 analyst 形成调查包。

设计点	决策
Event trigger	`aml.alert.created.v1` 触发 agent consumer。
Data access	只通过 entitlement-aware query API 读取, 不把完整敏感数据放入事件。
Tool tier	大多数工具是 T0 / T1; SAR narrative 草稿是 T2 draft。
HITL	analyst 必须 review evidence pack 和 narrative, agent 不提交 SAR。
Schema governance	evidence item schema 包含 source、record type、time range、confidence、redaction。
Replay	incident replay 需要固定 source snapshot 和 prompt version。
Audit	记录每条 evidence 的 provenance, 支持监管和 internal audit 问询。

关键控制:

Agent 不能扩大 AML 数据访问范围。
Tool result 进入 prompt 前要标注 trusted source 和 sensitivity。
Narrative 必须区分 factual evidence、analyst judgment 和 AI draft。
任何提交或关闭 alert 的动作都是 T3 / T4, 需要人工授权。

12.3 KYC Event Workflow

目标: 客户资料、文件、筛查、风险评级和人工复核之间通过事件和 workflow 协作, agent 负责缺件识别、差异解释和 reviewer assist。

customer.profile.updated
-> kyc.workflow.started
-> document.requested / document.received
-> screening.completed
-> agent.risk_summary.generated
-> human.review.completed
-> kyc.case.approved / kyc.case.rejected

设计点	决策
API vs event	profile 查询用 API; 状态变化用 event; 多步骤审批用 workflow。
Workflow owner	KYC platform owns state, agent 只是 task worker。
Schema	`kyc.case.status_changed.v1` 不暴露完整 PII, 只放 reference 和 risk flags。
HITL	高风险客户、PEP、制裁相似命中、文件不一致进入 review queue。
Idempotency	missing document task key = `customer_id_ref + document_type + policy_version`。
Compensation	错误批准后不能简单删除, 需 reopen case、客户补件、risk notification 和 audit note。
Governance	KYC policy version 变更必须触发 agent eval 和 workflow regression。

关键面试观点:

KYC agent 不应成为 master decision maker。
它应以 workflow task worker 的身份生成解释、缺口和证据包, 最终状态转移由 workflow、policy 和 human review 共同控制。

12.4 Customer Service Action Agent

目标: 客服 agent 根据客户对话和账户上下文建议下一步, 起草回复, 并在受控条件下执行低风险 case 更新或通知。

设计点	决策
Tool contract	区分 `search_policy`、`summarize_case`、`draft_response`、`update_case_status`、`send_customer_message`。
Risk tier	搜索和摘要 T0/T1, 草稿 T2, case 更新 T3, 客户外发 T4。
Approval	客户承诺、费用、退款、投诉权利、关闭 case 需要 human confirmation。
Prompt injection	用户输入和 retrieved emails 标记为 untrusted context, 不能覆盖 system policy。
Audit	保存 final message、AI draft、human edits、policy citations、send event。
DLQ	notification provider failure 创建 callback task, 不让 agent 循环重发。
Replay	客服对话 replay 必须脱敏并固定 policy version, 用于质量复核。

关键控制:

Agent 不能用客户提供文本覆盖内部政策。
客户可见承诺必须由 policy-backed template 或人工确认产生。
外发工具必须支持 preview、approval、idempotency 和 send receipt event。

13. Templates and Artifacts

13.1 Integration Pattern ADR

# ADR: [Capability] Integration Pattern

## Decision
Use [OpenAPI command / AsyncAPI event / workflow engine / human queue / hybrid] for [capability].

## Context
- Business process:
- Customer / regulatory impact:
- Source systems:
- Consumers:
- Required response time:
- Failure cost:

## Options Considered
| Option | Pros | Cons | Rejected / selected reason |
|---|---|---|---|
| Synchronous API | | | |
| Event-driven | | | |
| Workflow engine | | | |
| Human queue | | | |

## Selected Design
- Contract:
- Identity and permission:
- Idempotency:
- Audit:
- Retry and DLQ:
- Replay:
- Compensation:

## Risk Acceptance
- Residual risk:
- Owner:
- Review cadence:

13.2 Event Contract Card

字段	示例
Event type	`com.momofinance.kyc.case.status_changed.v1`
Business meaning	KYC case 的状态已经发生变化。
Producer	KYC workflow service。
Channel	`kyc.cases.events.v1`
CloudEvents source	`urn:service:kyc-workflow`
Subject	`kyc-case/{case_id}`
Schema ID	`kyc-case-status-changed-v1`
Data classification	PII references only, no raw identity document。
Ordering key	`case_id`
Retention	根据机构政策和审计要求设定。
Replay	allowed with replay marker and inbox dedupe。
DLQ owner	KYC integration operations。
Consumers	case dashboard、notification workflow、risk analytics、agent worker。

13.3 Tool Contract Checklist

检查项	通过标准
Business owner	明确 owner 和 escalation path。
Tool tier	T0 到 T4 分类完成。
Input schema	字段、类型、枚举、敏感度和验证规则明确。
Output schema	返回可解析状态和 source references。
Auth	user-delegated 或 service identity 明确。
Approval	高风险动作有 approval subject、scope、expiry 和 binding。
Idempotency	写动作有 key、request hash 和 retention。
Audit	action、policy、approval、model、prompt、tool version 可追溯。
Failure model	validation、entitlement、policy、dependency、conflict、timeout 分类清楚。
Replay	read-only replay 和 write replay 边界明确。

13.4 DLQ Triage Template

字段	内容
DLQ item ID	死信记录 ID。
Source event	CloudEvents `id`、`type`、`source`、`time`。
Failure category	schema、entitlement、dependency、policy、poison、unknown。
Risk tier	low、medium、high、regulated、customer-impacting。
Customer impact	是否影响客户、资金、身份、合规时限或投诉。
Owner	producer、consumer、workflow、tool gateway 或 vendor owner。
Decision	retry、fix and replay、drop with reason、manual process、start incident。
Evidence	error trace、payload hash、schema version、policy decision、consumer logs。
Approval	高风险重放的 approver 和 reason。

13.5 Replay Runbook

1. Classify replay type: read-only, event, workflow, incident.
2. Freeze relevant versions: schema, workflow definition, model, prompt, policy, tool.
3. Identify blast radius: cases, customers, events, workflows, consumers.
4. Validate idempotency and inbox readiness.
5. Run dry-run or read-only replay when possible.
6. Obtain approval for any replay that can trigger writes or customer-visible effects.
7. Execute replay with rate limit and replay marker.
8. Monitor consumer lag, DLQ, duplicate detection, workflow state and customer impact.
9. Record replay evidence and compare before / after outcomes.
10. Feed new cases into eval, contract tests and release gates.

14. 30-Day Lab

目标: 30 天内做出一个可展示的金融零售 event-driven agent integration 作品集, 覆盖契约、架构、可靠性、治理和面试表达。

Day	训练任务	产出
1	选择一个主场景: payment dispute、AML evidence、KYC workflow 或 customer service action。	Use case boundary + risk tier。
2	画出 source systems、agent、API、event bus、workflow、human queue 和 audit store。	C4 context / container 图。
3	划分 query、command、event、workflow task 和 human decision。	CQE / workflow boundary map。
4	定义 5 个关键业务事件, 用 CloudEvents 包络表达。	Event catalog draft。
5	为 2 个事件写 AsyncAPI 风格 channel、message 和 schema。	AsyncAPI contract excerpt。
6	为 3 个同步能力写 OpenAPI 风格 operation、request、response 和 error model。	OpenAPI contract excerpt。
7	设计 tool catalog, 每个 tool 标注 T0 到 T4 风险等级。	Tool inventory。
8	写 1 张完整 Tool Contract Card。	Tool contract artifact。
9	设计 approval subject、scope、expiry、binding 和 audit 字段。	Action approval design。
10	设计 idempotency keys, 覆盖 proposal、execution、notification、task creation。	Idempotency matrix。
11	设计 outbox / inbox 流程和事件发布链路。	Reliability sequence diagram。
12	设计 workflow state machine, 包含 waiting、approved、failed、compensating。	Workflow definition sketch。
13	写 saga forward / compensation map。	Saga compensation table。
14	设计 human-in-loop queues 和 SLA。	Queue contract。
15	设计 event schema governance policy 和 compatibility rules。	Schema governance one-pager。
16	设计 DLQ taxonomy 和 triage runbook。	DLQ operations template。
17	设计 replay runbook, 区分 read-only 和 write replay。	Replay runbook。
18	设计 audit event schema 和 traceability chain。	Audit evidence schema。
19	把 NIST AI RMF Govern / Map / Measure / Manage 映射到场景。	RMF control mapping。
20	设计 observability dashboard 指标: lag、DLQ、tool error、approval rate、override、cost。	Metrics spec。
21	写 3 个 failure scenarios: duplicate execution、schema break、policy denied。	Failure mode analysis。
22	为每个 failure scenario 写 containment 和 recovery。	Incident mini-runbook。
23	设计 prompt injection 和 tool misuse 防线。	Security control matrix。
24	设计 contract tests 和 consumer compatibility tests。	Test strategy。
25	写 Integration Pattern ADR。	ADR artifact。
26	写 Financial Retail Case Study, 解释价值、边界、架构和治理。	1500-2500 字案例文。
27	准备 6 张作品集图: architecture、event flow、workflow、approval、audit、failure recovery。	Diagram pack。
28	做一次 tabletop: 事件 schema 破坏导致 AML agent DLQ 激增。	Tabletop notes。
29	把 tabletop 结果转成 backlog、control 和 release gate。	Corrective action register。
30	准备面试讲稿: 5 分钟架构叙述 + 10 个追问答案。	Interview story pack。

15. Interview Answers

Q1: 什么时候给 AI Agent 用 API, 什么时候用事件?

30 秒回答:

API 适合立即查询或明确 command, 事件适合广播已经发生的业务事实。Agent 不应该用一串同步 API 模拟企业流程; 多 consumer、可重放、异步状态变化应使用事件, 长事务和人工审批应交给 workflow engine。

2 分钟回答:

如果 agent 需要马上拿到客户、case、交易或政策数据, 我会用 OpenAPI 描述 query API, 并加 entitlement、field masking 和 timeout。
如果 agent 要请求一个写动作, 我会用 command API, 明确 idempotency、approval ID、structured error 和 audit headers。
如果某个事实已经发生, 例如 dispute.opened 或 kyc.case.status_changed, 我会用 CloudEvents + AsyncAPI 发布事件, 让 agent 作为一个受治理 consumer。
如果流程包含等待、分支、补偿和人工任务, 我会用 workflow engine 持有状态, agent 只做 bounded task worker。
关键不是技术偏好, 而是语义边界、失败恢复和审计证据。

Q2: 为什么 agent 写动作必须设计 idempotency?

30 秒回答:

Agent 和分布式系统都会重试。没有 idempotency, 一次网络超时可能变成重复退款、重复通知或重复关闭 case。高风险写动作必须用业务语义 key 和 request hash 保证重试安全。

2 分钟回答:

写动作不能使用随机 key 掩盖重复, 要基于业务语义, 例如 approval_id + proposal_hash。
同 key 的请求返回原始执行结果, 不重复执行副作用。
如果同 key 但 payload hash 不同, 系统应拒绝并发出 conflict。
对事件 consumer, 我会用 inbox 记录 event ID 和处理状态, 防止重复消费导致二次副作用。
对支付、KYC、AML 等场景, idempotency 也是审计和事故恢复证据。

Q3: Event-driven agent 如何处理死信?

30 秒回答:

DLQ 不是技术垃圾桶, 是风险运营队列。要按 schema、entitlement、dependency、policy、poison message 分类, 设 owner、SLA、重放审批和客户影响评估。

2 分钟回答:

Schema failure 通常说明 producer / consumer contract 破坏, 不应盲目重试。
Entitlement failure 不能绕过权限, 需要修正 access design。
Dependency timeout 可以退避重试或进入 workflow wait state。
Policy denied 应转 human queue 或结束流程, 不能自动降级绕过。
高风险 DLQ 重放前要验证 inbox、idempotency、replay marker 和 approval。

Q4: Workflow engine 和 agent planner 的边界是什么?

30 秒回答:

Workflow engine 应拥有长期状态、分支、超时、补偿和人工任务。Agent planner 适合在单个受限任务内做证据聚合、摘要、草稿或建议, 不应凭 prompt 维护金融零售长事务状态。

2 分钟回答:

Workflow definition 是可版本化、可审计、可回放的流程契约。
Agent output 只是 workflow task 的结果, 必须符合 schema。
高风险 transition 由 workflow policy gate 和 human approval 控制。
Saga compensation、timeout、retry 和 manual escalation 应在 workflow 层表达。
这样可以避免模型上下文漂移、会话丢失或重试导致流程失控。

Q5: 如何设计 agent tool contract?

30 秒回答:

我会把 tool 当成受治理业务能力, 而不是函数。contract 必须包含 owner、side effect、risk tier、input / output schema、permission、approval、idempotency、audit、failure mode 和 replay behavior。

2 分钟回答:

先判断 tool 是 read、analyze、draft、write 还是 external / financial。
然后定义 allowed intent 和 disallowed intent, 防止 agent 在错误场景调用。
对写动作, 需要 approval binding、idempotency key、request hash 和 compensation。
对所有工具, 记录 tool version、policy decision、input hash、output ref、model / prompt version。
Tool result 要作为 observation 返回, 不能让工具输出覆盖系统指令。

Q6: CloudEvents 在企业 AI 集成中有什么价值?

30 秒回答:

CloudEvents 统一事件身份和元数据, 让 agent、workflow、audit、replay 和普通服务能用同一种方式理解事件的 id、source、type、time、subject 和 data 边界。

2 分钟回答:

它减少每个团队自定义 event envelope 的混乱。
对 agent 来说, source 和 type 帮助判断事件语义, id 支持去重, time 支持时序, subject 支持 aggregate。
企业可以扩展 correlation ID、causation ID、tenant、policy decision、risk class 和 trace context。
CloudEvents 不替代 domain schema, payload 仍需 AsyncAPI / JSON Schema 和 schema registry 管理。
它的最大价值是跨系统追踪、审计和 replay 的共同语言。

Q7: 如何防止 agent 通过事件流造成连锁事故?

30 秒回答:

把 agent 视为受治理 consumer, 不让它无限发布高风险事件。用 consumer lag、rate limit、step budget、tool allowlist、policy gate、DLQ、kill switch 和 replay marker 控制扩散。

2 分钟回答:

Agent consumer 必须有独立 consumer group、backpressure 和 lag alert。
Agent 发布的 enrichment event 要标明来源事件、model、prompt、tool version 和 confidence boundary。
高风险 event 不能由 agent 直接发布为事实, 应进入 proposal 或 workflow task。
对循环风险, 需要 causation chain 和 loop detector, 避免 agent 消费自己触发的事件后无限放大。
事故时可以禁用特定 consumer、tool 或 workflow path, 而不是停掉全平台。

Q8: 金融零售里哪些 agent action 必须 human-in-loop?

30 秒回答:

凡是影响资金、账户状态、客户权益、KYC/AML 结论、投诉处理、监管材料或客户可见承诺的动作, 默认需要 human review 或至少 policy-bound confirmation。风险越高, 越需要 dual control 和审计证据。

2 分钟回答:

Read-only summary 可以抽样复核, 但如果进入监管或客户材料, 就要 review。
Draft action 可以由 agent 生成, execution 必须绑定 approval。
大额退款、账户冻结、KYC 拒绝、AML narrative、投诉结论和客户外发是高风险。
HITL queue 要有 evidence bundle、allowed decisions、SLA、assignment 和 audit fields。
监督有效性要看 override rate、edit diff、backlog、review quality 和 incident pattern。

Q9: 如何处理 event schema breaking change?

30 秒回答:

breaking change 不能静默上线。要通过 schema registry、compatibility check、consumer inventory、impact review、新 major version、迁移窗口和 deprecation gate 管理。

2 分钟回答:

增加 optional field 通常兼容, 增加 required field、改类型、删字段、改枚举语义都可能 breaking。
高影响事件要知道所有 consumer, 包括 agent worker、workflow、analytics 和 audit。
新 schema 上线前跑 consumer contract tests。
如果改变业务语义, 不只改 version, 应使用新的 event type。
对金融零售, schema 破坏可能导致 AML evidence 缺失、KYC 状态误判或客户通知错误, 应进入 release gate。

Q10: 如何把 NIST AI RMF 用到 event-driven agent 架构?

30 秒回答:

我会用 Govern 定 owner 和门禁, Map 定 use case、数据流和伤害路径, Measure 定 tool misuse、DLQ、override、quality 和成本指标, Manage 定 kill switch、replay、compensation 和 incident loop。

2 分钟回答:

Govern: tool catalog、schema owner、workflow owner、risk acceptance、approval authority。
Map: 标注哪些 events、tools 和 workflows 影响客户、资金、身份、AML/KYC 或监管材料。
Measure: 监控 contract failure、policy denial、manual override、consumer lag、DLQ、latency、unit cost 和 incident trend。
Manage: 设计 containment、fallback、replay、compensation、postmortem 和 release gate。
这样 NIST AI RMF 不只是合规语言, 而是架构和产品运营的控制闭环。

16. 最终检查清单

Area	高级检查问题
Pattern choice	每个 agent capability 是否明确选择 API、event、workflow、human queue 或 hybrid, 并有 ADR?
Contract	OpenAPI、AsyncAPI、CloudEvents、tool contract 是否覆盖 schema、auth、error、version 和 owner?
Action safety	写动作是否有 approval binding、idempotency、request hash、audit 和 compensation?
Event reliability	outbox、inbox、DLQ、replay、ordering、backpressure 是否设计清楚?
Human oversight	queue 是否有 evidence bundle、SLA、assignment、decision set、diff 和 exit event?
Governance	schema change、tool onboarding、workflow change 和 model / prompt change 是否进入 release gate?
Audit	是否能从客户影响追溯到 event、command、approval、agent proposal、tool calls、evidence 和版本?
Incident readiness	是否能禁用单个 tool / consumer / workflow path, 并安全重放或补偿?

核心结论:

企业 AI Agent 的生产级能力, 不是“会调用更多工具”。
它是把工具、API、事件、工作流、人类审批、审计和恢复设计成同一个受治理的集成系统。