AI 扩展计划 / Playbooks

AI Durable Agent Workflow / State Machine Playbook

以下 primary / official sources 作为术语和架构锚点。本文把它们转成 AI agent workflow、金融零售流程、平台治理和可审计证据语言。

921 行AI_DURABLE_AGENT_WORKFLOW_STATE_MACHINE_PLAYBOOK.md

AI Durable Agent Workflow / State Machine Playbook

定位: 面向 AI Product Architect、Enterprise Architect、Platform PM、Workflow Architect、Risk Ops 的生产级 agent workflow 设计手册。核心问题: 当 AI Agent 不只是回答问题, 而要跨支付、KYC、AML、信贷、客服、审批、核心系统和外部工具执行长流程时, 如何用 durable workflow、state machine、event sourcing、checkpoint、saga、HITL、幂等、重放和审计证据把自动化做成可控的企业能力。学习目标: 不讲基础 BA 流程梳理, 不复述普通工作流概念。目标是训练高级角色能把 agentic workflow 设计成可恢复、可重放、可补偿、可审批、可审计、可演练、可治理的金融零售级架构资产。

重要说明: 本文是学习、作品集和架构训练材料, 不是法律意见、合规结论、审计意见或生产架构批准。金融零售正式项目必须由 Business Owner、Architecture、Engineering、Security、Privacy、Legal、Compliance、Model Risk、Operational Risk、Internal Audit、Platform Owner 和相关系统 owner 共同确认适用边界、客户影响、监管义务、模型风险和上线门禁。

Source Anchors

以下 primary / official sources 作为术语和架构锚点。本文把它们转成 AI agent workflow、金融零售流程、平台治理和可审计证据语言。

Source	Official / primary link	本手册使用方式
Temporal Workflows	https://docs.temporal.io/workflows	用于理解 durable execution、workflow execution、event history、workflow replay、long-running workflow 和失败恢复。
Temporal Workflow Execution	https://docs.temporal.io/workflow-execution	用于理解 replay 如何基于 Event History 恢复进度, 以及 deterministic workflow 对可靠性的影响。
Temporal Event History	https://docs.temporal.io/encyclopedia/event-history	用于设计 event history、commands、events、activity result、worker crash recovery 和审计重放边界。
AWS Step Functions	https://docs.aws.amazon.com/step-functions/	用于理解 state machine、Amazon States Language、Task / Choice / Fail / Succeed、retry、catch、timeout 和可视化编排。
Amazon States Language	https://states-language.net/	用于将 workflow state machine 写成机器可验证的 JSON 风格结构, 并训练状态、转移和错误处理的精确表达。
Workflow Patterns	https://www.workflowpatterns.com/	用于从 control-flow、data、resource、exception patterns 视角识别 workflow 结构复杂性, 避免只画线性流程图。
Sagas paper	https://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf	用于理解 long-lived transaction 如何拆成一组局部事务和补偿事务, 并转化为 agent 工具副作用恢复策略。
NIST AI RMF	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 组织 agent workflow 的风险识别、控制、度量、事故恢复和治理证据。
NIST AI RMF Playbook	https://airc.nist.gov/airmf-resources/playbook/	用于把 AI 风险治理从原则转成可执行的操作、证据和持续改进动作。

1. 一句话定位

Durable agent workflow 不是“让 Agent 多调几个工具”。它是把 AI 的计划、工具动作、人类审批、外部副作用、失败恢复和审计证据放进一个可持久化的业务状态机。

更准确的定义:

Durable agent workflow =
用确定性的工作流状态机管理长时间 AI 任务,
把模型调用和工具副作用隔离成可记录的 activity,
用 event history / checkpoint 恢复进度,
用 idempotency / saga / compensation 控制外部影响,
用 human approval states 管理高风险判断,
用 replay / audit evidence 支持事故复盘、监管解释和运营恢复。

它的核心边界:

Agent can reason.
Workflow owns state.
Tool gateway owns side effects.
Policy engine owns permissions.
Human approval owns high-impact decisions.
Event history owns replay.
Audit evidence owns accountability.

适合放入作品集的最终产出:

Portfolio artifact	展示能力
Durable Agent Workflow Architecture	能把 agent runtime、workflow engine、tool gateway、policy、checkpoint、audit、DLQ 和 HITL queue 放进同一张架构图。
State Machine Spec	能把自然语言流程转成状态、事件、转移、守卫条件、超时、错误和终态。
Agent Checkpoint Schema	能定义一次长任务恢复所需的最小持久状态、模型上下文摘要、工具结果、审批状态和证据引用。
Tool Side-Effect Matrix	能区分 read、draft、write、customer-facing、money movement、regulatory action 的权限、幂等、审批和重放规则。
Saga Compensation Table	能为跨系统长事务定义局部事务、成功事件、失败路径、补偿动作和不可补偿风险。
HITL Approval Map	能把 human approval 从按钮升级为状态、角色、SLA、证据、maker-checker、升级和过期处理。
Replay / Audit Evidence Binder	能证明 workflow 为什么走到某个状态、AI 使用了哪些输入、工具产生了哪些副作用、谁批准了什么。
Incident Recovery Drill	能演练重复执行、审批超时、下游系统故障、模型输出错误、DLQ 积压和 replay 不兼容。

2. 为什么重要

Agentic AI 在金融零售里最危险的地方不是回答错一句话, 而是在长流程中持续积累不可见状态和外部副作用:

Agent 生成信贷 memo 草稿, 三天后补充新材料, 需要保留前后证据和版本差异。
AML investigation copilot 汇总交易、账户、KYC、外部名单和历史 case, 需要能复盘每个结论来源。
KYC onboarding workflow 等待客户补件、合规复核和风险审批, 不能因为 worker 重启丢失状态。
支付争议 Agent 可能触发临时入账、商户调单、客户通知和 chargeback 文件提交, 任何重复动作都可能产生真实资金和客户影响。
客诉处理 Agent 需要跨客服、投诉、补偿、法律升级和监管期限, 一次失败不能让案件掉出 SLA。
核心系统变更审批 Agent 可能生成影响分析、审批包、回滚方案和上线 evidence, 每一步都必须可证明。

2.1 Demo Agent 与 Durable Agent 的差异

维度	Demo Agent	Durable Agent Workflow
状态	对话历史和内存变量	持久 workflow state、event history、checkpoint、business state reference
失败恢复	重新跑一次 prompt	从上次已确认事件恢复, 不重复外部副作用
工具调用	函数调用成功或失败	activity result、idempotency key、side-effect ledger、compensation path
人工审批	页面上点 approve	approval state、role、evidence、SLA、maker-checker、override reason、audit event
长任务	依赖会话不断线	天到月级流程实例, 支持 timer、pause、resume、cancel、escalate
审计	保存聊天记录	保存状态转移、输入输出哈希、工具结果、审批证据、版本和事件链
变更	直接更新 prompt 或代码	workflow versioning、replay compatibility、migration、canary、control gate
风险控制	系统提示词约束	policy engine、tool gateway、state guard、kill switch、DLQ 和 incident drill

2.2 金融零售里的失败形态

Failure mode	真实影响	Durable design response
Worker crash	长任务中断, 员工或客户以为系统还在处理	Event history + deterministic replay 恢复 workflow 进度。
Duplicate retry	重复退款、重复通知、重复提交材料	Idempotency ledger + tool side-effect matrix + replay 不重新执行已完成 activity。
Human approval timeout	支付争议、KYC、投诉超过 SLA	Approval state timeout + escalation + automated reminder + evidence log。
Model output drift	同一 case 重跑后生成不同结论	Model call 作为 activity, 记录 prompt template version、model version、input hash、output hash。
Downstream outage	KYC vendor、核心银行、支付网关不可用	Retry with backoff + circuit breaker + DLQ + recovery workflow。
Partial success	一个系统已写入, 另一个系统失败	Saga compensation table + manual repair queue + audit evidence。
Poison event	某类事件持续失败阻塞队列	DLQ triage + schema validation + replay approval + consumer isolation。
Replay incompatibility	旧 workflow 事件无法被新代码解释	Deterministic constraints + workflow versioning + replay test before release。

3. 高级架构: Durable Agent Workflow Control Plane

生产级 agent workflow 建议分成十层控制面:

1. Intent and risk classification
2. Workflow definition and state machine
3. Event history and workflow replay
4. Agent runtime and model activity
5. Tool gateway and side-effect ledger
6. Idempotency and saga compensation
7. Human approval and exception queue
8. Timeout, retry, DLQ and incident recovery
9. Audit evidence and regulatory traceability
10. Governance, versioning and continuous improvement

3.1 组件视图

Component	责任	不能承担的责任
Workflow engine	持久化流程实例、状态转移、timer、activity 调度、retry、replay、cancel、resume	不直接拥有业务事实, 不替代核心系统账务或案件状态。
State machine spec	定义状态、事件、转移、守卫条件、终态、超时、审批和错误路径	不依赖自然语言说明来决定可执行转移。
Agent runtime	生成计划、摘要、草稿、候选判断、工具参数建议	不拥有最终业务状态, 不绕过 workflow guard 和 policy。
Model activity	执行一次模型调用并返回结构化输出	不在 replay 时默认重新调用模型生成新答案。
Tool gateway	统一执行外部 API、RPA、消息、文件、通知和核心系统动作	不让 agent 直接拿到无治理的系统凭证。
Side-effect ledger	记录外部动作、幂等键、结果、补偿状态和审计引用	不替代业务系统真实交易记录。
Idempotency service	防止重试和重放造成重复写入	不保证业务语义正确, 只保证同一 intent 不重复执行。
HITL queue	管理人工审批、复核、补件、升级、退回和 SLA	不让审批变成无证据的橡皮图章。
Event store / audit log	保存 workflow events、tool events、approval events、policy events、model activity metadata	不把所有敏感原文无差别塞进日志。
DLQ and recovery console	接收失败事件、失败 activity、poison message 和人工修复任务	不自动重放高风险副作用。
Policy engine	计算权限、风险分层、审批要求、限额和停机规则	不把策略硬编码在 prompt 中。
Observability layer	指标、trace、日志、cost、latency、state distribution、retry trend、approval backlog	不只看 token cost, 忽略业务状态和客户影响。

3.2 端到端控制流

Business event / user request
-> intent and risk classification
-> workflow instance starts or resumes
-> state machine selects next allowed transition
-> agent generates plan or structured decision support
-> policy engine checks tool and approval requirements
-> tool gateway executes idempotent activity
-> event history records command, event and activity result
-> checkpoint stores recoverable agent context
-> human approval state handles high-impact decisions
-> saga compensation handles partial failure
-> audit evidence binder links state, model, tool, human and policy evidence
-> replay / recovery uses event history without repeating completed side effects

3.3 成熟度模型

Level	状态	主要风险	升级动作
Level 0: Chat assist	AI 只在对话里建议员工下一步	员工手动复制粘贴, 证据断裂, 无法复盘	建立 output schema、case evidence link 和 human decision log。
Level 1: Tool calling	Agent 能调用少量 API	权限、幂等、错误处理和审计散落	引入 tool gateway、side-effect matrix 和 idempotency key。
Level 2: Agent workflow	Agent 参与多步骤流程	状态靠应用代码和数据库字段拼接, 长任务恢复困难	引入 state machine spec、checkpoint 和 timer。
Level 3: Durable workflow	使用 workflow engine 管理状态和 replay	saga、审批、DLQ、版本治理复杂	建补偿表、HITL map、replay test、incident drill。
Level 4: Governed agent platform	多业务复用 durable workflow 能力	平台 owner、控制标准、迁移成本和组织协调难	建平台模式库、控制仪表板、季度风险复盘和证据包。

4. 核心架构模式

4.1 Deterministic Workflow

确定性工作流的目标不是让所有业务都机械化, 而是确保同一段 workflow code 在同一段 event history 上能产生同样的命令序列, 从而可靠 replay。

设计点	推荐做法	风险信号
随机数、当前时间、外部查询	放入 activity, 将结果写入 event history	workflow 代码直接调用系统时间、随机函数或外部 API。
模型调用	作为 model activity, 记录输入摘要、版本、输出和引用	replay 时重新调用模型, 导致旧流程变成新结论。
工具副作用	作为 activity 通过 tool gateway 执行, 结果持久化	replay 触发重复退款、重复邮件、重复冻结。
分支判断	基于 event history 中已记录的事实和 policy decision	分支取决于当前数据库实时状态但未记录快照或版本。
代码变更	使用 workflow versioning 和 replay test	直接修改状态顺序, 老实例 replay 失败。

4.2 Event Sourcing for Workflow State

Workflow state 应能从事件链重建。事件不只是日志, 而是恢复、审计和解释的事实来源。

Event type	示例	审计价值
WorkflowStarted	`PaymentDisputeAgentStarted`	证明流程何时、因何、由谁或哪个事件启动。
StateEntered	`EvidenceGatheringEntered`	证明流程当前阶段和进入原因。
ActivityScheduled	`FetchCardTransactionScheduled`	证明系统准备调用哪个工具和参数版本。
ActivityCompleted	`FetchCardTransactionCompleted`	保存工具输出引用、状态码、哈希和延迟。
ModelActivityCompleted	`DisputeEligibilityDrafted`	保存模型版本、prompt template、输入证据引用和结构化输出。
ApprovalRequested	`ProvisionalCreditApprovalRequested`	证明高风险动作已进入人工审批。
ApprovalRecorded	`ProvisionalCreditApproved`	记录审批人、理由、证据包和时间。
SideEffectCommitted	`ProvisionalCreditPosted`	证明外部写入已经完成, replay 不应重复执行。
CompensationScheduled	`ReverseProvisionalCreditScheduled`	证明补偿动作和触发原因。
WorkflowFailed / Completed / Canceled	`DisputeWorkflowCompleted`	证明终态、结果和责任链。

4.3 Saga / Compensation

Saga 适合跨系统长事务: 每一步是一个局部事务, 失败时按业务语义执行补偿。

原则	金融零售解释
不把长事务伪装成单个 ACID 交易	KYC、AML、支付争议和投诉处理常跨系统、跨人、跨天, 不可能用一个数据库事务包住。
每个局部事务必须有业务结果事件	例如 `ProvisionalCreditPosted`、`KycDocumentVerified`、`ComplaintAcknowledgementSent`。
补偿不是技术回滚, 而是业务动作	撤销临时入账、发送更正通知、重新打开 case、取消外发文件、记录 manual repair。
有些动作不可完全补偿	客户已经收到通知、监管文件已经提交、账户限制已经被客户感知, 只能更正、说明和留痕。
补偿也需要审批和审计	高影响补偿不能由 agent 自行决定。

4.4 Agent Checkpoint

Checkpoint 保存的不是模型的“内心状态”, 而是恢复任务需要的业务上下文、证据引用、结构化中间结果和未完成动作。

Checkpoint item	保存内容	不保存内容
Workflow position	当前 state、last event id、pending timers、pending approvals	模型自由文本推理链作为业务事实。
Evidence references	交易、文档、政策、case、工具结果的 immutable reference	未授权的敏感原文全集。
Agent outputs	结构化草稿、候选标签、置信边界、缺失信息清单	无来源的断言和未验证结论。
Tool ledger	已执行工具、幂等键、结果引用、补偿状态	临时 API token 或可滥用凭据。
Human decisions	审批状态、决策、理由、审批人、时间	无理由的 approve / reject。
Privacy and retention	数据分类、保留期、删除限制、审计保留依据	把所有上下文永久保存。

4.5 Tool Side Effect Control

Agent 工具必须按副作用分层。一个查询工具和一个能改变客户权益的工具不能共享同样的审批、重试和重放规则。

Tool class	示例	默认控制
Read-only	查询交易、读取 KYC profile、读取投诉历史	RBAC / ABAC、data minimization、source timestamp、rate limit。
Draft-only	生成信贷 memo、生成客户回复草稿、生成 AML narrative 草稿	必须标记 draft, 不写入客户或监管系统。
Internal write	更新 case note、添加缺失材料标签	幂等键、审计事件、业务对象版本检查。
Customer-facing	发送邮件、短信、app 通知、投诉回复	人工审批、模板版本、合规语言检查、发送回执。
Money movement	临时入账、退款、费用减免、chargeback 提交	双人审批、限额、幂等、补偿、账务核对。
Account restriction	冻结、解冻、限制交易、关闭账户	高风险审批、legal / compliance rule、客户影响评估、kill switch。
Regulatory action	SAR evidence pack、监管投诉材料、审计证据包	受控生成、人工签核、不可变归档、版本留痕。

5. 状态机设计: 从流程图到可执行控制

高级状态机不是把流程图节点换成英文名, 而是明确:

State = 系统在某个业务阶段允许哪些事件、动作、审批、超时和错误处理。
Transition = 在守卫条件满足时, 一个事件如何把 workflow 推到下一个状态。
Terminal state = 业务已经完成、取消、拒绝、失败、补偿完成或转人工接管。

5.1 状态类型

State type	用途	金融零售示例
Intake	接收请求、分类风险、建立 workflow instance	支付争议创建、KYC 申请提交、AML alert 收到。
Evidence gathering	拉取或等待证据	获取交易、商户响应、KYC 文档、客户聊天记录。
Agent drafting	AI 生成结构化草稿或候选判断	信贷 memo、AML narrative、投诉回复建议。
Policy gate	根据规则判断下一步	金额限额、KYC 风险等级、投诉监管时限。
Human approval	人类复核、批准、驳回、退回或升级	临时入账审批、SAR narrative 复核、核心系统变更 CAB 审批。
External action	调用工具产生外部副作用	发送通知、写 core banking、提交 chargeback。
Wait / timer	等待客户、商户、审查人或下游系统	等客户补件、等商户调单、等第二审批人。
Compensation	对已完成局部事务执行补偿	撤销临时入账、取消通知、重新打开 case。
DLQ / repair	进入故障队列或人工修复	Schema 错误、下游 outage、重复键冲突。
Terminal	完成、拒绝、取消、失败、人工接管	Dispute completed、KYC rejected、Complaint escalated。

5.2 状态转移守卫条件

Guard type	示例	设计要求
Risk guard	`risk_tier <= medium` 才能自动进入下一步	guard 结果来自 policy decision, 不是模型自由判断。
Evidence guard	必须有交易、客户声明、商户响应或 KYC 文档	缺失证据进入 `AwaitingEvidence` 或 `HumanReview`。
Authority guard	审批人具备对应权限且不与创建人相同	maker-checker 和 segregation of duties 必须可审计。
Amount guard	金额超过阈值必须双人审批	阈值版本、币种换算和例外理由要记录。
Freshness guard	政策、KYC、账户状态必须在有效期内	保存 source timestamp 和 policy version。
Idempotency guard	同一 business intent 未被执行过	幂等键和 side-effect ledger 校验。
Replay guard	replay 模式不允许重新触发已完成外部动作	workflow engine 使用记录结果恢复。

5.3 支付争议 Agent 状态机示例

workflow_name: PaymentDisputeAgentWorkflow
workflow_version: 1.0.0
business_object: payment_dispute
risk_domain: customer_funds_and_card_dispute
start_event: payment.dispute.opened.v1
terminal_states:
  - Completed
  - Rejected
  - Canceled
  - ManualTakeover
  - Compensated
states:
  Intake:
    allowed_events:
      - DisputeOpened
    actions:
      - classify_dispute_risk
      - create_audit_binder
    next:
      - EvidenceGathering
  EvidenceGathering:
    actions:
      - fetch_payment_transaction
      - fetch_customer_profile
      - fetch_merchant_response
    retry_policy: transient_errors_with_backoff
    timeout_policy: merchant_response_10_business_days
    next:
      - EligibilityDrafting
      - AwaitingEvidence
      - DLQRepair
  EligibilityDrafting:
    model_activity: draft_dispute_eligibility
    output_schema: dispute_eligibility_v1
    next:
      - HumanApprovalRequired
      - AutoRejectReview
  HumanApprovalRequired:
    approval_policy: provisional_credit_maker_checker
    sla: 4_business_hours
    next:
      - ProvisionalCreditPosting
      - ReworkRequested
      - EscalatedReview
  ProvisionalCreditPosting:
    tool_activity: post_provisional_credit
    idempotency_key: dispute_id_plus_credit_intent_version
    compensation: reverse_provisional_credit
    next:
      - CustomerNotification
      - CompensationRequired
  CustomerNotification:
    tool_activity: send_customer_notification
    approval_policy: approved_template_only
    next:
      - Completed
  CompensationRequired:
    actions:
      - schedule_reverse_provisional_credit
      - open_manual_repair_case
    next:
      - Compensated

这段示例的重点不是 YAML 语法, 而是把 workflow 的治理语义显式化: 每个高影响动作都有状态、审批、幂等、补偿和证据。

6. Workflow Replay 与 AI 的特殊边界

Workflow replay 是 durable workflow 的关键能力, 但 AI workflow 里必须避免一个常见误区: replay 不是“重新问一遍模型”。

6.1 Replay 的正确语义

场景	Replay 行为	禁止行为
Worker crash 后恢复	根据 event history 重新构建 workflow state, 继续未完成 activity	重做已完成的支付、通知、审批或模型结论。
已完成 model activity	使用 event history 中记录的模型输出和元数据	replay 时调用新模型生成不同结论。
已完成 tool activity	使用已记录的 activity result 和 side-effect ledger	再次调用外部系统写入。
代码升级后回放旧实例	通过版本兼容逻辑解释旧事件	改名、删状态、改分支导致旧 event history 无法解释。
事故调查 replay	重建当时流程状态、证据、决策和动作	用当前政策或当前模型覆盖当时证据。

6.2 Model Call as Activity

模型调用在 durable workflow 中应被视为非确定性 activity:

Metadata	示例	用途
`model_provider`	`internal_model_gateway`	证明调用入口和供应链。
`model_name`	`gpt-4.1-enterprise-routing`	记录模型路由结果或抽象模型名。
`model_version`	`gateway_policy_2026_06_29`	支持回放、差异分析和模型变更审计。
`prompt_template_version`	`dispute_eligibility_prompt_v3`	证明提示词版本和控制边界。
`input_evidence_refs`	`txn_ref_812`, `policy_ref_44`, `customer_statement_ref_19`	让输出可追溯到证据。
`input_hash`	`sha256:...`	防篡改和重复验证。
`output_schema`	`dispute_eligibility_v1`	避免自由文本驱动状态转移。
`output_hash`	`sha256:...`	事故复盘和证据归档。
`safety_policy_decision_id`	`poldec_20260629_0041`	证明 guardrail 和权限检查。

6.3 Workflow Replay vs Re-evaluation

操作	目的	触发条件	产物
Replay	恢复当时流程状态	worker crash、部署后兼容测试、事故调查	同一 event history 下的状态重建。
Re-evaluation	用新模型、新政策或新证据重新评估	政策变化、模型升级、客户新材料、质量抽检	新 workflow activity 和新 audit event。
Simulation	在不产生副作用的环境测试替代路径	训练、变更评审、风险演练	仿真结果和差异报告。
Remediation run	对生产问题执行受控修复	DLQ、补偿失败、审计发现	修复 workflow、审批证据和恢复结论。

7. 产品决策: 什么时候用 Durable Workflow

不是所有 AI 功能都需要 durable workflow。高级 PM 和架构师要能判断成本与风险。

7.1 决策矩阵

问题	如果答案是“是”	架构倾向
任务会跨分钟、小时、天或月吗?	会话断开不能丢状态	Durable workflow + checkpoint。
是否有人工审批、补件、等待外部响应?	需要 timer、SLA 和 escalation	State machine + HITL queue。
是否调用多个系统并产生副作用?	可能出现 partial success	Saga + side-effect ledger。
是否涉及资金、账户、客户权益、监管材料?	高影响动作必须可证明	Approval state + audit evidence。
失败后是否需要恢复而不是重来?	重来会造成重复或证据不一致	Event history + replay。
未来是否会有模型、政策、工具版本变更?	老实例必须继续可解释	Versioned workflow + replay test。
是否需要给审计、合规或事故复盘解释路径?	仅聊天记录不够	Evidence binder + state transition log。

7.2 不适合过度工作流化的场景

场景	更合适的做法
单次低风险问答	RAG / chat assist + source citation + session log。
一次性数据提取且无外部副作用	Batch job + validation report。
员工本地草稿生成	Draft-only tool + document version history。
毫秒级实时风控决策	Decisioning engine + feature store + model serving, workflow 只处理后续 case management。

7.3 Platform PM 的产品边界

平台能力	产品化要求
Workflow template catalog	提供 payment dispute、KYC、AML、credit memo、complaint、change approval 的标准状态机模板。
Tool side-effect registry	每个工具必须登记 owner、risk class、idempotency strategy、approval policy、compensation strategy。
Checkpoint service	统一保存 agent checkpoint, 支持权限、加密、保留期、引用证据和删除策略。
Replay console	支持按 workflow id 重建状态、查看 event history、对比版本、导出审计 evidence。
HITL workbench	提供审批队列、证据视图、决策理由、升级、退回、SLA 和 reviewer calibration。
DLQ operations	统一管理失败事件、失败 activity、poison message、manual repair 和恢复审批。
Governance dashboard	监控 state distribution、approval backlog、retry rate、compensation rate、replay failures、tool risk exposure。

8. 交付物模板包

本节是可落地交付物模板。模板采用“字段清单 + 金融零售示例 + 验收标准”的形式, 避免停留在空泛说明。

8.1 State Machine Spec Template

Section	必填内容	支付争议 Agent 示例	验收标准
Workflow identity	workflow name、version、domain、owner、business object	`PaymentDisputeAgentWorkflow`、`1.0.0`、Payments Ops	能唯一定位流程定义和责任团队。
Scope and boundary	自动化范围、人工边界、禁止动作	Agent 可整理证据和生成建议, 不可自行批准大额临时入账	审批边界和禁区明确。
Trigger events	启动事件、恢复事件、取消事件	`payment.dispute.opened.v1`、`evidence.received.v1`、`customer.cancelled.v1`	每个触发事件有 schema 和 producer。
States	状态名、类型、进入条件、允许动作	`EvidenceGathering`、`HumanApprovalRequired`、`ProvisionalCreditPosting`	状态不混合多个职责。
Transitions	from、event、guard、action、to	`HumanApprovalRequired` + `Approved` + amount guard -> `ProvisionalCreditPosting`	每条转移有守卫条件和结果。
Activities	model activity、tool activity、human task	`draft_dispute_eligibility`、`post_provisional_credit`	每个 activity 有输入输出 schema。
Timers and SLA	state timeout、business SLA、escalation	审批 4 business hours, 商户响应 10 business days	超时行为不是口头约定。
Error handling	retry、catch、DLQ、manual repair	支付网关超时重试三次后进入 `DLQRepair`	错误类型和处置路径可执行。
Idempotency	key strategy、dedupe window、result reuse	`dispute_id + action_type + intent_version`	重试和 replay 不重复副作用。
Compensation	补偿动作、审批、不可补偿说明	撤销临时入账并发送更正通知	partial success 有业务恢复路径。
Audit evidence	event list、input/output hash、approval evidence	state transitions、model activity、tool receipt、approver reason	审计可以重建路径。
Versioning	workflow version、schema version、migration policy	`v1` 实例继续用兼容分支, `v2` 新实例使用新状态	部署不会破坏旧实例 replay。

8.2 Agent Checkpoint Schema Template

{
  "checkpoint_id": "ckpt_payment_dispute_DSP-2026-09291_007",
  "workflow_id": "wf_DSP-2026-09291",
  "workflow_name": "PaymentDisputeAgentWorkflow",
  "workflow_version": "1.0.0",
  "business_object": {
    "type": "payment_dispute",
    "id": "DSP-2026-09291",
    "customer_ref": "cust_ref_9f22",
    "risk_tier": "medium",
    "jurisdiction": "US"
  },
  "state": {
    "current_state": "HumanApprovalRequired",
    "last_event_id": "evt_20260629_000341",
    "pending_events": ["ApprovalRecorded", "ReworkRequested", "ApprovalExpired"],
    "pending_timers": ["approval_sla_4_business_hours"]
  },
  "agent_context": {
    "task_goal": "Assess provisional credit eligibility for disputed card transaction",
    "prompt_template_version": "dispute_eligibility_prompt_v3",
    "model_activity_id": "act_model_20260629_7781",
    "structured_output_ref": "evidence://model_outputs/dispute_eligibility/7781",
    "input_evidence_refs": [
      "evidence://payments/txn_812",
      "evidence://customer_statement/stmt_19",
      "evidence://policy/dispute_policy_2026_04"
    ],
    "unsupported_claim_count": 0
  },
  "tool_ledger": [
    {
      "tool_name": "fetch_payment_transaction",
      "tool_class": "read_only",
      "activity_id": "act_tool_110",
      "result_ref": "evidence://tool_results/payment_txn_812",
      "idempotency_key": "read_txn_812_wf_DSP-2026-09291",
      "side_effect_committed": false
    },
    {
      "tool_name": "post_provisional_credit",
      "tool_class": "money_movement",
      "activity_id": "act_tool_116",
      "result_ref": "not_executed_waiting_for_approval",
      "idempotency_key": "DSP-2026-09291_provisional_credit_v1",
      "side_effect_committed": false
    }
  ],
  "human_approval": {
    "approval_state": "Pending",
    "approval_policy": "maker_checker_for_customer_funds",
    "assigned_role": "Payments Dispute Senior Reviewer",
    "sla_due_at": "2026-06-29T22:00:00Z",
    "evidence_binder_ref": "evidence://binders/DSP-2026-09291"
  },
  "recovery": {
    "replay_safe": true,
    "resume_action": "wait_for_approval_or_escalation",
    "compensation_status": "not_required",
    "dlq_ticket_ref": "none"
  },
  "governance": {
    "data_classification": "confidential_customer_financial_data",
    "retention_policy": "payments_dispute_case_retention",
    "audit_log_ref": "audit://workflow/wf_DSP-2026-09291",
    "created_at": "2026-06-29T18:00:00Z"
  }
}

Checkpoint 验收标准:

Check	标准
Resume	worker crash 后能恢复到同一 state, 不丢 pending approval 和 timers。
Replay	replay 使用已记录 activity result, 不重新执行已完成副作用。
Evidence	每个模型结论和工具结果都有证据引用或哈希。
Privacy	不保存不必要原文、凭据或未授权敏感数据。
Governance	能关联 workflow version、policy version、model metadata 和 audit log。

8.3 Tool Side-Effect Matrix Template

Tool	Tool class	Side effect	Approval	Idempotency key	Retry / replay rule	Compensation	Evidence
`fetch_payment_transaction`	Read-only	无业务写入	不需要审批, 需要授权	`workflow_id + transaction_id + read_version`	replay 使用记录结果; recovery 可按 source freshness 重新读取并写新事件	不需要	source timestamp、response hash、entitlement decision
`draft_customer_response`	Draft-only	生成草稿	reviewer 发送前审批	`case_id + draft_type + draft_version`	replay 使用已生成草稿; re-evaluation 写新 draft version	删除或废弃草稿版本	prompt version、evidence refs、draft diff
`update_case_note`	Internal write	写 case note	中风险自动, 高风险人工复核	`case_id + note_intent + content_hash`	retry 使用相同 key; replay 不重复写	追加更正 note, 不删除原记录	note id、actor、content hash
`send_customer_notification`	Customer-facing	客户收到通知	模板内低风险可自动, 争议/投诉需人工审批	`case_id + notification_type + approved_template_version`	retry 依赖 send receipt; replay 不重发	发送更正通知并记录原因	approved content、recipient、delivery receipt
`post_provisional_credit`	Money movement	客户账户临时入账	双人审批和限额	`dispute_id + credit_amount + intent_version`	retry 先查 ledger; replay 使用 posted result	`reverse_provisional_credit` 需审批	approval ids、ledger entry、posting receipt
`restrict_account`	Account restriction	限制账户交易	Compliance / Risk 授权	`customer_ref + restriction_type + case_id`	不自动重试高风险失败; 进入 manual repair	解除限制或更正通知, 取决于法律和政策	policy decision、approver、customer impact note
`submit_regulatory_pack`	Regulatory action	提交监管或审计材料	受权人签核	`case_id + submission_type + final_pack_hash`	replay 不重新提交; recovery 走 amended submission	更正或补充提交	final pack hash、sign-off、submission receipt

8.4 Saga Compensation Table Template

Step	Local transaction	Success event	Failure trigger	Compensation	Compensability	Approval	Evidence
1	创建支付争议 case	`DisputeCaseCreated`	case 创建后客户取消	关闭 case 并记录取消原因	可补偿	Ops lead for exception	case id、actor、cancel reason
2	发起商户调单	`MerchantEvidenceRequested`	调单渠道失败或过期	记录未能调单, 转人工联系渠道	部分可补偿	Dispute reviewer	request id、deadline、error
3	临时入账	`ProvisionalCreditPosted`	后续发现不符合条件或重复入账	撤销临时入账并发送说明	可补偿但有客户影响	Maker-checker	posting receipt、reversal receipt、customer notice
4	客户通知	`CustomerNotificationSent`	内容错误或证据变化	发送更正通知并关联原通知	不可完全补偿	Compliance review for regulated complaints	message hash、delivery receipt、correction note
5	提交 chargeback 文件	`ChargebackSubmitted`	文件字段错误	提交 amended file 或人工联系网络渠道	取决于网络规则	Payments specialist	file hash、submission receipt、amendment
6	关闭争议 case	`DisputeCaseClosed`	后续新证据进入	重新打开 case 或创建 linked case	可补偿	Ops lead	close reason、reopen reason、linked case

8.5 HITL Approval Map Template

Approval state	Trigger	Reviewer role	Evidence required	Allowed decisions	SLA	Escalation	Audit evidence
`ProvisionalCreditApprovalPending`	Agent 判断可考虑临时入账	Payments Dispute Senior Reviewer	交易详情、客户声明、商户信息、政策条款、AI eligibility output	approve、reject、request_rework、escalate	4 business hours	Payments Ops Manager	reviewer id、decision、reason code、evidence binder version
`AMLNarrativeReviewPending`	Agent 生成 AML investigation narrative	AML Investigator Level 2	alert evidence、transaction graph、KYC profile、sanctions hits、source freshness	approve_narrative、edit、request_more_evidence、escalate_to_MLRO	1 business day	AML Team Lead	narrative diff、approver、policy references
`KYCExceptionApprovalPending`	KYC 文档不一致或风险升高	KYC Quality Reviewer	customer documents、risk score、exception reason、policy version	approve_exception、reject_onboarding、request_documents	2 business days	Compliance Manager	document refs、decision rationale、customer impact
`CreditMemoSignoffPending`	AI 起草信贷 memo 完成	Underwriter / Credit Officer	financials、bureau data、collateral、covenant analysis、AI limitations	approve_memo、edit_memo、return_to_relationship_manager	1 business day	Credit Committee	memo version、edit diff、signoff
`ComplaintRegulatoryResponsePending`	投诉涉及监管期限或高影响客户	Complaint Manager / Legal	complaint timeline、call transcripts、policy, remediation proposal	approve_response、revise_response、legal_escalation	jurisdiction SLA	Head of Complaints	final response hash、legal note、delivery evidence
`CoreSystemChangeCABPending`	Agent 生成核心系统变更审批包	CAB members, Risk, Security, Ops	impact analysis、rollback plan、test evidence、deployment window	approve_change、reject_change、defer_change、require_controls	CAB calendar	Change Advisory Board chair	approval record、control checklist、release decision

8.6 Replay / Audit Evidence Template

Evidence category	Required fields	Example evidence	Audit question answered
Workflow identity	workflow id、name、version、business object、owner	`wf_DSP-2026-09291`, `PaymentDisputeAgentWorkflow`, `1.0.0`	哪个流程实例产生了这个结果。
Event history	event id、type、timestamp、causation id、actor	`ApprovalRecorded`, `SideEffectCommitted`	状态为什么按这个顺序变化。
Model activity	model name、version、prompt template、input refs、output hash、schema	`dispute_eligibility_prompt_v3`	AI 基于哪些证据生成了什么输出。
Tool activity	tool name、request hash、response hash、idempotency key、side-effect flag	`post_provisional_credit`, receipt id	外部动作是否发生、是否重复、是否可补偿。
Policy decision	policy id、version、input facts、decision、reason	`maker_checker_policy_2026_05`	系统为何要求审批或禁止自动化。
Human approval	reviewer、role、decision、reason、evidence binder version	`approve`, `customer_funds_policy_match`	谁在何时基于什么证据批准。
Checkpoint	checkpoint id、state、pending timers、pending approvals	`ckpt_payment_dispute_007`	中断后如何恢复。
Compensation	compensation trigger、action、approver、result	`reverse_provisional_credit`	partial failure 如何恢复或更正。
Incident linkage	incident id、DLQ ticket、recovery run、postmortem ref	`INC-2026-044`	事故如何处理, 是否完成恢复。

8.7 Incident Recovery Drill Template

Drill	Injected failure	Expected detection	Recovery action	Success evidence
Worker crash recovery	在 `HumanApprovalRequired` 状态停止 worker	Workflow execution stuck alert, worker restart event	replay event history, resume pending approval	state unchanged, no duplicate tool side effect, checkpoint valid
Duplicate tool retry	`post_provisional_credit` 返回 timeout 但账务已成功	idempotency ledger detects existing posting	reuse posting receipt, do not repost	one ledger entry, one customer credit, audit event notes timeout
Approval timeout	reviewer 未在 SLA 内处理	approval SLA breach alert	escalate to manager, send reminder, keep workflow paused	escalation event, assignment update, SLA report
DLQ poison event	`kyc.document.received.v1` schema 缺字段	schema validation failure and DLQ ticket	fix producer or map missing field through repair workflow	DLQ item resolved, replay approved, consumer lag normal
Model output schema error	model returns unsupported enum	output validator rejects activity result	retry with constrained schema or route to human drafting	invalid output captured, no state transition based on bad output
Compensation failure	撤销临时入账失败	compensation activity failed after retry	open manual repair, freeze further customer notification, assign Ops	repair ticket, no silent completion, customer impact tracked
Replay compatibility failure	新代码无法解释旧 event	replay test fails in pre-release	block deployment, add version branch, rerun replay test	release gate evidence, old instance compatible

9. 金融零售案例库

9.1 支付争议 Agent

设计维度	架构设计
Durable task	从客户提交争议到证据收集、eligibility draft、人工审批、临时入账、客户通知、最终关闭。
State machine	`Intake -> EvidenceGathering -> EligibilityDrafting -> HumanApprovalRequired -> ProvisionalCreditPosting -> CustomerNotification -> Completed`。
Agent role	归纳争议原因、检查政策匹配、生成 eligibility draft、列出缺失证据。
High-risk side effect	临时入账、费用减免、chargeback 文件提交、客户通知。
HITL	大额、重复争议、高风险客户、政策不确定和客户权益动作必须人工审批。
Idempotency	`dispute_id + action_type + amount + intent_version`。
Compensation	撤销临时入账、发送更正通知、重新打开 case。
Audit	交易证据、客户声明、政策版本、模型输出、审批理由、账务 receipt、通知回执。

9.2 AML Investigation Copilot

设计维度	架构设计
Durable task	从 AML alert 到证据收集、交易图谱摘要、风险假设、narrative draft、investigator review、case disposition。
State machine	`AlertReceived -> EvidenceCollection -> PatternDrafting -> InvestigatorReview -> NarrativeFinalization -> CaseDisposition`。
Agent role	汇总交易模式、关联 KYC 和账户行为、生成调查假设和 narrative 草稿。
High-risk side effect	关闭 alert、升级 case、生成 SAR evidence pack、账户限制建议。
HITL	SAR 相关 narrative、case closure、账户限制建议必须由授权 investigator 复核。
Idempotency	`alert_id + investigation_step + evidence_snapshot_hash`。
Compensation	重新打开 alert、追加更正 narrative、提交 amended evidence pack。
Audit	交易图谱版本、KYC profile timestamp、外部名单结果、AI narrative diff、investigator decision。

9.3 KYC Onboarding Workflow

设计维度	架构设计
Durable task	从申请提交到文档收集、身份验证、风险评分、异常复核、客户补件、开户决定。
State machine	`ApplicationSubmitted -> DocumentCollection -> Verification -> RiskScoring -> ExceptionReview -> AccountOpeningDecision`。
Agent role	检查材料完整性、解释失败原因、生成补件请求草稿、整理 exception evidence。
High-risk side effect	拒绝开户、账户启用、客户通知、风险标签写入。
HITL	高风险客户、文档不一致、PEP / sanctions 相关场景进入人工复核。
Idempotency	`application_id + verification_provider + document_hash`。
Compensation	撤销错误标签、重新打开 onboarding、发送更正补件通知。
Audit	文档 hash、验证 vendor response、risk score version、reviewer reason、客户通知回执。

9.4 信贷 Memo Workflow

设计维度	架构设计
Durable task	从 RM 提交材料到财务分析、现金流摘要、风险点、covenant、memo 起草、underwriter 签核。
State machine	`RequestIntake -> DataValidation -> FinancialAnalysisDraft -> RiskNarrativeDraft -> UnderwriterReview -> CreditCommitteePack`。
Agent role	结构化财务资料、生成风险摘要、对照信贷政策、生成 memo draft 和 committee Q&A。
High-risk side effect	memo 进入审批包、信贷建议、客户承诺、拒绝理由。
HITL	underwriter 对所有 credit decision support 输出负责, AI 只能生成草稿和证据整理。
Idempotency	`credit_request_id + memo_section + source_data_version`。
Compensation	memo 版本废弃、重新生成 section、追加审批说明。
Audit	数据来源、ratio calculation reference、policy version、AI draft hash、underwriter edit diff。

9.5 客户投诉处理

设计维度	架构设计
Durable task	从投诉进入到分类、监管期限判断、事实调查、补偿建议、客户回复、关闭和抽检。
State machine	`ComplaintReceived -> Classification -> Investigation -> RemediationDraft -> ManagerReview -> CustomerResponse -> Closure`。
Agent role	建投诉时间线、识别监管关键词、汇总证据、生成回复草稿和补偿建议。
High-risk side effect	客户可见回复、补偿、监管分类、案件关闭。
HITL	高影响客户、监管投诉、法律威胁、补偿和最终回复必须人工审批。
Idempotency	`complaint_id + response_version + approved_template_hash`。
Compensation	发送更正回复、重新打开投诉、调整补偿方案。
Audit	通话记录、聊天记录、政策引用、补偿审批、最终回复 hash、送达证据。

9.6 核心系统变更审批

设计维度	架构设计
Durable task	从变更请求到影响分析、风险评估、测试证据、CAB 审批、上线窗口、回滚验证。
State machine	`ChangeRequested -> ImpactAnalysisDraft -> RiskReview -> CABApproval -> DeploymentWindow -> PostImplementationReview`。
Agent role	生成影响分析草稿、识别依赖系统、整理测试证据、生成回滚和通信清单。
High-risk side effect	CAB 批准、上线指令、核心参数变更、客户影响通知。
HITL	CAB、Security、Ops、Business Owner 按风险类别签核。
Idempotency	`change_id + deployment_action + release_version`。
Compensation	回滚 release、恢复参数、发布 incident communication。
Audit	approval record、test evidence、deployment logs、rollback decision、post-implementation review。

10. Timeout / Retry / DLQ 设计

10.1 错误分类

Error class	示例	处理策略
Transient dependency error	支付网关短暂超时、KYC vendor 5xx	有界重试、指数退避、jitter、circuit breaker。
Validation error	工具参数缺字段、输出 schema 不合法	不盲目重试, 回到 agent rework 或 human repair。
Entitlement error	当前 actor 无权读取或执行	停止动作, 记录 policy decision, 请求授权或转人工。
Business conflict	case version 已变化、客户取消、重复申请	读取最新 business state, 走冲突处理状态。
Human timeout	reviewer 未处理、客户未补件	reminder、escalation、auto-close 或 manual takeover, 由政策决定。
Poison event	同类事件持续解析失败	进入 DLQ, 暂停自动重放, 修复 schema 或 producer。
Determinism error	replay 产生不同 command 序列	阻断部署或恢复, 走 workflow version repair。

10.2 Retry Policy

Activity type	Retry policy	上限	失败后状态
Read-only source fetch	快速重试 + backoff	3 到 5 次, 取决于 source SLA	`AwaitingDependency` 或 `DLQRepair`
Model activity	schema 校验失败可重试一次, 内容风险不自动重试	1 到 2 次	`AgentOutputReview`
Internal write	使用幂等键重试	到达 ledger 确认或上限	`ManualRepair`
Customer-facing notification	先查 delivery receipt, 再决定是否重试	严格受发送状态控制	`NotificationReview`
Money movement	不因 timeout 盲目重试, 先查账务 ledger	以 idempotency result 为准	`FundsManualRepair`
Regulatory submission	不自动重复提交	人工确认	`RegulatorySubmissionReview`

10.3 DLQ Operations

DLQ 不是垃圾桶, 而是受控恢复队列。

DLQ field	用途
`dlq_item_id`	唯一标识失败事件或 activity。
`workflow_id`	关联业务流程实例。
`failure_class`	transient、schema、policy、business_conflict、determinism、unknown。
`payload_ref`	加密存储的原始 payload 引用。
`first_failed_at` / `last_failed_at`	判断积压和恢复优先级。
`retry_count`	防止无限重试。
`customer_impact`	判断是否需要客户或监管处理。
`replay_allowed`	是否允许自动重放。
`repair_decision`	discard、replay、transform_and_replay、manual_complete、compensate。
`approver`	高风险恢复动作的批准人。

11. Governance: NIST AI RMF 映射

RMF Function	Durable agent workflow 问题	关键控制	Evidence
Govern	谁拥有 workflow、state machine、tool、审批、补偿和 replay 风险	RACI、policy、release gate、model/tool/workflow owner、risk acceptance	governance record、approval matrix、control owner
Map	哪些业务状态、客户影响、外部副作用和失败路径存在	use case risk map、side-effect inventory、harm scenario、dependency map	architecture diagram、state machine spec、tool matrix
Measure	如何度量 workflow 是否可靠、可审计、可恢复	replay test、state distribution、retry rate、DLQ aging、approval SLA、compensation rate	metrics dashboard、eval report、audit sample
Manage	事故、模型变更、工具变更、策略变更如何处置	incident drill、change management、kill switch、migration、postmortem	incident record、recovery run、change approval

11.1 关键治理指标

Metric	为什么重要
Workflow completion rate by state	识别流程卡点和异常状态。
Median / P95 state duration	找到长时间等待、人工队列积压和依赖系统瓶颈。
Approval SLA breach rate	衡量 HITL 是否真实可运营。
Retry exhaustion rate	识别下游不稳定或错误分类不合理。
DLQ aging and customer impact	防止失败事件长期无人处理。
Compensation rate	识别流程设计、工具可靠性或 agent 判断质量问题。
Duplicate side-effect prevented	证明幂等控制产生价值。
Replay test pass rate	证明新版本不会破坏旧实例。
Model output schema violation rate	衡量 AI 输出是否适合驱动 workflow。
Manual override reason distribution	识别模型、政策、数据或 UI 设计缺陷。

11.2 Versioning and Change Control

Change type	风险	控制
Prompt template change	同一证据生成不同结构化输出	prompt versioning、regression eval、approval for high-risk workflows。
Model routing change	输出质量、格式、拒答和安全策略变化	model gateway policy version、canary、rollback。
Tool API change	activity result schema 变化, replay 或恢复失败	contract test、schema version、consumer impact review。
State machine change	老实例 replay 不兼容	workflow version branch、migration plan、replay test。
Policy threshold change	自动化和审批边界变化	policy decision version、risk sign-off、effective date。
Compensation change	事故恢复方式变化	operational drill、approval update、audit training。

12. 30 天训练计划

Day	训练主题	练习	产出
1	Durable workflow 核心概念	阅读 Temporal workflow、AWS Step Functions state machine、Workflow Patterns 总览	1 页概念对比: workflow engine vs agent runtime vs business system
2	金融零售长流程识别	选支付争议、KYC、AML、信贷 memo、投诉、核心系统变更六个场景	长流程风险地图
3	State taxonomy	为六个场景分类 Intake、Evidence、Agent Drafting、Approval、External Action、Terminal	状态类型清单
4	支付争议状态机	写完整状态、事件、转移和终态	Payment Dispute State Machine Spec
5	KYC onboarding 状态机	加入客户补件、vendor failure、异常复核	KYC State Machine Spec
6	AML investigation 状态机	加入 evidence graph、narrative review、case disposition	AML State Machine Spec
7	State guard 设计	为三份状态机添加 risk、evidence、authority、freshness、idempotency guards	Guard Matrix
8	Event history 设计	为支付争议定义 workflow events、activity events、approval events	Event History Catalog
9	Determinism drill	列出 workflow 中所有非确定性来源	Determinism Risk Register
10	Model activity 设计	为信贷 memo 定义模型 activity metadata 和 output schema	Model Activity Contract
11	Checkpoint schema	为 AML copilot 写 checkpoint schema	AML Agent Checkpoint Schema
12	Tool side-effect classification	为 20 个金融零售工具分级	Tool Side-Effect Matrix
13	Idempotency design	为 money movement、notification、case note、regulatory submission 设计幂等键	Idempotency Key Standard
14	Saga compensation	为支付争议和投诉处理写局部事务与补偿	Saga Compensation Tables
15	HITL approval	为 KYC、AML、信贷 memo 设计审批状态、角色、SLA、证据	HITL Approval Maps
16	Timeout policy	为等待客户、等待商户、等待 reviewer、等待 vendor 定义 timeout	Timeout and Escalation Policy
17	Retry policy	按错误类型定义 retry、catch、manual repair	Retry Policy Matrix
18	DLQ operations	设计 DLQ 字段、triage、重放审批和恢复路径	DLQ Runbook
19	Replay evidence	设计 workflow replay 需要的最小 evidence binder	Replay / Audit Evidence Binder
20	Incident drill 1	演练 worker crash 和恢复	Worker Crash Recovery Drill
21	Incident drill 2	演练重复支付和幂等恢复	Duplicate Side-Effect Drill
22	Incident drill 3	演练审批超时和升级	Approval Timeout Drill
23	Incident drill 4	演练 DLQ poison event	DLQ Poison Event Drill
24	Versioning	为状态机、prompt、model、tool、policy 写版本治理	Versioning and Migration Plan
25	Governance mapping	用 NIST AI RMF 映射控制与证据	Govern / Map / Measure / Manage Matrix
26	Platform capability	设计 durable agent workflow 平台能力清单	Platform PM Capability Roadmap
27	Metrics dashboard	定义 state duration、DLQ aging、compensation rate 等指标	Workflow Reliability Dashboard Spec
28	Architecture review	为支付争议 Agent 写架构评审包	Architecture Review Pack
29	Portfolio story	把一个案例写成面试可讲的 STAR-T 故事	Durable Agent Workflow Portfolio Story
30	Executive readout	用 10 页以内讲清业务价值、风险控制和落地路线	Executive Briefing Memo

13. 面试回答库

13.1 为什么 AI Agent 需要 durable workflow, 而不是普通函数调用?

30 秒回答

因为金融零售里的 agent 任务往往跨系统、跨人、跨天, 并且会产生资金、客户权益或监管副作用。普通函数调用只能表达一次动作, durable workflow 能表达状态、等待、审批、重试、补偿、恢复和审计。

2 分钟回答

我会把 agent runtime 和 workflow control plane 分开。Agent 负责生成计划、草稿和结构化判断; workflow engine 负责持久状态、事件历史、timer、retry、approval state 和 replay; tool gateway 负责外部副作用和幂等。比如支付争议 Agent 不能因为网络 timeout 重复发起临时入账, 也不能因为 worker 重启丢掉审批状态。用 durable workflow 后, 已完成的工具 activity 会记录到 event history 和 side-effect ledger, replay 时只恢复状态, 不重复执行副作用。这样既提高可靠性, 也满足审计和事故复盘。

13.2 Workflow replay 对 AI Agent 最大的设计挑战是什么?

30 秒回答

最大挑战是模型调用和工具调用都是非确定性的。Replay 应恢复当时状态, 不是重新问模型或重新执行工具。

2 分钟回答

在 durable workflow 中, workflow logic 要尽量确定性, 而模型调用、当前时间、外部 API 和随机性都应该放入 activity, 并把结果写入 event history。AI 场景尤其要记录 model version、prompt template version、input evidence refs、output schema 和 output hash。支付、通知、账户限制这类工具副作用必须通过 idempotency key 和 side-effect ledger 控制。Replay 时使用历史 activity result, 如果需要用新模型或新政策重新评估, 那是 re-evaluation workflow, 不是 replay。

13.3 如何设计 agent checkpoint?

30 秒回答

Checkpoint 保存的是恢复任务需要的业务状态引用、当前 state、证据引用、结构化输出、工具 ledger、审批状态、timer 和治理元数据, 不是保存模型的自由推理过程。

2 分钟回答

我会把 checkpoint 分成六部分: workflow position、business object、evidence refs、agent structured output、tool ledger、human approval 和 governance metadata。比如 AML investigation copilot 的 checkpoint 要记录 alert id、当前 investigation state、交易图谱引用、KYC profile timestamp、narrative draft hash、pending investigator review、policy version 和 retention policy。这样 worker crash 后能恢复, 审计时能证明证据来源, 同时避免把不必要的敏感原文或模型推理链长期保存。

13.4 Saga compensation 在 agent workflow 里怎么用?

30 秒回答

Saga 用来处理跨系统长事务的 partial success。每个局部事务都要有成功事件和补偿动作; 补偿是业务动作, 不是简单技术回滚。

2 分钟回答

以支付争议为例, 创建 case、商户调单、临时入账、客户通知和 chargeback 提交都是不同系统里的局部事务。临时入账成功后, 如果后续发现条件不满足, 补偿可能是撤销临时入账并发更正通知; 如果客户通知已发送, 不能“撤回事实”, 只能发送更正并留痕。因此 saga compensation table 必须写清楚每一步的 local transaction、success event、failure trigger、compensation、审批要求和 evidence。Agent 可以建议补偿路径, 但高影响补偿必须受审批和审计控制。

13.5 幂等设计为什么是 agent tool 的核心控制?

30 秒回答

Agent 和 workflow 都会重试、恢复和 replay。如果工具没有幂等控制, 一次 timeout 就可能变成重复退款、重复通知或重复 case note。

2 分钟回答

我会为所有写工具设计 business intent 级幂等键, 而不是只用 request id。比如临时入账的 key 应包含 dispute id、金额、动作类型和 intent version; 客户通知的 key 应包含 case id、通知类型和已批准模板版本。工具 gateway 在执行前查 side-effect ledger, 执行后保存 receipt。Replay 不重新执行已提交副作用, retry 也要先判断上一次结果是否已经成功。幂等不能替代业务校验, 但它是控制重复副作用的底线。

13.6 Human approval state 如何避免变成橡皮图章?

30 秒回答

审批必须是状态机的一部分, 包含角色、权限、证据、SLA、允许动作、决策理由、升级和审计记录, 不能只是一个 approve 按钮。

2 分钟回答

我会先区分 approval、review、exception handling 和 escalation。比如 AML narrative review 需要 investigator 看到交易图谱、KYC profile、source freshness、AI draft、引用证据和模型限制。UI 必须允许 approve、edit、request more evidence、escalate, 并要求 reason code。架构上要有 approval event、evidence binder version、reviewer role、segregation of duties 和 SLA timer。这样审批才是真实控制, 也能被运营和审计评估。

13.7 DLQ 在 durable agent workflow 中如何治理?

30 秒回答

DLQ 是受控恢复队列, 不是失败消息堆积区。每个 DLQ item 要有失败分类、客户影响、重放许可、修复决策和责任人。

2 分钟回答

我会把 DLQ 分成 transient、schema、policy、business conflict、determinism 和 unknown 等类别。对于 KYC document event schema 缺字段, 不能自动重放到同一个 consumer 反复失败, 应进入 DLQ triage, 判断是 producer 修复、transform and replay、manual complete 还是 discard。高风险场景的 replay 必须审批, 并记录 payload ref、workflow id、failure class、repair decision、approver 和 recovery result。指标上看 DLQ aging、customer impact 和 repeated poison pattern。

13.8 如何向管理层解释 durable agent workflow 的商业价值?

30 秒回答

它不是工程炫技, 而是让 AI 自动化能进入高价值长流程: 更少掉单、更少重复副作用、更快审批、更强审计、更可控的事故恢复。

2 分钟回答

我会用业务风险语言解释。支付争议、KYC、AML、信贷和投诉处理都有等待、审批、证据、监管期限和外部副作用。没有 durable workflow, agent 很容易停留在助手层, 无法安全接近核心流程。引入状态机、checkpoint、幂等、saga、HITL 和 audit evidence 后, 平台可以把 AI 从“生成文本”升级为“参与受控业务流程”。衡量指标包括 cycle time、approval SLA、manual rework、duplicate side-effect prevented、DLQ aging、compensation rate 和 audit finding reduction。

13.9 如果新版本 workflow 破坏旧实例 replay, 你怎么处理?

30 秒回答

先阻断发布或停止扩大影响, 用 replay test 定位不兼容事件, 通过 workflow version branch 或 migration workflow 修复, 并保留变更证据。

2 分钟回答

我会把它当成 release gate 和生产可靠性问题。第一步确认是否只是测试发现, 还是已有生产实例受影响。第二步找出状态名、activity 顺序、分支条件或 schema 变化导致的 non-determinism。第三步为旧版本加入兼容分支, 或为受影响实例设计 migration workflow。第四步 rerun replay tests, 更新架构决策记录和变更审批。高风险 workflow 不能直接让老实例使用不兼容的新逻辑。

13.10 Agent 能否自动执行补偿动作?

30 秒回答

低风险、完全可逆、预先授权的补偿可以自动化; 涉及资金、客户权益、账户限制或监管材料的补偿必须有审批和证据。

2 分钟回答

补偿动作本身也是副作用, 不能因为它叫“恢复”就放松控制。比如撤销错误 case note 可以自动追加更正记录, 但撤销临时入账、发送更正客户通知或补交监管材料都需要审批。设计上我会给每个补偿动作定义 risk class、approval policy、idempotency key、customer impact、evidence 和 post-compensation verification。Agent 可以识别 failure pattern、建议补偿路径、准备 evidence binder, 但不能越过 policy 和 HITL。

14. 参考来源链接

Temporal Workflows: https://docs.temporal.io/workflows
Temporal Workflow Execution and Replay: https://docs.temporal.io/workflow-execution
Temporal Event History: https://docs.temporal.io/encyclopedia/event-history
AWS Step Functions Documentation: https://docs.aws.amazon.com/step-functions/
AWS Step Functions State Machines: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-statemachines.html
AWS Step Functions Error Handling: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html
AWS Step Functions Task State Timeout / Retry / Catch: https://docs.aws.amazon.com/step-functions/latest/dg/state-task.html
Amazon States Language: https://states-language.net/
Workflow Patterns: https://www.workflowpatterns.com/
Workflow Patterns Control-Flow Patterns: https://www.workflowpatterns.com/patterns/control/
Sagas paper, Hector Garcia-Molina and Kenneth Salem: https://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf
NIST AI Risk Management Framework: https://www.nist.gov/itl/ai-risk-management-framework
NIST AI RMF Core: https://airc.nist.gov/airmf-resources/airmf/5-sec-core/
NIST AI RMF Playbook: https://airc.nist.gov/airmf-resources/playbook/