AI 扩展计划 / Playbooks

AI Runtime Evidence / Observability Architecture Playbook

932 行AI_RUNTIME_EVIDENCE_OBSERVABILITY_ARCHITECTURE_PLAYBOOK.md

AI Runtime Evidence / Observability Architecture Playbook

面向 CBAP+ 金融零售 AI PM、AI Product Architect、Solutions Architect、Enterprise Architect、AI Governance、Model Risk、内审与平台团队。本文关注的不是“把日志打全”这类普通 logging，而是如何把 AI 在生产运行中的 prompt、context、retrieval、tool calls、human approvals、model config、cost、quality、policy decisions 与 incident evidence 设计成可观测、可审计、可追责、可学习的运行证据体系。目标读者应能用这份 playbook 设计 AI runtime evidence architecture，回答业务负责人、监管、模型风险、内审、安全团队和工程团队共同关心的问题：这次 AI 行为为什么发生、基于什么证据、谁批准、用了哪个版本、是否越界、花了多少钱、质量是否可接受、事故能否复盘、改进是否闭环。

1. Source Anchors

以下官方来源是本文的架构锚点。本文不会把它们机械拼成一张“大而全表格”，而是将其转化为金融零售 AI 生产系统的 runtime evidence 设计语言。访问日期：2026-06-30。

Anchor	Official link	本文采用的思想	落地到 AI runtime evidence
OpenTelemetry	https://opentelemetry.io/docs/	用 vendor-neutral traces、metrics、logs、context propagation、semantic attributes 设计生产遥测。	AI request trace、prompt span、retrieval span、tool span、judge span、approval span、cost metric、quality metric。
OpenLineage	https://openlineage.io/docs/	用 job、run、dataset 与 extensible facets 记录运行中数据与处理链路的 lineage。	RAG knowledge base、index version、retrieved chunk、embedding job、rerank run、eval dataset、evidence lake lineage。
W3C PROV	https://www.w3.org/TR/prov-overview/	用 Entity、Activity、Agent 思维描述证据由谁、基于什么、通过什么活动生成。	Prompt/config/output 是 Entity，model call/retrieval/tool call 是 Activity，user/service/reviewer/policy engine 是 Agent。
CloudEvents	https://cloudevents.io/	用一致事件 envelope 让不同系统产生的事件可路由、可过滤、可消费。	`ai.prompt.rendered`、`ai.tool.invoked`、`ai.approval.decided`、`ai.incident.signal.detected` 等 evidence events。
NIST AI RMF	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern、Map、Measure、Manage 组织 AI 风险上下文、评估、控制和持续改进。	把 runtime evidence 连接到风险分类、上线门禁、质量度量、事故管理和治理报告。
NIST CSF	https://www.nist.gov/cyberframework	用 Govern、Identify、Protect、Detect、Respond、Recover 组织网络安全风险管理。	把 AI evidence pipeline 纳入身份权限、数据保护、检测告警、响应复盘和恢复能力。

1.1 Standards-to-Architecture Translation

标准视角	架构转译	高级表达
OpenTelemetry	每次 AI 请求都应有 trace，每个关键 AI 行为都应有 span，跨服务通过 trace context 串起来。	“我不是只看最终答案，我能看到模型为什么被路由、检索召回了什么、哪个工具被调用、哪个策略允许或拦截。”
OpenLineage	RAG、embedding、indexing、eval、feature、policy data 都要有 lineage，不把知识库当黑盒。	“我能证明引用依据来自哪个知识源、哪个版本、哪个索引构建任务和哪个权限过滤规则。”
W3C PROV	运行证据必须表达 Entity、Activity、Agent 关系，支持 trustworthiness 判断。	“证据可信不是因为它被存了，而是因为能证明它如何生成、谁参与、使用了哪些实体、支持了哪个决策。”
CloudEvents	关键事件用统一 envelope 发布到 evidence bus，避免每个系统自定义不可治理日志。	“AI evidence 是事件产品，有契约、版本、schema、路由、保留和消费者。”
NIST AI RMF	运行证据支撑 AI 风险管理的全生命周期，不只支撑 SRE 排障。	“每个监控信号都能回连到 AI 风险、控制目标、eval contract、release gate 和改进动作。”
NIST CSF	运行证据也是安全资产，要治理、识别、保护、检测、响应和恢复。	“观测系统本身不能成为 PII 泄漏、越权取证或事故证据缺失的源头。”

2. Core Mental Model：Runtime Evidence = 黑匣子 + 审计链 + 学习回路

AI runtime evidence 的核心心智模型：

Runtime evidence is the product's black box + audit trail + learning loop.

中文展开：

产品黑匣子 Black Box：像飞机黑匣子一样，记录一次 AI 行为发生时的关键状态、输入摘要、模型配置、检索证据、工具动作、策略判断、人工审批、输出和结果。它不是为了“偷窥用户内容”，而是为了在高风险业务中解释行为、定位问题、复盘事故。
审计链 Audit Trail：从业务请求、用户权限、prompt version、model version、knowledge base version、policy version、tool schema、approval decision 到 final output，形成可查询证据链。审计链要能支撑内审、模型风险、合规、监管问询、客户投诉和事故复盘。
学习回路 Learning Loop：把线上失败、人工改写、用户反馈、judge score、incident signal、成本异常和模型漂移变成 eval case、prompt 改进、知识库更新、policy-as-code 修正、tool 权限调整和流程再设计。没有学习回路的 observability 只是 expensive dashboard。

2.1 为什么普通日志不够

普通服务日志通常回答：

这个 API 有没有报错？
请求耗时多少？
哪个依赖超时？
系统资源是否异常？

AI runtime evidence 还必须回答：

Prompt 是哪个模板和版本渲染出来的？
用户上下文、会话记忆、case facts 是否被正确选择？
RAG 检索了哪些 chunk，为什么这些 chunk 被允许使用？
输出中的每个 material claim 是否有 citation 支持？
Tool call 是否经过 policy decision、dry-run、approval、idempotency 和 side-effect 记录？
人工审批者看到了什么、批准了什么、修改了什么？
模型参数、路由策略、fallback 策略和供应商版本是什么？
judge 与人工复核是否发现质量、安全或合规问题？
单次请求、单个 case、单个业务线、单个模型 route 的成本是多少？
事故发生后，能否在权限边界内复原关键时间线？

2.2 三个设计原则

原则	含义	金融零售落地
Evidence by design	在架构阶段定义证据对象、字段、采样、脱敏、保留和查询，不依赖事故后补材料。	AML、支付争议、信贷、客服等高影响用例上线前必须通过 evidence readiness gate。
Minimum sufficient content	保存能证明行为和决策的最小充分证据，避免原文滥存和 PII 扩散。	对 prompt/input/output 使用摘要、哈希、分类标签、受控片段和可授权重放指针。
Traceability over screenshots	用 queryable trace、event、provenance graph 和 evidence pack 支撑审计，而不是靠截图和人工说明。	审计问询从 `trace_id`、`case_id`、`policy_decision_id`、`approval_id` 查询证据包。

3. Evidence Objects Taxonomy

一个成熟的 AI runtime evidence model 至少覆盖 10 类对象。每类对象都要明确：记录什么、不记录什么、谁能看、保留多久、用于什么决策。

Evidence object	核心问题	最低字段	风险控制
Prompt / Config Evidence	模型看到的指令和配置是什么？	`prompt_template_id`、`prompt_version`、`render_policy_version`、`model_alias`、`model_version`、`temperature`、`max_tokens`、`route_reason`、`config_hash`	原文按风险分级保存；敏感字段脱敏；配置 hash 用于复现；prompt registry 强制版本化。
Context Evidence	哪些业务上下文进入了 AI？	`context_type`、`source_system`、`case_id`、`customer_segment`、`entitlement_result`、`context_snapshot_hash`、`redaction_profile`	保存上下文来源和摘要，不默认保存完整客户资料；权限过滤先于模型调用。
Retrieval Evidence	RAG 找到了什么证据？	`kb_id`、`kb_version`、`index_version`、`query_hash`、`retrieved_chunk_ids`、`reranked_chunk_ids`、`citation_chunk_ids`、`freshness_days`、`permission_filter_result`	记录召回与最终引用差异；保留 chunk 指针和版本；高风险 claim 强制 citation support。
Tool Call Evidence	AI 请求或执行了什么外部动作？	`tool_name`、`tool_schema_version`、`tool_risk_tier`、`input_hash`、`dry_run_result`、`policy_decision_id`、`approval_id`、`side_effect_id`、`idempotency_key`	写操作先 dry-run；高风险动作必须 human approval；不可逆动作必须有 compensating control。
Approval Evidence	哪个人或系统批准了什么？	`approval_id`、`approver_role`、`approver_id_hash`、`decision`、`decision_reason_code`、`visible_evidence_set`、`timestamp`、`expiry`	审批界面展示的证据也要可追溯；审批不得只记录“通过”。
Policy Decision Evidence	哪个策略允许、拦截或升级？	`policy_id`、`policy_version`、`decision`、`decision_reason`、`risk_tier`、`input_attributes_hash`、`obligations`、`policy_engine_version`	policy-as-code 版本化；每个 block/allow/escalate 可解释；策略变更走 release gate。
Output Evidence	AI 输出了什么、是否可验证？	`output_id`、`output_type`、`output_hash`、`claim_count`、`citation_support_rate`、`safety_label`、`final_status`、`delivery_channel`	客户可见输出比内部草稿更高保留等级；material claims 与 citations 关联。
User Feedback Evidence	用户如何采纳、修改或拒绝？	`feedback_type`、`accepted`、`edited`、`edit_diff_hash`、`thumbs_signal`、`reason_code`、`workflow_completion`	用户反馈用于 learning loop，但不能直接作为质量真相；重要修改进入 eval queue。
Cost / Latency Evidence	花费与速度是否在边界内？	`input_tokens`、`output_tokens`、`reasoning_tokens`、`embedding_cost`、`rerank_cost`、`tool_cost`、`human_review_cost`、`ttft_ms`、`total_latency_ms`	成本按 use case、risk tier、model route、business unit 归因；超预算触发 route review。
Incident Signal Evidence	哪些信号表明可能出事？	`signal_id`、`signal_type`、`severity`、`linked_trace_ids`、`detector_version`、`trigger_threshold`、`triage_status`、`incident_id`	告警必须能关联 trace；严重事件启用 evidence preservation hold。

3.1 Evidence Objects 与责任人

Object	Product owner 关注	Architect 关注	Risk / Audit 关注
Prompt / Config	需求意图是否正确表达，版本是否可回滚。	registry、deployment、routing、cache、fallback。	变更审批、用途边界、证据保留。
Context	是否拿到了业务所需最小上下文。	context service、PII redaction、entitlement。	最小化、授权、数据使用边界。
Retrieval	答案是否有真实依据。	vector index、rerank、chunking、lineage。	citation support、知识源批准、过期内容控制。
Tool Call	AI 是否能真正完成工作流。	tool gateway、policy gate、idempotency、side effects。	越权动作、审批、不可抵赖。
Approval	人机协同是否有效。	workflow state、audit record、review UI evidence set。	谁批准、基于什么、是否独立。
Policy Decision	风险边界是否执行。	PDP/PEP、policy version、obligations。	allow/block/escalate 是否可解释。
Output	用户价值是否交付。	output store、hash、delivery state。	客户影响、投诉证据、合规语言。
Feedback	是否形成改进闭环。	feedback pipeline、eval queue。	人工覆盖率、系统性偏差。
Cost / Latency	单位经济是否成立。	token ledger、metrics pipeline、SLO。	成本失控、供应商依赖。
Incident Signal	风险是否被及时发现。	detector、alert routing、forensic preservation。	事故响应、整改、复盘。

4. Reference Architecture

AI runtime evidence architecture 不是单个日志库，而是一组协同能力：trace/span model、event envelope、provenance graph、evidence lake、dashboards、retention/deletion boundary、access control。

4.1 Logical Architecture

--------------------+       +-----------------------+       +---------------------+
| AI Experience      |       | AI Runtime Services   |       | Business Systems    |
| copilot / agent /  |-----> | gateway / orchestrator|-----> | CRM / AML / lending |
| workflow UI        |       | RAG / tools / policy  |       | payment / case mgmt |
+---------+----------+       +-----------+-----------+       +----------+----------+
          |                              |                              |
          | trace context                | spans + metrics + events      |
          v                              v                              v
+--------------------------------------------------------------------------------+
| Evidence Collection Layer                                                      |
| OpenTelemetry SDK/Collector + evidence event producer + redaction processor     |
+-------------------------+--------------------------+---------------------------+
                          |                          |
                          v                          v
        +-------------------------+       +-------------------------------+
        | Trace / Metrics Store   |       | Evidence Event Bus             |
        | spans, exemplars, SLO   |       | CloudEvents-style contracts    |
        +-----------+-------------+       +----------------+--------------+
                    |                                      |
                    v                                      v
        +-------------------------+       +-------------------------------+
        | Evidence Lake           |<----->| Provenance / Lineage Graph     |
        | immutable raw zone,     |       | PROV + OpenLineage-inspired    |
        | curated zone, packs     |       | entities, activities, agents   |
        +-----------+-------------+       +----------------+--------------+
                    |                                      |
                    v                                      v
        +-------------------------+       +-------------------------------+
        | Dashboards / Alerts     |       | Audit Query / Evidence Pack    |
        | quality, safety, cost,  |       | incident, release, regulator   |
        | latency, drift, KRI     |       | request, control testing       |
        +-------------------------+       +-------------------------------+

4.2 Trace / Span Model

Trace 是一次业务请求或 AI workflow 的完整链路。Span 是链路中的关键活动。高风险金融零售 AI 建议把 span 切到能回答审计问题的粒度，而不是只记录一个 POST /chat。

Span name	Parent	记录内容	失败示例
`ai.request`	root	use case、risk tier、user role、case id、workflow step、release version。	request 无 risk tier，无法判断保留策略。
`ai.context.load`	`ai.request`	上下文来源、权限、脱敏、快照 hash。	客户 PII 未脱敏进入 prompt。
`ai.prompt.render`	`ai.request`	prompt template/version、变量摘要、render policy。	prompt hotfix 未版本化。
`ai.retrieval.query`	`ai.request`	query rewrite、KB/index、filters、召回 chunk。	召回过期政策文件。
`ai.model.invoke`	`ai.request`	provider、model、params、tokens、latency、cost、finish reason。	fallback 到未经批准模型。
`ai.policy.evaluate`	`ai.request` 或 `ai.tool.invoke`	policy id/version、decision、obligations、reason。	高风险动作未升级审批。
`ai.tool.invoke`	`ai.request`	tool schema、input hash、dry-run、side effect、idempotency。	tool 调用成功但无法证明参数。
`ai.approval.review`	`ai.request`	reviewer、decision、visible evidence、edit diff、latency。	审批只记录了 yes/no。
`ai.output.finalize`	`ai.request`	output hash、citations、safety label、delivery state。	客户收到的版本与内部记录不一致。
`ai.feedback.capture`	`ai.request`	accept/edit/reject/escalate、reason、workflow outcome。	坐席大幅修改但未进入质量队列。
`ai.judge.evaluate`	`ai.request`	rubric、judge version、score、critical failure。	judge prompt 漂移导致分数不可比。

4.3 Span Attribute Naming

建议采用稳定命名，避免每个团队自创字段。

Attribute group	示例字段
Identity	`ai.trace_id`、`ai.request_id`、`ai.workflow_id`、`ai.case_id_hash`、`ai.tenant_id`、`ai.use_case_id`
Risk	`ai.risk_tier`、`ai.customer_impact`、`ai.regulatory_scope`、`ai.decision_boundary`
Version	`ai.release_version`、`ai.prompt.version`、`ai.model.version`、`ai.kb.version`、`ai.policy.version`、`ai.tool.schema_version`
Prompt	`ai.prompt.template_id`、`ai.prompt.render_hash`、`ai.prompt.variable_profile`、`ai.prompt.redaction_profile`
Retrieval	`ai.retrieval.kb_id`、`ai.retrieval.index_version`、`ai.retrieval.top_k`、`ai.retrieval.citation_chunk_ids`、`ai.retrieval.support_score`
Model	`ai.model.provider`、`ai.model.name`、`ai.model.route_reason`、`ai.model.temperature`、`ai.model.finish_reason`
Tool	`ai.tool.name`、`ai.tool.risk_tier`、`ai.tool.input_hash`、`ai.tool.side_effect_id`、`ai.tool.idempotency_key`
Approval	`ai.approval.id`、`ai.approval.decision`、`ai.approval.reason_code`、`ai.approval.visible_evidence_hash`
Policy	`ai.policy.id`、`ai.policy.decision`、`ai.policy.reason_code`、`ai.policy.obligations`
Cost	`ai.cost.total_usd`、`ai.tokens.input`、`ai.tokens.output`、`ai.tokens.reasoning`、`ai.cost.human_review_usd`
Quality	`ai.quality.score`、`ai.quality.rubric_id`、`ai.citation.support_rate`、`ai.safety.label`

4.4 Event Envelope

Trace 适合看时间线，event 适合驱动证据流水、告警、审计包和学习回路。建议使用 CloudEvents 风格 envelope：外层统一，内层 domain data。

{
  "specversion": "1.0",
  "id": "evt_01JZ_RUNTIME_EVIDENCE_0001",
  "source": "ai-runtime/aml-copilot/prod",
  "type": "ai.tool.invoked",
  "subject": "trace/trc_8f3d/tool/case_lookup",
  "time": "2026-06-30T14:22:18.428Z",
  "datacontenttype": "application/json",
  "dataschema": "https://internal.example/schemas/ai-tool-invoked/v1",
  "traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01",
  "data": {
    "use_case_id": "aml_copilot",
    "risk_tier": "high",
    "tool_name": "aml_case_lookup",
    "tool_schema_version": "2026-06-15",
    "tool_risk_tier": "read_sensitive",
    "input_hash": "sha256:9b17...",
    "policy_decision_id": "poldec_7731",
    "policy_decision": "allowed",
    "side_effect": "none",
    "latency_ms": 187,
    "evidence_class": "restricted",
    "retention_policy_id": "ret_ai_high_7y"
  }
}

4.5 Provenance Graph

Trace 解决“时间线”，provenance graph 解决“证据关系”。建议用 W3C PROV 的 Entity / Activity / Agent 心智模型，并借鉴 OpenLineage 的 run / job / dataset / facet 做数据处理链路。

PROV object	AI runtime mapping	示例
Entity	prompt template、rendered prompt hash、KB chunk、model config、output、approval record、incident pack	`prompt_template:aml_summary:v12`、`output_hash:sha256:72ab...`
Activity	context load、retrieval query、model invocation、tool invocation、policy evaluation、human review、judge evaluation	`activity:ai.model.invoke:span_991`
Agent	end user、AI service account、policy engine、model provider、reviewer、compliance approver	`agent:role:aml_analyst`、`agent:service:policy-engine`

Minimum graph edges:

Edge	含义
`wasGeneratedBy(output, model_invocation)`	输出由哪次模型调用生成。
`used(model_invocation, rendered_prompt)`	模型调用使用了哪个 prompt。
`used(rendered_prompt, context_snapshot)`	prompt 渲染使用了哪个上下文摘要。
`used(model_invocation, retrieved_chunks)`	输出依赖哪些检索证据。
`wasAssociatedWith(tool_invocation, service_account)`	工具调用由哪个服务身份执行。
`wasInformedBy(tool_invocation, policy_evaluation)`	工具调用受哪个策略判断影响。
`wasGeneratedBy(approval_record, human_review)`	审批记录由哪次人工复核生成。
`wasDerivedFrom(eval_case, incident_signal)`	新的 eval case 来源于哪次线上失败。

4.6 Evidence Lake

Evidence lake 不是普通 data lake 的随意目录，而是受控证据产品。

Zone	内容	控制
Raw immutable zone	原始 trace export、events、metrics snapshot、policy decision events、approval records。	append-only、WORM 或 object lock、hash chain、严格访问控制。
Redacted curated zone	脱敏后的可分析数据、聚合指标、质量评分、成本账本。	字段级权限、数据分类、脱敏证明、重识别风险控制。
Evidence pack zone	按 incident、release、audit query、control test 生成的证据包。	版本化、审批、legal hold、保留策略、导出审计。
Learning loop zone	线上失败样本、人工修改样本、regression cases、prompt improvement candidates。	数据使用审批、抽样策略、label quality、隐私边界。

4.7 Dashboards

Dashboard 要分角色，不要把所有指标堆给所有人。

Dashboard	用户	必须回答
Executive AI Risk Dashboard	管理层、AI governance committee	高风险用例是否在边界内运行？事故、例外、成本、客户影响趋势如何？
Product Quality Dashboard	AI PM、业务 owner	哪些 workflow 质量下降、用户不采纳、人工改写多、业务结果没有改善？
Architect Runtime Dashboard	架构师、平台团队	哪些 span 慢、fallback 多、tool error 高、trace 缺失、lineage 断裂？
Model / EvalOps Dashboard	模型团队、EvalOps	哪些 eval slice 在线失败、judge 漂移、retrieval support 下降、vendor drift？
Compliance / Audit Dashboard	合规、内审、模型风险	审批覆盖率、policy block、override、evidence freshness、retention exception 如何？
FinOps Dashboard	PM、平台、财务	单次任务成本、每 case 成本、缓存命中、模型路由成本、预算消耗如何？

4.8 Retention / Deletion Boundary

AI 证据需要保留，但不能“什么都永久保存”。金融零售要在审计、监管、隐私、客户权利、安全和成本之间设计边界。

Data class	示例	保留策略建议	删除/限制策略
Telemetry metadata	trace id、span type、latency、cost、version、decision reason code	中长期保留，用于趋势和审计。	不含原始 PII，删除要求相对低。
Sensitive content pointer	case id hash、document id、chunk id、output hash	与 system of record 保留策略对齐。	删除源文档后保留不可逆 hash 与审计指针。
Full prompt / output content	高风险事故、客户可见正式输出、监管问询相关内容	仅在明确业务和合规依据下保留。	默认脱敏或分级加密；客户删除请求触发评估。
Approval record	审批人、决策、可见证据集、时间戳	按业务记录和模型风险要求保留。	审批人身份可 pseudonymize，但问责链需可授权解析。
Incident evidence pack	事故时间线、关键 trace、影响客户、补救行动	legal hold 或事故保留政策。	关闭后按保留矩阵归档，不允许随意清理。
Learning samples	失败案例、人工修改、feedback、label	只保留最小充分样本。	进入训练或 eval 前做数据使用审批和脱敏。

4.9 Access Control

Evidence architecture 必须内置访问控制。否则 observability 会变成新的敏感数据扩散平台。

Role	可访问	不应默认访问
AI PM	聚合质量、成本、采纳、case-level 摘要、脱敏失败样本。	客户完整 PII、完整 prompt 原文、审批人真实身份。
Architect / SRE	trace、span、错误、latency、dependency、config hash、tool schema。	客户业务明文、可识别个人信息。
Compliance / Model Risk	高风险用例证据包、policy decision、approval、incident evidence、抽样内容。	与审查无关的大规模原文导出。
Internal Audit	控制测试样本、证据完整性、审批记录、保留证明、访问日志。	生产系统写权限、未经批准的客户内容全集。
Security	访问日志、异常访问、secrets、policy violations、forensic pack。	与安全事件无关的业务明文批量下载。
Data Subject Request Team	与客户权利请求有关的内容指针和删除状态。	与请求无关的模型调试数据。

Control baseline:

Attribute-based access control：按 role、purpose、case assignment、risk tier、jurisdiction、data class 授权。
Break-glass access：事故响应可临时扩大权限，但必须记录、审批、复核。
Field-level encryption：prompt/output/content pointers 分级加密。
Export controls：证据导出需要 reason code、approval、watermark、download log。
Query audit：每次 evidence query 本身也是 evidence event。

5. Financial Retail Runtime Cases

下面四个案例展示“可观测 AI 行为”在金融零售里的实际形态。重点不是技术炫技，而是能否证明业务行为、风险控制和审计证据完整。

5.1 AML Copilot Trace

场景：反洗钱分析师打开一个高风险告警，AI copilot 汇总交易模式、客户画像、历史 case notes、名单筛查、可疑点和建议下一步调查问题。AI 不允许直接提交 SAR/STR，只能辅助分析。

Trace timeline

Step	Span / Event	Evidence captured	Control point
1	`ai.request`	use case `aml_copilot`、risk tier high、analyst role、case id hash。	只有被分配 case 的分析师可触发。
2	`ai.context.load`	case facts、transaction summary、KYC summary、historical notes 指针和 redaction profile。	PII 最小化，敏感字段按 analyst entitlement 展示。
3	`ai.retrieval.query`	AML policy KB version、typology KB version、retrieved chunk ids、freshness。	只允许 approved AML procedure 和 typology library。
4	`ai.model.invoke`	prompt version、model route、temperature、tokens、cost。	高风险场景强制低随机性、approved model route。
5	`ai.output.finalize`	suspicious pattern claims、citation support、unsupported claim count。	所有 material claims 必须有 case facts 或 policy citation。
6	`ai.feedback.capture`	analyst accepted / edited / escalated、edit reason。	大幅编辑进入 QA sample 和 eval queue。
7	`ai.judge.evaluate`	missing red flag、overstatement、policy compliance score。	critical failure 触发 quality incident。

What audit can ask

Audit question	Evidence query
这份 AI 摘要是否使用了当时有效的 AML 政策？	通过 `trace_id` 查 `ai.retrieval.kb_version` 和 chunk effective date。
AI 是否把推测写成事实？	查 output claim classifier、citation support、analyst edit reason。
是否有人把 AI 结果直接作为 SAR 结论？	查 downstream workflow：AI output 是否进入正式提交字段，是否有人类审批。
质量下降如何发现？	查 judge score trend、analyst override rate、QA critical failure。

5.2 Payment Dispute Agent Trace

场景：支付争议处理 agent 帮助客服或运营人员汇总交易、商户、争议规则、证据材料，并草拟给客户或网络的回复。Agent 可创建草稿，但不得自动发起退款、拒付或客户通知，除非满足低风险自动化规则。

Trace timeline

Step	Span / Event	Evidence captured	Control point
1	`ai.request`	dispute case id hash、card product、jurisdiction、channel。	jurisdiction 决定适用规则和保留策略。
2	`ai.tool.invoke`	transaction lookup、merchant lookup、chargeback status，均为 read-only。	tool gateway 限制读取范围。
3	`ai.retrieval.query`	card network rule KB、fee policy、customer communication template。	规则版本和 effective date 进入 evidence。
4	`ai.policy.evaluate`	refund action classification、customer impact、requires approval。	高金额或敏感客户必须人工审批。
5	`ai.tool.invoke`	draft response create，side effect `draft_created`。	写入草稿使用 idempotency key。
6	`ai.approval.review`	supervisor approves/edits/rejects，visible evidence set hash。	审批者看到的交易、规则、AI 草稿可复原。
7	`ai.output.finalize`	final response hash、delivery state。	客户可见文本与审批记录一致。

Product architecture insight

支付争议场景的关键不是“回答准确”，而是客户影响动作可控。因此 trace 要把 tool action 分成：

read-only evidence gathering；
draft creation；
customer communication；
monetary action；
network submission；
reversal / remediation。

每个动作的 policy decision、approval、side effect 和 idempotency 证据必须分开记录。

5.3 Lending Policy RAG Trace

场景：信贷政策团队和一线经理使用 RAG assistant 查询贷款政策、例外条件、文档要求和 reason code。系统只提供 policy interpretation support，不做自动授信决策。

Trace timeline

Step	Span / Event	Evidence captured	Control point
1	`ai.request`	role、product、state、loan type、customer segment。	权限和地理适用性先过滤。
2	`ai.context.load`	user-provided scenario classification，不记录完整客户隐私。	使用 scenario abstraction 减少 PII。
3	`ai.retrieval.query`	policy manual version、exception memo、effective dates、retired docs excluded。	过期政策不可被引用，除非作为历史解释。
4	`ai.model.invoke`	prompt instructs “no credit decision, cite policy, show uncertainty”。	prompt version 与 risk control 绑定。
5	`ai.output.finalize`	answer、citations、uncertainty label、escalation recommendation。	无 citation 的政策结论不得输出为确定建议。
6	`ai.feedback.capture`	policy owner marks helpful / incorrect / needs update。	反馈触发 knowledge governance，不直接改变生产 KB。

Critical evidence

Evidence	Why it matters
`policy_effective_date`	信贷政策有时按日期、州、产品、渠道不同。没有 effective date，citation 不够可信。
`retired_doc_filter_result`	防止 RAG 引用旧政策。
`decision_boundary`	明确系统不是 automated credit decision。
`escalation_recommendation`	当政策不确定或例外复杂时，AI 应建议升级给 policy owner。

5.4 Customer Service Escalation Trace

场景：客服 AI 助手帮助坐席回答费用、交易、账户限制、投诉流程问题，并判断是否需要升级给主管、合规或二线团队。系统客户可见，错误承诺会导致投诉、补偿或监管风险。

Trace timeline

Step	Span / Event	Evidence captured	Control point
1	`ai.request`	customer service channel、language、product、complaint indicator。	投诉、脆弱客户、欺诈风险走高风险 route。
2	`ai.context.load`	account status summary、recent interaction summary、redaction profile。	客户内容最小化，坐席权限过滤。
3	`ai.policy.evaluate`	escalation policy、complaint classification、vulnerable customer policy。	触发 mandatory escalation 时 AI 不得继续普通答复。
4	`ai.retrieval.query`	fees policy、account restrictions SOP、approved customer language。	只引用客户可沟通版本。
5	`ai.model.invoke`	tone policy、prohibited commitments、language settings。	禁止未经授权的 fee waiver 或法律/监管承诺。
6	`ai.output.finalize`	customer-visible response hash、citation support、safety label。	客户可见输出进入更高保留等级。
7	`ai.feedback.capture`	agent sent / edited / escalated、customer satisfaction signal。	未采纳或大幅改写进入 quality review。

Escalation evidence

客户服务 escalation 的证据不是只记录“升级了”。至少要记录：

escalation trigger：投诉关键词、脆弱客户信号、账户冻结、欺诈风险、法律威胁、监管机构提及。
escalation policy version：哪条策略要求升级。
AI recommendation：建议升级到哪个队列。
agent action：坐席是否接受升级。
override reason：坐席未升级的原因。
downstream outcome：case 是否真正进入二线队列。

6. Metrics, SLO and KRI

AI runtime observability 要同时支持 SLO 和 KRI。SLO 看服务承诺，KRI 看风险暴露。金融零售 AI 不应只追求快和便宜，也要衡量可验证、可控、可恢复。

6.1 Metric Taxonomy

Category	Metric	Definition	Product decision
Quality	Task success rate	通过 judge、专家抽样或 workflow outcome 判断任务成功的比例。	是否扩展、回滚、改 prompt、改 KB、改流程。
Quality	Human edit distance	人工对 AI 草稿的修改幅度。	高编辑率说明 AI 不适配 workflow 或证据不足。
Quality	Critical failure rate	违反安全、合规、客户影响边界的严重错误率。	超阈值必须停止或降级高风险功能。
Safety	Policy block rate	policy engine 拦截的请求或动作比例。	突增可能是攻击、prompt injection、需求误配或规则过严。
Safety	Unsafe tool attempt rate	AI 尝试调用未授权或高风险工具的比例。	调整 tool schema、prompt、policy、agent planner。
Latency	TTFT	Time to first token。	影响坐席体验和流式响应策略。
Latency	End-to-end workflow latency	从用户请求到可用输出或审批完成。	判断 AI 是否真的提升流程效率。
Cost	Cost per successful case	完成一个成功业务任务的总成本。	不能只看单次 token 成本，要看成功分母。
Cost	Cost per accepted output	被用户采纳输出的模型、RAG、tool、judge、人审成本。	高成本低采纳要重设产品设计。
Citation	Citation support rate	material claims 中被有效证据支持的比例。	RAG 系统是否可用的核心指标。
Citation	Stale citation rate	引用过期或 retired source 的比例。	知识库治理和索引刷新问题。
Escalation	Required escalation capture rate	需要升级的 case 被正确升级的比例。	客服、AML、信贷高风险场景关键 KRI。
Override	Human override rate	人工拒绝、覆盖或大幅修改 AI 建议比例。	监控过度依赖和系统失配。
Incident	Evidence completeness rate	事故所需 trace、event、approval、output、policy evidence 是否完整。	证据缺失本身就是控制失败。
Drift	Model route drift	流量、质量、成本在不同 model/vendor route 的分布变化。	供应商版本、路由策略、fallback 是否异常。
Drift	Vendor behavior drift	同一 eval slice 在供应商模型版本更新后的表现变化。	第三方模型风险管理。

6.2 SLO Examples

Use case	SLO	Measurement	Error budget interpretation
Customer service assistant	95% 坐席请求在 3 秒内产生 first useful draft。	`ai.request` end-to-end latency + agent accept signal。	超预算先优化 retrieval、cache、model route，而不是降低引用要求。
Lending policy RAG	99% 客户可影响政策回答具备有效 citation。	claim extraction + citation support audit。	citation failure 是质量风险，不允许用更快模型掩盖。
AML copilot	98% 高风险 case summary 包含 required evidence sections。	judge rubric + QA sampling。	缺少 red flag section 触发 release hold。
Payment dispute agent	99.9% monetary action 具备 policy decision 和 approval evidence。	tool side effect records + approval joins。	缺一条 evidence 都应视为 control failure。

6.3 KRI Examples

KRI	Trigger	Response
High-risk trace missing rate > 0.1%	高风险 use case 中 root trace 或关键 span 缺失。	停止自动化动作，切 manual fallback，补 instrumentation。
PII redaction failure detected	prompt/output/evidence event 出现未授权敏感字段。	触发 security incident，隔离数据，rotate keys，修复 redaction policy。
Unsupported claim rate exceeds threshold	material claims 无 citation 或 citation 不支持。	暂停客户可见输出，切换到 draft-only，更新 KB/eval。
Human override rate doubles week-over-week	坐席或分析师频繁覆盖 AI。	进行 workflow review、slice analysis、prompt/RAG 修正。
Tool policy bypass attempt	AI 或代码路径绕过 tool gateway。	立即阻断该路径，做 access review 和 incident pack。
Vendor fallback spike	fallback 到备用模型比例异常升高。	检查主供应商 SLA、模型行为漂移、成本与质量影响。
Evidence query anomaly	非授权人员批量查询敏感 evidence。	安全告警、访问冻结、审计查询者目的。

6.4 Metric Anti-Patterns

Anti-pattern	Why dangerous	Better metric
只看 answer thumbs-up	用户喜欢不等于合规正确。	thumbs-up + citation support + expert sample + complaint outcome。
只看平均 latency	高风险 case 尾延迟可能严重影响工作流。	p50/p90/p95/p99 by risk tier and workflow step。
只看 token cost	低成本但低采纳没有价值。	cost per accepted output / successful case / avoided manual minutes。
只看 block rate	block 高低都可能有问题。	block reason distribution + false block review + incident correlation。
只看 dashboard availability	dashboard 在不代表证据完整。	evidence completeness, freshness, lineage coverage, query reproducibility。

7. Artifact Templates

以下模板可直接作为作品集、架构评审、上线门禁和审计证据设计素材。

7.1 AI Span Schema

schema_id: ai-span-schema-v1
span_kind: internal
span_name: ai.model.invoke
required_attributes:
  ai.trace_id: string
  ai.request_id: string
  ai.use_case_id: string
  ai.risk_tier: enum[low, medium, high, critical]
  ai.workflow_step: string
  ai.release_version: string
  ai.model.provider: string
  ai.model.name: string
  ai.model.version: string
  ai.model.route_reason: string
  ai.prompt.template_id: string
  ai.prompt.version: string
  ai.prompt.render_hash: string
  ai.policy.version: string
  ai.tokens.input: integer
  ai.tokens.output: integer
  ai.cost.total_usd: decimal
  ai.latency.ttft_ms: integer
  ai.latency.total_ms: integer
  ai.model.finish_reason: string
recommended_attributes:
  ai.retrieval.kb_version: string
  ai.retrieval.citation_chunk_ids: array
  ai.quality.score: decimal
  ai.safety.label: string
  ai.cache.status: enum[hit, miss, bypass, stale_blocked]
data_controls:
  no_raw_pii_in_attributes: true
  prompt_content_storage: governed_by_risk_tier
  output_content_storage: governed_by_delivery_channel
  field_level_encryption: true

7.2 Evidence Event Contract

contract_id: ai-evidence-event-contract-v1
envelope:
  specversion: "1.0"
  id: globally_unique_event_id
  source: producing_service
  type: ai.domain.action
  subject: trace_or_business_subject
  time: rfc3339_timestamp
  datacontenttype: application/json
  dataschema: schema_uri
  traceparent: w3c_trace_context
required_data_fields:
  use_case_id: string
  risk_tier: string
  workflow_step: string
  evidence_class: enum[public, internal, confidential, restricted]
  retention_policy_id: string
  producer_service: string
  producer_version: string
  entity_ids: array
  activity_id: string
  agent_id_hash: string
quality_rules:
  event_time_must_be_utc: true
  schema_version_must_be_registered: true
  pii_fields_must_be_classified: true
  high_risk_events_must_include_traceparent: true
  policy_decision_events_must_include_policy_version: true

Recommended event types:

Event type	Purpose
`ai.request.started`	记录一次 AI workflow 开始。
`ai.context.loaded`	记录上下文来源、权限和脱敏结果。
`ai.prompt.rendered`	记录 prompt template、版本和 render hash。
`ai.retrieval.completed`	记录 KB、index、召回、引用和过滤结果。
`ai.model.invoked`	记录模型调用、tokens、latency、cost、finish reason。
`ai.policy.decided`	记录 allow/block/escalate 和 reason code。
`ai.tool.invoked`	记录 tool call、schema、side effect、idempotency。
`ai.approval.decided`	记录人工审批或复核结果。
`ai.output.finalized`	记录最终输出、hash、citation support、delivery。
`ai.feedback.captured`	记录采纳、修改、拒绝和 workflow outcome。
`ai.incident.signal.detected`	记录异常信号和 linked traces。
`ai.evidence.query.executed`	记录谁查询了什么证据、目的和结果范围。

7.3 Dashboard Spec

dashboard_id: ai-runtime-evidence-executive-risk-v1
audience:
  - AI governance committee
  - product leadership
  - model risk
refresh_cadence: daily
filters:
  - use_case_id
  - business_unit
  - risk_tier
  - model_provider
  - release_version
  - date_range
tiles:
  - name: High-risk evidence completeness
    metric: evidence_completeness_rate
    target: ">= 99.9%"
    drilldown: missing_trace_by_span_type
  - name: Critical failure trend
    metric: critical_failure_rate
    target: "0 for customer-impacting high-risk outputs"
    drilldown: failed_eval_cases_and_trace_ids
  - name: Citation support
    metric: citation_support_rate
    target: ">= 99% for policy answers"
    drilldown: unsupported_claim_samples
  - name: Human override and escalation
    metric: override_rate_and_required_escalation_capture
    target: risk_tier_specific
    drilldown: reason_code_distribution
  - name: Cost per successful case
    metric: total_cost_usd / successful_cases
    target: use_case_budget
    drilldown: model_route_and_workflow_step
  - name: Incident signals
    metric: open_signals_by_severity
    target: severity_sla
    drilldown: incident_evidence_pack_status
controls:
  no_raw_customer_content_on_dashboard: true
  row_level_security: true
  export_requires_approval: true

7.4 Incident Evidence Pack

Section	Required contents	Why it matters
Executive summary	incident id、severity、time window、affected use case、customer impact、current state。	管理层和风险团队快速判断影响。
Timeline	linked traces、events、alerts、deployments、policy changes、vendor events。	证明发生顺序。
Version set	model、prompt、KB、index、tool schema、policy、release、judge rubric。	防止复盘时用错版本。
Evidence objects	prompt/config、context、retrieval、tool calls、approval、policy decisions、outputs、feedback。	形成完整行为证据。
Control assessment	哪些 preventive、detective、corrective controls 生效或失效。	判断是否控制缺陷。
Customer / business impact	affected cases、messages sent、monetary impact、complaints、regulatory exposure。	支撑补救和报告。
Root cause	product、process、model、data、tool、policy、vendor、human factors。	避免只归咎于模型。
Remediation	rollback、disable、KB fix、prompt fix、policy change、training、customer remediation。	关闭风险。
Learning loop	new eval cases、regression tests、dashboard changes、control updates。	证明持续改进。
Approvals	incident commander、business owner、risk、legal/compliance、model owner sign-off。	问责和关闭。

7.5 Audit Query Catalog

Query id	Audit question	Required joins	Expected evidence
AQ-001	某次客户可见 AI 输出使用了哪个模型、prompt 和 KB？	trace -> model span -> prompt span -> retrieval span -> output event	version set、output hash、citation chunk ids。
AQ-002	高风险 tool action 是否经过 policy decision 和 human approval？	tool event -> policy decision -> approval record -> side effect	allow/escalate reason、approval id、idempotency key。
AQ-003	某个事故窗口内是否存在 trace 缺失？	request count -> trace store -> event bus -> evidence lake	missing trace list、span coverage、collection errors。
AQ-004	哪些输出引用了 retired policy？	output citations -> KB lineage -> document lifecycle	source ids、retired date、affected traces。
AQ-005	坐席覆盖 AI 建议的主要原因是什么？	feedback events -> edit reason -> workflow step -> prompt/model version	override reason distribution、sample traces。
AQ-006	某次模型供应商变更后质量是否漂移？	vendor version -> eval scores -> production quality -> incident signals	before/after slices、critical failures、rollback decision。
AQ-007	谁查询过某个客户相关 AI evidence？	evidence query events -> access control logs -> purpose codes	requester、purpose、approved export、fields accessed。
AQ-008	监管问询所需证据是否新鲜完整？	evidence pack -> object freshness -> control coverage	evidence freshness score、missing objects、owner actions。

7.6 Retention Matrix

Evidence class	Examples	Default retention	Access	Deletion handling
Operational telemetry	latency、tokens、cost、span status、error code	13-24 months for trend analysis	platform, SRE, PM aggregate	删除原始关联后保留匿名聚合。
High-risk decision support evidence	AML summaries、lending policy answers、payment dispute drafts	5-7 years or policy-defined business record period	restricted, case-bound	与业务记录保留一致，支持 legal hold。
Customer-visible AI output	sent response、approved message、final advice	aligned to customer communication record	restricted, need-to-know	客户权利请求按法务和记录义务评估。
Raw prompt/input content	rendered prompt、context snippets	minimized, risk-tiered, shorter than business record unless required	highly restricted	优先保存 hash、pointer、redacted version。
Approval evidence	reviewer decision、visible evidence set hash、reason code	aligned to control evidence policy	risk, audit, assigned management	审批者身份可授权解析，不公开暴露。
Incident evidence pack	timeline、linked traces、impact、remediation	legal hold or incident retention policy	incident team, risk, legal, audit	事故关闭后转长期归档。
Learning samples	failed examples、human edits、judge failures	limited and reviewed periodically	EvalOps, model team under purpose control	脱敏、去标识、样本过期清理。

8. Failure Modes and Controls

8.1 Missing Trace

Failure mode	Impact	Controls
root trace missing	无法证明一次 AI 行为完整发生过。	AI gateway 强制生成 trace；无 trace 不允许高风险 workflow 继续。
critical span missing	无法证明 prompt、retrieval、tool、approval 或 policy decision。	span coverage SLO；release gate 检查 instrumentation；tail sampling 保留高风险 traces。
trace context broken	多服务链路断裂，incident 无法重建。	W3C trace context propagation；contract test；collector health dashboard。

8.2 PII in Logs

Failure mode	Impact	Controls
prompt 原文带客户完整 PII 进入普通日志	观测系统成为敏感数据泄漏源。	redaction processor、PII classifier、field-level encryption、禁止 raw content attributes。
开发调试开启 verbose logging 未关闭	大量敏感内容外泄。	production logging policy、config guardrail、security scan、break-glass review。
evidence dashboard 展示客户明文	非必要人员可见敏感信息。	role-based views、masking by default、purpose-based access、export approval。

8.3 Unverifiable Tool Action

Failure mode	Impact	Controls
AI 执行写操作但没有 side effect id	无法证明或撤销客户影响动作。	tool gateway 强制 side_effect_id、idempotency key、dry-run record。
tool input 只保存自然语言说明	无法复现实际参数。	canonical tool input hash、schema version、validation result。
人工审批和 tool action 脱节	审批内容不是实际执行内容。	approval visible evidence hash 与 execution input hash 绑定。

8.4 Broken Lineage

Failure mode	Impact	Controls
RAG chunk 没有 KB/index/document version	引用证据不可验证。	OpenLineage-style KB build events、chunk ids、effective dates。
eval result 不记录 dataset version	评测不可比较。	eval run metadata schema、dataset hash、slice definition。
prompt config 热修复无记录	事故复盘用错版本。	prompt registry、config-as-code、release approval、runtime config snapshot。

8.5 Non-Replayable Incident

Failure mode	Impact	Controls
事故窗口缺少 model/prompt/KB/policy version set	无法判断根因。	incident evidence preservation 自动冻结 version set。
供应商模型版本不可追踪	第三方行为漂移无法证明。	model alias registry、vendor response metadata、contractual version notice。
只保存最终答案不保存检索和策略	只能说“模型错了”，不能定位为什么错。	trace/span + event + provenance graph 三层证据。

8.6 Dashboard Theater

Failure mode	Impact	Controls
dashboard 指标漂亮但无 drilldown	管理层误以为可控，事故时无法取证。	每个指标必须能 drill down 到 trace/evidence sample。
指标没有 owner 和 action	告警没人处理，质量下降变常态。	metric owner、threshold、runbook、SLA、post-incident review。
只看 aggregate，掩盖高风险 slice	平均数正常，高风险客户或渠道出问题。	risk-tiered and slice-based dashboards。
judge 分数替代人工抽样	自动评测自身漂移没人发现。	judge calibration、expert sample、rubric versioning。

9. Implementation Roadmap

9.1 Maturity Levels

Level	状态	能力描述	下一步
L0 Ad hoc logs	零散日志	只能排查技术错误，无法审计 AI 行为。	定义 evidence taxonomy 和高风险 use case。
L1 Traceable request	基础 trace	每次 AI 请求有 trace id、model、prompt、latency、cost。	增加 retrieval/tool/policy/approval spans。
L2 Evidence events	事件化证据	关键 AI 行为通过 contract events 进入 evidence lake。	建立 provenance graph 和 dashboard。
L3 Audit-ready evidence	审计就绪	能按 incident、release、control、case 生成证据包。	建立 retention、access、query audit 和 legal hold。
L4 Learning loop	持续改进	线上失败自动进入 eval、KB、prompt、policy 和流程改进闭环。	建立 portfolio-level governance 和 vendor drift controls。

9.2 First 90 Days Plan

Phase	Days	Deliverables
Foundation	1-15	高风险 use case 清单、evidence object taxonomy、trace/span naming standard、data classification。
Instrumentation	16-35	AI gateway instrumentation、prompt/model/retrieval/tool/policy/approval spans、cost ledger。
Evidence pipeline	36-55	CloudEvents-style evidence contracts、event bus、evidence lake raw/curated zones、redaction processor。
Governance	56-70	retention matrix、access control、evidence query audit、release evidence gate、incident evidence pack template。
Dashboards and learning	71-90	quality/safety/cost/latency/KRI dashboards、audit query catalog、eval queue from production failures、model/vendor drift review。

9.3 Architecture Decisions to Record

ADR	Decision question	Key trade-off
ADR-001 Trace sampling	高风险 AI traces 是否全量保存？	成本与审计完整性。
ADR-002 Prompt content storage	是否保存完整 rendered prompt？	复现能力与 PII 风险。
ADR-003 Evidence lake design	原始证据和分析数据是否分区分级？	查询便利与访问控制。
ADR-004 Tool gateway	是否所有 AI tool calls 都必须经过统一 gateway？	开发速度与动作可控。
ADR-005 Policy-as-code	policy decision 是否统一由 PDP 生成？	灵活性与可审计性。
ADR-006 Vendor metadata	第三方模型响应 metadata 如何标准化？	多供应商兼容与 drift 追踪。
ADR-007 Retention boundary	哪些内容保存 hash/pointer，哪些保存原文？	事故复盘与隐私最小化。

10. Governance Operating Model

10.1 RACI

Activity	AI PM	Product Architect	Platform / SRE	Model / EvalOps	Risk / Compliance	Security / Privacy	Internal Audit
Evidence taxonomy	A	R	C	C	C	C	C
Span schema	C	A	R	C	C	C	I
Event contract	C	A	R	C	C	C	I
Retention matrix	C	C	C	I	A	R	C
Access control	C	C	R	I	C	A	C
Dashboard metrics	A	R	R	R	C	C	I
Incident evidence pack	R	R	R	C	A	A	C
Audit query catalog	C	R	C	C	A	C	R
Learning loop	A	R	C	R	C	C	I

R = Responsible, A = Accountable, C = Consulted, I = Informed。

10.2 Release Evidence Gate

任何 customer-impacting 或 regulated decision support AI use case 上线前，至少通过以下门禁：

Gate	Required proof
Trace coverage	高风险 workflow 的 root trace 与关键 spans 覆盖率测试通过。
Evidence contract	所有 required events 已注册 schema，生产环境启用 validation。
Redaction	PII redaction test、sensitive field scan、dashboard masking 验证通过。
Policy decision	allow/block/escalate policy version 可追踪，tool action 通过 gateway。
Retrieval lineage	KB、index、chunk、effective date、permission filter 可追踪。
Approval	human review workflow 记录可见证据集、decision、reason、timestamp。
Dashboard	SLO/KRI dashboard 可 drill down 到 trace 和 sample evidence。
Incident readiness	incident evidence pack 生成演练通过，legal hold 流程明确。
Retention/access	retention matrix 和 access control 已批准并自动化执行。

11. Interview Section

Question

How do you make an AI system auditable in production?

30-second answer

我会把 AI auditability 设计成 runtime evidence architecture，而不是事后补日志。核心是每次 AI 请求都有 trace，每个关键行为都有 span：prompt/config、context、retrieval、model call、tool call、policy decision、human approval、output、feedback、cost 和 incident signal。同时用统一 event contract 进入 evidence lake，用 provenance graph 连接谁、基于什么、做了什么、生成了什么证据。对金融场景，我会加上 PII redaction、access control、retention boundary、high-risk full trace、audit query catalog 和 incident evidence pack，确保生产行为可解释、可复盘、可追责、可持续改进。

2-minute answer

我会分四层做。

第一层是 instrumentation。AI gateway 和 orchestrator 必须强制生成 trace id，并把 prompt render、context load、RAG retrieval、model invoke、tool call、policy evaluation、human approval、output finalize、feedback capture 都记录成 span。普通 API 200 不代表 AI 成功，所以 span 里要记录 model version、prompt version、KB/index version、tool schema、policy version、tokens、latency、cost、citation support、safety label 和 outcome。

第二层是 evidence contract。关键行为不能只散落在服务日志里，而要发布成稳定事件，例如 ai.policy.decided、ai.tool.invoked、ai.approval.decided、ai.output.finalized。事件要有统一 envelope、schema version、trace context、data classification、retention policy 和 producer version。

第三层是 provenance and evidence lake。用 PROV 思维连接 Entity、Activity、Agent：输出由哪次模型调用生成，模型调用用了哪个 prompt、哪些检索 chunk、哪个策略允许、哪个人工审批。证据湖分 raw immutable、redacted curated、evidence pack、learning loop 区域，配套访问控制、脱敏、保留和删除边界。

第四层是 governance and learning。为审计、监管、事故和上线门禁准备 audit query catalog、dashboard、SLO/KRI、incident evidence pack。线上失败、人工覆盖、unsupported citation、policy block、成本异常和 vendor drift 都要进入 eval、prompt、KB、policy 和流程改进闭环。这样 AI 系统不是“有日志”，而是在生产中可观察、可证明、可治理。

Follow-up points

Follow-up	Strong answer angle
会不会保存太多敏感内容？	保存最小充分证据，优先 hash、pointer、metadata、redacted content；原文按风险和法律依据分级保存。
如何复现第三方模型输出？	不承诺完全 deterministic replay；保留 version set、prompt/config、input hash、vendor metadata、output hash 和 eval replay case。
如何证明工具动作没有越权？	所有 tool calls 经过 gateway、policy decision、schema validation、dry-run、approval、side-effect id 和 idempotency。
如何避免 dashboard theater？	每个指标必须能 drill down 到 trace/sample evidence，有 owner、threshold、runbook 和改进行动。

12. Portfolio Exercise

选择一个金融零售 AI 场景，完成一个 runtime evidence architecture pack。推荐选题：

AML copilot for suspicious activity investigation。
Payment dispute response agent。
Lending policy RAG assistant。
Customer service escalation copilot。
Branch banker product recommendation assistant。

12.1 Deliverables

Artifact	内容要求
Runtime evidence context diagram	展示 AI gateway、orchestrator、RAG、tool gateway、policy engine、approval workflow、evidence lake、dashboards。
Evidence object taxonomy	至少覆盖 prompt/config、context、retrieval、tool call、approval、policy decision、output、feedback、cost/latency、incident signal。
Trace/span schema	定义 root trace 和关键 spans，列出 required attributes。
Event contracts	至少定义 5 个 CloudEvents-style evidence events。
Provenance graph	用 Entity / Activity / Agent 展示一次请求的证据关系。
SLO/KRI dashboard spec	至少包含 quality、safety、latency、cost、citation、escalation、override、incident、drift。
Incident evidence pack sample	选择一个失败场景，生成可审计证据包结构。
Audit query catalog	至少 8 个审计问题和所需 joins。
Retention/access matrix	说明哪些内容保存原文、hash、pointer、聚合数据，谁能访问。
Interview narrative	30 秒和 2 分钟版本，说明为什么这不是普通 logging。

12.2 Scenario Template

# Runtime Evidence Architecture Pack: [Use Case Name]

## Business Context
- Business workflow:
- User roles:
- Customer impact:
- Regulatory / risk scope:
- Decision boundary:

## Runtime Evidence Goals
- Explainability goal:
- Audit goal:
- Incident replay goal:
- Learning loop goal:
- Privacy and retention goal:

## Trace Model
| Span | Required attributes | Evidence object | Control |
|---|---|---|---|

## Evidence Events
| Event type | Producer | Consumer | Schema version | Retention |
|---|---|---|---|---|

## Provenance Graph
| Entity | Activity | Agent | Edge |
|---|---|---|---|

## Metrics and KRI
| Metric | Threshold | Owner | Response |
|---|---|---|---|

## Incident Evidence Pack
| Section | Evidence source | Owner |
|---|---|---|

## Retention and Access
| Evidence class | Retention | Access | Deletion handling |
|---|---|---|---|

13. Checklist

13.1 Design Checklist

已定义 AI runtime evidence 的业务目标、风险目标、审计目标和学习目标。
已按风险等级识别哪些 use case 需要 full trace、哪些可以 sampling。
已定义 prompt/config、context、retrieval、tool call、approval、policy decision、output、feedback、cost/latency、incident signal 的证据对象。
已设计 root trace 和关键 AI spans，并定义 required attributes。
已定义 CloudEvents-style evidence event contract 和 schema registry 流程。
已设计 provenance graph，能连接 Entity、Activity、Agent。
已把 RAG knowledge base、index、chunk、effective date、permission filter 纳入 lineage。
已把 tool calls 统一接入 tool gateway、policy decision、dry-run、approval、side-effect id 和 idempotency。
已定义 customer-visible output 与 internal draft 的不同保留策略。
已定义 PII redaction、field-level encryption、dashboard masking 和 export control。

13.2 Operations Checklist

高风险 workflow 的 trace coverage 和 critical span coverage 达到上线门槛。
SLO/KRI dashboard 可按 use case、risk tier、model provider、release version、workflow step drill down。
每个 dashboard 指标都有 owner、threshold、runbook 和 review cadence。
Incident evidence pack 可以自动收集 timeline、version set、linked traces、policy decisions、approval records、outputs 和 remediation。
Evidence query 本身被记录，敏感导出需要 approval 和 purpose code。
Retention matrix 已被系统执行，不依赖人工清理。
Learning loop 能把线上失败转成 eval cases、KB updates、prompt changes、policy updates 或 workflow redesign。
Vendor drift review 能比较模型供应商版本变化前后的质量、成本、安全和 fallback 行为。

13.3 Audit Readiness Checklist

能回答“这次 AI 输出使用了哪个 prompt、model、KB、policy 和 tool schema”。
能回答“这个工具动作是否经过策略判断和人工审批”。
能回答“客户可见输出是否与审批版本一致”。
能回答“哪些 claims 有 citation 支持，哪些没有”。
能回答“某个事故窗口内是否存在 trace 或 evidence 缺失”。
能回答“谁访问或导出了敏感 evidence，目的是什么”。
能回答“上线后哪些失败进入了持续改进闭环”。
能回答“保留和删除边界如何执行，是否存在过度保存或证据不足”。

14. Closing Synthesis

AI Runtime Evidence / Observability Architecture 的成熟度，不取决于日志平台多贵、dashboard 多漂亮、模型调用记录多长，而取决于系统能否在关键时刻回答三类问题：

Behavior：AI 在这个业务场景中到底做了什么，基于哪些上下文、证据、模型、策略和工具？
Accountability：谁或哪个系统允许、审批、修改、发送、覆盖或升级了这个行为？
Learning：这次行为产生的质量、安全、成本、客户影响和事故信号，如何进入下一轮 eval、治理、架构和产品改进？

对 CBAP+、AI PM 和架构师来说，真正的竞争力不是会说“我们有 observability”，而是能把 runtime evidence 设计成产品能力、控制能力和学习能力的交汇点。金融零售 AI 的生产系统必须能被使用，也必须能被证明。