AI 底层逻辑 / 经典论文

AI Runtime Evidence：可观测证据架构

一句话:

266 行ai-foundations/papers/101-ai-runtime-evidence-observability-architecture.md

AI Runtime Evidence / Observability Architecture 解读

面向对象: AI Product Architect / Platform PM / EvalOps Lead / SRE / Senior BA / Audit Evidence Owner。核心问题: AI 系统上线后, 不能只靠聊天记录和用户投诉判断质量。生产级 AI 需要 runtime evidence architecture, 覆盖 prompt、context、retrieval、tool call、policy decision、human approval、output、feedback、cost、latency 和 incident。学习目标: 用 OpenTelemetry、OpenLineage、W3C PROV、CloudEvents 和 AI RMF 思维, 设计可观测、可审计、可复盘、可持续改进的 AI evidence plane。

Source Anchors

Source	Link	用途
OpenTelemetry	https://opentelemetry.io/docs/	参考 trace、span、metrics、logs 的可观测性模型
OpenLineage	https://openlineage.io/docs/	参考 lineage event、job、dataset、run 的数据血缘思维
W3C PROV	https://www.w3.org/TR/prov-overview/	参考 entity、activity、agent 的 provenance 图谱
CloudEvents	https://cloudevents.io/	参考事件 envelope、source、type、id、time、subject
NIST AI RMF	https://www.nist.gov/itl/ai-risk-management-framework	把 measure/manage 变成运行时证据和风险反馈
NIST CSF	https://www.nist.gov/cyberframework	参考 identify/protect/detect/respond/recover 的运营控制思维

一句话:

Runtime evidence is the AI product's black box, audit trail and learning loop.

1. 为什么 AI Observability 不等于 Logging

普通 logging 常回答:

request time, status code, error

AI runtime evidence 要回答:

用户问了什么。
系统给了什么指令。
检索到了哪些来源。
哪些 context 被放进 prompt。
模型和参数是什么。
调用了哪些工具。
哪个 policy gate 允许或拒绝。
人类是否审批。
输出是否被用户采纳。
后续是否产生投诉、错误、损失或返工。

如果没有这些证据, 事故复盘会停留在:

AI 好像答错了, 但我们不知道为什么。

2. Evidence Object Taxonomy

Evidence object	示例字段
Request	user_id, role, tenant, channel, purpose
Prompt/config	system prompt version, policy profile, model, temperature
Context	memory keys, retrieved chunks, tool observations
Retrieval	query, index version, top_k, source ids, scores
Tool call	tool name, args hash, policy decision, result summary
Human approval	approver, decision, rationale, timestamp
Output	response hash, citation ids, schema validation
Feedback	thumbs, edit distance, reviewer score, override
Cost/latency	token count, model cost, wall time, queue delay
Safety signal	refusal, escalation, red-team hit, policy violation
Incident link	incident id, severity, containment action

不是所有内容都要原文保存。高风险字段可以 hash、mask、tokenize 或按 retention policy 保存。

3. Reference Architecture

AI app / agent
  -> instrumentation SDK
  -> trace/span/events
  -> policy/eval annotations
  -> evidence lake
  -> dashboards + audit queries + incident workbench

核心平面:

Plane	作用
Trace plane	串起一次 AI run 的步骤
Event plane	记录关键业务/风险事件
Provenance plane	解释 output 来源和生成链路
Metrics plane	监控 SLO/KRI 和趋势
Evidence plane	支撑审计、监管、validation 和复盘
Retention plane	控制保留、删除、隐私和访问

4. AI Span Model

一次 agent run 可以拆成:

root span: ai.workflow.run
  child span: ai.policy.precheck
  child span: ai.retrieval.query
  child span: ai.model.generate
  child span: ai.tool.call
  child span: ai.human.approval
  child span: ai.policy.postcheck
  child span: ai.output.deliver

Span attributes:

Attribute	用途
ai.use_case	use case 分类
ai.risk_tier	风险等级
ai.agent_id	agent 身份
ai.model_id	模型
ai.prompt_version	prompt 版本
ai.index_version	RAG index 版本
ai.tool_scope	工具权限
ai.policy_decision	allow/deny/escalate
ai.human_decision	approve/reject/edit
ai.output_schema_valid	是否通过 schema
ai.citation_support	引用支撑
ai.cost_usd	成本
ai.latency_ms	延迟

5. 金融零售案例

5.1 AML copilot trace

需要证据:

被总结的交易和 alert。
使用的 typology / policy source。
生成 narrative 的 prompt/config。
analyst 修改内容。
final case disposition。
后续 QA finding。

5.2 Payment dispute agent trace

需要证据:

交易数据来源。
dispute reason code。
agent 推荐的下一步。
人类是否批准客户通知。
退款/拒绝动作是否由人执行。
客户投诉是否回流到 monitoring。

5.3 Lending policy RAG trace

需要证据:

policy repository version。
retrieved policy sections。
citation support。
low-confidence escalation。
reviewer override。
fair lending / adverse action boundary。

6. Metrics / SLO / KRI

Category	Metric
Quality	task success, expert score, correction rate
Grounding	citation support, unsupported claim rate
Safety	refusal quality, policy violation, harmful completion
Human control	approval rate, override rate, escalation miss
Reliability	error rate, retry, timeout, fallback usage
Cost	cost per case, token per workflow, cache hit
Latency	p50/p95/p99 end-to-end and per span
Risk	incident rate, complaint linkage, drift alert
Adoption	active users, accepted suggestions, review load
Audit	evidence completeness, missing trace rate

高级产品指标要避免只看:

DAU, message count, token usage

这些不能证明 AI 系统安全、有效或可审计。

7. Failure Modes

Failure	后果	控制
Missing trace	出事无法复盘	required instrumentation gate
PII in logs	隐私和合规风险	masking, classification, retention
Broken lineage	不知道 output 来源	source ids + index version
Unverifiable tool action	无法证明谁做了什么	tool event contract
Dashboard theater	图很多但不支持决策	audit query catalog
Non-replayable incident	无法重建上下文	config/version capture
Over-retention	保存过多敏感数据	retention matrix

8. Evidence Event Contract

{
  "id": "event_id",
  "source": "ai.customer_service_agent",
  "type": "ai.tool.call.completed",
  "time": "timestamp",
  "subject": "case_id",
  "data": {
    "run_id": "run_id",
    "agent_id": "agent_id",
    "human_actor": "user_id",
    "risk_tier": "high",
    "tool": "crm.case.note.create",
    "policy_decision": "allow",
    "approval_id": "approval_id",
    "evidence_hash": "hash"
  }
}

CloudEvents 思维的好处是: 事件先有稳定 envelope, 再扩展 data schema。

9. 面试表达

30 秒版本:

我会把 AI observability 设计成 runtime evidence architecture, 不只是日志。每次 AI run 都要有 trace, 记录 prompt/config、retrieval、tool call、policy decision、human approval、output、feedback、cost、latency 和 incident link。这样上线后可以做 SLO/KRI 监控、审计查询、事故复盘和持续改进。

2 分钟版本:

对金融零售 AI, 我不会只看聊天记录或模型分数。我要能回答一次输出从哪里来、基于哪些 source、用了哪个 prompt 和模型、调用了什么工具、谁批准了动作、是否触发 policy gate、最后是否产生客户影响。架构上我会参考 OpenTelemetry 的 trace/span, 用 CloudEvents 做关键事件 envelope, 用 W3C PROV / OpenLineage 思维组织来源和转换链路。证据进入 evidence lake, 但要有隐私分级、masking、retention 和 access control。运营上 dashboard 要覆盖 quality、grounding、safety、human control、latency、cost、incident 和 evidence completeness。这样 AI 产品才能被治理、被审计、被持续优化。

10. Portfolio Exercise

为 lending policy assistant 设计:

AI span tree。
evidence event contract。
dashboard spec。
audit query catalog。
retention matrix。
incident replay checklist。

最终输出:

一张 runtime evidence architecture 图。
一张 SLO/KRI dashboard 草图。
一个事故复盘查询例子。