AI 扩展计划 / Playbooks

AI Observability / Cost / SLO Playbook

这份手册是四类学习资产之间的运营层连接器：

1,364 行AI_OBSERVABILITY_COST_SLO_PLAYBOOK.md

AI Observability / Cost / SLO Playbook

目标：训练把 AI 系统从 demo 变成可运营服务的能力。一个可上线的 AI service 不只要“能回答”，还要能解释每次回答经历了什么、花了多少钱、慢在哪里、质量是否稳定、风险是否被拦截、用户是否真的采用、业务结果是否改善。

1. 定位：把架构、平台、推理优化和 EvalOps 连起来

这份手册是四类学习资产之间的运营层连接器：

已有资产	本手册如何连接
`docs/AI_ARCHITECTURE_REVIEW_GATE_CHECKLISTS.md`	把 G4 Architecture Gate、G5 Eval and Risk Gate、G7 Release Gate、G8 Scale Gate 所要求的 monitoring dashboard、incident plan、quality trend、cost review 具体化。
`docs/AI_PLATFORM_PM_PLAYBOOK.md`	把 model gateway、prompt registry、RAG starter kit、eval harness、audit log、cost dashboard、adoption analytics 落成平台级 observability 能力。
`docs/ai-foundations/papers/07-inference-optimization-kv-cache-flashattention-speculative.md`	把 TTFT、total latency、KV cache、batching、streaming、caching、model routing、long-context cost 转成生产指标和成本策略。
`docs/AI_REQUIREMENTS_TO_EVAL_COOKBOOK.md`	把 requirement -> eval contract -> release gate -> monitoring -> incident loop 接入线上 trace、judge span、human review span 和质量仪表盘。

一句话定位：

AI observability 是 AI 系统的运营控制面。它让 PM/BA/Architect 能够回答：这次输出为什么是这样、证据来自哪里、模型和工具有没有失败、成本是否合理、SLO 是否违反、用户是否采用、业务是否真的变好。

在金融零售场景，AI service 的上线证据不能停留在 prompt 截图或离线 eval 分数。真正的运营证据要覆盖：

request trace：一次用户请求从入口到输出的完整链路。
prompt / retrieval / tool / judge span：每个关键 AI 行为的输入、输出、版本、耗时、错误和风险标签。
token / cost ledger：按 use case、team、customer segment、model、route、知识库、工具归因。
latency：TTFT、total latency、queue wait、retrieval latency、tool latency、judge latency、human review turnaround。
quality：answer quality、retrieval quality、groundedness、citation correctness、policy compliance、judge score、expert review result。
safety：policy block、PII leakage block、unsafe tool attempt、high-risk escalation、human override。
adoption：target user activation、repeat use、accepted suggestion、workflow completion。
business outcome：resolved case、AHT reduction、false positive reduction、first contact resolution、dispute cycle time、forecast accuracy、ROI。

2. Source Anchors

Source	URL	本手册使用方式
OpenTelemetry GenAI Semantic Conventions	https://opentelemetry.io/docs/specs/semconv/gen-ai/	作为 GenAI trace/span 命名、属性、事件和指标设计的标准锚点。该页面提示规范已迁移到独立仓库，因此实际落地时应同步检查 OpenTelemetry semantic conventions genai 仓库。
NIST AI 600-1 GenAI Profile	https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence	用 Govern / Map / Measure / Manage 的风险框架组织质量、安全、监控、事件和治理证据。
FlashAttention Paper	https://arxiv.org/abs/2205.14135	用于解释长上下文 attention 的 IO-aware 优化、延迟和内存效率，不把长上下文成本误认为纯 prompt 问题。
AI Architecture Review Gate Checklists	`docs/AI_ARCHITECTURE_REVIEW_GATE_CHECKLISTS.md`	将 observability 作为 release/scale gate 证据。
AI Platform PM Playbook	`docs/AI_PLATFORM_PM_PLAYBOOK.md`	将 observability、cost、adoption 做成平台能力。
Inference Optimization Note	`docs/ai-foundations/papers/07-inference-optimization-kv-cache-flashattention-speculative.md`	将推理优化术语映射为 latency/cost 指标。
AI Requirements-to-Eval Cookbook	`docs/AI_REQUIREMENTS_TO_EVAL_COOKBOOK.md`	将线上质量指标和 incident loop 连接到 eval contract。

3. 为什么 AI Observability 不等于普通 APM

普通 APM 关注服务是否可用、接口是否慢、错误率是否高。AI service 的核心风险不止这些。

同一个 HTTP 200 可能代表完全不同的结果：

回答很快，但引用了错误政策。
模型没有报错，但检索召回了过期知识。
工具调用成功，但 action policy 不应该允许。
judge 给出高分，但 judge prompt 版本漂移。
用户看到答案后没有采纳，业务流程没有改善。
单次成本看似正常，但长上下文让高峰期预算失控。

所以 AI observability 要从“系统运行是否正常”升级为“AI 行为是否可解释、可验收、可治理、可优化”。

核心差异：

维度	普通服务观测	AI 服务观测
成功定义	uptime、error rate、latency	成功回答、被用户采纳、证据正确、风险可控、业务指标改善
请求结构	API call	request + prompt + retrieval + model + tool + judge + human review
失败模式	5xx、timeout、dependency error	hallucination、ungrounded answer、bad retrieval、unsafe tool use、policy violation、cost overrun
版本控制	code version、config version	model、prompt、retriever、index、chunking、reranker、judge、policy、tool schema
成本归因	infra cost、request count	tokens、embedding、rerank、tool/API、human review、eval、cache miss、long context
发布证据	load test、SLA、monitoring	eval report、risk-tiered SLO、trace sampling、quality dashboard、incident playbook

4. AI Observability Stack

一个生产级 AI service 至少需要九类可观测对象。

4.1 Request Trace

Request trace 是一次 AI 请求的完整时间线。它回答：

谁在什么工作流里触发了请求？
请求属于哪个 use case、risk tier、tenant、customer segment？
使用了哪些 prompt、model、retriever、index、tool、judge、policy？
哪一步耗时最多、花费最高、质量下降、被安全策略拦截？
输出是否被用户接受、编辑、覆盖、升级人工？

最小字段：

字段	含义
`trace_id`	全链路唯一标识。
`request_id`	业务请求标识，可与 case id / ticket id / workflow id 关联。
`use_case_id`	例如 `customer_service_rag`、`aml_copilot`。
`risk_tier`	low / medium / high / critical。
`user_role`	branch agent、AML analyst、credit policy analyst、QA reviewer。
`tenant_id`	多租户或业务线归因。
`workflow_step`	classify、retrieve、draft、review、submit、escalate。
`route_policy_version`	model routing 和 guardrail 策略版本。
`eval_profile`	low_risk、customer_facing、regulated_decision_support。
`outcome`	success、blocked、escalated、failed、overridden。

4.2 Model Call Span

Model call span 记录每次模型调用。它不是只记录模型名字，而要记录“为什么选它、输入输出规模、版本、延迟、成本和质量信号”。

关键字段：

model.provider：模型供应商或内部服务。
model.name：模型或部署名。
model.version：供应商版本、deployment version 或 internal alias。
prompt.template_id / prompt.version：prompt registry 中的模板和版本。
input_tokens / output_tokens / reasoning_tokens：token 规模。
ttft_ms：time to first token。
total_latency_ms：完整响应耗时。
temperature / top_p / max_output_tokens：生成参数。
route_reason：fast_path、high_risk_route、fallback、budget_route、quality_retry。
cache_status：hit、miss、bypass、stale_blocked。
cost_usd：本次模型调用成本。
finish_reason：stop、length、tool_call、content_filter、error。

4.3 Retrieval Span

Retrieval span 让团队知道 RAG 是不是召回了对的证据，而不是只看最终答案。

关键字段：

knowledge_base_id / index_version：知识库和索引版本。
chunking_policy_version：chunk 策略版本。
embedding_model：向量模型版本。
query_rewrite_version：查询改写策略版本。
retrieval_top_k：召回数量。
reranker_model：rerank 模型。
filter_policy：权限、地区、产品、客户类型、日期过滤。
retrieved_doc_ids：召回文档标识。
citation_doc_ids：最终引用文档标识。
retrieval_latency_ms：检索耗时。
retrieval_quality_score：离线或线上 judge/专家抽样评分。
freshness_days：证据距离当前日期的时间差。
permission_filter_result：allowed、filtered、blocked。

4.4 Tool Call Span

Tool call span 用于观测 agent 或 workflow 调用外部系统的行为，尤其是金融零售中的高风险动作。

关键字段：

tool.name：如 case_lookup、payment_dispute_status、aml_case_notes、policy_search。
tool.risk_level：read_only、write_low、write_high、irreversible。
tool.schema_version：工具输入输出 schema 版本。
tool.input_hash：输入摘要，避免泄露敏感明文。
tool.latency_ms：工具调用耗时。
tool.error_code：业务或技术错误。
policy_decision：allowed、blocked、requires_human_approval。
approval_id：人工审批记录。
side_effect：none、draft_created、case_updated、message_sent、payment_action_requested。

4.5 Judge Span

Judge span 记录自动评测发生了什么。它是 eval cookbook 在线化的关键。

关键字段：

judge.type：deterministic、llm_as_judge、expert_sampling_trigger、hybrid。
judge.model / judge.prompt_version：LLM judge 的模型和 rubric 版本。
eval_requirement_id：对应需求到评测矩阵中的 requirement。
rubric_id / rubric_version：评分规则。
score：数值评分。
pass_fail：是否通过阈值。
severity：S0 / S1 / S2 / S3。
judge_latency_ms：评测耗时。
judge_cost_usd：评测成本。
failure_reason：unsupported_claim、bad_citation、unsafe_advice、tool_policy_violation。

4.6 Human Review Span

金融场景的 AI 不应假装“全自动即成熟”。Human review span 让人工审批、抽样 QA、专家复核可量化。

关键字段：

review.type：pre_send_approval、post_hoc_qa、expert_review、risk_escalation。
reviewer_role：QA、compliance、AML lead、credit policy owner、branch supervisor。
review_decision：approved、edited、rejected、escalated。
edit_distance：人工修改幅度。
review_latency_ms：从进入复核到完成的耗时。
override_reason：wrong_policy、missing_evidence、tone_issue、regulatory_risk、customer_context_missing。
linked_eval_case：是否转成 golden set 或 regression case。

4.7 Cost Ledger

Cost ledger 是成本账本。它不能只按模型汇总账单，而要按业务价值归因。

成本项：

input token cost。
output token cost。
reasoning token cost。
embedding cost。
vector DB / retrieval cost。
rerank cost。
tool/API cost。
judge/eval cost。
human review cost。
cache storage / invalidation cost。
monitoring and trace storage cost。
peak capacity / idle capacity cost。

归因维度：

use case。
business unit。
workflow step。
customer segment。
model route。
risk tier。
prompt version。
knowledge base。
tool。
user role。
success / failure outcome。

4.8 Quality Dashboard

Quality dashboard 不是一次性 eval report，而是上线后的质量趋势。

至少展示：

pass rate by requirement。
groundedness score trend。
citation correctness。
retrieval recall / precision proxy。
unsupported claim rate。
policy violation rate。
judge disagreement rate。
human rejection / edit / override rate。
quality by model route。
quality by knowledge base version。
quality by customer segment / product line。

4.9 Incident Timeline

AI incident timeline 记录从异常出现到恢复、复盘、回归测试更新的全过程。

时间线节点：

detection：由 dashboard、user feedback、QA sampling、risk alert、cost anomaly 发现。
triage：判断影响范围、风险等级、是否停用。
containment：禁用 prompt/model/tool route、切换 fallback、升级人工、冻结高风险动作。
diagnosis：定位到 model、prompt、retrieval、tool、policy、judge、data freshness 或用户流程。
correction：修复配置、知识库、prompt、routing、policy、工具 schema 或用户教育。
validation：离线回归、shadow run、canary release。
postmortem：补充 golden set、SLO、dashboard、runbook、owner。

5. Reference Architecture：从一次请求到运营闭环

flowchart LR
  U[User / Workflow] --> GW[AI Gateway]
  GW --> POL[Policy and Risk Tiering]
  POL --> TRACE[Trace Context]
  TRACE --> RET[Retrieval Span]
  RET --> MC[Model Call Span]
  MC --> TOOL[Tool Call Span]
  TOOL --> JUDGE[Judge Span]
  JUDGE --> HR[Human Review Span]
  HR --> OUT[Response / Action]
  OUT --> FB[User Feedback and Adoption]
  TRACE --> LEDGER[Cost Ledger]
  JUDGE --> QD[Quality Dashboard]
  FB --> BO[Business Outcome Metrics]
  LEDGER --> CR[Cost Review]
  QD --> INC[Incident Timeline]
  BO --> QR[Quarterly AI Review]

设计原则：

Trace context 必须从入口开始，不要等模型调用时才补日志。
所有 span 必须携带 use case、risk tier、owner、version 和 outcome。
质量信号要能回链到 requirement id 和 eval case。
成本账本要能解释“成功一次业务结果花了多少钱”，而不是“本月调用了多少 token”。
高风险 action 的 tool span 必须和 human approval / policy decision 绑定。
Dashboard 要同时服务 PM、BA、Architect、SRE、Risk、Finance，不同角色看同一事实的不同视角。

6. Metrics Taxonomy

AI metrics 要分层，不要把所有指标堆在一个 dashboard。

6.1 Latency Metrics

Metric	定义	为什么重要	常见诊断
TTFT	从请求进入到第一个 token 出现	影响用户感知速度	model cold start、queue wait、长 prompt prefill、provider 网络
Total latency	从请求进入到完整结果返回	影响工作流完成时间	检索慢、工具慢、judge 慢、输出过长
Queue wait	等待模型资源或 worker 的时间	反映容量和限流	峰值并发、batch 策略、供应商限速
Retrieval latency	查询改写、向量检索、rerank 耗时	RAG 体验和成本关键	index 性能、top_k 过大、metadata filter 复杂
Tool latency	外部工具调用耗时	agent 工作流瓶颈	下游系统慢、重试、权限校验
Judge latency	自动评测耗时	release gate 与在线质量控制成本	judge 模型过大、rubric 过长、同步评测过多
Human review turnaround	人工复核完成时间	高风险流程 SLA 的真实瓶颈	审批队列、角色不清、升级规则过宽

6.2 Token and Context Metrics

Metric	定义	诊断价值
Input tokens	prompt、system instruction、context、conversation history 的 token 数	判断上下文膨胀、RAG 过取、模板冗余。
Output tokens	模型生成 token 数	影响成本、latency、review time。
Reasoning tokens	支持该维度的模型内部推理 token	解释复杂任务成本和延迟波动。
Context utilization	实际使用上下文 / 最大上下文	识别长上下文滥用。
Evidence token ratio	证据 token / 总输入 token	判断 RAG 是否把预算用在证据上。
Conversation carryover tokens	多轮对话历史 token	识别未压缩历史导致的成本增长。

6.3 Cache Metrics

Metric	定义	关键问题
Prompt cache hit rate	相同或可复用 prompt/context 命中比例	是否降低重复 prefill 成本。
Retrieval cache hit rate	高频查询或文档片段命中比例	FAQ/政策查询是否复用。
Answer cache hit rate	可直接复用答案命中比例	低风险场景是否可降低 latency。
Cache bypass rate	因权限、风险、版本、敏感数据而绕过 cache 的比例	cache key 是否设计合理。
Stale cache block rate	因知识版本过期而阻止 cache 的比例	知识更新和缓存失效是否可靠。

金融场景的 cache key 至少包含：

tenant / business unit。
user role / permission scope。
customer segment。
product / jurisdiction。
policy version。
knowledge base version。
risk tier。
prompt version。
tool permission profile。

6.4 Retrieval Quality Metrics

Metric	定义	适用方式
Recall@k	正确证据是否出现在 top-k	黄金集离线 eval。
MRR / nDCG	正确证据排序是否靠前	评估 reranker。
Citation precision	输出引用是否支持回答中的主张	自动 judge + 专家抽样。
Freshness	证据是否在允许的时间窗口内	政策、费率、监管要求。
Permission correctness	不该看到的文档是否被过滤	权限测试和审计。
Evidence coverage	答案中的关键结论是否都有证据	groundedness 评估。

6.5 Judge and Quality Metrics

Metric	定义	风险
Judge score	按 rubric 给出的质量分	不能单独作为高风险放行依据。
Pass rate by requirement	每条 requirement 的通过率	能定位需求层面的质量缺口。
Groundedness	输出是否被检索证据支持	RAG、政策助手、客服必备。
Unsupported claim rate	无证据主张比例	高风险场景应严格控制。
Judge drift	judge 版本变化导致分数漂移	需要版本化和校准集。
Expert disagreement rate	专家与 judge 结论不一致比例	判断 judge 是否可信。

6.6 Tool and Agent Metrics

Metric	定义	诊断
Tool call success rate	工具调用成功比例	API 稳定性、schema 适配。
Tool error rate	技术错误和业务错误比例	下游系统、输入校验、权限。
Tool retry rate	重试比例	模型参数错误、服务超时。
Unsafe tool attempt rate	被 policy 拦截的高风险调用比例	agent 越权倾向。
Action rollback rate	已执行动作被撤销比例	工作流设计和审批不足。
Human approval required rate	需要人工审批的比例	风险分层是否过宽或过窄。

6.7 Safety and Risk Metrics

Metric	定义	例子
Safety block rate	内容或动作被拦截比例	投资建议、信贷承诺、敏感个人信息。
PII leakage block	PII 泄露风险被拦截次数	客服、KYC、支付争议。
High-risk escalation rate	升级人工比例	AML、信贷、投诉、监管问询。
Human override rate	人工推翻 AI 建议比例	质量、风险、流程适配。
Critical failure count	S0/S1 失败次数	release/scale gate 必须看。
Policy exception rate	触发例外处理比例	合规政策或系统规则不清。

6.8 Adoption Metrics

Metric	定义	注意点
Target user activation	目标角色中实际开始使用的人数比例	不用泛泛看总用户。
Repeat usage	同一目标角色持续使用	判断是否嵌入工作流。
Suggestion acceptance rate	AI 建议被采纳比例	要区分原样采纳和修改后采纳。
Edit distance	用户对 AI 输出修改幅度	质量和信任的间接信号。
Workflow completion rate	使用 AI 后流程是否完成	防止只提升局部体验。
Abandonment rate	AI 介入后用户放弃比例	可能是慢、错、不可信或流程不顺。

6.9 Business Outcome and ROI Metrics

Metric	定义	示例
Cost per resolved case	完成一个业务 case 的 AI + 人工成本	客服、支付争议、AML。
Cost per successful answer	通过质量阈值且被采纳的回答成本	RAG、政策助手。
AHT reduction	平均处理时长降低	客服和运营。
Rework reduction	返工率降低	信贷政策、投诉处理。
False positive reduction	误报减少	AML、欺诈、风控。
SLA attainment	业务流程 SLA 达成率	支付争议、客服响应。
Revenue / loss impact	增收、留存、损失减少	零售需求预测、营销、欺诈。
ROI	业务收益 / 总成本	CFO 关注净收益，不只看 token 单价。

7. Metric Tree：从业务结果倒推观测指标

不要从“我们能采集什么”开始，而要从业务目标倒推。

示例：Customer Service RAG

Business outcome
  First contact resolution 提升
  Average handling time 降低
  Complaint escalation 降低
  Compliance incident 为 0

Operational outcomes
  Answer accepted by agent
  Answer has correct citation
  High-risk answer escalated
  Response visible within workflow SLA

AI quality drivers
  Retrieval recall@k
  Citation precision
  Groundedness score
  Unsupported claim rate
  Judge pass rate
  Human edit distance

System drivers
  TTFT
  Total latency
  Retrieval latency
  Tool error rate
  Cache hit rate
  Token per answer
  Cost per successful answer

PM/BA/Architect 的关键判断：

如果业务指标没改善，但 AI 指标好看，说明 AI 可能没有嵌入真正工作流。
如果 adoption 高但质量下降，说明用户可能在用一个有风险的捷径。
如果质量高但成本不可控，说明 scale gate 还没准备好。
如果 latency 达标但 human review turnaround 慢，真正的瓶颈在 operating model。

8. SLO / SLA：AI SLO 不能只写 Uptime

8.1 为什么 uptime 不够

一个 AI service 可以 99.9% 可用，但仍然不能上线：

它持续给出无证据答案。
它在高风险场景没有升级人工。
它成本超预算，单 case 不具备经济性。
它 p95 latency 满足 API SLA，但用户要等人工复核 24 小时。
它没有违反技术错误率，但违反了合规话术。

所以 AI SLO 要覆盖五类目标：

Availability：服务可用。
Latency：在业务工作流允许时间内返回。
Quality：输出满足需求和 eval 阈值。
Safety：风险行为被阻止或升级。
Cost：单位经济性在预算内。

SLA 是对外或跨团队承诺，通常更少、更稳定、更保守。SLO 是内部运营目标，可以更细、更快迭代。AI 系统建议先建立 SLO，再谨慎形成 SLA。

8.2 Risk-Tiered AI SLO Matrix

Risk tier	场景	Quality SLO	Latency SLO	Cost SLO	Safety SLO	Human review
Low	内部 FAQ、产品知识搜索	grounded answer pass rate >= 90%；unsupported claim <= 3%	p95 total latency <= 5s	cost per successful answer <= 目标值	blocked unsafe content >= 99%	抽样复核
Medium	客服坐席草稿、运营分析摘要	citation correctness >= 95%；human reject <= 8%	p95 TTFT <= 2s；p95 total <= 8s	cost per resolved case <= 人工节省的 30%-50%	high-risk query escalation >= 99%	QA 抽样 + 高风险前置复核
High	AML copilot、信贷政策助手、支付争议建议	S0 critical failure = 0；expert pass >= 98%	p95 total <= 工作流窗口；不牺牲复核	cost per accepted recommendation <= approved budget	prohibited final decision by AI = 0	必须保留人工审批
Critical	自动化不可逆金融动作、监管提交	通常不允许模型独立执行最终动作	以安全和审批为优先	单独审批预算	unsafe action escape = 0	人类 accountable owner 签核

8.3 AI SLO 设计步骤

定义 use case 和 workflow step。
标记 risk tier。
把需求转成 eval requirement id。
为每个 requirement 设计线上可观测信号。
为 latency 拆分预算：queue、retrieval、model、tool、judge、human review。
为 cost 设定单位指标：successful answer、resolved case、accepted recommendation。
定义 safety escape：哪些错误必须为 0。
定义 error budget policy：什么时候限流、回滚、降级、暂停 scale。
定义 review cadence：daily pilot review、weekly ops review、monthly cost review、quarterly AI review。

8.4 Error Budget 在 AI 场景中的含义

普通 error budget 关注可用性错误。AI error budget 要分桶：

quality budget：允许多少低严重度质量失败。
latency budget：允许多少请求超过 p95/p99。
cost budget：允许多少预算偏差。
safety budget：高风险逃逸通常没有预算，目标为 0。
adoption budget：如果 adoption 低于阈值，说明功能没有产品化成功。

例子：

Violation	Action
Low-risk FAQ unsupported claim 连续 3 天超过 3%	触发知识库和 prompt review，暂停扩大流量。
Customer service high-risk escalation 低于 99%	立即切换保守 policy，所有相关问题进入人工复核。
AML S0 failure 出现 1 起	停止自动草稿发布，进入 incident 流程，补充 eval regression。
Cost per resolved case 超预算 20% 且持续 2 周	启动 model routing / cache / prompt compression review。
Adoption 低于目标但 quality 达标	PM/BA 重新做 workflow research，不把问题归咎于模型。

9. Cost Management：从 Token 账单到 Unit Economics

9.1 成本公式

AI cost 不等于模型账单。更完整的公式：

cost per request =
  input token cost
+ output token cost
+ reasoning token cost
+ embedding / retrieval / rerank cost
+ tool and downstream API cost
+ judge / eval cost
+ human review cost
+ trace and monitoring storage cost
+ cache storage / invalidation cost
+ peak capacity and fallback cost

业务决策更关心：

cost per successful answer =
  total AI service cost for accepted answers
/ number of answers that pass quality threshold and are accepted

cost per resolved case =
  total AI + operation cost for resolved cases
/ number of cases resolved without rework or escalation beyond threshold

AI ROI =
  measurable business benefit - incremental AI operating cost
/ incremental AI operating cost

9.2 Unit Economics

场景	成本单位	收益单位	关键判断
Customer Service RAG	cost per successful answer / resolved ticket	AHT 降低、FCR 提升、QA 成本降低	AI 成本必须低于节省的坐席时间和返工成本。
AML Copilot	cost per accepted investigation summary	analyst time saving、false positive reduction、case consistency	不能用成本压过风险控制；人工复核仍是成本项。
Credit Policy Assistant	cost per compliant policy answer	policy query time saving、错误减少	groundedness 和政策版本正确性比低价更重要。
Payment Dispute Assistant	cost per resolved dispute	cycle time reduction、write-off reduction、customer satisfaction	工具调用和审批成本要纳入。
Retail Demand Analyst	cost per forecast / insight accepted	inventory reduction、stockout reduction、margin improvement	价值在业务决策质量，不在聊天次数。

9.3 Budget Guardrail

预算 guardrail 要在运行时生效，而不是月底看账单。

建议机制：

每个 use case 有 monthly budget、daily burn alert、per-request soft limit、hard stop。
每个 risk tier 有可用模型集合和最大上下文策略。
高成本 route 必须记录 route_reason。
低风险高频请求优先 cache / small model / batch。
高风险请求允许更高成本，但必须有业务 owner 和风险 owner 批准。
对异常 token 增长设置自动拦截：输入 token 超出历史 p95 的 2 倍时触发压缩或人工提示。
每次 prompt、retrieval、model、judge 版本变更都要比较 cost delta。

9.4 Model Routing

Model routing 是成本、质量、延迟、风险的共同控制点。

示例策略：

条件	Route
低风险、高频、答案可缓存	small / fast model + answer cache。
中风险客服草稿	fast model 生成，judge 检查，不通过则 stronger model retry。
高风险 AML / credit policy	stronger model + stricter retrieval + judge + human approval。
长文档总结	chunk summarize + hierarchical synthesis，而不是直接塞满 long context。
成本预算接近上限	降级到 cheaper approved model、缩短 context、关闭非关键 judge。
模型供应商异常	fallback provider 或 manual workflow。

Routing 记录必须回答：

为什么选这个模型？
是否因为风险、质量、延迟、成本或 fallback？
选路后质量和成本是否符合预期？
哪些请求被降级，是否影响业务结果？

9.5 Cache

Cache 是成本和延迟优化，也是风险控制对象。

可缓存：

公开或低风险 FAQ。
已版本化政策片段。
检索结果候选集。
高频 embedding。
低风险 judge 结果。

谨慎缓存：

包含客户个人信息的上下文。
高风险决策建议。
依赖实时费率、账户状态、案件状态的答案。
需要权限过滤的跨租户知识。

缓存策略必须记录：

cache key。
knowledge version。
permission scope。
expiry。
invalidation trigger。
bypass reason。
hit/miss 对 latency 和 cost 的影响。

9.6 Batch

Batch 适合异步、低交互、可排队任务：

批量工单分类。
夜间 AML case summarization。
零售需求分析报告。
大规模知识库 eval。
月度 policy drift 检查。

不适合：

坐席实时对话建议。
需要即时风控拦截的支付动作。
用户等待中的交易争议处理。

Batch 指标：

throughput。
queue wait。
batch size。
cost per batch。
failed item rate。
retry rate。
completion before business deadline。

9.7 Long-Context Cost

长上下文不是免费能力。FlashAttention 等 IO-aware 技术能缓解 attention 的内存读写瓶颈，但不能消除长上下文带来的 token、KV cache、延迟和质量问题。

PM/BA/Architect 要避免三种误区：

误区 1：模型支持 128K context，就应该把全部文档塞进去。修正：先做 retrieval、chunking、context budgeting 和 evidence selection。
误区 2：长上下文能替代知识治理。修正：版本、权限、freshness、source of truth 仍然需要管理。
误区 3：延迟只是工程问题。修正：需求设计、证据范围、输出长度、复核策略都会影响 latency。

长上下文控制指标：

input tokens by request。
context utilization。
evidence token ratio。
irrelevant context rate。
KV cache memory pressure。
p95 prefill latency。
cost per long-context request。
quality delta compared with RAG route。

9.8 FinOps / RiskOps 关系

FinOps 关注云和 AI 成本可见、可归因、可优化。RiskOps 关注风险可见、可控制、可审计。AI service 需要两者共同工作。

问题	FinOps 视角	RiskOps 视角	共同决策
是否使用更强模型	单次成本更高	高风险失败更少	用 risk tier 和业务价值决定 route。
是否减少 judge	降低成本和延迟	可能失去质量控制	低风险可抽样，高风险保留。
是否启用 cache	降低成本和 latency	可能权限或知识过期	cache key 和 invalidation 必须过审。
是否扩大流量	单位成本下降或上升	暴露风险增加	scale gate 需要 cost + quality + safety 共同通过。
是否自动化工具动作	减少人工成本	增加操作风险	不可逆动作保留人工审批。

10. 金融零售场景 Observability Blueprint

10.1 AML Copilot

用途：帮助 AML analyst 汇总案件、解释可疑模式、生成 SAR draft 或 investigation narrative。

维度	设计
Risk tier	High。AI 不能做最终可疑判断，必须保留 analyst 决策和审计。
核心 trace	case intake -> data retrieval -> pattern summary -> evidence citation -> narrative draft -> analyst review。
Retrieval span	客户交易、KYC、历史 case、规则命中、外部名单、政策手册。
Tool span	case_lookup、transaction_graph、watchlist_check、case_note_draft。高风险写操作需要 approval。
Quality metrics	evidence coverage、unsupported claim rate、typology mapping accuracy、expert pass rate。
Safety metrics	final decision by AI = 0、PII leakage block、policy violation、human override。
Cost metrics	cost per accepted narrative、cost per case reviewed、human review cost。
SLO	S0 critical failure = 0；expert pass >= 98%；p95 draft within investigation workflow window。
Incident trigger	AI 引用错误交易、遗漏关键证据、暗示最终 SAR 决策、越权访问客户数据。

运营重点：

每个 narrative 必须有 evidence map。
analyst edit distance 是质量信号，不是“用户挑剔”。
false positive reduction 要小心定义，不能鼓励漏报。

10.2 Customer Service RAG

用途：给客服坐席提供政策、流程、产品条款和回答草稿。

维度	设计
Risk tier	Medium；涉及投诉、费用、信贷、监管话术时升高。
核心 trace	customer intent -> risk classification -> retrieval -> answer draft -> citation check -> agent accept/edit/send。
Retrieval span	产品 FAQ、费率表、政策、脚本、地区规则、客户类型过滤。
Tool span	customer_profile_read、ticket_history_read、case_disposition_suggest。通常先 read-only。
Quality metrics	citation correctness、groundedness、tone compliance、agent acceptance、QA pass。
Safety metrics	high-risk escalation、PII handling、prohibited promise block。
Cost metrics	cost per successful answer、cost per resolved ticket、cache hit saving。
SLO	p95 TTFT <= 2s；p95 total <= 8s；citation correctness >= 95%；high-risk escalation >= 99%。
Incident trigger	过期政策、错误费用承诺、未升级投诉、跨客户信息泄露。

运营重点：

高 adoption 但高 edit distance 表示“可用但不可信”。
高频 FAQ 应该推动 cache 和知识库产品化。
质量 dashboard 要按产品线和政策版本切片。

10.3 Credit Policy Assistant

用途：帮助信贷运营、分行、风控团队查询政策、解释 reason code、准备合规一致的说明。

维度	设计
Risk tier	High。不得产生最终信贷审批结果，不得绕开 fair lending review。
核心 trace	policy question -> jurisdiction/product filter -> retrieval -> policy answer -> reason code explanation -> compliance review。
Retrieval span	信贷政策、例外规则、监管要求、产品条款、日期有效性。
Tool span	policy_search、reason_code_lookup、application_status_read。禁止自动 approve/decline。
Quality metrics	policy version correctness、groundedness、reason code consistency、expert pass。
Safety metrics	protected class leakage、unsupported eligibility claim、final decision prohibition。
Cost metrics	cost per compliant answer、cost per avoided escalation。
SLO	S0 failure = 0；policy version correctness >= 99%；unsupported eligibility claim = 0。
Incident trigger	错误解释拒贷原因、引用过期政策、生成歧视性或代理变量建议。

运营重点：

Trace 中必须记录 jurisdiction、product、policy effective date。
用 judge 初筛，但高风险样本需要 expert review。
每次政策更新必须触发 regression eval。

10.4 Payment Dispute Assistant

用途：辅助坐席或后台运营处理支付争议、收集证据、生成下一步建议和客户沟通草稿。

维度	设计
Risk tier	Medium to High，取决于是否影响资金动作。
核心 trace	dispute intake -> transaction lookup -> rules retrieval -> evidence checklist -> action recommendation -> approval -> customer message。
Retrieval span	卡组织规则、银行政策、交易证据要求、时间窗口。
Tool span	dispute_status_read、transaction_lookup、evidence_upload_check、case_note_draft。资金动作必须审批。
Quality metrics	rule citation correctness、evidence completeness、next-best-action accuracy。
Safety metrics	irreversible action block、deadline miss warning、wrong customer disclosure block。
Cost metrics	cost per resolved dispute、cost per prevented write-off、tool/API cost。
SLO	p95 recommendation <= workflow SLA；irreversible action without approval = 0；deadline detection >= 99%。
Incident trigger	错过申诉期限、错误建议退款/拒付、访问错误客户交易。

运营重点：

Tool call span 比 model span 更关键，因为争议流程依赖核心系统状态。
Incident timeline 要覆盖“错误建议是否已执行”。
成本要包含下游系统 API 和人工审批。

10.5 Retail Demand Analyst

用途：帮助零售团队分析销售、库存、促销、天气、节假日和供应链信号，生成需求洞察。

维度	设计
Risk tier	Low to Medium，若直接驱动采购或定价则升高。
核心 trace	business question -> data retrieval -> analysis tool -> narrative generation -> analyst review -> decision tracking。
Retrieval span	销售数据、库存、促销、历史预测、门店/地区维度。
Tool span	sql_query、forecast_model_run、inventory_lookup、scenario_simulation。
Quality metrics	data freshness、calculation correctness、forecast error、analyst acceptance。
Safety metrics	unsupported causal claim、sensitive competitive data leakage。
Cost metrics	cost per accepted insight、cost per forecast run。
SLO	data freshness within business cadence；calculation error = 0；accepted insight rate >= target。
Incident trigger	SQL 错误导致错误补货建议、过期数据、把相关性写成因果。

运营重点：

这类场景不应只评估文本质量，必须评估数据和计算正确性。
Business outcome 应追踪库存周转、缺货、毛利或 markdown。
Tool span 要记录 SQL、数据版本和计算结果摘要。

11. PM / BA / Architect 输出物

11.1 Metric Tree

用途：把业务目标、AI 行为、系统指标和成本指标连成一棵树。

内容结构：

North-star business outcome。
目标用户和 workflow step。
AI 成功定义。
Quality drivers。
Safety controls。
Latency drivers。
Cost drivers。
Adoption drivers。
Leading / lagging indicators。
指标 owner 和 review cadence。

好指标树的标准：

能解释为什么这个 AI use case 值得做。
能定位问题发生在数据、模型、检索、工具、用户流程还是运营。
能让 CFO 看到单位经济性，让 Risk 看到控制，让 SRE 看到稳定性，让 PM 看到 adoption。

11.2 Dashboard Spec

用途：让团队知道要建哪些 dashboard、给谁看、如何行动。

必须包含：

Dashboard name。
Audience。
Decision supported。
Refresh cadence。
Data sources。
Filters。
Tiles / charts。
Alert rules。
Drill-down trace link。
Owner。
Runbook link。

Dashboard 类型：

Dashboard	主要受众	关键问题
AI Service Health	SRE / Architect	系统是否稳定，瓶颈在哪里？
Quality and Eval	PM / BA / Risk / QA	质量是否达标，哪些 requirement 失败？
Cost and Unit Economics	PM / Finance / Platform	成本是否可归因，单位经济性是否成立？
Adoption and Workflow	PM / Business Owner	用户是否采用，流程是否改善？
Safety and Incident	Risk / Compliance / SRE	风险是否逃逸，事件是否闭环？

11.3 Trace Schema

用途：统一所有 AI trace/span 字段，避免各团队随意打日志。

必须包含：

trace-level common fields。
span-level common fields。
model span fields。
retrieval span fields。
tool span fields。
judge span fields。
human review span fields。
cost fields。
outcome fields。
privacy and retention rules。

11.4 SLO Doc

用途：定义 AI service 的运营目标、阈值、error budget 和 violation action。

必须包含：

Use case and workflow scope。
Risk tier。
Business objective。
Availability SLO。
Latency SLO。
Quality SLO。
Safety SLO。
Cost SLO。
Adoption SLO。
Measurement method。
Exclusions。
Alert threshold。
Error budget policy。
Review cadence。
Owner。

11.5 Incident Playbook

用途：让团队在 AI 事故发生时知道如何检测、分级、止血、修复和复盘。

必须包含：

Incident severity。
Detection sources。
Triage owner。
Immediate containment。
Customer / internal communication。
Rollback / fallback。
Data and trace collection。
Root cause categories。
Regression eval update。
Postmortem template。
Governance follow-up。

11.6 Cost Review Memo

用途：让 PM/Finance/Platform/Risk 定期评审 AI 成本是否匹配价值。

必须包含：

本周期总成本。
cost by use case / team / model / route / risk tier。
cost per successful answer。
cost per resolved case。
cache savings。
long-context cost。
judge and human review cost。
budget variance。
quality and safety impact。
optimization actions。
decision request。

12. 21 天 Lab：AI Service Operations Pack

目标：用 21 天为一个真实或模拟 AI use case 形成完整 operations pack。建议选择一个金融零售场景，例如 Customer Service RAG 或 AML Copilot。

Day	任务	Artifact
1	选择 use case，写清目标用户、工作流、业务痛点、risk tier。	Use Case Operations Brief
2	画 workflow，从用户触发到 AI 输出再到人工处理。	Workflow + AI Touchpoint Map
3	写 business outcome 和 baseline metric，例如 AHT、case backlog、QA fail rate。	Baseline Metric Sheet
4	把需求转成 8-12 条 eval-ready requirements。	Requirements-to-Eval Mini Matrix
5	设计 request trace，从入口字段到 outcome 字段。	Trace-Level Schema
6	设计 model call span，包含 prompt/model/version/token/latency/cost。	Model Span Schema
7	设计 retrieval span，包含 KB、index、filter、doc ids、freshness、citation。	Retrieval Span Schema
8	设计 tool call span，区分 read-only、write、irreversible action。	Tool Span Schema
9	设计 judge span，把 rubric、score、severity、requirement id 连起来。	Judge Span Schema
10	设计 human review span，记录审批、编辑、拒绝和 override reason。	Human Review Span Schema
11	建立 latency budget，拆到 queue、retrieval、model、tool、judge、human review。	Latency Budget Sheet
12	建立 cost ledger，列出 token、retrieval、tool、judge、human、monitoring。	Cost Ledger Model
13	计算 unit economics，定义 cost per successful answer / resolved case。	Unit Economics Sheet
14	设计 risk-tiered SLO matrix，覆盖 quality、latency、cost、safety、adoption。	AI SLO Matrix
15	设计 quality dashboard，包括 groundedness、citation、judge、human override。	Quality Dashboard Spec
16	设计 cost dashboard，包括 route、model、token、cache、budget variance。	Cost Dashboard Spec
17	设计 adoption dashboard，包括 activation、repeat use、acceptance、edit distance。	Adoption Dashboard Spec
18	写 alert rules 和 error budget policy。	Alert and Error Budget Policy
19	写 incident playbook，覆盖检测、分级、止血、回滚、复盘。	AI Incident Playbook
20	做一次 tabletop incident drill，例如“过期政策被引用给客户”。	Incident Drill Report
21	汇总为可展示作品集包，写 executive summary 和面试讲述稿。	AI Service Operations Pack

最终 pack 目录建议：

AI Service Operations Pack
  01-use-case-brief.md
  02-workflow-and-risk-tier.md
  03-requirements-to-eval-matrix.md
  04-trace-schema.md
  05-latency-and-cost-model.xlsx
  06-slo-matrix.md
  07-dashboard-spec.md
  08-incident-playbook.md
  09-cost-review-memo.md
  10-executive-summary.md

13. 模板 1：AI Trace Schema

13.1 Trace-Level Fields

Field	Type	Example	Required	Notes
`trace_id`	string	`trc_20260629_001`	Yes	全链路唯一。
`request_id`	string	`ticket_849201`	Yes	与业务对象绑定。
`use_case_id`	string	`customer_service_rag`	Yes	成本和质量归因核心。
`risk_tier`	enum	`medium`	Yes	low / medium / high / critical。
`tenant_id`	string	`retail_banking_us`	Yes	多租户或业务线。
`user_role`	string	`contact_center_agent`	Yes	目标用户角色。
`workflow_step`	string	`answer_draft`	Yes	AI 所在流程节点。
`customer_segment`	string	`mass_retail`	Conditional	有客户上下文时记录。
`policy_region`	string	`US-IL`	Conditional	地区影响政策时记录。
`route_policy_version`	string	`route_v12`	Yes	路由和 guardrail 版本。
`eval_profile`	string	`customer_facing_medium_risk`	Yes	对应 eval 强度。
`start_time`	timestamp	`2026-06-29T14:03:21Z`	Yes	统一时区。
`end_time`	timestamp	`2026-06-29T14:03:27Z`	Yes	计算 total latency。
`outcome`	enum	`accepted_with_edit`	Yes	success / blocked / escalated / failed / overridden。
`business_outcome_id`	string	`case_resolved_same_contact`	Conditional	可在异步结果回填。

13.2 Span-Level Common Fields

Field	Type	Example	Required	Notes
`span_id`	string	`spn_model_001`	Yes	span 唯一。
`parent_span_id`	string	`spn_retrieval_001`	Conditional	形成层级。
`span_type`	enum	`model_call`	Yes	request / model / retrieval / tool / judge / human_review。
`component_owner`	string	`ai_platform`	Yes	运营 owner。
`version`	string	`prompt_v18`	Conditional	与 span 类型相关。
`latency_ms`	number	`1840`	Yes	span 耗时。
`status`	enum	`ok`	Yes	ok / error / blocked / escalated。
`error_code`	string	`retrieval_timeout`	Conditional	出错时记录。
`cost_usd`	number	`0.014`	Conditional	可计费 span 记录。
`privacy_class`	enum	`confidential`	Yes	public / internal / confidential / restricted。
`retention_policy`	string	`90d_trace_1y_aggregate`	Yes	避免无限留存敏感输入。

13.3 Example Trace Summary

{
  "trace_id": "trc_customer_service_20260629_001",
  "request_id": "ticket_849201",
  "use_case_id": "customer_service_rag",
  "risk_tier": "medium",
  "user_role": "contact_center_agent",
  "workflow_step": "answer_draft",
  "route_policy_version": "route_v12",
  "eval_profile": "customer_facing_medium_risk",
  "outcome": "accepted_with_edit",
  "latency_ms": 6420,
  "cost_usd": 0.038,
  "quality": {
    "groundedness_score": 0.94,
    "citation_correct": true,
    "judge_pass": true,
    "human_edit_distance": 0.12
  },
  "business": {
    "case_resolved": true,
    "agent_accepted": true
  }
}

14. 模板 2：AI SLO Matrix

Use case	Risk tier	Workflow step	Metric type	SLO	Measurement	Alert	Violation action	Owner
Customer Service RAG	Medium	answer draft	Latency	p95 TTFT <= 2s；p95 total <= 8s	trace latency	2 consecutive hours breach	enable cache, reduce top_k, route to faster model	AI Platform + Contact Center Ops
Customer Service RAG	Medium	answer draft	Quality	citation correctness >= 95%	judge + QA sample	daily rate below 95%	freeze scale, review KB and prompt	PM + QA
Customer Service RAG	Medium	risk classification	Safety	high-risk escalation >= 99%	classifier span + QA sample	any day below 99%	conservative policy route, manual review	Risk Owner
Customer Service RAG	Medium	resolved ticket	Cost	cost per resolved ticket <= approved target	cost ledger + case outcome	weekly 20% variance	cost review, routing and cache changes	PM + Finance
AML Copilot	High	narrative draft	Quality	S0 critical failure = 0；expert pass >= 98%	expert review + eval	any S0	stop release, incident process	AML Owner + Risk
AML Copilot	High	decision boundary	Safety	final SAR decision by AI = 0	tool/action audit	any breach	disable action path, executive escalation	Compliance

SLO doc 应附带：

why this SLO matters。
data source。
sampling strategy。
exclusions。
dashboard link。
runbook link。
review cadence。
last review decision。

15. 模板 3：Cost Unit Economics Sheet

15.1 Sheet Columns

Column	Example	Notes
Date	2026-06-29	日或周粒度。
Use case	customer_service_rag	与 trace 一致。
Business unit	contact_center	成本归因。
Risk tier	medium	解释 route 差异。
Requests	10,000	总请求数。
Successful answers	7,800	通过质量阈值且被采纳。
Resolved cases	6,400	业务完成。
Input tokens	45,000,000	观察 context 膨胀。
Output tokens	9,000,000	观察回答长度。
Model cost	520.00	模型调用。
Retrieval cost	80.00	embedding、vector DB、rerank。
Judge cost	120.00	在线 eval。
Tool/API cost	60.00	下游系统。
Human review cost	900.00	QA、审批、专家复核。
Observability cost	45.00	trace、metrics、storage。
Total cost	1,725.00	全量成本。
Cost per successful answer	0.221	Total cost / successful answers。
Cost per resolved case	0.270	Total cost / resolved cases。
Estimated labor saving	5,200.00	业务假设要说明。
Net benefit	3,475.00	saving - total cost。
ROI	2.01	net benefit / total cost。
Quality pass rate	96.2%	成本不能脱离质量。
Safety incidents	0	成本优化不能牺牲安全。

15.2 Review Questions

成本增长来自请求量、token、route、judge、human review 还是 cache miss？
cost per successful answer 是否下降，还是只是总成本下降？
便宜模型是否导致 reject / rework / escalation 上升？
高风险场景的成本是否经过 business owner 和 risk owner 接受？
哪些长上下文请求应该改成 RAG 或分段总结？
cache savings 是否建立在正确权限和版本控制上？

16. 模板 4：Dashboard Spec

# Dashboard Spec: Customer Service RAG Quality and Cost

## Audience
Contact Center PM、QA Lead、AI Platform、Risk、Finance。

## Decisions Supported
- 是否可以扩大 pilot 流量。
- 哪些产品线需要知识库修复。
- 成本是否符合 unit economics。
- 是否触发 incident 或 SLO violation。

## Data Sources
- AI trace store。
- Cost ledger。
- QA review system。
- Ticketing system。
- Knowledge base version registry。

## Filters
- Date range。
- Product line。
- Region。
- Risk tier。
- Model route。
- Prompt version。
- Knowledge base version。
- User role。

## Tiles
1. p50/p95 TTFT and total latency。
2. Request volume by route。
3. Citation correctness trend。
4. Groundedness score by product line。
5. Unsupported claim rate。
6. Agent acceptance and edit distance。
7. High-risk escalation rate。
8. Cost per successful answer。
9. Cache hit rate and estimated savings。
10. Open incidents and SLO violations。

## Alerts
- Citation correctness below 95% for one business day。
- High-risk escalation below 99%。
- Cost per successful answer 20% above approved target for one week。
- Unsupported claim rate above 3% for low-risk and above 0 for high-risk regulated answers。

## Drill-Down
Every chart links to sampled traces with model, retrieval, tool, judge and human review spans.

## Owner and Cadence
Dashboard owner: AI Platform PM。
Review cadence: daily during pilot, weekly after release, monthly cost review。

17. 模板 5：Incident Postmortem

# AI Incident Postmortem: Incorrect Policy Citation in Customer Service RAG

## Incident Summary
On 2026-06-29, the customer service RAG assistant generated answers citing an outdated fee waiver policy for checking accounts in one region. The service remained technically available, but quality and compliance SLOs were violated.

## Severity
S1: Customer-facing regulated policy error with limited scope and no irreversible financial action executed.

## Detection
- QA sampling found citation mismatch.
- Quality dashboard showed citation correctness drop from 96.4% to 91.2%.
- Trace drill-down linked failures to knowledge base version `kb_fee_policy_v31`.

## Impact
- 184 answer drafts generated under affected policy version.
- 73 drafts accepted by agents.
- 16 customer conversations required follow-up clarification.
- No fee reversal action was automatically executed.

## Timeline
- 09:10: QA reviewer flags first mismatch.
- 09:25: Risk owner opens incident.
- 09:40: AI Platform disables affected KB route and falls back to previous approved policy page.
- 10:15: Contact center supervisors receive correction guidance.
- 12:30: Knowledge owner publishes fixed index version.
- 14:00: Regression eval passes affected policy cases.
- 15:00: Traffic restored to normal route.

## Root Cause
The policy source was updated, but the RAG index refresh completed before metadata effective-date validation. Retrieval returned a superseded document because the filter did not enforce active policy status.

## Contributing Factors
- Dashboard tracked KB version but not policy effective-date mismatch.
- Golden set had fee policy cases but did not include this region.
- Agent UI showed citation title but not effective date.

## Containment
- Disabled affected retrieval route.
- Required human review for fee waiver questions during incident window.
- Sent supervisor guidance for follow-up conversations.

## Corrective Actions
1. Add `policy_effective_status` to retrieval filter.
2. Add effective date to UI citation display.
3. Add regional fee waiver cases to golden set.
4. Add dashboard tile for stale or superseded policy retrieval.
5. Add SLO alert for citation correctness by policy region.

## Regression Evidence
- 40 new regional policy cases pass deterministic date validation.
- LLM judge groundedness pass rate returns to 97.1%.
- QA sample of 60 answers shows 0 outdated policy citations.

## Governance Follow-Up
- Knowledge owner added to weekly AI ops review.
- Release gate updated: policy KB refresh requires metadata validation before production route.

18. 面试表达

18.1 30 秒版本

我不会把 AI observability 理解成普通日志。生产 AI 服务必须能追踪一次请求里的 prompt、model、retrieval、tool、judge、human review、token、cost、latency、quality、safety 和 adoption。SLO 也不能只写 uptime，而要按 use case risk tier 同时定义质量、成本、延迟和安全阈值。这样 demo 才能变成可运营、可审计、可优化的服务。

18.2 2 分钟版本

我会从业务 workflow 开始设计 AI observability。以客服 RAG 为例，一次请求要有完整 trace：用户问题、风险分类、检索到哪些文档、使用哪个 prompt 和模型、是否调用工具、judge 如何评分、坐席是否采纳或修改、最终工单是否解决。指标分成几层：latency 看 TTFT 和 total latency；token 和 cache 看成本；retrieval 看 recall、freshness 和 citation；quality 看 groundedness、unsupported claim、judge pass 和 QA review；safety 看 high-risk escalation 和 policy block；adoption 看目标用户激活、采纳率、edit distance；business outcome 看 FCR、AHT、cost per resolved case 和 ROI。

SLO 要按风险分层。低风险 FAQ 可以强调低延迟和低成本；客服草稿要强调 citation correctness 和升级人工；AML、信贷、支付争议这类高风险场景要设定 S0 failure 为 0，禁止 AI 做最终决策，并保留 human review 和 audit trail。成本管理也不是月底看 token 账单，而是做 unit economics：cost per successful answer、cost per resolved case、judge 和 human review 成本、cache savings、long-context cost。最终交付物包括 metric tree、trace schema、dashboard spec、SLO doc、incident playbook 和 cost review memo。

18.3 CTO 深挖

Q: 你如何设计 AI trace schema？

A: 我会先定义 trace-level 公共字段：trace id、request id、use case、risk tier、tenant、user role、workflow step、route policy version、eval profile 和 outcome。然后按 span 拆 model、retrieval、tool、judge、human review。Model span 记录模型、prompt 版本、token、TTFT、total latency、route reason、cache、cost。Retrieval span 记录 KB、index、chunking、filter、doc ids、freshness、citation。Tool span 记录 tool risk level、policy decision、approval 和 side effect。Judge span 记录 requirement id、rubric、score、severity。这样可以从一次业务结果回溯到模型、知识、工具和评测版本。

Q: OpenTelemetry 在这里怎么用？

A: 我会用 OpenTelemetry 的 trace/span 思路和 GenAI semantic conventions 作为命名和属性锚点，但不会只照搬技术字段。我会在平台层扩展业务字段，例如 use case、risk tier、workflow step、cost center、eval requirement id、human review decision。标准化的好处是跨模型、跨应用、跨团队比较 latency、tokens、errors 和 cost；业务扩展的好处是能支持 release gate、risk audit 和 ROI。

Q: 如何定位 AI 服务变慢？

A: 先把 total latency 拆开：queue wait、retrieval、model prefill、decode、tool、judge、human review。TTFT 高可能是队列、冷启动或长 prompt prefill；total latency 高可能是输出过长、工具慢、judge 同步执行或人工复核瓶颈。再看 route distribution、input tokens、cache hit、top_k、rerank 和 provider error。不要直接换更快模型，因为可能真正瓶颈在检索或工具。

18.4 CFO 深挖

Q: 你如何证明 AI 成本值得？

A: 我不会只报 token 账单。我会报 unit economics：cost per successful answer、cost per resolved case、cost per accepted recommendation，并把模型、retrieval、tool、judge、human review、observability 都算进去。然后和业务收益比较，例如 AHT 降低、返工减少、误报减少、库存改善或损失减少。只有质量达标、安全可控、目标用户采用的请求，才算 successful answer。这样能避免用低成本但低质量的模型制造返工。

Q: 成本超预算怎么处理？

A: 先归因：是请求量增长、input token 膨胀、长上下文、cache miss、强模型 route 过多、judge 过重、工具/API 成本，还是人工复核过多。然后按风险分层优化：低风险高频请求用 cache、小模型、batch 和 prompt compression；中风险用 fast model + judge + selective retry；高风险保留强模型和人工审批，但要明确业务 owner 接受成本。任何成本优化都不能降低高风险 safety SLO。

18.5 SRE 深挖

Q: AI incident 和普通 incident 有什么不同？

A: 普通 incident 常见是服务不可用或错误率上升。AI incident 可能是 HTTP 200 但答案错误、证据过期、工具越权、judge 漂移、成本异常或用户误用。处理上除了 rollback，还要能禁用 prompt、model route、retriever index、tool action 或 judge version；要回收受影响输出、通知业务 owner、补充 golden set 和 regression eval。Postmortem 不只问系统为什么坏，还要问为什么 eval、dashboard、human review 或 gate 没提前发现。

Q: 哪些 alert 最有用？

A: 我会组合技术和业务 alert：p95 TTFT/total latency、tool error、provider fallback、cache miss spike、cost anomaly、unsupported claim、citation correctness drop、high-risk escalation drop、human override spike、S0 failure、adoption sudden drop。Alert 必须能 drill down 到 trace，否则只能制造噪音。

19. 作品集转化方式

这份手册可以转成三类作品集材料。

19.1 一页 Executive Memo

标题：From AI Demo to Operated Service

核心内容：

问题：AI demo 缺少 trace、质量、成本、SLO 和 incident loop。
方法：建立 AI observability stack、risk-tiered SLO、unit economics 和 dashboard。
场景：Customer Service RAG / AML Copilot / Payment Dispute Assistant。
产出：trace schema、dashboard spec、SLO matrix、cost review memo、incident playbook。
价值：降低生产风险、控制成本、提升 adoption、证明 ROI。

19.2 面试案例讲述

结构：

背景：金融零售 AI 从 pilot 到 release 的常见断点。
诊断：只看 uptime 和 token 账单不够。
方案：trace + metrics taxonomy + SLO + cost ledger + incident loop。
风险：高风险场景保留 human review 和 audit。
结果：用 cost per resolved case、quality pass、safety incidents、adoption 和 business outcome 证明。

19.3 可画的架构图

图中必须包含：

AI gateway。
policy and risk tiering。
model gateway。
retrieval service。
tool gateway。
judge/eval service。
human review queue。
trace store。
cost ledger。
quality dashboard。
incident timeline。
business outcome store。

20. 自检清单

上线前或作品集提交前，用下面问题检查：

是否每个 use case 都有 risk tier？
是否每次请求都能追踪到 prompt、model、retrieval、tool、judge、human review？
是否记录 token、latency、cost、cache、route reason？
是否能回答“这次答案引用了哪些证据”？
是否能回答“这次成本为什么这么高”？
是否能回答“用户有没有采纳，业务有没有改善”？
是否有 quality dashboard，而不只是 eval report？
是否有 cost ledger，而不只是供应商账单？
是否有 SLO matrix，而不只是 uptime SLA？
是否有 safety escape 目标，例如高风险最终决策 by AI = 0？
是否有 incident playbook 和 postmortem 模板？
是否把 incident 结果回写到 eval golden set？
是否能向 CTO 解释 trace schema？
是否能向 CFO 解释 unit economics？
是否能向 SRE 解释 alert、rollback 和 error budget？

21. Final Principle

AI demo 证明“模型可能有用”。AI observability、cost management 和 risk-tiered SLO 证明“这个能力可以被运营”。

对金融零售 AI 来说，真正的成熟不是让模型回答更多问题，而是让每一次回答都能被追踪、计量、评估、审计、复盘和优化。