AI 扩展计划 / Playbooks

AI Product Metrics / North Star Value Measurement Playbook

这些来源作为产品度量、可信 AI、交付能力和价值治理的锚点, 不构成法律、监管、审计或供应商选型意见。

1,063 行AI_PRODUCT_METRICS_NORTH_STAR_VALUE_MEASUREMENT_PLAYBOOK.md

AI Product Metrics / North Star / Value Measurement Playbook

适用对象: 已具备 BA / CBAP / 产品管理基础的 AI PM、AI BA、AI Product Architect、AI Value Office Lead、金融零售数字化负责人。 核心问题: AI 产品如何从“模型效果不错、用户说有用”升级为“有 North Star、有输入指标、有 guardrail、有因果证据、有收益兑现、有风险调整、有财务认可、有可审计治理”。 学习目标: 能为 AML、客服、信贷、财富/分行、AI 平台等金融零售场景设计高级 AI 产品指标体系, 并把 metric tree、实验设计、benefits realization、risk-adjusted value 和 product analytics governance 转成作品集资产。 边界说明: 本文不是基础指标课, 不讲 DAU/MAU 入门、漏斗术语入门或 BI 报表教程。正式金融零售项目必须由 business owner、risk、model risk、legal、compliance、privacy、security、finance、data owner、architecture 和 operations 共同确认。

Source Anchors

这些来源作为产品度量、可信 AI、交付能力和价值治理的锚点, 不构成法律、监管、审计或供应商选型意见。

Anchor	Official / primary source	本 playbook 中的用法
Amplitude North Star Metric official guide	https://amplitude.com/north-star	用于锚定 North Star Metric 的产品管理语言: 把客户价值、产品使用、业务结果和团队输入指标连接成一个可行动的指标体系。
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 组织 AI 指标风险、guardrail、measurement evidence、monitoring 和治理闭环。
DORA	https://dora.dev/	用软件交付和运营能力语言连接 AI 平台、工程生产力、可靠性、变更风险和业务目标, 防止 AI 平台价值只停留在 demo 数量。
Trustworthy Online Controlled Experiments	https://www.cambridge.org/core/books/trustworthy-online-controlled-experiments/D97B26382EB0EB2DC2019A7A7B518F59	用 online controlled experiments、组织级指标、实验可信度和长期影响语言支撑 AI 产品因果证据。
NIST AI RMF Generative AI Profile	https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence	用于 GenAI 风险场景下的 measurement、monitoring、content risk、human oversight 和 evidence design。

1. One-Sentence Positioning

AI Product Metrics 是把 AI 行为质量、真实用户采用、业务结果、风险 guardrail、因果归因、单位经济和收益兑现连接起来的产品治理系统; North Star 是这个系统的方向盘, 不是一个孤立 KPI。

更短的面试版:

AI North Star = qualified value events created by AI, adjusted by trust, risk, adoption and unit economics.

高级 AI PM / BA / Architect 的关键不是“列很多指标”, 而是回答:

AI 到底改变了哪个决策、动作、工作流或客户体验。
哪个指标代表持续客户价值和业务价值。
哪些输入指标能被团队直接拉动。
哪些 guardrail 一旦恶化就必须停止扩容。
哪些证据能证明改善来自 AI, 而不是季节性、人员选择、管理关注或流程重排。
哪部分收益被 finance、risk、ops 和业务 owner 认可。
指标口径、血缘、owner、变更、权限和事件响应是否可治理。

2. 为什么 AI 产品度量不能停留在模型指标

AI 产品常见汇报方式:

accuracy 提升 8%。
answer quality judge score 达到 4.5/5。
用户访问量上涨 30%。
生成了 20 万次摘要。
预计节省 30% 人工时间。

这些信号有用, 但不足以支撑 scale / stop / fund decision。原因是:

单点指标	能说明什么	不能说明什么	需要补强
Model accuracy / eval score	AI 输出在测试集上更接近标准	是否被用户采用、是否改善流程、是否降低风险	exposure、adoption、workflow outcome、guardrail
Usage count	有人打开或调用	是否产生合格价值事件	accepted action、completed workflow、quality pass
Time saved	某步骤更快	是否返工增加、质量下降、节省时间是否被兑现	rework、QA defect、capacity redeployment
Automation rate	系统处理更多任务	是否安全、是否误伤客户、是否把风险转移给人工	exception rate、manual override、customer harm
Cost per call	推理成本	单位业务价值是否成立	cost per resolved case、cost per risk avoided
Customer satisfaction	体验信号	是否导致长期价值、合规风险是否可控	retention、complaint、policy breach、segment fairness

成熟度跃迁:

Model metric
-> AI behavior metric
-> workflow metric
-> customer / risk / financial outcome
-> causal evidence
-> risk-adjusted value
-> benefits realization
-> portfolio funding decision

一句话:

模型指标是 release evidence 的一部分, 不是 AI 产品价值的终点。

3. AI 产品指标 Taxonomy

3.1 高级分类

Metric class	要回答的问题	金融零售示例	Owner	典型用途
North Star Metric	AI 产品创造的核心合格价值事件是什么	`risk-adjusted resolved customer issues`、`quality-approved AI-assisted AML cases`	Product / Business owner	战略对齐、团队聚焦、scale 决策
Business Outcome Metric	业务最终结果是否改善	cost per case、loss avoided、FCR、approval cycle time、complaint rate	Business / Finance	benefits realization、funding gate
Workflow Metric	AI 是否改变了流程表现	handle time、case aging、queue backlog、reopen rate、touches per case	Ops / BA	流程优化、瓶颈定位
Adoption Metric	目标用户是否在正确场景采用	eligible user activation、repeat use、accepted suggestion rate、copilot-assisted workflow share	Product / Ops	adoption 管理、培训和 UX 迭代
AI Quality / Eval Metric	AI 行为是否满足任务要求	groundedness、citation correctness、policy adherence、tool success、format validity	AI PM / EvalOps	release gate、regression test
Decision Quality Metric	人机决策是否更好	escalation precision、memo defect rate、override quality、appeal overturn rate	Risk / Ops	高风险决策支持治理
Guardrail Metric	AI 是否造成不可接受伤害	PII leakage、unauthorized advice、wrong denial、complaint spike、fairness gap	Risk / Compliance	stop rule、rollback、incident
Cost / Unit Economics Metric	价值是否覆盖成本	cost per resolved case、token cost per accepted action、review cost per case	Finance / Platform	route 优化、budget、scale economics
Data / Knowledge Metric	AI 依赖的数据是否可信	freshness、coverage、retrieval recall、policy effective-date correctness	Data owner	RAG、eval、audit
Platform / Engineering Metric	AI 交付能力是否可扩展可靠	lead time to AI change、deployment frequency、change failure rate、MTTR、platform reuse rate	Platform / Engineering	DORA-style 平台价值、运营成熟度
Benefits Realization Metric	承诺收益是否兑现	finance-signed benefit、redeployed capacity、avoided loss recognized	Value Office / Finance	portfolio review、scale / stop

3.2 指标对象边界

容易混淆	正确区分	产品治理意义
KPI vs North Star	KPI 可以很多; North Star 是核心价值事件和方向	防止每个团队优化不同方向
Eval metric vs business metric	eval 测 AI 行为; business metric 测业务结果	防止 judge score 被包装成 ROI
Feature metric vs decision metric	feature 是模型输入; decision metric 是业务动作质量	防止输入质量被误解为业务收益
Adoption vs impact	adoption 是使用和暴露; impact 是可归因结果变化	防止使用量增长被误报为价值
Guardrail vs secondary metric	guardrail 是约束和停机规则; secondary 是解释结果	防止严重风险被平均数掩盖
Activity vs value event	activity 是点击、查询、生成; value event 是合格完成的业务结果	防止“AI 很忙”但业务无改善

3.3 AI 产品度量栈

flowchart TB
  S[Strategy and risk appetite] --> N[North Star Metric]
  N --> V[Qualified value event]
  V --> I[Input metrics]
  I --> A[AI behavior and eval metrics]
  I --> W[Workflow and adoption metrics]
  I --> B[Business outcome metrics]
  A --> G[Guardrail metrics]
  W --> G
  B --> R[Risk-adjusted value]
  G --> R
  R --> C[Causal evidence]
  C --> F[Finance-recognized benefits]
  F --> P[Portfolio scale / stop decision]

4. North Star Metric: 高级设计原则

4.1 合格 North Star 的判断标准

一个 AI 产品 North Star 必须同时满足九个条件:

条件	判断问题	不合格信号
用户价值清楚	用户或业务流程为什么更好	指标只统计模型调用或页面访问
业务价值清楚	为什么这个指标增长会支持经营结果	与成本、风险、收入、客户体验无连接
AI 贡献可解释	AI 如何影响该价值事件	AI 只是背景工具, 无 exposure 记录
可被团队拉动	输入指标可拆到产品、数据、模型、运营动作	指标太滞后或太宏观
不易被作弊	增长不能靠降低质量或转移风险	自动关闭更多 case 但返工和投诉上升
受 guardrail 约束	风险阈值明确, 不允许用伤害换增长	只看效率, 不看合规和客户伤害
可分层诊断	能按 segment、渠道、风险等级、团队拆解	平均值掩盖高风险客群伤害
有可用基线	可以建立 pre-AI / control / holdout 对照	只能做主观估算
财务可翻译	可以映射到 cost、revenue、loss avoided 或 capacity	finance 无法签字认可

4.2 AI North Star 常用形态

Product pattern	推荐 North Star 形态	示例
RAG knowledge assistant	Grounded and accepted resolutions	`weekly grounded accepted answers that resolve workflow without QA defect`
Copilot / draft assistant	Quality-approved assisted work completed	`AI-assisted customer responses sent with policy pass and no reopen`
Decision support	Better decisions under human accountability	`risk-reviewed decisions with improved precision and no adverse guardrail breach`
Automation	Safely automated eligible outcomes	`eligible cases safely resolved by AI within SLA and no critical defect`
Agent workflow	Approved actions completed safely	`bounded AI actions completed with human-approved audit trail and no rollback`
AI platform	Production AI workflows delivering governed value	`active production AI use cases passing value, risk and DORA-style reliability gates`

4.3 推荐公式: Qualified Value Event

North Star =
sum(Qualified value events)
where each event passes:
  target workflow eligibility
  real user or system exposure
  accepted or completed action
  quality / eval threshold
  risk guardrail threshold
  cost ceiling
  auditable evidence

更适合金融零售的风险调整版:

Risk-adjusted North Star =
sum(value_event_count * value_weight * confidence_weight * adoption_weight)
- expected_harm_cost
- quality_failure_cost
- incremental_operating_cost

说明:

value_event_count: 合格业务事件数量, 例如 resolved case、approved memo、completed branch interaction。
value_weight: 事件价值权重, 可来自工时、损失避免、收入、客户体验或风险暴露。
confidence_weight: 证据强度, 随实验、准实验、holdout、finance sign-off 提升。
adoption_weight: 用户真实采用和流程改变程度。
expected_harm_cost: 风险事件概率乘以严重度和补救成本。
quality_failure_cost: 返工、QA defect、投诉、申诉、人工复核成本。
incremental_operating_cost: 模型、平台、人工审核、标注、监控、培训和治理成本。

4.4 不同金融零售场景的 North Star 示例

场景	North Star	为什么比使用量更好
AML Copilot	`quality-approved AI-assisted investigations completed within SLA with no critical evidence defect`	关注合格调查完成, 不奖励低质量快关案
客服 RAG / Copilot	`customer issues resolved with grounded AI assistance and no reopen or policy breach`	同时约束解决率、证据、返工和政策风险
信贷 Memo Assistant	`credit memos completed with AI assistance, underwriter acceptance and no policy exception defect`	AI 只辅助人工决策, 不越过授信责任
财富 / 分行 Advisor Assistant	`compliant client interactions improved by AI with advisor acceptance and suitability guardrail pass`	防止把销售转化置于适当性和合规之上
AI Platform	`production AI workflows shipped through shared platform that pass value, risk, reliability and cost gates`	平台价值来自可复用受控交付, 不是接入模型数量

5. North Star to Input Metrics

5.1 指标树逻辑

Business goal
-> North Star
-> Qualified value event
-> Input metric groups
-> Product levers
-> Guardrails
-> Evidence and benefit realization

5.2 输入指标分层

Level	指标组	关键问题	示例
L1 Eligibility	覆盖范围	哪些对象应该被 AI 影响	eligible cases、eligible users、eligible workflow steps
L2 Exposure	真实暴露	目标对象是否真的看到或使用 AI	exposed cases、AI suggestion visible rate、default-on share
L3 Adoption	采用行为	用户是否采纳 AI 输出或动作	accepted suggestion rate、draft edit rate、repeat use
L4 Workflow change	流程改变	AI 是否缩短、简化或改善流程	time-to-summary、touches per case、queue age
L5 AI quality	行为质量	AI 输出是否可用、可证据化、可审计	groundedness、citation correctness、policy adherence
L6 Business outcome	业务结果	客户、运营、收入或风险是否改善	FCR、loss avoided、approval cycle time、complaint rate
L7 Unit economics	单位经济	单位价值是否覆盖单位成本	cost per resolved case、cost per accepted memo
L8 Guardrail	风险约束	是否发生不可接受伤害	PII leakage、wrong advice、false negative、fairness gap
L9 Evidence quality	证据强度	指标改善是否可归因	experiment pass、DiD estimate、holdout comparison

5.3 客服 AI North Star 树示例

Business goal:
  降低客服成本, 提升一次解决率, 控制政策错误和投诉风险

North Star:
  grounded AI-assisted customer issues resolved without reopen or policy breach

Input metrics:
  coverage:
    eligible intent coverage
    approved policy knowledge coverage
  exposure:
    agent-visible AI answer rate
    customer self-service AI answer exposure
  adoption:
    accepted answer rate
    answer edit distance
    repeat use by agent cohort
  quality:
    citation correctness
    policy adherence
    unsupported claim rate
  workflow:
    AHT
    after-call work time
    transfer rate
    reopen rate
  business:
    FCR
    complaint rate
    cost per resolved contact
  guardrail:
    wrong fee disclosure
    unauthorized promise
    PII leakage
    vulnerable customer escalation miss
  unit economics:
    token and retrieval cost per resolved contact
    QA and human review cost

5.4 输入指标与产品杠杆

Input metric	可操作产品杠杆	常见误判
Eligible workflow coverage	扩展 intent taxonomy、政策知识覆盖、工具接入	把所有场景都纳入 AI, 导致风险失控
Exposure rate	默认展示、嵌入工作台、减少切换成本	强制曝光但用户绕开或复制粘贴到外部工具
Acceptance rate	提升引用质量、可编辑草稿、结构化下一步	高接受率可能来自用户过度信任
Edit distance	改进格式、语气、上下文注入	低编辑不等于正确, 需要 QA 抽样
Workflow cycle time	自动预填、摘要、排序、工具调用	时间下降但返工上升
Groundedness	检索过滤、rerank、引用强制、证据不足时拒答	引用存在但不支持结论
Cost per value event	model routing、cache、prompt compression、small model	降成本导致质量或风险恶化
Repeat use	onboarding、manager cadence、workflow fit	重复使用可能只说明替代搜索, 不说明价值

6. Guardrail Metrics

Guardrail 是 release contract, 不是 dashboard 边角指标。AI 产品允许优化 North Star, 但不能越过 guardrail。

6.1 Guardrail 分类

Guardrail class	金融零售示例	阈值策略	Owner
Customer harm	错误拒绝、错误收费解释、误导还款、错误投资建议	critical = 0; medium breach 有暂停阈值	Business / CX / Risk
Compliance / policy	未授权承诺、违反 KYC/AML/credit policy、记录保留失败	critical = 0; policy defect rate 上限	Compliance / Legal
Privacy / security	PII 泄露、越权检索、prompt injection 成功、敏感字段进入日志	zero tolerance for critical leakage	Privacy / Security
Model behavior	hallucinated rationale、unsupported claim、wrong citation、overconfident answer	按风险 tier 分阈值	AI PM / EvalOps
Decision quality	信贷 memo 漏关键风险、AML evidence defect、错误升级/降级	高风险 case 设 hard gate	Risk / Ops
Fairness / segment	特定年龄、语言、地区、渠道、风险等级被误伤	gap 上限 + slice review	Fair lending / Risk
Operational	queue backlog、manual override spike、QA capacity overload、fallback failure	bounded degradation	Ops
Financial	loss rate、chargeback、refund、margin erosion、review cost overrun	risk appetite threshold	Finance / Risk
Reliability	latency、timeout、tool error、retrieval empty rate、rollback failure	SLO/SLA threshold	Platform / Engineering
Engineering delivery	change failure rate、MTTR、incident recurrence、deployment rollback	DORA-style reliability guardrail	Platform / SRE

6.2 阈值类型

Threshold type	用法	示例
Zero tolerance	对不可接受风险	`PII leakage critical = 0`、`unauthorized credit decision = 0`
Bounded degradation	允许轻微波动但需限制	`AHT increase <= 3%`、`latency P95 <= 2.5s`
Segment parity	防止平均收益掩盖伤害	`approval support defect gap by protected class <= approved threshold`
Capacity limit	防止把工作转移给人工	`QA queue backlog <= baseline + 10%`
Cost ceiling	防止单位经济失控	`AI cost per resolved contact <= benefit per contact * 20%`
Stop trigger	达到即暂停或回滚	`high severity complaint attributable to AI >= 3 in rolling 7 days`
Review trigger	不一定回滚, 但必须复核	`manual override increases > 15% for two consecutive weeks`

6.3 Guardrail Matrix 示例

Use case	North Star	Guardrail	Stop / review rule
AML Copilot	合格 AI 辅助调查完成	critical evidence defect、SAR narrative unsupported claim、analyst override spike	critical defect = stop expansion; override spike = expert review
客服 RAG	有证据的一次解决	wrong policy answer、vulnerable customer escalation miss、reopen rate	wrong regulated policy answer = stop affected intent
信贷 Memo	被 underwriter 接受的合规 memo	unauthorized recommendation、fair lending sensitive wording、missing adverse action reason	unauthorized decision language = release block
财富 / 分行	合规客户互动改善	unsuitable recommendation、unapproved product promotion、complaint spike	suitability breach = immediate disable for segment
AI Platform	通过平台交付的受控 AI workflow	change failure rate、policy bypass、cost overrun、audit log gap	audit log gap = platform gate fail

7. Risk-Adjusted Value

AI 产品价值不能只算效率或收入。金融零售必须把风险、质量、运营、治理和客户伤害纳入净值。

7.1 核心公式

Gross incremental value =
  incremental revenue
  + cost avoided
  + loss avoided
  + rework avoided
  + capacity redeployed value
  + risk exposure reduction value

AI total cost =
  model and infrastructure cost
  + data / labeling / eval cost
  + human review and QA cost
  + platform support cost
  + change management cost
  + governance and audit cost
  + vendor and legal cost

Expected risk cost =
  probability of harm * severity * remediation cost
  + regulatory / compliance exposure
  + customer compensation and complaint handling
  + reputational and operational disruption adjustment

Risk-adjusted net value =
  credible incremental value * adoption realization factor * quality pass factor
  - AI total cost
  - expected risk cost
  - opportunity cost

7.2 参数解释

参数	定义	证据来源
Credible incremental value	可归因于 AI 的增量收益, 不是观察到的全部变化	A/B、cluster test、DiD、CausalImpact、holdout
Adoption realization factor	真正进入流程并改变行为的比例	exposure log、accepted action、manager audit
Quality pass factor	通过质量和风险门槛的价值比例	eval report、QA sample、expert review
AI total cost	运行、治理和变更的全成本	finance model、cloud bill、vendor contract、ops staffing
Expected risk cost	风险事件的期望成本	risk register、incident history、severity matrix
Opportunity cost	同等资源投入其他 AI use case 的机会损失	portfolio scoring、capacity plan

7.3 风险调整不要只做扣分

风险调整不是把高风险项目一票否决, 而是让决策更清楚:

情况	决策含义
高价值、高风险、证据强、控制强	controlled scale, 强 gate, 分阶段扩容
高价值、高风险、证据弱	pilot only, 优先补因果和控制证据
中等价值、低风险、复用强	可作为平台 pattern 扩散
低价值、高治理成本	stop 或回到流程优化
单点收益小、组合复用大	平台化评估, 不按单一 use case ROI 否定

7.4 金融零售价值类型

Value type	例子	注意事项
Labor efficiency	客服 AHT 降低、AML evidence gathering 时间减少	只有当人力被减少、转岗或释放到高价值任务时才算兑现
Capacity creation	同样团队处理更多 case、缩短 backlog	要证明质量和风险没有恶化
Revenue uplift	财富下一步建议提升转化、分行 cross-sell 更精准	必须扣除 suitability、投诉、客户长期价值风险
Loss avoidance	fraud loss 降低、AML false negative 风险降低	需要反事实和延迟结果跟踪
Quality improvement	返工、reopen、QA defect 降低	可转化为成本、风险或客户体验价值
Risk exposure reduction	审计发现减少、证据完整性提升	finance 可能不直接入账, 但可进入 risk-adjusted portfolio score
Platform leverage	复用 gateway、eval、observability 缩短上线周期	用 DORA-style lead time、change failure、MTTR 和 reuse rate 证明

8. Causal / Experimental Evidence

8.1 证据阶梯

Evidence level	证据类型	能支持的结论	不足
L0 Anecdote	用户访谈、专家样例、demo 截图	发现机会和失败模式	不能证明价值
L1 Descriptive analytics	使用量、采用率、前后趋势	看到相关变化	无反事实
L2 Baseline comparison	pre/post、目标 vs 实际	初步估计改善	易受季节性和流程变化影响
L3 Matched / adjusted analysis	propensity matching、case mix adjustment	降低选择偏差	依赖可观测混杂
L4 Quasi-experiment	DiD、interrupted time series、CausalImpact、synthetic control	无法随机时建立更可信反事实	假设需要检验
L5 Randomized experiment	A/B、cluster randomization、switchback、champion-challenger	最强因果证据	金融场景需控制风险和干扰
L6 Scaled holdout	长期 holdout、phased rollout、policy experiment	支撑规模化和持续价值	需要组织纪律和伦理边界

8.2 金融零售实验设计选择

Use case	推荐设计	Randomization unit	关键 guardrail
客服 Copilot	Agent-level 或 team-level cluster A/B	agent、team、queue	wrong policy answer、complaint、reopen、AHT
AML Copilot	Team / jurisdiction phased rollout + matched case analysis	analyst team、case cohort	evidence defect、SAR narrative quality、false negative proxy
信贷 Memo Assistant	Underwriter / branch cluster 或 eligible application randomization	underwriter、application	fair lending、policy exception、appeal overturn
财富 Advisor Assistant	Advisor cohort controlled rollout + compliance audit	advisor、branch、client segment	unsuitable recommendation、complaint、disclosure miss
AI Platform capability	Team cohort rollout + DORA-style before/after with control teams	product team、use case team	change failure、incident、cost overrun、audit gap

8.3 必须记录的 telemetry

Telemetry	用途	示例字段
Assignment	谁被分配到 treatment/control	experiment_id、unit_id、variant、assignment_time
Eligibility	谁有资格接受 AI	risk_tier、workflow_step、case_type、exclusion_reason
Exposure	谁真的看到或受 AI 影响	AI_visible、AI_suggestion_generated、tool_result_shown
Adoption	用户是否采纳	accepted、edited、ignored、override_reason
Action	采纳后做了什么	response_sent、case_escalated、memo_submitted
Outcome	下游结果	resolved、reopened、QA_passed、loss_avoided
Guardrail	风险和质量	policy_breach、complaint、PII_block、fairness_slice
Cost	单位成本	tokens、model_cost、review_minutes、platform_cost
Version	组件版本	model、prompt、retriever、policy、tool_schema、knowledge_index

8.4 常见因果威胁

Threat	AI 产品表现	控制方式
Selection bias	高绩效员工更愿意使用 AI	随机默认开启、encouragement design、匹配分析
Seasonality	节假日、监管周期、营销季影响指标	同期对照、时间序列、switchback
Case mix shift	pilot 后处理的 case 类型变化	case complexity adjustment、固定 eligibility
Management attention	pilot 团队获得更多培训和主管关注	把培训作为单独 treatment 或所有组一致培训
Spillover	control 组学习 treatment 组提示词	team cluster、隔离知识库、contamination log
Metric drift	口径或源系统变更	metric contract、lineage、change freeze
Outcome delay	fraud loss、投诉、申诉延迟出现	延迟窗口、leading proxy 和 final outcome 分开
Risk displacement	节省一处成本, 增加另一处风险	risk-adjusted value 和 cross-functional guardrail

9. Benefits Realization

Benefits realization 是把 AI 价值从 business case 估算变成 finance 和业务 owner 可认可的证据链。

9.1 标准流程

Problem baseline
-> Value hypothesis
-> Metric contract
-> Measurement design
-> Pilot evidence
-> Adoption proof
-> Quality and risk proof
-> Finance translation
-> Benefits register
-> Scale / stop decision
-> Post-scale audit

9.2 Benefits Register 字段

Field	填写规则	金融零售示例
Benefit id	稳定编号	`AML-COPILOT-BEN-001`
Business owner	对收益负责的人	Head of Financial Crime Operations
Baseline	AI 前的量、成本、质量、风险	每月 18,000 cases, median review 42 min
Target	pilot 或 scale 目标	合格 case review time 降低 15%
Metric contract	指标口径和数据来源	`aml.case_review_minutes_p50`
Evidence design	如何归因	phased rollout + matched case complexity
Observed change	观察到的变化	treatment team median 下降 9 min
Incremental estimate	因果调整后的增量	DiD estimate -6.5 min per case
Adoption proof	真实采用证据	72% eligible cases AI evidence summary accepted
Quality proof	质量证据	critical evidence defect = 0, QA pass +3.2pp
Risk adjustment	风险成本或限制	high-risk typology remains human-first
Cost	AI 全成本	model、retrieval、QA、training、support
Finance treatment	如何入账或管理	capacity redeployed to backlog reduction
Sign-off	认可状态	business + finance signed at monthly value review
Scale decision	决策	expand to two additional analyst teams

9.3 收益兑现口径

Benefit claim	不成熟说法	成熟说法
节省时间	AI 每次摘要节省 5 分钟	在 68% eligible cases 中, AI 被采纳后经 case mix 调整节省 3.8 分钟, QA defect 未上升
降低成本	客服成本下降 20%	treatment queues cost per resolved contact 下降 8.4%, reopen 和投诉 guardrail 未恶化
提升质量	judge score 更高	citation correctness 提升 12pp, wrong policy answer critical defects 为 0, QA pass 提升 5pp
降低风险	AML 更安全	mandatory evidence completeness 提升 9pp, high-risk typology escalation miss 未增加
平台复用	平台提高效率	使用共享 eval/gateway 的团队 lead time 下降 35%, change failure rate 不升, MTTR 下降

9.4 Monthly Value Review

议题	关键问题	输出
Value	增量收益是否超过可信反事实	benefits register 更新
Adoption	目标用户是否改变工作方式	adoption intervention
Quality	eval 和 QA 是否支持 scale	release / scale gate
Risk	residual risk 是否在 appetite 内	risk acceptance or mitigation
Cost	单位经济是否随规模改善	routing、cache、capacity 决策
Portfolio	是否继续、扩大、平台化或停止	scale / stop memo

10. 金融零售案例

10.1 AML Investigator Copilot

定位: AI 辅助 analyst 收集证据、摘要交易、生成 narrative draft 和提示缺失信息, 但不替代 SAR / STR 判断责任。

Layer	Metric design
North Star	`quality-approved AI-assisted investigations completed within SLA with no critical evidence defect`
Value event	analyst 使用 AI evidence summary 或 narrative draft 完成 case, QA 通过, 无 critical defect
Input metrics	eligible case coverage、AI summary acceptance、evidence checklist completeness、time-to-evidence、narrative edit distance
AI quality	citation correctness、unsupported claim rate、missing evidence detection、typology coverage
Business outcome	case cycle time、backlog aging、QA pass、SAR narrative quality、capacity redeployed
Guardrail	critical evidence defect = 0、high-risk typology escalation miss、PII overexposure、analyst overreliance
Causal evidence	team phased rollout + matched case complexity + expert QA sample
Benefit realization	capacity released to aged backlog, not automatic headcount reduction claim

关键设计:

AI 输出必须区分 source fact、inference、missing evidence。
高风险 typology 保持 stronger human review。
指标按 jurisdiction、typology、risk tier、analyst tenure 分层。
收益不能只算 review minutes, 还要看 evidence completeness 和监管质量。

10.2 客服 / Contact Center RAG + Copilot

定位: AI 为 agent 或自助渠道提供带引用的政策答案、下一步建议和回复草稿。

Layer	Metric design
North Star	`customer issues resolved with grounded AI assistance and no reopen or policy breach`
Value event	AI 答案被 agent 采纳或客户自助完成, 问题一次解决, 无 reopen、无政策错误
Input metrics	intent coverage、approved knowledge freshness、answer exposure、acceptance rate、edit distance、self-service containment
AI quality	groundedness、citation correctness、policy effective-date correctness、tone suitability
Business outcome	FCR、AHT、after-call work、transfer rate、complaint rate、cost per resolved contact
Guardrail	wrong policy answer、unauthorized promise、vulnerable customer escalation miss、PII leakage、latency
Causal evidence	queue / agent cluster A/B, triggered analysis for actual exposure
Benefit realization	降低 resolved contact 成本, 同时 reopen 和 complaint 不恶化

关键设计:

North Star 不用 answers generated, 因为生成越多不等于解决越多。
对 regulated intents 设置 zero-tolerance critical errors。
self-service containment 必须与 complaint、repeat contact、abandonment 联合看。

10.3 信贷 / Underwriting Decision Support

定位: AI 辅助整理申请材料、解释政策、草拟 memo、提示缺失信息, 最终授信判断仍由授权人员和模型治理流程负责。

Layer	Metric design
North Star	`underwriter-accepted AI-assisted credit memos completed with policy adherence and no fairness guardrail breach`
Value event	memo 被 underwriter 采纳或低改动提交, QA / policy review 通过
Input metrics	eligible application coverage、document extraction confidence、policy citation correctness、memo acceptance、missing-info detection
AI quality	unsupported risk rationale、wrong policy citation、adverse action wording risk、data extraction accuracy
Business outcome	cycle time、rework rate、condition clearing time、decision consistency、underwriter capacity
Guardrail	unauthorized recommendation、fair lending sensitive defect、appeal overturn、policy exception miss
Causal evidence	underwriter / branch cluster rollout + case mix adjustment + delayed outcome monitoring
Benefit realization	更快更一致的 memo 和补件流程, 不把 approval rate 上升自动算作收益

关键设计:

AI 不应输出“批准/拒绝”的最终授权语言, 除非治理范围明确允许。
approval rate 不能作为孤立 North Star, 因为可能引入信用风险和公平风险。
需要按产品、客群、渠道、地区、underwriter tenure 分层。

10.4 财富 / 分行 Advisor Assistant

定位: AI 辅助 RM / advisor / branch staff 做客户准备、产品知识检索、合规话术、next best conversation, 但不得绕过 suitability、disclosure 和监督流程。

Layer	Metric design
North Star	`compliant advisor interactions improved by AI with accepted preparation and suitability guardrail pass`
Value event	advisor 使用 AI 准备或对话建议, 客户互动完成, compliance QA 通过
Input metrics	advisor activation、client prep usage、approved content coverage、accepted next-step suggestion、meeting follow-up completion
AI quality	suitability context completeness、disclosure citation、product restriction accuracy、tone appropriateness
Business outcome	meeting conversion、follow-up completion、client retention、assets retained、advisor productivity
Guardrail	unsuitable recommendation、unapproved product promotion、complaint、missing disclosure、vulnerable client miss
Causal evidence	branch / advisor cohort rollout + matched client segment + compliance sampling
Benefit realization	只认可通过 suitability 和 complaint guardrail 的增量收入或保留价值

关键设计:

不把销售额直接设为 North Star, 以免激励错误推荐。
对客户画像和产品适当性使用 source-of-truth 和 policy citation。
价值按 client segment、advisor tenure、branch capacity 分层。

10.5 AI Platform / Model Gateway / EvalOps Platform

定位: 平台不是“接了多少模型”, 而是让多个 AI use case 更快、更安全、更便宜、更可审计地交付业务价值。

Layer	Metric design
North Star	`production AI workflows shipped through shared platform that pass value, risk, reliability and cost gates`
Value event	一个生产 AI workflow 使用共享 gateway / eval / observability / guardrail, 并通过 release gate
Input metrics	platform adoption by use case、reuse rate、eval coverage、trace coverage、policy coverage、model routing hit rate
Engineering / DORA	lead time to AI change、deployment frequency、change failure rate、MTTR、reliability
Business outcome	time-to-market reduction、duplicated platform spend avoided、incident reduction、cost per workflow
Guardrail	audit log gap、policy bypass、unapproved model use、cost overrun、change failure
Causal evidence	cohort comparison between platform and non-platform teams, before/after with delivery complexity adjustment
Benefit realization	平台收益按 use case 复用、风险控制、交付周期缩短和成本降低分摊

关键设计:

平台 North Star 不用 API calls 或 number of models connected。
DORA 指标用于证明交付和可靠性改善, 但必须连接到 AI use case value。
平台价值要扣除平台团队、基础设施、治理和迁移成本。

11. Product Analytics Governance

AI 产品指标治理的目标是让指标能被人、BI、LLM、eval harness、Value Office 和审计共同信任。

11.1 Metric Contract

Field	必填内容	示例
Metric name	稳定名称和 namespace	`cx.ai_grounded_resolution_rate`
Business decision	支持什么决策	是否扩大客服 RAG 到更多 intent
Definition	业务定义	AI 辅助且带正确引用的一次解决工单占 eligible 工单比例
Formula	可执行公式	`grounded_resolved_contacts / eligible_exposed_contacts`
Numerator	分子口径	AI exposed, resolved, no reopen in 7 days, citation QA pass
Denominator	分母口径	eligible and exposed contacts in approved intents
Grain	粒度	contact_id
Time window	时间口径	contact close date, rolling 7 / 28 days
Dimensions	可切片维度	intent、channel、agent_team、customer_segment、risk_tier
Source-of-truth	权威系统	contact center system + QA system + AI trace store
Data quality SLO	质量目标	exposure event completeness >= 99%, QA linkage >= 98%
AI consumption policy	AI 能如何使用	可用于 Value Office summary, 不可生成个人绩效处罚
Guardrail linkage	关联约束	wrong policy answer, complaint, reopen, PII leakage
Owner	责任模型	CX ops accountable, AI PM responsible, risk consulted
Change policy	变更流程	intent eligibility 变更需要 product + risk approval

11.2 RACI

Activity	Business owner	AI PM	BA	Data owner	Risk / Compliance	Finance	Platform	Ops
North Star definition	A	R	R	C	C	C	C	C
Metric contract	A	R	R	R	C	C	C	C
Guardrail threshold	C	R	C	C	A/R	C	C	C
Experiment design	A	R	R	C	C	C	C	C
Telemetry spec	C	R	R	R	C	I	R	C
Benefits register	A	R	C	I	C	A/R	I	C
Release / scale gate	A	R	C	C	A/R	C	R	R
Metric incident response	A	R	C	R	C	C	R	R

11.3 Governance Forums

Forum	Cadence	关键问题	输出
Metric design review	use case discovery / pilot 前	North Star 是否代表合格价值事件	approved metric tree
Guardrail review	release 前和高风险变更前	风险阈值是否可执行	guardrail matrix
Experiment review	pilot 前	反事实、随机化、样本、telemetry 是否可信	experiment brief approval
Value review	monthly	收益是否可归因、可兑现、可扩张	benefits register update
Metric incident review	incident 后	指标、数据或解释是否误导决策	metric correction and comms
Portfolio review	quarterly	哪些 use case scale、stop、platformize	funding decision

11.4 Metric Incident

AI 指标事故包括:

exposure event 丢失导致 adoption 虚高或虚低。
知识库版本变更未记录, 影响 groundedness 口径。
dashboard 把 ineligible cases 放入分母。
LLM analytics assistant 解释了未批准指标。
实验 SRM 失败但结果被继续用于 scale decision。
benefits register 把全部 pre/post 改善归因给 AI。

响应流程:

Detect
-> classify severity
-> freeze affected decision
-> identify lineage and consumers
-> correct metric / dashboard / AI summary
-> communicate impacted decisions
-> update contract and tests
-> add regression check

12. Templates

12.1 North Star Metric Canvas

Field	Filled example
Product / use case	Customer Service AI Policy Copilot
Target user	Contact center agents handling regulated servicing intents
Business problem	High AHT and reopen rate caused by policy search friction and inconsistent answers
One-sentence North Star	`grounded AI-assisted customer issues resolved without reopen or policy breach`
Qualified value event	eligible contact, AI answer exposed, agent accepted or customer self-served, resolved, no reopen in 7 days, citation QA pass
Primary value dimension	cost per resolved contact and customer issue resolution quality
Input metric groups	coverage、exposure、acceptance、groundedness、cycle time、reopen、cost
Guardrails	wrong policy answer、PII leakage、vulnerable customer escalation miss、complaint spike
Causal evidence plan	agent-team cluster A/B with triggered analysis and QA sample
Finance translation	incremental resolved contacts * adjusted cost reduction - AI total cost - risk cost
Scale rule	expand only if FCR improves, AHT decreases, critical policy defects remain zero, and cost per resolved contact improves

12.2 Metric Tree Template

Business outcome:
  reduce cost per resolved customer issue while maintaining policy compliance

North Star:
  grounded AI-assisted customer issues resolved without reopen or policy breach

Qualified value event:
  eligible contact + AI exposure + accepted answer + resolution + QA pass + no reopen

Input metrics:
  coverage:
    approved intent coverage
    knowledge base freshness
  exposure:
    AI answer visible rate
    triggered contact rate
  adoption:
    accepted answer rate
    edit distance
  quality:
    citation correctness
    unsupported claim rate
  workflow:
    AHT
    transfer rate
    after-call work
  outcome:
    FCR
    reopen rate
    complaint rate
  economics:
    AI cost per resolved contact
    QA cost per contact
  guardrails:
    wrong policy answer
    PII leakage
    vulnerable customer miss

12.3 Guardrail Matrix Template

Guardrail	Severity	Metric	Threshold	Detection	Decision
Wrong regulated policy answer	Critical	expert QA critical defect count	`0` per release gate	QA sample + user report	stop affected intent and run root cause
PII leakage	Critical	confirmed leakage event	`0`	DLP + trace audit	disable feature path and incident response
Reopen rate	High	7-day reopen rate	no statistically credible increase above control	experiment scorecard	pause scale and diagnose intents
Latency	Medium	P95 response latency	<= 2.5 seconds for agent desktop	observability dashboard	route optimization or fallback
Cost overrun	Medium	AI cost per resolved contact	<= approved unit economics ceiling	cost ledger	model routing review

12.4 Experiment Design Brief

Field	Filled example
Hypothesis	Grounded AI policy answers reduce AHT and reopen rate for eligible servicing intents without increasing policy defects
Treatment	AI answer with citation shown in agent desktop
Control	Existing policy search and macro workflow
Unit of assignment	Agent team
Unit of analysis	Contact
Eligibility	Approved servicing intents, excluding complaints and vulnerable customer cases in first pilot
Primary metric	Grounded AI-assisted resolved contact rate
Secondary metrics	AHT、after-call work、transfer rate、agent acceptance
Guardrails	wrong policy answer、PII leakage、complaint、reopen、latency
Exposure logging	contact_id、agent_id、team_id、variant、AI_visible、accepted、edited
Analysis	ITT + triggered exposure analysis, case mix adjustment, pre-registered slices
Decision rule	scale if primary improves, cost per resolved contact improves, no critical guardrail breach

12.5 Benefits Realization Register

Field	Filled example
Benefit id	`CX-AI-BEN-004`
Use case	Customer Service AI Policy Copilot
Baseline	620,000 monthly eligible contacts, AHT P50 7.8 min, reopen 11.2%
Incremental effect	Cluster experiment estimates -0.7 min AHT per exposed resolved contact
Adoption	64% eligible exposed contacts accepted AI answer
Quality	citation QA pass 96.4%, critical policy defect 0
Gross value	capacity equivalent from reduced handle time for accepted contacts
Cost	model, retrieval, QA sample, training, platform support
Risk adjustment	complaint guardrail unchanged; high-risk intents excluded until separate gate
Recognized benefit	capacity redeployed to backlog and peak coverage
Sign-off	CX operations and finance approved for limited scale
Next review	60-day post-scale value audit with expanded intent set

12.6 Scale / Stop Memo

Section	内容要求
Decision	scale, limited scale, continue pilot, redesign, stop
Evidence	North Star movement, input metric movement, causal estimate, confidence
Guardrails	pass / breach / trend / mitigation
Unit economics	value per event, cost per event, scale cost curve
Adoption	target user adoption and workflow change evidence
Residual risk	risk owner view and control plan
Benefits	finance treatment and benefits register update
Platform reuse	reusable components, shared controls, additional use cases
Decision log	owner, date, rationale, conditions

13. Review Checklists

13.1 North Star Review

North Star 是否是合格价值事件, 而不是调用量、登录量或生成量。
是否清楚连接客户价值、业务价值和 AI 贡献。
是否能拆成团队可拉动的 input metrics。
是否有明确 guardrail, 防止用风险换增长。
是否能按风险等级、渠道、客群、团队、地区分层。
是否能翻译成 finance 可讨论的价值。
是否不鼓励越权自动化或低质量快速完成。

13.2 Metric Contract Review

指标名称、定义、公式、分子、分母、粒度、时间窗口是否清楚。
source-of-truth、数据质量 SLO、血缘、owner 是否明确。
AI consumption policy 是否说明哪些 AI 系统可使用该指标。
口径变更是否有审批、版本和影响分析。
指标是否可被 eval、dashboard、LLM analytics 和 Value Office 一致消费。
是否定义 metric incident 的 freeze、correction 和 communication 流程。

13.3 Experiment / Causal Evidence Review

treatment、control、eligibility、assignment、exposure 是否清楚。
randomization unit 与 analysis unit 是否匹配, 聚类如何处理。
是否记录 assignment、exposure、adoption、action、outcome、guardrail。
是否预先声明 primary、secondary、guardrail 和 slices。
是否检查 SRM、case mix、seasonality、spillover、metric drift。
无法随机时, 准实验假设是否写清楚并做敏感性检查。
是否同时报告 ITT 和 triggered analysis, 避免只看采纳者。

13.4 Benefits Realization Review

baseline 是否在 pilot 前冻结。
observed change 与 incremental effect 是否分开。
收益是否扣除 AI total cost、human review、QA、培训、治理和风险成本。
节省时间是否转成 headcount、capacity、SLA、revenue 或 risk reduction 的具体兑现路径。
finance、business owner、risk owner 是否认可口径。
scale 后是否安排 post-scale audit, 防止 pilot 效果衰减。

13.5 Guardrail Review

critical guardrail 是否有 zero-tolerance 或 hard stop。
guardrail 是否覆盖客户伤害、合规、隐私、安全、公平、运营、财务和可靠性。
阈值是否按风险 tier 区分。
是否定义 detection source、owner、response time 和 rollback。
是否防止平均指标掩盖高风险 segment 伤害。
是否把 guardrail breach 纳入 risk-adjusted value。

14. 反模式

Anti-pattern	表现	为什么危险	更好做法
把 model accuracy 当 North Star	“准确率 95%”成为唯一成功指标	无法证明采用、流程和业务价值	用 qualified value event + eval guardrail
把 AI 调用量当价值	API calls、answers generated、tokens consumed 增长	激励无效使用和成本膨胀	统计 accepted and quality-passed workflow outcomes
只报节省小时数	用主观估计乘以使用次数	finance 难认可, 忽略返工和风险	用 causal estimate + capacity redeployment
只看平均值	全体 AHT 下降	高风险 segment 可能恶化	按 risk tier、channel、customer segment 分层
Guardrail 后置	上线后再看投诉和合规问题	高风险场景可能不可逆	release gate 前定义 stop rules
只看采纳者	采纳 AI 的人表现更好	selection bias 高估效果	同时看 assignment、exposure、ITT、triggered
把 pilot 团队成功当全量成功	最强团队试点表现好	scale 后 adoption 和质量衰减	phased rollout + heterogeneity analysis
把平台接入数当平台价值	接了 20 个模型和 50 个应用	不代表更快、更安全、更便宜	用 DORA-style delivery + value/risk gates
用 revenue 直接做财富 AI North Star	推荐后销售额上升	可能牺牲 suitability 和客户信任	compliant accepted interactions + risk-adjusted revenue
指标无 owner	dashboard 数字没人负责	事故时无法修复和解释	metric contract + RACI + incident playbook

15. 30 天训练计划

目标: 30 天内产出一套可放入作品集的 AI Product Metrics / North Star / Value Measurement 证据包, 选择一个金融零售 use case 深做, 同时覆盖平台治理视角。

Day	训练主题	产出
1	选择 use case: AML、客服、信贷、财富/分行或 AI 平台	Use case decision card
2	写 problem baseline: volume、cost、quality、risk、cycle time	Baseline table
3	定义 AI intervention 和 decision boundary	Intervention brief
4	识别用户、流程、system-of-record 和 risk owner	Stakeholder / system map
5	设计 3 个候选 North Star 并打分	North Star option matrix
6	选定 North Star 和 qualified value event	North Star canvas
7	拆 North Star 到 input metrics	Metric tree v1
8	定义 AI quality / eval metrics	Eval-to-business matrix
9	设计 guardrail categories 和 critical thresholds	Guardrail matrix v1
10	写 metric contract: name、formula、grain、source、owner	Metric contract
11	设计 telemetry: assignment、exposure、adoption、action、outcome	Telemetry spec
12	画 data lineage: source -> metric -> dashboard -> AI summary	Metric lineage map
13	设计 causal evidence plan: A/B、cluster、DiD 或 time series	Experiment / quasi-experiment brief
14	识别因果威胁: selection、seasonality、spillover、case mix	Threat-to-validity register
15	设计 benefits register 字段和 finance translation	Benefits register v1
16	计算 gross value、cost、risk adjustment 的样例	Risk-adjusted value model
17	定义 adoption realization factor 和 quality pass factor	Value adjustment rules
18	写 pilot release gate: eval、risk、ops、cost	Pilot gate checklist
19	写 scale / stop decision rule	Scale / stop memo skeleton with filled example
20	补 DORA-style 平台指标或工程交付指标	Platform metric addendum
21	做 AML case version 或客服 case version 的完整示例	Case metric pack
22	做信贷或财富/分行 case 的对比示例	Second case comparison
23	设计 dashboard 信息架构: exec、product、risk、ops 四层	Dashboard outline
24	写 product analytics governance RACI	RACI table
25	写 metric incident response 流程	Metric incident playbook
26	整理反模式和面试风险点	Anti-pattern cheat sheet
27	写 5 个高阶面试答案	Interview answer pack v1
28	把所有产物整理成 portfolio narrative	Portfolio storyline
29	做自评: 是否有 North Star、guardrail、causal、benefits、governance	Review checklist results
30	完成最终作品集包	Final AI product metrics portfolio pack

完成标准:

有一个清楚的 North Star, 且不是 activity metric。
有完整 metric tree 和 guardrail matrix。
有 causal evidence plan, 不只看 pre/post。
有 risk-adjusted value model 和 benefits register。
有 product analytics governance: metric contract、RACI、incident。
能用金融零售语言讲清收益兑现和风险控制。

16. 面试答案

Q1: 你会如何为银行的 AI 客服 Copilot 设计 North Star?

30 秒版本

我不会用调用量或生成答案数做 North Star。我会定义为“有证据支持、被采纳、一次解决且没有 reopen 或政策违规的 AI 辅助客户问题数”。这个指标同时包含客户价值、业务价值、AI 贡献和风险边界。

2 分钟版本

我会先定义 qualified value event: eligible contact、AI answer exposed、agent accepted 或客户自助完成、问题 resolved、7 天内无 reopen、引用 QA 通过、无 critical policy defect。然后拆 input metrics: intent coverage、knowledge freshness、exposure rate、acceptance rate、citation correctness、AHT、FCR、reopen、complaint、cost per resolved contact。guardrail 包括 wrong policy answer、PII leakage、vulnerable customer escalation miss 和 latency。归因上优先用 agent-team cluster A/B, 同时记录 assignment、exposure、adoption、outcome, 避免只看采纳者造成 selection bias。收益兑现时只认可通过质量和风险门槛的增量解决量, 再扣除模型、QA、培训和平台成本。

Q2: AI eval 指标和业务指标是什么关系?

30 秒版本

Eval 指标证明 AI 行为是否合格, 业务指标证明流程和经营结果是否改善。eval 是 release gate, 不是 ROI 本身。

2 分钟版本

例如信贷 memo assistant 的 eval 指标包括 policy citation correctness、unsupported risk rationale、missing document detection 和 prohibited recommendation language。这些能决定是否允许进入 pilot。但业务指标是 cycle time、rework、underwriter capacity、appeal overturn、policy exception defect。两者通过 metric tree 连接: 如果 citation correctness 提升, memo acceptance 和 rework 应改善, 再影响 cycle time 和成本。若 eval 提升但业务结果没变, 可能是工作流嵌入差、用户不信任、case mix 变化或 AI 只改善了不重要的片段。

Q3: 如何证明 AI 项目的收益不是季节性或团队选择造成的?

30 秒版本

要建立反事实。能随机就做 A/B 或 cluster rollout; 不能随机就用 DiD、interrupted time series、matched cohort 或 synthetic control, 并记录 assignment、exposure、adoption、outcome 和 guardrail。

2 分钟版本

我会先定义 treatment 和 eligibility, 再选择 randomization unit。客服适合 agent team cluster A/B; AML 可能用 phased rollout 加 matched case complexity; 财富分行适合 advisor cohort rollout 加 compliance sampling。分析时区分 ITT 和 triggered exposure, 预先声明 primary metric、guardrail 和 segment slices。还要检查 SRM、case mix、seasonality、spillover、metric drift 和 outcome delay。最后只把可信增量效果放入 benefits register, 不把全部 pre/post 变化算给 AI。

Q4: 你如何做 risk-adjusted AI ROI?

30 秒版本

我会从可信增量价值开始, 乘以 adoption 和 quality pass factor, 再扣除 AI 全成本和 expected risk cost。金融零售不能只算效率, 还要扣质量、合规、客户伤害和治理成本。

2 分钟版本

公式是: risk-adjusted net value = credible incremental value * adoption realization * quality pass - model/platform/data/QA/change/governance cost - expected risk cost - opportunity cost。举例 AML Copilot, 如果每个 case 节省 6 分钟但只有 70% eligible cases 真正采用, 且高风险 typology 需要更强人工复核, 那么收益要按采用和质量通过比例调整。若 evidence defect 或 SAR narrative unsupported claim 出现, 相关价值应扣除或冻结。finance sign-off 也要说明节省时间如何兑现为 backlog reduction、capacity redeployment 或成本减少。

Q5: AI 平台的 North Star 怎么设计?

30 秒版本

平台 North Star 不应是模型接入数或 API 调用量, 而应是“通过共享平台交付、通过价值/风险/可靠性/成本门禁的生产 AI workflows 数量”。

2 分钟版本

AI 平台价值来自复用和治理能力, 例如 model gateway、prompt registry、eval harness、observability、cost ledger、policy guardrail 和 audit log。输入指标包括 platform adoption by use case、eval coverage、trace coverage、policy coverage、model routing hit rate、cost per workflow。DORA-style 指标可以证明交付能力: lead time to AI change、deployment frequency、change failure rate、MTTR 和 reliability。但这些要连接业务 use case value, 否则平台只是工程活动。guardrail 包括 unapproved model use、audit log gap、policy bypass、cost overrun 和 change failure。

Q6: 如果 North Star 上升但 guardrail 恶化, 你会怎么处理?

30 秒版本

先冻结扩容, 看 guardrail severity。critical breach 直接 stop 或 rollback affected path; 非 critical 则分 segment、case type、版本和 workflow step 诊断, 在风险 owner 接受前不把增长计入可兑现价值。

2 分钟版本

North Star 不能压过风险边界。比如客服 AI 的 resolved contacts 上升, 但 wrong policy answer 或投诉上升, 我会先检查是否集中在某些 intent、知识库版本、agent cohort 或 prompt 版本。critical policy defect 要关闭受影响 intent, 更新 eval set 和 release gate。对于 bounded degradation, 可以限制流量、增加 human review、调整 retrieval filter 或回到 pilot。收益计算中, guardrail breach 影响的事件不算 qualified value event, 还要进入 expected risk cost。

17. 作品集交付物

一套高级 AI Product Metrics 作品集可以包含以下资产:

Artifact	内容	评估标准
One-page metric strategy	use case、North Star、qualified event、metric tree、guardrail	一页能讲清价值和风险
North Star option matrix	2-3 个候选 North Star 的取舍	能说明为什么不用 activity metric
AI product metric taxonomy	business、workflow、adoption、eval、guardrail、cost、platform	分类清楚, owner 清楚
Metric contract	definition、formula、grain、source、owner、AI consumption policy	可被 dashboard / eval / audit 复用
Guardrail matrix	severity、threshold、detection、decision	有 hard stop 和 review trigger
Experiment / causal design brief	treatment、control、unit、telemetry、analysis、threats	能证明增量价值
Risk-adjusted value model	gross value、cost、risk adjustment、finance treatment	不夸大 ROI
Benefits realization register	baseline、target、incremental estimate、sign-off、scale decision	能支撑 Value Office review
Dashboard information architecture	exec、product、risk、ops、platform 分层视图	不把所有指标堆在一起
Product analytics governance pack	RACI、metric incident、change policy、lineage	可审计、可运营
Financial retail case pack	AML、客服、信贷、财富/分行、AI 平台示例	展示领域迁移能力
Interview answer pack	6-10 个高阶问题答案	能讲清 North Star、causal、guardrail、benefits

作品集叙事建议:

I did not start with model accuracy.
I started with the business decision and the qualified value event.
Then I designed the North Star, input metrics, guardrails, causal evidence, risk-adjusted value and benefits realization governance.
This is how I would help a regulated financial institution scale AI without confusing usage with value.

18. 最终检查: 一套指标体系是否成熟

Question	Mature answer
North Star 是什么	一个带质量、采用、风险和成本约束的 qualified value event
输入指标是什么	能被产品、数据、模型、运营、平台团队直接拉动
Guardrail 是什么	有 owner、阈值、检测、响应和 stop / rollback 规则
AI 质量如何度量	eval、QA、human review、trace、online monitoring 联动
业务收益如何证明	使用实验或准实验建立 credible counterfactual
收益如何兑现	进入 benefits register, 由 business 和 finance 认可
风险如何进入 ROI	用 expected risk cost、quality pass、guardrail breach 调整
平台价值如何证明	用复用、成本、可靠性和 DORA-style 交付指标连接 use case value
指标如何治理	metric contract、semantic layer、RACI、lineage、change policy、incident
是否适合金融零售	覆盖合规、隐私、公平、客户伤害、审计和人工责任边界

一句话收束:

高级 AI 产品度量不是“看 AI 有没有被使用”, 而是证明 AI 在受控风险下创造了可归因、可兑现、可扩展、可治理的业务价值。