AI Product Metrics / North Star Value Measurement Playbook
这些来源作为产品度量、可信 AI、交付能力和价值治理的锚点, 不构成法律、监管、审计或供应商选型意见。
AI Product Metrics / North Star / Value Measurement Playbook
适用对象: 已具备 BA / CBAP / 产品管理基础的 AI PM、AI BA、AI Product Architect、AI Value Office Lead、金融零售数字化负责人。 核心问题: AI 产品如何从“模型效果不错、用户说有用”升级为“有 North Star、有输入指标、有 guardrail、有因果证据、有收益兑现、有风险调整、有财务认可、有可审计治理”。 学习目标: 能为 AML、客服、信贷、财富/分行、AI 平台等金融零售场景设计高级 AI 产品指标体系, 并把 metric tree、实验设计、benefits realization、risk-adjusted value 和 product analytics governance 转成作品集资产。 边界说明: 本文不是基础指标课, 不讲 DAU/MAU 入门、漏斗术语入门或 BI 报表教程。正式金融零售项目必须由 business owner、risk、model risk、legal、compliance、privacy、security、finance、data owner、architecture 和 operations 共同确认。
Source Anchors
这些来源作为产品度量、可信 AI、交付能力和价值治理的锚点, 不构成法律、监管、审计或供应商选型意见。
| Anchor | Official / primary source | 本 playbook 中的用法 |
|---|---|---|
| Amplitude North Star Metric official guide | https://amplitude.com/north-star | 用于锚定 North Star Metric 的产品管理语言: 把客户价值、产品使用、业务结果和团队输入指标连接成一个可行动的指标体系。 |
| NIST AI Risk Management Framework | https://www.nist.gov/itl/ai-risk-management-framework | 用 Govern / Map / Measure / Manage 组织 AI 指标风险、guardrail、measurement evidence、monitoring 和治理闭环。 |
| DORA | https://dora.dev/ | 用软件交付和运营能力语言连接 AI 平台、工程生产力、可靠性、变更风险和业务目标, 防止 AI 平台价值只停留在 demo 数量。 |
| Trustworthy Online Controlled Experiments | https://www.cambridge.org/core/books/trustworthy-online-controlled-experiments/D97B26382EB0EB2DC2019A7A7B518F59 | 用 online controlled experiments、组织级指标、实验可信度和长期影响语言支撑 AI 产品因果证据。 |
| NIST AI RMF Generative AI Profile | https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence | 用于 GenAI 风险场景下的 measurement、monitoring、content risk、human oversight 和 evidence design。 |
1. One-Sentence Positioning
AI Product Metrics 是把 AI 行为质量、真实用户采用、业务结果、风险 guardrail、因果归因、单位经济和收益兑现连接起来的产品治理系统; North Star 是这个系统的方向盘, 不是一个孤立 KPI。
更短的面试版:
AI North Star = qualified value events created by AI, adjusted by trust, risk, adoption and unit economics.
高级 AI PM / BA / Architect 的关键不是“列很多指标”, 而是回答:
- AI 到底改变了哪个决策、动作、工作流或客户体验。
- 哪个指标代表持续客户价值和业务价值。
- 哪些输入指标能被团队直接拉动。
- 哪些 guardrail 一旦恶化就必须停止扩容。
- 哪些证据能证明改善来自 AI, 而不是季节性、人员选择、管理关注或流程重排。
- 哪部分收益被 finance、risk、ops 和业务 owner 认可。
- 指标口径、血缘、owner、变更、权限和事件响应是否可治理。
2. 为什么 AI 产品度量不能停留在模型指标
AI 产品常见汇报方式:
- accuracy 提升 8%。
- answer quality judge score 达到 4.5/5。
- 用户访问量上涨 30%。
- 生成了 20 万次摘要。
- 预计节省 30% 人工时间。
这些信号有用, 但不足以支撑 scale / stop / fund decision。原因是:
| 单点指标 | 能说明什么 | 不能说明什么 | 需要补强 |
|---|---|---|---|
| Model accuracy / eval score | AI 输出在测试集上更接近标准 | 是否被用户采用、是否改善流程、是否降低风险 | exposure、adoption、workflow outcome、guardrail |
| Usage count | 有人打开或调用 | 是否产生合格价值事件 | accepted action、completed workflow、quality pass |
| Time saved | 某步骤更快 | 是否返工增加、质量下降、节省时间是否被兑现 | rework、QA defect、capacity redeployment |
| Automation rate | 系统处理更多任务 | 是否安全、是否误伤客户、是否把风险转移给人工 | exception rate、manual override、customer harm |
| Cost per call | 推理成本 | 单位业务价值是否成立 | cost per resolved case、cost per risk avoided |
| Customer satisfaction | 体验信号 | 是否导致长期价值、合规风险是否可控 | retention、complaint、policy breach、segment fairness |
成熟度跃迁:
Model metric
-> AI behavior metric
-> workflow metric
-> customer / risk / financial outcome
-> causal evidence
-> risk-adjusted value
-> benefits realization
-> portfolio funding decision
一句话:
模型指标是 release evidence 的一部分, 不是 AI 产品价值的终点。
3. AI 产品指标 Taxonomy
3.1 高级分类
| Metric class | 要回答的问题 | 金融零售示例 | Owner | 典型用途 |
|---|---|---|---|---|
| North Star Metric | AI 产品创造的核心合格价值事件是什么 | risk-adjusted resolved customer issues、quality-approved AI-assisted AML cases | Product / Business owner | 战略对齐、团队聚焦、scale 决策 |
| Business Outcome Metric | 业务最终结果是否改善 | cost per case、loss avoided、FCR、approval cycle time、complaint rate | Business / Finance | benefits realization、funding gate |
| Workflow Metric | AI 是否改变了流程表现 | handle time、case aging、queue backlog、reopen rate、touches per case | Ops / BA | 流程优化、瓶颈定位 |
| Adoption Metric | 目标用户是否在正确场景采用 | eligible user activation、repeat use、accepted suggestion rate、copilot-assisted workflow share | Product / Ops | adoption 管理、培训和 UX 迭代 |
| AI Quality / Eval Metric | AI 行为是否满足任务要求 | groundedness、citation correctness、policy adherence、tool success、format validity | AI PM / EvalOps | release gate、regression test |
| Decision Quality Metric | 人机决策是否更好 | escalation precision、memo defect rate、override quality、appeal overturn rate | Risk / Ops | 高风险决策支持治理 |
| Guardrail Metric | AI 是否造成不可接受伤害 | PII leakage、unauthorized advice、wrong denial、complaint spike、fairness gap | Risk / Compliance | stop rule、rollback、incident |
| Cost / Unit Economics Metric | 价值是否覆盖成本 | cost per resolved case、token cost per accepted action、review cost per case | Finance / Platform | route 优化、budget、scale economics |
| Data / Knowledge Metric | AI 依赖的数据是否可信 | freshness、coverage、retrieval recall、policy effective-date correctness | Data owner | RAG、eval、audit |
| Platform / Engineering Metric | AI 交付能力是否可扩展可靠 | lead time to AI change、deployment frequency、change failure rate、MTTR、platform reuse rate | Platform / Engineering | DORA-style 平台价值、运营成熟度 |
| Benefits Realization Metric | 承诺收益是否兑现 | finance-signed benefit、redeployed capacity、avoided loss recognized | Value Office / Finance | portfolio review、scale / stop |
3.2 指标对象边界
| 容易混淆 | 正确区分 | 产品治理意义 |
|---|---|---|
| KPI vs North Star | KPI 可以很多; North Star 是核心价值事件和方向 | 防止每个团队优化不同方向 |
| Eval metric vs business metric | eval 测 AI 行为; business metric 测业务结果 | 防止 judge score 被包装成 ROI |
| Feature metric vs decision metric | feature 是模型输入; decision metric 是业务动作质量 | 防止输入质量被误解为业务收益 |
| Adoption vs impact | adoption 是使用和暴露; impact 是可归因结果变化 | 防止使用量增长被误报为价值 |
| Guardrail vs secondary metric | guardrail 是约束和停机规则; secondary 是解释结果 | 防止严重风险被平均数掩盖 |
| Activity vs value event | activity 是点击、查询、生成; value event 是合格完成的业务结果 | 防止“AI 很忙”但业务无改善 |
3.3 AI 产品度量栈
flowchart TB
S[Strategy and risk appetite] --> N[North Star Metric]
N --> V[Qualified value event]
V --> I[Input metrics]
I --> A[AI behavior and eval metrics]
I --> W[Workflow and adoption metrics]
I --> B[Business outcome metrics]
A --> G[Guardrail metrics]
W --> G
B --> R[Risk-adjusted value]
G --> R
R --> C[Causal evidence]
C --> F[Finance-recognized benefits]
F --> P[Portfolio scale / stop decision]
4. North Star Metric: 高级设计原则
4.1 合格 North Star 的判断标准
一个 AI 产品 North Star 必须同时满足九个条件:
| 条件 | 判断问题 | 不合格信号 |
|---|---|---|
| 用户价值清楚 | 用户或业务流程为什么更好 | 指标只统计模型调用或页面访问 |
| 业务价值清楚 | 为什么这个指标增长会支持经营结果 | 与成本、风险、收入、客户体验无连接 |
| AI 贡献可解释 | AI 如何影响该价值事件 | AI 只是背景工具, 无 exposure 记录 |
| 可被团队拉动 | 输入指标可拆到产品、数据、模型、运营动作 | 指标太滞后或太宏观 |
| 不易被作弊 | 增长不能靠降低质量或转移风险 | 自动关闭更多 case 但返工和投诉上升 |
| 受 guardrail 约束 | 风险阈值明确, 不允许用伤害换增长 | 只看效率, 不看合规和客户伤害 |
| 可分层诊断 | 能按 segment、渠道、风险等级、团队拆解 | 平均值掩盖高风险客群伤害 |
| 有可用基线 | 可以建立 pre-AI / control / holdout 对照 | 只能做主观估算 |
| 财务可翻译 | 可以映射到 cost、revenue、loss avoided 或 capacity | finance 无法签字认可 |
4.2 AI North Star 常用形态
| Product pattern | 推荐 North Star 形态 | 示例 |
|---|---|---|
| RAG knowledge assistant | Grounded and accepted resolutions | weekly grounded accepted answers that resolve workflow without QA defect |
| Copilot / draft assistant | Quality-approved assisted work completed | AI-assisted customer responses sent with policy pass and no reopen |
| Decision support | Better decisions under human accountability | risk-reviewed decisions with improved precision and no adverse guardrail breach |
| Automation | Safely automated eligible outcomes | eligible cases safely resolved by AI within SLA and no critical defect |
| Agent workflow | Approved actions completed safely | bounded AI actions completed with human-approved audit trail and no rollback |
| AI platform | Production AI workflows delivering governed value | active production AI use cases passing value, risk and DORA-style reliability gates |
4.3 推荐公式: Qualified Value Event
North Star =
sum(Qualified value events)
where each event passes:
target workflow eligibility
real user or system exposure
accepted or completed action
quality / eval threshold
risk guardrail threshold
cost ceiling
auditable evidence
更适合金融零售的风险调整版:
Risk-adjusted North Star =
sum(value_event_count * value_weight * confidence_weight * adoption_weight)
- expected_harm_cost
- quality_failure_cost
- incremental_operating_cost
说明:
value_event_count: 合格业务事件数量, 例如 resolved case、approved memo、completed branch interaction。value_weight: 事件价值权重, 可来自工时、损失避免、收入、客户体验或风险暴露。confidence_weight: 证据强度, 随实验、准实验、holdout、finance sign-off 提升。adoption_weight: 用户真实采用和流程改变程度。expected_harm_cost: 风险事件概率乘以严重度和补救成本。quality_failure_cost: 返工、QA defect、投诉、申诉、人工复核成本。incremental_operating_cost: 模型、平台、人工审核、标注、监控、培训和治理成本。
4.4 不同金融零售场景的 North Star 示例
| 场景 | North Star | 为什么比使用量更好 |
|---|---|---|
| AML Copilot | quality-approved AI-assisted investigations completed within SLA with no critical evidence defect | 关注合格调查完成, 不奖励低质量快关案 |
| 客服 RAG / Copilot | customer issues resolved with grounded AI assistance and no reopen or policy breach | 同时约束解决率、证据、返工和政策风险 |
| 信贷 Memo Assistant | credit memos completed with AI assistance, underwriter acceptance and no policy exception defect | AI 只辅助人工决策, 不越过授信责任 |
| 财富 / 分行 Advisor Assistant | compliant client interactions improved by AI with advisor acceptance and suitability guardrail pass | 防止把销售转化置于适当性和合规之上 |
| AI Platform | production AI workflows shipped through shared platform that pass value, risk, reliability and cost gates | 平台价值来自可复用受控交付, 不是接入模型数量 |
5. North Star to Input Metrics
5.1 指标树逻辑
Business goal
-> North Star
-> Qualified value event
-> Input metric groups
-> Product levers
-> Guardrails
-> Evidence and benefit realization
5.2 输入指标分层
| Level | 指标组 | 关键问题 | 示例 |
|---|---|---|---|
| L1 Eligibility | 覆盖范围 | 哪些对象应该被 AI 影响 | eligible cases、eligible users、eligible workflow steps |
| L2 Exposure | 真实暴露 | 目标对象是否真的看到或使用 AI | exposed cases、AI suggestion visible rate、default-on share |
| L3 Adoption | 采用行为 | 用户是否采纳 AI 输出或动作 | accepted suggestion rate、draft edit rate、repeat use |
| L4 Workflow change | 流程改变 | AI 是否缩短、简化或改善流程 | time-to-summary、touches per case、queue age |
| L5 AI quality | 行为质量 | AI 输出是否可用、可证据化、可审计 | groundedness、citation correctness、policy adherence |
| L6 Business outcome | 业务结果 | 客户、运营、收入或风险是否改善 | FCR、loss avoided、approval cycle time、complaint rate |
| L7 Unit economics | 单位经济 | 单位价值是否覆盖单位成本 | cost per resolved case、cost per accepted memo |
| L8 Guardrail | 风险约束 | 是否发生不可接受伤害 | PII leakage、wrong advice、false negative、fairness gap |
| L9 Evidence quality | 证据强度 | 指标改善是否可归因 | experiment pass、DiD estimate、holdout comparison |
5.3 客服 AI North Star 树示例
Business goal:
降低客服成本, 提升一次解决率, 控制政策错误和投诉风险
North Star:
grounded AI-assisted customer issues resolved without reopen or policy breach
Input metrics:
coverage:
eligible intent coverage
approved policy knowledge coverage
exposure:
agent-visible AI answer rate
customer self-service AI answer exposure
adoption:
accepted answer rate
answer edit distance
repeat use by agent cohort
quality:
citation correctness
policy adherence
unsupported claim rate
workflow:
AHT
after-call work time
transfer rate
reopen rate
business:
FCR
complaint rate
cost per resolved contact
guardrail:
wrong fee disclosure
unauthorized promise
PII leakage
vulnerable customer escalation miss
unit economics:
token and retrieval cost per resolved contact
QA and human review cost
5.4 输入指标与产品杠杆
| Input metric | 可操作产品杠杆 | 常见误判 |
|---|---|---|
| Eligible workflow coverage | 扩展 intent taxonomy、政策知识覆盖、工具接入 | 把所有场景都纳入 AI, 导致风险失控 |
| Exposure rate | 默认展示、嵌入工作台、减少切换成本 | 强制曝光但用户绕开或复制粘贴到外部工具 |
| Acceptance rate | 提升引用质量、可编辑草稿、结构化下一步 | 高接受率可能来自用户过度信任 |
| Edit distance | 改进格式、语气、上下文注入 | 低编辑不等于正确, 需要 QA 抽样 |
| Workflow cycle time | 自动预填、摘要、排序、工具调用 | 时间下降但返工上升 |
| Groundedness | 检索过滤、rerank、引用强制、证据不足时拒答 | 引用存在但不支持结论 |
| Cost per value event | model routing、cache、prompt compression、small model | 降成本导致质量或风险恶化 |
| Repeat use | onboarding、manager cadence、workflow fit | 重复使用可能只说明替代搜索, 不说明价值 |
6. Guardrail Metrics
Guardrail 是 release contract, 不是 dashboard 边角指标。AI 产品允许优化 North Star, 但不能越过 guardrail。
6.1 Guardrail 分类
| Guardrail class | 金融零售示例 | 阈值策略 | Owner |
|---|---|---|---|
| Customer harm | 错误拒绝、错误收费解释、误导还款、错误投资建议 | critical = 0; medium breach 有暂停阈值 | Business / CX / Risk |
| Compliance / policy | 未授权承诺、违反 KYC/AML/credit policy、记录保留失败 | critical = 0; policy defect rate 上限 | Compliance / Legal |
| Privacy / security | PII 泄露、越权检索、prompt injection 成功、敏感字段进入日志 | zero tolerance for critical leakage | Privacy / Security |
| Model behavior | hallucinated rationale、unsupported claim、wrong citation、overconfident answer | 按风险 tier 分阈值 | AI PM / EvalOps |
| Decision quality | 信贷 memo 漏关键风险、AML evidence defect、错误升级/降级 | 高风险 case 设 hard gate | Risk / Ops |
| Fairness / segment | 特定年龄、语言、地区、渠道、风险等级被误伤 | gap 上限 + slice review | Fair lending / Risk |
| Operational | queue backlog、manual override spike、QA capacity overload、fallback failure | bounded degradation | Ops |
| Financial | loss rate、chargeback、refund、margin erosion、review cost overrun | risk appetite threshold | Finance / Risk |
| Reliability | latency、timeout、tool error、retrieval empty rate、rollback failure | SLO/SLA threshold | Platform / Engineering |
| Engineering delivery | change failure rate、MTTR、incident recurrence、deployment rollback | DORA-style reliability guardrail | Platform / SRE |
6.2 阈值类型
| Threshold type | 用法 | 示例 |
|---|---|---|
| Zero tolerance | 对不可接受风险 | PII leakage critical = 0、unauthorized credit decision = 0 |
| Bounded degradation | 允许轻微波动但需限制 | AHT increase <= 3%、latency P95 <= 2.5s |
| Segment parity | 防止平均收益掩盖伤害 | approval support defect gap by protected class <= approved threshold |
| Capacity limit | 防止把工作转移给人工 | QA queue backlog <= baseline + 10% |
| Cost ceiling | 防止单位经济失控 | AI cost per resolved contact <= benefit per contact * 20% |
| Stop trigger | 达到即暂停或回滚 | high severity complaint attributable to AI >= 3 in rolling 7 days |
| Review trigger | 不一定回滚, 但必须复核 | manual override increases > 15% for two consecutive weeks |
6.3 Guardrail Matrix 示例
| Use case | North Star | Guardrail | Stop / review rule |
|---|---|---|---|
| AML Copilot | 合格 AI 辅助调查完成 | critical evidence defect、SAR narrative unsupported claim、analyst override spike | critical defect = stop expansion; override spike = expert review |
| 客服 RAG | 有证据的一次解决 | wrong policy answer、vulnerable customer escalation miss、reopen rate | wrong regulated policy answer = stop affected intent |
| 信贷 Memo | 被 underwriter 接受的合规 memo | unauthorized recommendation、fair lending sensitive wording、missing adverse action reason | unauthorized decision language = release block |
| 财富 / 分行 | 合规客户互动改善 | unsuitable recommendation、unapproved product promotion、complaint spike | suitability breach = immediate disable for segment |
| AI Platform | 通过平台交付的受控 AI workflow | change failure rate、policy bypass、cost overrun、audit log gap | audit log gap = platform gate fail |
7. Risk-Adjusted Value
AI 产品价值不能只算效率或收入。金融零售必须把风险、质量、运营、治理和客户伤害纳入净值。
7.1 核心公式
Gross incremental value =
incremental revenue
+ cost avoided
+ loss avoided
+ rework avoided
+ capacity redeployed value
+ risk exposure reduction value
AI total cost =
model and infrastructure cost
+ data / labeling / eval cost
+ human review and QA cost
+ platform support cost
+ change management cost
+ governance and audit cost
+ vendor and legal cost
Expected risk cost =
probability of harm * severity * remediation cost
+ regulatory / compliance exposure
+ customer compensation and complaint handling
+ reputational and operational disruption adjustment
Risk-adjusted net value =
credible incremental value * adoption realization factor * quality pass factor
- AI total cost
- expected risk cost
- opportunity cost
7.2 参数解释
| 参数 | 定义 | 证据来源 |
|---|---|---|
| Credible incremental value | 可归因于 AI 的增量收益, 不是观察到的全部变化 | A/B、cluster test、DiD、CausalImpact、holdout |
| Adoption realization factor | 真正进入流程并改变行为的比例 | exposure log、accepted action、manager audit |
| Quality pass factor | 通过质量和风险门槛的价值比例 | eval report、QA sample、expert review |
| AI total cost | 运行、治理和变更的全成本 | finance model、cloud bill、vendor contract、ops staffing |
| Expected risk cost | 风险事件的期望成本 | risk register、incident history、severity matrix |
| Opportunity cost | 同等资源投入其他 AI use case 的机会损失 | portfolio scoring、capacity plan |
7.3 风险调整不要只做扣分
风险调整不是把高风险项目一票否决, 而是让决策更清楚:
| 情况 | 决策含义 |
|---|---|
| 高价值、高风险、证据强、控制强 | controlled scale, 强 gate, 分阶段扩容 |
| 高价值、高风险、证据弱 | pilot only, 优先补因果和控制证据 |
| 中等价值、低风险、复用强 | 可作为平台 pattern 扩散 |
| 低价值、高治理成本 | stop 或回到流程优化 |
| 单点收益小、组合复用大 | 平台化评估, 不按单一 use case ROI 否定 |
7.4 金融零售价值类型
| Value type | 例子 | 注意事项 |
|---|---|---|
| Labor efficiency | 客服 AHT 降低、AML evidence gathering 时间减少 | 只有当人力被减少、转岗或释放到高价值任务时才算兑现 |
| Capacity creation | 同样团队处理更多 case、缩短 backlog | 要证明质量和风险没有恶化 |
| Revenue uplift | 财富下一步建议提升转化、分行 cross-sell 更精准 | 必须扣除 suitability、投诉、客户长期价值风险 |
| Loss avoidance | fraud loss 降低、AML false negative 风险降低 | 需要反事实和延迟结果跟踪 |
| Quality improvement | 返工、reopen、QA defect 降低 | 可转化为成本、风险或客户体验价值 |
| Risk exposure reduction | 审计发现减少、证据完整性提升 | finance 可能不直接入账, 但可进入 risk-adjusted portfolio score |
| Platform leverage | 复用 gateway、eval、observability 缩短上线周期 | 用 DORA-style lead time、change failure、MTTR 和 reuse rate 证明 |
8. Causal / Experimental Evidence
8.1 证据阶梯
| Evidence level | 证据类型 | 能支持的结论 | 不足 |
|---|---|---|---|
| L0 Anecdote | 用户访谈、专家样例、demo 截图 | 发现机会和失败模式 | 不能证明价值 |
| L1 Descriptive analytics | 使用量、采用率、前后趋势 | 看到相关变化 | 无反事实 |
| L2 Baseline comparison | pre/post、目标 vs 实际 | 初步估计改善 | 易受季节性和流程变化影响 |
| L3 Matched / adjusted analysis | propensity matching、case mix adjustment | 降低选择偏差 | 依赖可观测混杂 |
| L4 Quasi-experiment | DiD、interrupted time series、CausalImpact、synthetic control | 无法随机时建立更可信反事实 | 假设需要检验 |
| L5 Randomized experiment | A/B、cluster randomization、switchback、champion-challenger | 最强因果证据 | 金融场景需控制风险和干扰 |
| L6 Scaled holdout | 长期 holdout、phased rollout、policy experiment | 支撑规模化和持续价值 | 需要组织纪律和伦理边界 |
8.2 金融零售实验设计选择
| Use case | 推荐设计 | Randomization unit | 关键 guardrail |
|---|---|---|---|
| 客服 Copilot | Agent-level 或 team-level cluster A/B | agent、team、queue | wrong policy answer、complaint、reopen、AHT |
| AML Copilot | Team / jurisdiction phased rollout + matched case analysis | analyst team、case cohort | evidence defect、SAR narrative quality、false negative proxy |
| 信贷 Memo Assistant | Underwriter / branch cluster 或 eligible application randomization | underwriter、application | fair lending、policy exception、appeal overturn |
| 财富 Advisor Assistant | Advisor cohort controlled rollout + compliance audit | advisor、branch、client segment | unsuitable recommendation、complaint、disclosure miss |
| AI Platform capability | Team cohort rollout + DORA-style before/after with control teams | product team、use case team | change failure、incident、cost overrun、audit gap |
8.3 必须记录的 telemetry
| Telemetry | 用途 | 示例字段 |
|---|---|---|
| Assignment | 谁被分配到 treatment/control | experiment_id、unit_id、variant、assignment_time |
| Eligibility | 谁有资格接受 AI | risk_tier、workflow_step、case_type、exclusion_reason |
| Exposure | 谁真的看到或受 AI 影响 | AI_visible、AI_suggestion_generated、tool_result_shown |
| Adoption | 用户是否采纳 | accepted、edited、ignored、override_reason |
| Action | 采纳后做了什么 | response_sent、case_escalated、memo_submitted |
| Outcome | 下游结果 | resolved、reopened、QA_passed、loss_avoided |
| Guardrail | 风险和质量 | policy_breach、complaint、PII_block、fairness_slice |
| Cost | 单位成本 | tokens、model_cost、review_minutes、platform_cost |
| Version | 组件版本 | model、prompt、retriever、policy、tool_schema、knowledge_index |
8.4 常见因果威胁
| Threat | AI 产品表现 | 控制方式 |
|---|---|---|
| Selection bias | 高绩效员工更愿意使用 AI | 随机默认开启、encouragement design、匹配分析 |
| Seasonality | 节假日、监管周期、营销季影响指标 | 同期对照、时间序列、switchback |
| Case mix shift | pilot 后处理的 case 类型变化 | case complexity adjustment、固定 eligibility |
| Management attention | pilot 团队获得更多培训和主管关注 | 把培训作为单独 treatment 或所有组一致培训 |
| Spillover | control 组学习 treatment 组提示词 | team cluster、隔离知识库、contamination log |
| Metric drift | 口径或源系统变更 | metric contract、lineage、change freeze |
| Outcome delay | fraud loss、投诉、申诉延迟出现 | 延迟窗口、leading proxy 和 final outcome 分开 |
| Risk displacement | 节省一处成本, 增加另一处风险 | risk-adjusted value 和 cross-functional guardrail |
9. Benefits Realization
Benefits realization 是把 AI 价值从 business case 估算变成 finance 和业务 owner 可认可的证据链。
9.1 标准流程
Problem baseline
-> Value hypothesis
-> Metric contract
-> Measurement design
-> Pilot evidence
-> Adoption proof
-> Quality and risk proof
-> Finance translation
-> Benefits register
-> Scale / stop decision
-> Post-scale audit
9.2 Benefits Register 字段
| Field | 填写规则 | 金融零售示例 |
|---|---|---|
| Benefit id | 稳定编号 | AML-COPILOT-BEN-001 |
| Business owner | 对收益负责的人 | Head of Financial Crime Operations |
| Baseline | AI 前的量、成本、质量、风险 | 每月 18,000 cases, median review 42 min |
| Target | pilot 或 scale 目标 | 合格 case review time 降低 15% |
| Metric contract | 指标口径和数据来源 | aml.case_review_minutes_p50 |
| Evidence design | 如何归因 | phased rollout + matched case complexity |
| Observed change | 观察到的变化 | treatment team median 下降 9 min |
| Incremental estimate | 因果调整后的增量 | DiD estimate -6.5 min per case |
| Adoption proof | 真实采用证据 | 72% eligible cases AI evidence summary accepted |
| Quality proof | 质量证据 | critical evidence defect = 0, QA pass +3.2pp |
| Risk adjustment | 风险成本或限制 | high-risk typology remains human-first |
| Cost | AI 全成本 | model、retrieval、QA、training、support |
| Finance treatment | 如何入账或管理 | capacity redeployed to backlog reduction |
| Sign-off | 认可状态 | business + finance signed at monthly value review |
| Scale decision | 决策 | expand to two additional analyst teams |
9.3 收益兑现口径
| Benefit claim | 不成熟说法 | 成熟说法 |
|---|---|---|
| 节省时间 | AI 每次摘要节省 5 分钟 | 在 68% eligible cases 中, AI 被采纳后经 case mix 调整节省 3.8 分钟, QA defect 未上升 |
| 降低成本 | 客服成本下降 20% | treatment queues cost per resolved contact 下降 8.4%, reopen 和投诉 guardrail 未恶化 |
| 提升质量 | judge score 更高 | citation correctness 提升 12pp, wrong policy answer critical defects 为 0, QA pass 提升 5pp |
| 降低风险 | AML 更安全 | mandatory evidence completeness 提升 9pp, high-risk typology escalation miss 未增加 |
| 平台复用 | 平台提高效率 | 使用共享 eval/gateway 的团队 lead time 下降 35%, change failure rate 不升, MTTR 下降 |
9.4 Monthly Value Review
| 议题 | 关键问题 | 输出 |
|---|---|---|
| Value | 增量收益是否超过可信反事实 | benefits register 更新 |
| Adoption | 目标用户是否改变工作方式 | adoption intervention |
| Quality | eval 和 QA 是否支持 scale | release / scale gate |
| Risk | residual risk 是否在 appetite 内 | risk acceptance or mitigation |
| Cost | 单位经济是否随规模改善 | routing、cache、capacity 决策 |
| Portfolio | 是否继续、扩大、平台化或停止 | scale / stop memo |
10. 金融零售案例
10.1 AML Investigator Copilot
定位: AI 辅助 analyst 收集证据、摘要交易、生成 narrative draft 和提示缺失信息, 但不替代 SAR / STR 判断责任。
| Layer | Metric design |
|---|---|
| North Star | quality-approved AI-assisted investigations completed within SLA with no critical evidence defect |
| Value event | analyst 使用 AI evidence summary 或 narrative draft 完成 case, QA 通过, 无 critical defect |
| Input metrics | eligible case coverage、AI summary acceptance、evidence checklist completeness、time-to-evidence、narrative edit distance |
| AI quality | citation correctness、unsupported claim rate、missing evidence detection、typology coverage |
| Business outcome | case cycle time、backlog aging、QA pass、SAR narrative quality、capacity redeployed |
| Guardrail | critical evidence defect = 0、high-risk typology escalation miss、PII overexposure、analyst overreliance |
| Causal evidence | team phased rollout + matched case complexity + expert QA sample |
| Benefit realization | capacity released to aged backlog, not automatic headcount reduction claim |
关键设计:
- AI 输出必须区分 source fact、inference、missing evidence。
- 高风险 typology 保持 stronger human review。
- 指标按 jurisdiction、typology、risk tier、analyst tenure 分层。
- 收益不能只算 review minutes, 还要看 evidence completeness 和监管质量。
10.2 客服 / Contact Center RAG + Copilot
定位: AI 为 agent 或自助渠道提供带引用的政策答案、下一步建议和回复草稿。
| Layer | Metric design |
|---|---|
| North Star | customer issues resolved with grounded AI assistance and no reopen or policy breach |
| Value event | AI 答案被 agent 采纳或客户自助完成, 问题一次解决, 无 reopen、无政策错误 |
| Input metrics | intent coverage、approved knowledge freshness、answer exposure、acceptance rate、edit distance、self-service containment |
| AI quality | groundedness、citation correctness、policy effective-date correctness、tone suitability |
| Business outcome | FCR、AHT、after-call work、transfer rate、complaint rate、cost per resolved contact |
| Guardrail | wrong policy answer、unauthorized promise、vulnerable customer escalation miss、PII leakage、latency |
| Causal evidence | queue / agent cluster A/B, triggered analysis for actual exposure |
| Benefit realization | 降低 resolved contact 成本, 同时 reopen 和 complaint 不恶化 |
关键设计:
- North Star 不用
answers generated, 因为生成越多不等于解决越多。 - 对 regulated intents 设置 zero-tolerance critical errors。
- self-service containment 必须与 complaint、repeat contact、abandonment 联合看。
10.3 信贷 / Underwriting Decision Support
定位: AI 辅助整理申请材料、解释政策、草拟 memo、提示缺失信息, 最终授信判断仍由授权人员和模型治理流程负责。
| Layer | Metric design |
|---|---|
| North Star | underwriter-accepted AI-assisted credit memos completed with policy adherence and no fairness guardrail breach |
| Value event | memo 被 underwriter 采纳或低改动提交, QA / policy review 通过 |
| Input metrics | eligible application coverage、document extraction confidence、policy citation correctness、memo acceptance、missing-info detection |
| AI quality | unsupported risk rationale、wrong policy citation、adverse action wording risk、data extraction accuracy |
| Business outcome | cycle time、rework rate、condition clearing time、decision consistency、underwriter capacity |
| Guardrail | unauthorized recommendation、fair lending sensitive defect、appeal overturn、policy exception miss |
| Causal evidence | underwriter / branch cluster rollout + case mix adjustment + delayed outcome monitoring |
| Benefit realization | 更快更一致的 memo 和补件流程, 不把 approval rate 上升自动算作收益 |
关键设计:
- AI 不应输出“批准/拒绝”的最终授权语言, 除非治理范围明确允许。
- approval rate 不能作为孤立 North Star, 因为可能引入信用风险和公平风险。
- 需要按产品、客群、渠道、地区、underwriter tenure 分层。
10.4 财富 / 分行 Advisor Assistant
定位: AI 辅助 RM / advisor / branch staff 做客户准备、产品知识检索、合规话术、next best conversation, 但不得绕过 suitability、disclosure 和监督流程。
| Layer | Metric design |
|---|---|
| North Star | compliant advisor interactions improved by AI with accepted preparation and suitability guardrail pass |
| Value event | advisor 使用 AI 准备或对话建议, 客户互动完成, compliance QA 通过 |
| Input metrics | advisor activation、client prep usage、approved content coverage、accepted next-step suggestion、meeting follow-up completion |
| AI quality | suitability context completeness、disclosure citation、product restriction accuracy、tone appropriateness |
| Business outcome | meeting conversion、follow-up completion、client retention、assets retained、advisor productivity |
| Guardrail | unsuitable recommendation、unapproved product promotion、complaint、missing disclosure、vulnerable client miss |
| Causal evidence | branch / advisor cohort rollout + matched client segment + compliance sampling |
| Benefit realization | 只认可通过 suitability 和 complaint guardrail 的增量收入或保留价值 |
关键设计:
- 不把销售额直接设为 North Star, 以免激励错误推荐。
- 对客户画像和产品适当性使用 source-of-truth 和 policy citation。
- 价值按 client segment、advisor tenure、branch capacity 分层。
10.5 AI Platform / Model Gateway / EvalOps Platform
定位: 平台不是“接了多少模型”, 而是让多个 AI use case 更快、更安全、更便宜、更可审计地交付业务价值。
| Layer | Metric design |
|---|---|
| North Star | production AI workflows shipped through shared platform that pass value, risk, reliability and cost gates |
| Value event | 一个生产 AI workflow 使用共享 gateway / eval / observability / guardrail, 并通过 release gate |
| Input metrics | platform adoption by use case、reuse rate、eval coverage、trace coverage、policy coverage、model routing hit rate |
| Engineering / DORA | lead time to AI change、deployment frequency、change failure rate、MTTR、reliability |
| Business outcome | time-to-market reduction、duplicated platform spend avoided、incident reduction、cost per workflow |
| Guardrail | audit log gap、policy bypass、unapproved model use、cost overrun、change failure |
| Causal evidence | cohort comparison between platform and non-platform teams, before/after with delivery complexity adjustment |
| Benefit realization | 平台收益按 use case 复用、风险控制、交付周期缩短和成本降低分摊 |
关键设计:
- 平台 North Star 不用
API calls或number of models connected。 - DORA 指标用于证明交付和可靠性改善, 但必须连接到 AI use case value。
- 平台价值要扣除平台团队、基础设施、治理和迁移成本。
11. Product Analytics Governance
AI 产品指标治理的目标是让指标能被人、BI、LLM、eval harness、Value Office 和审计共同信任。
11.1 Metric Contract
| Field | 必填内容 | 示例 |
|---|---|---|
| Metric name | 稳定名称和 namespace | cx.ai_grounded_resolution_rate |
| Business decision | 支持什么决策 | 是否扩大客服 RAG 到更多 intent |
| Definition | 业务定义 | AI 辅助且带正确引用的一次解决工单占 eligible 工单比例 |
| Formula | 可执行公式 | grounded_resolved_contacts / eligible_exposed_contacts |
| Numerator | 分子口径 | AI exposed, resolved, no reopen in 7 days, citation QA pass |
| Denominator | 分母口径 | eligible and exposed contacts in approved intents |
| Grain | 粒度 | contact_id |
| Time window | 时间口径 | contact close date, rolling 7 / 28 days |
| Dimensions | 可切片维度 | intent、channel、agent_team、customer_segment、risk_tier |
| Source-of-truth | 权威系统 | contact center system + QA system + AI trace store |
| Data quality SLO | 质量目标 | exposure event completeness >= 99%, QA linkage >= 98% |
| AI consumption policy | AI 能如何使用 | 可用于 Value Office summary, 不可生成个人绩效处罚 |
| Guardrail linkage | 关联约束 | wrong policy answer, complaint, reopen, PII leakage |
| Owner | 责任模型 | CX ops accountable, AI PM responsible, risk consulted |
| Change policy | 变更流程 | intent eligibility 变更需要 product + risk approval |
11.2 RACI
| Activity | Business owner | AI PM | BA | Data owner | Risk / Compliance | Finance | Platform | Ops |
|---|---|---|---|---|---|---|---|---|
| North Star definition | A | R | R | C | C | C | C | C |
| Metric contract | A | R | R | R | C | C | C | C |
| Guardrail threshold | C | R | C | C | A/R | C | C | C |
| Experiment design | A | R | R | C | C | C | C | C |
| Telemetry spec | C | R | R | R | C | I | R | C |
| Benefits register | A | R | C | I | C | A/R | I | C |
| Release / scale gate | A | R | C | C | A/R | C | R | R |
| Metric incident response | A | R | C | R | C | C | R | R |
11.3 Governance Forums
| Forum | Cadence | 关键问题 | 输出 |
|---|---|---|---|
| Metric design review | use case discovery / pilot 前 | North Star 是否代表合格价值事件 | approved metric tree |
| Guardrail review | release 前和高风险变更前 | 风险阈值是否可执行 | guardrail matrix |
| Experiment review | pilot 前 | 反事实、随机化、样本、telemetry 是否可信 | experiment brief approval |
| Value review | monthly | 收益是否可归因、可兑现、可扩张 | benefits register update |
| Metric incident review | incident 后 | 指标、数据或解释是否误导决策 | metric correction and comms |
| Portfolio review | quarterly | 哪些 use case scale、stop、platformize | funding decision |
11.4 Metric Incident
AI 指标事故包括:
- exposure event 丢失导致 adoption 虚高或虚低。
- 知识库版本变更未记录, 影响 groundedness 口径。
- dashboard 把 ineligible cases 放入分母。
- LLM analytics assistant 解释了未批准指标。
- 实验 SRM 失败但结果被继续用于 scale decision。
- benefits register 把全部 pre/post 改善归因给 AI。
响应流程:
Detect
-> classify severity
-> freeze affected decision
-> identify lineage and consumers
-> correct metric / dashboard / AI summary
-> communicate impacted decisions
-> update contract and tests
-> add regression check
12. Templates
12.1 North Star Metric Canvas
| Field | Filled example |
|---|---|
| Product / use case | Customer Service AI Policy Copilot |
| Target user | Contact center agents handling regulated servicing intents |
| Business problem | High AHT and reopen rate caused by policy search friction and inconsistent answers |
| One-sentence North Star | grounded AI-assisted customer issues resolved without reopen or policy breach |
| Qualified value event | eligible contact, AI answer exposed, agent accepted or customer self-served, resolved, no reopen in 7 days, citation QA pass |
| Primary value dimension | cost per resolved contact and customer issue resolution quality |
| Input metric groups | coverage、exposure、acceptance、groundedness、cycle time、reopen、cost |
| Guardrails | wrong policy answer、PII leakage、vulnerable customer escalation miss、complaint spike |
| Causal evidence plan | agent-team cluster A/B with triggered analysis and QA sample |
| Finance translation | incremental resolved contacts * adjusted cost reduction - AI total cost - risk cost |
| Scale rule | expand only if FCR improves, AHT decreases, critical policy defects remain zero, and cost per resolved contact improves |
12.2 Metric Tree Template
Business outcome:
reduce cost per resolved customer issue while maintaining policy compliance
North Star:
grounded AI-assisted customer issues resolved without reopen or policy breach
Qualified value event:
eligible contact + AI exposure + accepted answer + resolution + QA pass + no reopen
Input metrics:
coverage:
approved intent coverage
knowledge base freshness
exposure:
AI answer visible rate
triggered contact rate
adoption:
accepted answer rate
edit distance
quality:
citation correctness
unsupported claim rate
workflow:
AHT
transfer rate
after-call work
outcome:
FCR
reopen rate
complaint rate
economics:
AI cost per resolved contact
QA cost per contact
guardrails:
wrong policy answer
PII leakage
vulnerable customer miss
12.3 Guardrail Matrix Template
| Guardrail | Severity | Metric | Threshold | Detection | Decision |
|---|---|---|---|---|---|
| Wrong regulated policy answer | Critical | expert QA critical defect count | 0 per release gate | QA sample + user report | stop affected intent and run root cause |
| PII leakage | Critical | confirmed leakage event | 0 | DLP + trace audit | disable feature path and incident response |
| Reopen rate | High | 7-day reopen rate | no statistically credible increase above control | experiment scorecard | pause scale and diagnose intents |
| Latency | Medium | P95 response latency | <= 2.5 seconds for agent desktop | observability dashboard | route optimization or fallback |
| Cost overrun | Medium | AI cost per resolved contact | <= approved unit economics ceiling | cost ledger | model routing review |
12.4 Experiment Design Brief
| Field | Filled example |
|---|---|
| Hypothesis | Grounded AI policy answers reduce AHT and reopen rate for eligible servicing intents without increasing policy defects |
| Treatment | AI answer with citation shown in agent desktop |
| Control | Existing policy search and macro workflow |
| Unit of assignment | Agent team |
| Unit of analysis | Contact |
| Eligibility | Approved servicing intents, excluding complaints and vulnerable customer cases in first pilot |
| Primary metric | Grounded AI-assisted resolved contact rate |
| Secondary metrics | AHT、after-call work、transfer rate、agent acceptance |
| Guardrails | wrong policy answer、PII leakage、complaint、reopen、latency |
| Exposure logging | contact_id、agent_id、team_id、variant、AI_visible、accepted、edited |
| Analysis | ITT + triggered exposure analysis, case mix adjustment, pre-registered slices |
| Decision rule | scale if primary improves, cost per resolved contact improves, no critical guardrail breach |
12.5 Benefits Realization Register
| Field | Filled example |
|---|---|
| Benefit id | CX-AI-BEN-004 |
| Use case | Customer Service AI Policy Copilot |
| Baseline | 620,000 monthly eligible contacts, AHT P50 7.8 min, reopen 11.2% |
| Incremental effect | Cluster experiment estimates -0.7 min AHT per exposed resolved contact |
| Adoption | 64% eligible exposed contacts accepted AI answer |
| Quality | citation QA pass 96.4%, critical policy defect 0 |
| Gross value | capacity equivalent from reduced handle time for accepted contacts |
| Cost | model, retrieval, QA sample, training, platform support |
| Risk adjustment | complaint guardrail unchanged; high-risk intents excluded until separate gate |
| Recognized benefit | capacity redeployed to backlog and peak coverage |
| Sign-off | CX operations and finance approved for limited scale |
| Next review | 60-day post-scale value audit with expanded intent set |
12.6 Scale / Stop Memo
| Section | 内容要求 |
|---|---|
| Decision | scale, limited scale, continue pilot, redesign, stop |
| Evidence | North Star movement, input metric movement, causal estimate, confidence |
| Guardrails | pass / breach / trend / mitigation |
| Unit economics | value per event, cost per event, scale cost curve |
| Adoption | target user adoption and workflow change evidence |
| Residual risk | risk owner view and control plan |
| Benefits | finance treatment and benefits register update |
| Platform reuse | reusable components, shared controls, additional use cases |
| Decision log | owner, date, rationale, conditions |
13. Review Checklists
13.1 North Star Review
- North Star 是否是合格价值事件, 而不是调用量、登录量或生成量。
- 是否清楚连接客户价值、业务价值和 AI 贡献。
- 是否能拆成团队可拉动的 input metrics。
- 是否有明确 guardrail, 防止用风险换增长。
- 是否能按风险等级、渠道、客群、团队、地区分层。
- 是否能翻译成 finance 可讨论的价值。
- 是否不鼓励越权自动化或低质量快速完成。
13.2 Metric Contract Review
- 指标名称、定义、公式、分子、分母、粒度、时间窗口是否清楚。
- source-of-truth、数据质量 SLO、血缘、owner 是否明确。
- AI consumption policy 是否说明哪些 AI 系统可使用该指标。
- 口径变更是否有审批、版本和影响分析。
- 指标是否可被 eval、dashboard、LLM analytics 和 Value Office 一致消费。
- 是否定义 metric incident 的 freeze、correction 和 communication 流程。
13.3 Experiment / Causal Evidence Review
- treatment、control、eligibility、assignment、exposure 是否清楚。
- randomization unit 与 analysis unit 是否匹配, 聚类如何处理。
- 是否记录 assignment、exposure、adoption、action、outcome、guardrail。
- 是否预先声明 primary、secondary、guardrail 和 slices。
- 是否检查 SRM、case mix、seasonality、spillover、metric drift。
- 无法随机时, 准实验假设是否写清楚并做敏感性检查。
- 是否同时报告 ITT 和 triggered analysis, 避免只看采纳者。
13.4 Benefits Realization Review
- baseline 是否在 pilot 前冻结。
- observed change 与 incremental effect 是否分开。
- 收益是否扣除 AI total cost、human review、QA、培训、治理和风险成本。
- 节省时间是否转成 headcount、capacity、SLA、revenue 或 risk reduction 的具体兑现路径。
- finance、business owner、risk owner 是否认可口径。
- scale 后是否安排 post-scale audit, 防止 pilot 效果衰减。
13.5 Guardrail Review
- critical guardrail 是否有 zero-tolerance 或 hard stop。
- guardrail 是否覆盖客户伤害、合规、隐私、安全、公平、运营、财务和可靠性。
- 阈值是否按风险 tier 区分。
- 是否定义 detection source、owner、response time 和 rollback。
- 是否防止平均指标掩盖高风险 segment 伤害。
- 是否把 guardrail breach 纳入 risk-adjusted value。
14. 反模式
| Anti-pattern | 表现 | 为什么危险 | 更好做法 |
|---|---|---|---|
| 把 model accuracy 当 North Star | “准确率 95%”成为唯一成功指标 | 无法证明采用、流程和业务价值 | 用 qualified value event + eval guardrail |
| 把 AI 调用量当价值 | API calls、answers generated、tokens consumed 增长 | 激励无效使用和成本膨胀 | 统计 accepted and quality-passed workflow outcomes |
| 只报节省小时数 | 用主观估计乘以使用次数 | finance 难认可, 忽略返工和风险 | 用 causal estimate + capacity redeployment |
| 只看平均值 | 全体 AHT 下降 | 高风险 segment 可能恶化 | 按 risk tier、channel、customer segment 分层 |
| Guardrail 后置 | 上线后再看投诉和合规问题 | 高风险场景可能不可逆 | release gate 前定义 stop rules |
| 只看采纳者 | 采纳 AI 的人表现更好 | selection bias 高估效果 | 同时看 assignment、exposure、ITT、triggered |
| 把 pilot 团队成功当全量成功 | 最强团队试点表现好 | scale 后 adoption 和质量衰减 | phased rollout + heterogeneity analysis |
| 把平台接入数当平台价值 | 接了 20 个模型和 50 个应用 | 不代表更快、更安全、更便宜 | 用 DORA-style delivery + value/risk gates |
| 用 revenue 直接做财富 AI North Star | 推荐后销售额上升 | 可能牺牲 suitability 和客户信任 | compliant accepted interactions + risk-adjusted revenue |
| 指标无 owner | dashboard 数字没人负责 | 事故时无法修复和解释 | metric contract + RACI + incident playbook |
15. 30 天训练计划
目标: 30 天内产出一套可放入作品集的 AI Product Metrics / North Star / Value Measurement 证据包, 选择一个金融零售 use case 深做, 同时覆盖平台治理视角。
| Day | 训练主题 | 产出 |
|---|---|---|
| 1 | 选择 use case: AML、客服、信贷、财富/分行或 AI 平台 | Use case decision card |
| 2 | 写 problem baseline: volume、cost、quality、risk、cycle time | Baseline table |
| 3 | 定义 AI intervention 和 decision boundary | Intervention brief |
| 4 | 识别用户、流程、system-of-record 和 risk owner | Stakeholder / system map |
| 5 | 设计 3 个候选 North Star 并打分 | North Star option matrix |
| 6 | 选定 North Star 和 qualified value event | North Star canvas |
| 7 | 拆 North Star 到 input metrics | Metric tree v1 |
| 8 | 定义 AI quality / eval metrics | Eval-to-business matrix |
| 9 | 设计 guardrail categories 和 critical thresholds | Guardrail matrix v1 |
| 10 | 写 metric contract: name、formula、grain、source、owner | Metric contract |
| 11 | 设计 telemetry: assignment、exposure、adoption、action、outcome | Telemetry spec |
| 12 | 画 data lineage: source -> metric -> dashboard -> AI summary | Metric lineage map |
| 13 | 设计 causal evidence plan: A/B、cluster、DiD 或 time series | Experiment / quasi-experiment brief |
| 14 | 识别因果威胁: selection、seasonality、spillover、case mix | Threat-to-validity register |
| 15 | 设计 benefits register 字段和 finance translation | Benefits register v1 |
| 16 | 计算 gross value、cost、risk adjustment 的样例 | Risk-adjusted value model |
| 17 | 定义 adoption realization factor 和 quality pass factor | Value adjustment rules |
| 18 | 写 pilot release gate: eval、risk、ops、cost | Pilot gate checklist |
| 19 | 写 scale / stop decision rule | Scale / stop memo skeleton with filled example |
| 20 | 补 DORA-style 平台指标或工程交付指标 | Platform metric addendum |
| 21 | 做 AML case version 或客服 case version 的完整示例 | Case metric pack |
| 22 | 做信贷或财富/分行 case 的对比示例 | Second case comparison |
| 23 | 设计 dashboard 信息架构: exec、product、risk、ops 四层 | Dashboard outline |
| 24 | 写 product analytics governance RACI | RACI table |
| 25 | 写 metric incident response 流程 | Metric incident playbook |
| 26 | 整理反模式和面试风险点 | Anti-pattern cheat sheet |
| 27 | 写 5 个高阶面试答案 | Interview answer pack v1 |
| 28 | 把所有产物整理成 portfolio narrative | Portfolio storyline |
| 29 | 做自评: 是否有 North Star、guardrail、causal、benefits、governance | Review checklist results |
| 30 | 完成最终作品集包 | Final AI product metrics portfolio pack |
完成标准:
- 有一个清楚的 North Star, 且不是 activity metric。
- 有完整 metric tree 和 guardrail matrix。
- 有 causal evidence plan, 不只看 pre/post。
- 有 risk-adjusted value model 和 benefits register。
- 有 product analytics governance: metric contract、RACI、incident。
- 能用金融零售语言讲清收益兑现和风险控制。
16. 面试答案
Q1: 你会如何为银行的 AI 客服 Copilot 设计 North Star?
30 秒版本
我不会用调用量或生成答案数做 North Star。我会定义为“有证据支持、被采纳、一次解决且没有 reopen 或政策违规的 AI 辅助客户问题数”。这个指标同时包含客户价值、业务价值、AI 贡献和风险边界。
2 分钟版本
我会先定义 qualified value event: eligible contact、AI answer exposed、agent accepted 或客户自助完成、问题 resolved、7 天内无 reopen、引用 QA 通过、无 critical policy defect。然后拆 input metrics: intent coverage、knowledge freshness、exposure rate、acceptance rate、citation correctness、AHT、FCR、reopen、complaint、cost per resolved contact。guardrail 包括 wrong policy answer、PII leakage、vulnerable customer escalation miss 和 latency。归因上优先用 agent-team cluster A/B, 同时记录 assignment、exposure、adoption、outcome, 避免只看采纳者造成 selection bias。收益兑现时只认可通过质量和风险门槛的增量解决量, 再扣除模型、QA、培训和平台成本。
Q2: AI eval 指标和业务指标是什么关系?
30 秒版本
Eval 指标证明 AI 行为是否合格, 业务指标证明流程和经营结果是否改善。eval 是 release gate, 不是 ROI 本身。
2 分钟版本
例如信贷 memo assistant 的 eval 指标包括 policy citation correctness、unsupported risk rationale、missing document detection 和 prohibited recommendation language。这些能决定是否允许进入 pilot。但业务指标是 cycle time、rework、underwriter capacity、appeal overturn、policy exception defect。两者通过 metric tree 连接: 如果 citation correctness 提升, memo acceptance 和 rework 应改善, 再影响 cycle time 和成本。若 eval 提升但业务结果没变, 可能是工作流嵌入差、用户不信任、case mix 变化或 AI 只改善了不重要的片段。
Q3: 如何证明 AI 项目的收益不是季节性或团队选择造成的?
30 秒版本
要建立反事实。能随机就做 A/B 或 cluster rollout; 不能随机就用 DiD、interrupted time series、matched cohort 或 synthetic control, 并记录 assignment、exposure、adoption、outcome 和 guardrail。
2 分钟版本
我会先定义 treatment 和 eligibility, 再选择 randomization unit。客服适合 agent team cluster A/B; AML 可能用 phased rollout 加 matched case complexity; 财富分行适合 advisor cohort rollout 加 compliance sampling。分析时区分 ITT 和 triggered exposure, 预先声明 primary metric、guardrail 和 segment slices。还要检查 SRM、case mix、seasonality、spillover、metric drift 和 outcome delay。最后只把可信增量效果放入 benefits register, 不把全部 pre/post 变化算给 AI。
Q4: 你如何做 risk-adjusted AI ROI?
30 秒版本
我会从可信增量价值开始, 乘以 adoption 和 quality pass factor, 再扣除 AI 全成本和 expected risk cost。金融零售不能只算效率, 还要扣质量、合规、客户伤害和治理成本。
2 分钟版本
公式是: risk-adjusted net value = credible incremental value * adoption realization * quality pass - model/platform/data/QA/change/governance cost - expected risk cost - opportunity cost。举例 AML Copilot, 如果每个 case 节省 6 分钟但只有 70% eligible cases 真正采用, 且高风险 typology 需要更强人工复核, 那么收益要按采用和质量通过比例调整。若 evidence defect 或 SAR narrative unsupported claim 出现, 相关价值应扣除或冻结。finance sign-off 也要说明节省时间如何兑现为 backlog reduction、capacity redeployment 或成本减少。
Q5: AI 平台的 North Star 怎么设计?
30 秒版本
平台 North Star 不应是模型接入数或 API 调用量, 而应是“通过共享平台交付、通过价值/风险/可靠性/成本门禁的生产 AI workflows 数量”。
2 分钟版本
AI 平台价值来自复用和治理能力, 例如 model gateway、prompt registry、eval harness、observability、cost ledger、policy guardrail 和 audit log。输入指标包括 platform adoption by use case、eval coverage、trace coverage、policy coverage、model routing hit rate、cost per workflow。DORA-style 指标可以证明交付能力: lead time to AI change、deployment frequency、change failure rate、MTTR 和 reliability。但这些要连接业务 use case value, 否则平台只是工程活动。guardrail 包括 unapproved model use、audit log gap、policy bypass、cost overrun 和 change failure。
Q6: 如果 North Star 上升但 guardrail 恶化, 你会怎么处理?
30 秒版本
先冻结扩容, 看 guardrail severity。critical breach 直接 stop 或 rollback affected path; 非 critical 则分 segment、case type、版本和 workflow step 诊断, 在风险 owner 接受前不把增长计入可兑现价值。
2 分钟版本
North Star 不能压过风险边界。比如客服 AI 的 resolved contacts 上升, 但 wrong policy answer 或投诉上升, 我会先检查是否集中在某些 intent、知识库版本、agent cohort 或 prompt 版本。critical policy defect 要关闭受影响 intent, 更新 eval set 和 release gate。对于 bounded degradation, 可以限制流量、增加 human review、调整 retrieval filter 或回到 pilot。收益计算中, guardrail breach 影响的事件不算 qualified value event, 还要进入 expected risk cost。
17. 作品集交付物
一套高级 AI Product Metrics 作品集可以包含以下资产:
| Artifact | 内容 | 评估标准 |
|---|---|---|
| One-page metric strategy | use case、North Star、qualified event、metric tree、guardrail | 一页能讲清价值和风险 |
| North Star option matrix | 2-3 个候选 North Star 的取舍 | 能说明为什么不用 activity metric |
| AI product metric taxonomy | business、workflow、adoption、eval、guardrail、cost、platform | 分类清楚, owner 清楚 |
| Metric contract | definition、formula、grain、source、owner、AI consumption policy | 可被 dashboard / eval / audit 复用 |
| Guardrail matrix | severity、threshold、detection、decision | 有 hard stop 和 review trigger |
| Experiment / causal design brief | treatment、control、unit、telemetry、analysis、threats | 能证明增量价值 |
| Risk-adjusted value model | gross value、cost、risk adjustment、finance treatment | 不夸大 ROI |
| Benefits realization register | baseline、target、incremental estimate、sign-off、scale decision | 能支撑 Value Office review |
| Dashboard information architecture | exec、product、risk、ops、platform 分层视图 | 不把所有指标堆在一起 |
| Product analytics governance pack | RACI、metric incident、change policy、lineage | 可审计、可运营 |
| Financial retail case pack | AML、客服、信贷、财富/分行、AI 平台示例 | 展示领域迁移能力 |
| Interview answer pack | 6-10 个高阶问题答案 | 能讲清 North Star、causal、guardrail、benefits |
作品集叙事建议:
I did not start with model accuracy.
I started with the business decision and the qualified value event.
Then I designed the North Star, input metrics, guardrails, causal evidence, risk-adjusted value and benefits realization governance.
This is how I would help a regulated financial institution scale AI without confusing usage with value.
18. 最终检查: 一套指标体系是否成熟
| Question | Mature answer |
|---|---|
| North Star 是什么 | 一个带质量、采用、风险和成本约束的 qualified value event |
| 输入指标是什么 | 能被产品、数据、模型、运营、平台团队直接拉动 |
| Guardrail 是什么 | 有 owner、阈值、检测、响应和 stop / rollback 规则 |
| AI 质量如何度量 | eval、QA、human review、trace、online monitoring 联动 |
| 业务收益如何证明 | 使用实验或准实验建立 credible counterfactual |
| 收益如何兑现 | 进入 benefits register, 由 business 和 finance 认可 |
| 风险如何进入 ROI | 用 expected risk cost、quality pass、guardrail breach 调整 |
| 平台价值如何证明 | 用复用、成本、可靠性和 DORA-style 交付指标连接 use case value |
| 指标如何治理 | metric contract、semantic layer、RACI、lineage、change policy、incident |
| 是否适合金融零售 | 覆盖合规、隐私、公平、客户伤害、审计和人工责任边界 |
一句话收束:
高级 AI 产品度量不是“看 AI 有没有被使用”, 而是证明 AI 在受控风险下创造了可归因、可兑现、可扩展、可治理的业务价值。