返回 Papers
AI 扩展计划 / Playbooks

AI Product Metrics / North Star Value Measurement Playbook

这些来源作为产品度量、可信 AI、交付能力和价值治理的锚点, 不构成法律、监管、审计或供应商选型意见。

1,063AI_PRODUCT_METRICS_NORTH_STAR_VALUE_MEASUREMENT_PLAYBOOK.md

AI Product Metrics / North Star / Value Measurement Playbook

适用对象: 已具备 BA / CBAP / 产品管理基础的 AI PM、AI BA、AI Product Architect、AI Value Office Lead、金融零售数字化负责人。 核心问题: AI 产品如何从“模型效果不错、用户说有用”升级为“有 North Star、有输入指标、有 guardrail、有因果证据、有收益兑现、有风险调整、有财务认可、有可审计治理”。 学习目标: 能为 AML、客服、信贷、财富/分行、AI 平台等金融零售场景设计高级 AI 产品指标体系, 并把 metric tree、实验设计、benefits realization、risk-adjusted value 和 product analytics governance 转成作品集资产。 边界说明: 本文不是基础指标课, 不讲 DAU/MAU 入门、漏斗术语入门或 BI 报表教程。正式金融零售项目必须由 business owner、risk、model risk、legal、compliance、privacy、security、finance、data owner、architecture 和 operations 共同确认。


Source Anchors

这些来源作为产品度量、可信 AI、交付能力和价值治理的锚点, 不构成法律、监管、审计或供应商选型意见。

AnchorOfficial / primary source本 playbook 中的用法
Amplitude North Star Metric official guidehttps://amplitude.com/north-star用于锚定 North Star Metric 的产品管理语言: 把客户价值、产品使用、业务结果和团队输入指标连接成一个可行动的指标体系。
NIST AI Risk Management Frameworkhttps://www.nist.gov/itl/ai-risk-management-framework用 Govern / Map / Measure / Manage 组织 AI 指标风险、guardrail、measurement evidence、monitoring 和治理闭环。
DORAhttps://dora.dev/用软件交付和运营能力语言连接 AI 平台、工程生产力、可靠性、变更风险和业务目标, 防止 AI 平台价值只停留在 demo 数量。
Trustworthy Online Controlled Experimentshttps://www.cambridge.org/core/books/trustworthy-online-controlled-experiments/D97B26382EB0EB2DC2019A7A7B518F59用 online controlled experiments、组织级指标、实验可信度和长期影响语言支撑 AI 产品因果证据。
NIST AI RMF Generative AI Profilehttps://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence用于 GenAI 风险场景下的 measurement、monitoring、content risk、human oversight 和 evidence design。

1. One-Sentence Positioning

AI Product Metrics 是把 AI 行为质量、真实用户采用、业务结果、风险 guardrail、因果归因、单位经济和收益兑现连接起来的产品治理系统; North Star 是这个系统的方向盘, 不是一个孤立 KPI。

更短的面试版:

AI North Star = qualified value events created by AI, adjusted by trust, risk, adoption and unit economics.

高级 AI PM / BA / Architect 的关键不是“列很多指标”, 而是回答:

  • AI 到底改变了哪个决策、动作、工作流或客户体验。
  • 哪个指标代表持续客户价值和业务价值。
  • 哪些输入指标能被团队直接拉动。
  • 哪些 guardrail 一旦恶化就必须停止扩容。
  • 哪些证据能证明改善来自 AI, 而不是季节性、人员选择、管理关注或流程重排。
  • 哪部分收益被 finance、risk、ops 和业务 owner 认可。
  • 指标口径、血缘、owner、变更、权限和事件响应是否可治理。

2. 为什么 AI 产品度量不能停留在模型指标

AI 产品常见汇报方式:

  • accuracy 提升 8%。
  • answer quality judge score 达到 4.5/5。
  • 用户访问量上涨 30%。
  • 生成了 20 万次摘要。
  • 预计节省 30% 人工时间。

这些信号有用, 但不足以支撑 scale / stop / fund decision。原因是:

单点指标能说明什么不能说明什么需要补强
Model accuracy / eval scoreAI 输出在测试集上更接近标准是否被用户采用、是否改善流程、是否降低风险exposure、adoption、workflow outcome、guardrail
Usage count有人打开或调用是否产生合格价值事件accepted action、completed workflow、quality pass
Time saved某步骤更快是否返工增加、质量下降、节省时间是否被兑现rework、QA defect、capacity redeployment
Automation rate系统处理更多任务是否安全、是否误伤客户、是否把风险转移给人工exception rate、manual override、customer harm
Cost per call推理成本单位业务价值是否成立cost per resolved case、cost per risk avoided
Customer satisfaction体验信号是否导致长期价值、合规风险是否可控retention、complaint、policy breach、segment fairness

成熟度跃迁:

Model metric
-> AI behavior metric
-> workflow metric
-> customer / risk / financial outcome
-> causal evidence
-> risk-adjusted value
-> benefits realization
-> portfolio funding decision

一句话:

模型指标是 release evidence 的一部分, 不是 AI 产品价值的终点。


3. AI 产品指标 Taxonomy

3.1 高级分类

Metric class要回答的问题金融零售示例Owner典型用途
North Star MetricAI 产品创造的核心合格价值事件是什么risk-adjusted resolved customer issuesquality-approved AI-assisted AML casesProduct / Business owner战略对齐、团队聚焦、scale 决策
Business Outcome Metric业务最终结果是否改善cost per case、loss avoided、FCR、approval cycle time、complaint rateBusiness / Financebenefits realization、funding gate
Workflow MetricAI 是否改变了流程表现handle time、case aging、queue backlog、reopen rate、touches per caseOps / BA流程优化、瓶颈定位
Adoption Metric目标用户是否在正确场景采用eligible user activation、repeat use、accepted suggestion rate、copilot-assisted workflow shareProduct / Opsadoption 管理、培训和 UX 迭代
AI Quality / Eval MetricAI 行为是否满足任务要求groundedness、citation correctness、policy adherence、tool success、format validityAI PM / EvalOpsrelease gate、regression test
Decision Quality Metric人机决策是否更好escalation precision、memo defect rate、override quality、appeal overturn rateRisk / Ops高风险决策支持治理
Guardrail MetricAI 是否造成不可接受伤害PII leakage、unauthorized advice、wrong denial、complaint spike、fairness gapRisk / Compliancestop rule、rollback、incident
Cost / Unit Economics Metric价值是否覆盖成本cost per resolved case、token cost per accepted action、review cost per caseFinance / Platformroute 优化、budget、scale economics
Data / Knowledge MetricAI 依赖的数据是否可信freshness、coverage、retrieval recall、policy effective-date correctnessData ownerRAG、eval、audit
Platform / Engineering MetricAI 交付能力是否可扩展可靠lead time to AI change、deployment frequency、change failure rate、MTTR、platform reuse ratePlatform / EngineeringDORA-style 平台价值、运营成熟度
Benefits Realization Metric承诺收益是否兑现finance-signed benefit、redeployed capacity、avoided loss recognizedValue Office / Financeportfolio review、scale / stop

3.2 指标对象边界

容易混淆正确区分产品治理意义
KPI vs North StarKPI 可以很多; North Star 是核心价值事件和方向防止每个团队优化不同方向
Eval metric vs business metriceval 测 AI 行为; business metric 测业务结果防止 judge score 被包装成 ROI
Feature metric vs decision metricfeature 是模型输入; decision metric 是业务动作质量防止输入质量被误解为业务收益
Adoption vs impactadoption 是使用和暴露; impact 是可归因结果变化防止使用量增长被误报为价值
Guardrail vs secondary metricguardrail 是约束和停机规则; secondary 是解释结果防止严重风险被平均数掩盖
Activity vs value eventactivity 是点击、查询、生成; value event 是合格完成的业务结果防止“AI 很忙”但业务无改善

3.3 AI 产品度量栈

flowchart TB
  S[Strategy and risk appetite] --> N[North Star Metric]
  N --> V[Qualified value event]
  V --> I[Input metrics]
  I --> A[AI behavior and eval metrics]
  I --> W[Workflow and adoption metrics]
  I --> B[Business outcome metrics]
  A --> G[Guardrail metrics]
  W --> G
  B --> R[Risk-adjusted value]
  G --> R
  R --> C[Causal evidence]
  C --> F[Finance-recognized benefits]
  F --> P[Portfolio scale / stop decision]

4. North Star Metric: 高级设计原则

4.1 合格 North Star 的判断标准

一个 AI 产品 North Star 必须同时满足九个条件:

条件判断问题不合格信号
用户价值清楚用户或业务流程为什么更好指标只统计模型调用或页面访问
业务价值清楚为什么这个指标增长会支持经营结果与成本、风险、收入、客户体验无连接
AI 贡献可解释AI 如何影响该价值事件AI 只是背景工具, 无 exposure 记录
可被团队拉动输入指标可拆到产品、数据、模型、运营动作指标太滞后或太宏观
不易被作弊增长不能靠降低质量或转移风险自动关闭更多 case 但返工和投诉上升
受 guardrail 约束风险阈值明确, 不允许用伤害换增长只看效率, 不看合规和客户伤害
可分层诊断能按 segment、渠道、风险等级、团队拆解平均值掩盖高风险客群伤害
有可用基线可以建立 pre-AI / control / holdout 对照只能做主观估算
财务可翻译可以映射到 cost、revenue、loss avoided 或 capacityfinance 无法签字认可

4.2 AI North Star 常用形态

Product pattern推荐 North Star 形态示例
RAG knowledge assistantGrounded and accepted resolutionsweekly grounded accepted answers that resolve workflow without QA defect
Copilot / draft assistantQuality-approved assisted work completedAI-assisted customer responses sent with policy pass and no reopen
Decision supportBetter decisions under human accountabilityrisk-reviewed decisions with improved precision and no adverse guardrail breach
AutomationSafely automated eligible outcomeseligible cases safely resolved by AI within SLA and no critical defect
Agent workflowApproved actions completed safelybounded AI actions completed with human-approved audit trail and no rollback
AI platformProduction AI workflows delivering governed valueactive production AI use cases passing value, risk and DORA-style reliability gates

4.3 推荐公式: Qualified Value Event

North Star =
sum(Qualified value events)
where each event passes:
  target workflow eligibility
  real user or system exposure
  accepted or completed action
  quality / eval threshold
  risk guardrail threshold
  cost ceiling
  auditable evidence

更适合金融零售的风险调整版:

Risk-adjusted North Star =
sum(value_event_count * value_weight * confidence_weight * adoption_weight)
- expected_harm_cost
- quality_failure_cost
- incremental_operating_cost

说明:

  • value_event_count: 合格业务事件数量, 例如 resolved case、approved memo、completed branch interaction。
  • value_weight: 事件价值权重, 可来自工时、损失避免、收入、客户体验或风险暴露。
  • confidence_weight: 证据强度, 随实验、准实验、holdout、finance sign-off 提升。
  • adoption_weight: 用户真实采用和流程改变程度。
  • expected_harm_cost: 风险事件概率乘以严重度和补救成本。
  • quality_failure_cost: 返工、QA defect、投诉、申诉、人工复核成本。
  • incremental_operating_cost: 模型、平台、人工审核、标注、监控、培训和治理成本。

4.4 不同金融零售场景的 North Star 示例

场景North Star为什么比使用量更好
AML Copilotquality-approved AI-assisted investigations completed within SLA with no critical evidence defect关注合格调查完成, 不奖励低质量快关案
客服 RAG / Copilotcustomer issues resolved with grounded AI assistance and no reopen or policy breach同时约束解决率、证据、返工和政策风险
信贷 Memo Assistantcredit memos completed with AI assistance, underwriter acceptance and no policy exception defectAI 只辅助人工决策, 不越过授信责任
财富 / 分行 Advisor Assistantcompliant client interactions improved by AI with advisor acceptance and suitability guardrail pass防止把销售转化置于适当性和合规之上
AI Platformproduction AI workflows shipped through shared platform that pass value, risk, reliability and cost gates平台价值来自可复用受控交付, 不是接入模型数量

5. North Star to Input Metrics

5.1 指标树逻辑

Business goal
-> North Star
-> Qualified value event
-> Input metric groups
-> Product levers
-> Guardrails
-> Evidence and benefit realization

5.2 输入指标分层

Level指标组关键问题示例
L1 Eligibility覆盖范围哪些对象应该被 AI 影响eligible cases、eligible users、eligible workflow steps
L2 Exposure真实暴露目标对象是否真的看到或使用 AIexposed cases、AI suggestion visible rate、default-on share
L3 Adoption采用行为用户是否采纳 AI 输出或动作accepted suggestion rate、draft edit rate、repeat use
L4 Workflow change流程改变AI 是否缩短、简化或改善流程time-to-summary、touches per case、queue age
L5 AI quality行为质量AI 输出是否可用、可证据化、可审计groundedness、citation correctness、policy adherence
L6 Business outcome业务结果客户、运营、收入或风险是否改善FCR、loss avoided、approval cycle time、complaint rate
L7 Unit economics单位经济单位价值是否覆盖单位成本cost per resolved case、cost per accepted memo
L8 Guardrail风险约束是否发生不可接受伤害PII leakage、wrong advice、false negative、fairness gap
L9 Evidence quality证据强度指标改善是否可归因experiment pass、DiD estimate、holdout comparison

5.3 客服 AI North Star 树示例

Business goal:
  降低客服成本, 提升一次解决率, 控制政策错误和投诉风险

North Star:
  grounded AI-assisted customer issues resolved without reopen or policy breach

Input metrics:
  coverage:
    eligible intent coverage
    approved policy knowledge coverage
  exposure:
    agent-visible AI answer rate
    customer self-service AI answer exposure
  adoption:
    accepted answer rate
    answer edit distance
    repeat use by agent cohort
  quality:
    citation correctness
    policy adherence
    unsupported claim rate
  workflow:
    AHT
    after-call work time
    transfer rate
    reopen rate
  business:
    FCR
    complaint rate
    cost per resolved contact
  guardrail:
    wrong fee disclosure
    unauthorized promise
    PII leakage
    vulnerable customer escalation miss
  unit economics:
    token and retrieval cost per resolved contact
    QA and human review cost

5.4 输入指标与产品杠杆

Input metric可操作产品杠杆常见误判
Eligible workflow coverage扩展 intent taxonomy、政策知识覆盖、工具接入把所有场景都纳入 AI, 导致风险失控
Exposure rate默认展示、嵌入工作台、减少切换成本强制曝光但用户绕开或复制粘贴到外部工具
Acceptance rate提升引用质量、可编辑草稿、结构化下一步高接受率可能来自用户过度信任
Edit distance改进格式、语气、上下文注入低编辑不等于正确, 需要 QA 抽样
Workflow cycle time自动预填、摘要、排序、工具调用时间下降但返工上升
Groundedness检索过滤、rerank、引用强制、证据不足时拒答引用存在但不支持结论
Cost per value eventmodel routing、cache、prompt compression、small model降成本导致质量或风险恶化
Repeat useonboarding、manager cadence、workflow fit重复使用可能只说明替代搜索, 不说明价值

6. Guardrail Metrics

Guardrail 是 release contract, 不是 dashboard 边角指标。AI 产品允许优化 North Star, 但不能越过 guardrail。

6.1 Guardrail 分类

Guardrail class金融零售示例阈值策略Owner
Customer harm错误拒绝、错误收费解释、误导还款、错误投资建议critical = 0; medium breach 有暂停阈值Business / CX / Risk
Compliance / policy未授权承诺、违反 KYC/AML/credit policy、记录保留失败critical = 0; policy defect rate 上限Compliance / Legal
Privacy / securityPII 泄露、越权检索、prompt injection 成功、敏感字段进入日志zero tolerance for critical leakagePrivacy / Security
Model behaviorhallucinated rationale、unsupported claim、wrong citation、overconfident answer按风险 tier 分阈值AI PM / EvalOps
Decision quality信贷 memo 漏关键风险、AML evidence defect、错误升级/降级高风险 case 设 hard gateRisk / Ops
Fairness / segment特定年龄、语言、地区、渠道、风险等级被误伤gap 上限 + slice reviewFair lending / Risk
Operationalqueue backlog、manual override spike、QA capacity overload、fallback failurebounded degradationOps
Financialloss rate、chargeback、refund、margin erosion、review cost overrunrisk appetite thresholdFinance / Risk
Reliabilitylatency、timeout、tool error、retrieval empty rate、rollback failureSLO/SLA thresholdPlatform / Engineering
Engineering deliverychange failure rate、MTTR、incident recurrence、deployment rollbackDORA-style reliability guardrailPlatform / SRE

6.2 阈值类型

Threshold type用法示例
Zero tolerance对不可接受风险PII leakage critical = 0unauthorized credit decision = 0
Bounded degradation允许轻微波动但需限制AHT increase <= 3%latency P95 <= 2.5s
Segment parity防止平均收益掩盖伤害approval support defect gap by protected class <= approved threshold
Capacity limit防止把工作转移给人工QA queue backlog <= baseline + 10%
Cost ceiling防止单位经济失控AI cost per resolved contact <= benefit per contact * 20%
Stop trigger达到即暂停或回滚high severity complaint attributable to AI >= 3 in rolling 7 days
Review trigger不一定回滚, 但必须复核manual override increases > 15% for two consecutive weeks

6.3 Guardrail Matrix 示例

Use caseNorth StarGuardrailStop / review rule
AML Copilot合格 AI 辅助调查完成critical evidence defect、SAR narrative unsupported claim、analyst override spikecritical defect = stop expansion; override spike = expert review
客服 RAG有证据的一次解决wrong policy answer、vulnerable customer escalation miss、reopen ratewrong regulated policy answer = stop affected intent
信贷 Memo被 underwriter 接受的合规 memounauthorized recommendation、fair lending sensitive wording、missing adverse action reasonunauthorized decision language = release block
财富 / 分行合规客户互动改善unsuitable recommendation、unapproved product promotion、complaint spikesuitability breach = immediate disable for segment
AI Platform通过平台交付的受控 AI workflowchange failure rate、policy bypass、cost overrun、audit log gapaudit log gap = platform gate fail

7. Risk-Adjusted Value

AI 产品价值不能只算效率或收入。金融零售必须把风险、质量、运营、治理和客户伤害纳入净值。

7.1 核心公式

Gross incremental value =
  incremental revenue
  + cost avoided
  + loss avoided
  + rework avoided
  + capacity redeployed value
  + risk exposure reduction value

AI total cost =
  model and infrastructure cost
  + data / labeling / eval cost
  + human review and QA cost
  + platform support cost
  + change management cost
  + governance and audit cost
  + vendor and legal cost

Expected risk cost =
  probability of harm * severity * remediation cost
  + regulatory / compliance exposure
  + customer compensation and complaint handling
  + reputational and operational disruption adjustment

Risk-adjusted net value =
  credible incremental value * adoption realization factor * quality pass factor
  - AI total cost
  - expected risk cost
  - opportunity cost

7.2 参数解释

参数定义证据来源
Credible incremental value可归因于 AI 的增量收益, 不是观察到的全部变化A/B、cluster test、DiD、CausalImpact、holdout
Adoption realization factor真正进入流程并改变行为的比例exposure log、accepted action、manager audit
Quality pass factor通过质量和风险门槛的价值比例eval report、QA sample、expert review
AI total cost运行、治理和变更的全成本finance model、cloud bill、vendor contract、ops staffing
Expected risk cost风险事件的期望成本risk register、incident history、severity matrix
Opportunity cost同等资源投入其他 AI use case 的机会损失portfolio scoring、capacity plan

7.3 风险调整不要只做扣分

风险调整不是把高风险项目一票否决, 而是让决策更清楚:

情况决策含义
高价值、高风险、证据强、控制强controlled scale, 强 gate, 分阶段扩容
高价值、高风险、证据弱pilot only, 优先补因果和控制证据
中等价值、低风险、复用强可作为平台 pattern 扩散
低价值、高治理成本stop 或回到流程优化
单点收益小、组合复用大平台化评估, 不按单一 use case ROI 否定

7.4 金融零售价值类型

Value type例子注意事项
Labor efficiency客服 AHT 降低、AML evidence gathering 时间减少只有当人力被减少、转岗或释放到高价值任务时才算兑现
Capacity creation同样团队处理更多 case、缩短 backlog要证明质量和风险没有恶化
Revenue uplift财富下一步建议提升转化、分行 cross-sell 更精准必须扣除 suitability、投诉、客户长期价值风险
Loss avoidancefraud loss 降低、AML false negative 风险降低需要反事实和延迟结果跟踪
Quality improvement返工、reopen、QA defect 降低可转化为成本、风险或客户体验价值
Risk exposure reduction审计发现减少、证据完整性提升finance 可能不直接入账, 但可进入 risk-adjusted portfolio score
Platform leverage复用 gateway、eval、observability 缩短上线周期用 DORA-style lead time、change failure、MTTR 和 reuse rate 证明

8. Causal / Experimental Evidence

8.1 证据阶梯

Evidence level证据类型能支持的结论不足
L0 Anecdote用户访谈、专家样例、demo 截图发现机会和失败模式不能证明价值
L1 Descriptive analytics使用量、采用率、前后趋势看到相关变化无反事实
L2 Baseline comparisonpre/post、目标 vs 实际初步估计改善易受季节性和流程变化影响
L3 Matched / adjusted analysispropensity matching、case mix adjustment降低选择偏差依赖可观测混杂
L4 Quasi-experimentDiD、interrupted time series、CausalImpact、synthetic control无法随机时建立更可信反事实假设需要检验
L5 Randomized experimentA/B、cluster randomization、switchback、champion-challenger最强因果证据金融场景需控制风险和干扰
L6 Scaled holdout长期 holdout、phased rollout、policy experiment支撑规模化和持续价值需要组织纪律和伦理边界

8.2 金融零售实验设计选择

Use case推荐设计Randomization unit关键 guardrail
客服 CopilotAgent-level 或 team-level cluster A/Bagent、team、queuewrong policy answer、complaint、reopen、AHT
AML CopilotTeam / jurisdiction phased rollout + matched case analysisanalyst team、case cohortevidence defect、SAR narrative quality、false negative proxy
信贷 Memo AssistantUnderwriter / branch cluster 或 eligible application randomizationunderwriter、applicationfair lending、policy exception、appeal overturn
财富 Advisor AssistantAdvisor cohort controlled rollout + compliance auditadvisor、branch、client segmentunsuitable recommendation、complaint、disclosure miss
AI Platform capabilityTeam cohort rollout + DORA-style before/after with control teamsproduct team、use case teamchange failure、incident、cost overrun、audit gap

8.3 必须记录的 telemetry

Telemetry用途示例字段
Assignment谁被分配到 treatment/controlexperiment_id、unit_id、variant、assignment_time
Eligibility谁有资格接受 AIrisk_tier、workflow_step、case_type、exclusion_reason
Exposure谁真的看到或受 AI 影响AI_visible、AI_suggestion_generated、tool_result_shown
Adoption用户是否采纳accepted、edited、ignored、override_reason
Action采纳后做了什么response_sent、case_escalated、memo_submitted
Outcome下游结果resolved、reopened、QA_passed、loss_avoided
Guardrail风险和质量policy_breach、complaint、PII_block、fairness_slice
Cost单位成本tokens、model_cost、review_minutes、platform_cost
Version组件版本model、prompt、retriever、policy、tool_schema、knowledge_index

8.4 常见因果威胁

ThreatAI 产品表现控制方式
Selection bias高绩效员工更愿意使用 AI随机默认开启、encouragement design、匹配分析
Seasonality节假日、监管周期、营销季影响指标同期对照、时间序列、switchback
Case mix shiftpilot 后处理的 case 类型变化case complexity adjustment、固定 eligibility
Management attentionpilot 团队获得更多培训和主管关注把培训作为单独 treatment 或所有组一致培训
Spillovercontrol 组学习 treatment 组提示词team cluster、隔离知识库、contamination log
Metric drift口径或源系统变更metric contract、lineage、change freeze
Outcome delayfraud loss、投诉、申诉延迟出现延迟窗口、leading proxy 和 final outcome 分开
Risk displacement节省一处成本, 增加另一处风险risk-adjusted value 和 cross-functional guardrail

9. Benefits Realization

Benefits realization 是把 AI 价值从 business case 估算变成 finance 和业务 owner 可认可的证据链。

9.1 标准流程

Problem baseline
-> Value hypothesis
-> Metric contract
-> Measurement design
-> Pilot evidence
-> Adoption proof
-> Quality and risk proof
-> Finance translation
-> Benefits register
-> Scale / stop decision
-> Post-scale audit

9.2 Benefits Register 字段

Field填写规则金融零售示例
Benefit id稳定编号AML-COPILOT-BEN-001
Business owner对收益负责的人Head of Financial Crime Operations
BaselineAI 前的量、成本、质量、风险每月 18,000 cases, median review 42 min
Targetpilot 或 scale 目标合格 case review time 降低 15%
Metric contract指标口径和数据来源aml.case_review_minutes_p50
Evidence design如何归因phased rollout + matched case complexity
Observed change观察到的变化treatment team median 下降 9 min
Incremental estimate因果调整后的增量DiD estimate -6.5 min per case
Adoption proof真实采用证据72% eligible cases AI evidence summary accepted
Quality proof质量证据critical evidence defect = 0, QA pass +3.2pp
Risk adjustment风险成本或限制high-risk typology remains human-first
CostAI 全成本model、retrieval、QA、training、support
Finance treatment如何入账或管理capacity redeployed to backlog reduction
Sign-off认可状态business + finance signed at monthly value review
Scale decision决策expand to two additional analyst teams

9.3 收益兑现口径

Benefit claim不成熟说法成熟说法
节省时间AI 每次摘要节省 5 分钟在 68% eligible cases 中, AI 被采纳后经 case mix 调整节省 3.8 分钟, QA defect 未上升
降低成本客服成本下降 20%treatment queues cost per resolved contact 下降 8.4%, reopen 和投诉 guardrail 未恶化
提升质量judge score 更高citation correctness 提升 12pp, wrong policy answer critical defects 为 0, QA pass 提升 5pp
降低风险AML 更安全mandatory evidence completeness 提升 9pp, high-risk typology escalation miss 未增加
平台复用平台提高效率使用共享 eval/gateway 的团队 lead time 下降 35%, change failure rate 不升, MTTR 下降

9.4 Monthly Value Review

议题关键问题输出
Value增量收益是否超过可信反事实benefits register 更新
Adoption目标用户是否改变工作方式adoption intervention
Qualityeval 和 QA 是否支持 scalerelease / scale gate
Riskresidual risk 是否在 appetite 内risk acceptance or mitigation
Cost单位经济是否随规模改善routing、cache、capacity 决策
Portfolio是否继续、扩大、平台化或停止scale / stop memo

10. 金融零售案例

10.1 AML Investigator Copilot

定位: AI 辅助 analyst 收集证据、摘要交易、生成 narrative draft 和提示缺失信息, 但不替代 SAR / STR 判断责任。

LayerMetric design
North Starquality-approved AI-assisted investigations completed within SLA with no critical evidence defect
Value eventanalyst 使用 AI evidence summary 或 narrative draft 完成 case, QA 通过, 无 critical defect
Input metricseligible case coverage、AI summary acceptance、evidence checklist completeness、time-to-evidence、narrative edit distance
AI qualitycitation correctness、unsupported claim rate、missing evidence detection、typology coverage
Business outcomecase cycle time、backlog aging、QA pass、SAR narrative quality、capacity redeployed
Guardrailcritical evidence defect = 0、high-risk typology escalation miss、PII overexposure、analyst overreliance
Causal evidenceteam phased rollout + matched case complexity + expert QA sample
Benefit realizationcapacity released to aged backlog, not automatic headcount reduction claim

关键设计:

  • AI 输出必须区分 source fact、inference、missing evidence。
  • 高风险 typology 保持 stronger human review。
  • 指标按 jurisdiction、typology、risk tier、analyst tenure 分层。
  • 收益不能只算 review minutes, 还要看 evidence completeness 和监管质量。

10.2 客服 / Contact Center RAG + Copilot

定位: AI 为 agent 或自助渠道提供带引用的政策答案、下一步建议和回复草稿。

LayerMetric design
North Starcustomer issues resolved with grounded AI assistance and no reopen or policy breach
Value eventAI 答案被 agent 采纳或客户自助完成, 问题一次解决, 无 reopen、无政策错误
Input metricsintent coverage、approved knowledge freshness、answer exposure、acceptance rate、edit distance、self-service containment
AI qualitygroundedness、citation correctness、policy effective-date correctness、tone suitability
Business outcomeFCR、AHT、after-call work、transfer rate、complaint rate、cost per resolved contact
Guardrailwrong policy answer、unauthorized promise、vulnerable customer escalation miss、PII leakage、latency
Causal evidencequeue / agent cluster A/B, triggered analysis for actual exposure
Benefit realization降低 resolved contact 成本, 同时 reopen 和 complaint 不恶化

关键设计:

  • North Star 不用 answers generated, 因为生成越多不等于解决越多。
  • 对 regulated intents 设置 zero-tolerance critical errors。
  • self-service containment 必须与 complaint、repeat contact、abandonment 联合看。

10.3 信贷 / Underwriting Decision Support

定位: AI 辅助整理申请材料、解释政策、草拟 memo、提示缺失信息, 最终授信判断仍由授权人员和模型治理流程负责。

LayerMetric design
North Starunderwriter-accepted AI-assisted credit memos completed with policy adherence and no fairness guardrail breach
Value eventmemo 被 underwriter 采纳或低改动提交, QA / policy review 通过
Input metricseligible application coverage、document extraction confidence、policy citation correctness、memo acceptance、missing-info detection
AI qualityunsupported risk rationale、wrong policy citation、adverse action wording risk、data extraction accuracy
Business outcomecycle time、rework rate、condition clearing time、decision consistency、underwriter capacity
Guardrailunauthorized recommendation、fair lending sensitive defect、appeal overturn、policy exception miss
Causal evidenceunderwriter / branch cluster rollout + case mix adjustment + delayed outcome monitoring
Benefit realization更快更一致的 memo 和补件流程, 不把 approval rate 上升自动算作收益

关键设计:

  • AI 不应输出“批准/拒绝”的最终授权语言, 除非治理范围明确允许。
  • approval rate 不能作为孤立 North Star, 因为可能引入信用风险和公平风险。
  • 需要按产品、客群、渠道、地区、underwriter tenure 分层。

10.4 财富 / 分行 Advisor Assistant

定位: AI 辅助 RM / advisor / branch staff 做客户准备、产品知识检索、合规话术、next best conversation, 但不得绕过 suitability、disclosure 和监督流程。

LayerMetric design
North Starcompliant advisor interactions improved by AI with accepted preparation and suitability guardrail pass
Value eventadvisor 使用 AI 准备或对话建议, 客户互动完成, compliance QA 通过
Input metricsadvisor activation、client prep usage、approved content coverage、accepted next-step suggestion、meeting follow-up completion
AI qualitysuitability context completeness、disclosure citation、product restriction accuracy、tone appropriateness
Business outcomemeeting conversion、follow-up completion、client retention、assets retained、advisor productivity
Guardrailunsuitable recommendation、unapproved product promotion、complaint、missing disclosure、vulnerable client miss
Causal evidencebranch / advisor cohort rollout + matched client segment + compliance sampling
Benefit realization只认可通过 suitability 和 complaint guardrail 的增量收入或保留价值

关键设计:

  • 不把销售额直接设为 North Star, 以免激励错误推荐。
  • 对客户画像和产品适当性使用 source-of-truth 和 policy citation。
  • 价值按 client segment、advisor tenure、branch capacity 分层。

10.5 AI Platform / Model Gateway / EvalOps Platform

定位: 平台不是“接了多少模型”, 而是让多个 AI use case 更快、更安全、更便宜、更可审计地交付业务价值。

LayerMetric design
North Starproduction AI workflows shipped through shared platform that pass value, risk, reliability and cost gates
Value event一个生产 AI workflow 使用共享 gateway / eval / observability / guardrail, 并通过 release gate
Input metricsplatform adoption by use case、reuse rate、eval coverage、trace coverage、policy coverage、model routing hit rate
Engineering / DORAlead time to AI change、deployment frequency、change failure rate、MTTR、reliability
Business outcometime-to-market reduction、duplicated platform spend avoided、incident reduction、cost per workflow
Guardrailaudit log gap、policy bypass、unapproved model use、cost overrun、change failure
Causal evidencecohort comparison between platform and non-platform teams, before/after with delivery complexity adjustment
Benefit realization平台收益按 use case 复用、风险控制、交付周期缩短和成本降低分摊

关键设计:

  • 平台 North Star 不用 API callsnumber of models connected
  • DORA 指标用于证明交付和可靠性改善, 但必须连接到 AI use case value。
  • 平台价值要扣除平台团队、基础设施、治理和迁移成本。

11. Product Analytics Governance

AI 产品指标治理的目标是让指标能被人、BI、LLM、eval harness、Value Office 和审计共同信任。

11.1 Metric Contract

Field必填内容示例
Metric name稳定名称和 namespacecx.ai_grounded_resolution_rate
Business decision支持什么决策是否扩大客服 RAG 到更多 intent
Definition业务定义AI 辅助且带正确引用的一次解决工单占 eligible 工单比例
Formula可执行公式grounded_resolved_contacts / eligible_exposed_contacts
Numerator分子口径AI exposed, resolved, no reopen in 7 days, citation QA pass
Denominator分母口径eligible and exposed contacts in approved intents
Grain粒度contact_id
Time window时间口径contact close date, rolling 7 / 28 days
Dimensions可切片维度intent、channel、agent_team、customer_segment、risk_tier
Source-of-truth权威系统contact center system + QA system + AI trace store
Data quality SLO质量目标exposure event completeness >= 99%, QA linkage >= 98%
AI consumption policyAI 能如何使用可用于 Value Office summary, 不可生成个人绩效处罚
Guardrail linkage关联约束wrong policy answer, complaint, reopen, PII leakage
Owner责任模型CX ops accountable, AI PM responsible, risk consulted
Change policy变更流程intent eligibility 变更需要 product + risk approval

11.2 RACI

ActivityBusiness ownerAI PMBAData ownerRisk / ComplianceFinancePlatformOps
North Star definitionARRCCCCC
Metric contractARRRCCCC
Guardrail thresholdCRCCA/RCCC
Experiment designARRCCCCC
Telemetry specCRRRCIRC
Benefits registerARCICA/RIC
Release / scale gateARCCA/RCRR
Metric incident responseARCRCCRR

11.3 Governance Forums

ForumCadence关键问题输出
Metric design reviewuse case discovery / pilot 前North Star 是否代表合格价值事件approved metric tree
Guardrail reviewrelease 前和高风险变更前风险阈值是否可执行guardrail matrix
Experiment reviewpilot 前反事实、随机化、样本、telemetry 是否可信experiment brief approval
Value reviewmonthly收益是否可归因、可兑现、可扩张benefits register update
Metric incident reviewincident 后指标、数据或解释是否误导决策metric correction and comms
Portfolio reviewquarterly哪些 use case scale、stop、platformizefunding decision

11.4 Metric Incident

AI 指标事故包括:

  • exposure event 丢失导致 adoption 虚高或虚低。
  • 知识库版本变更未记录, 影响 groundedness 口径。
  • dashboard 把 ineligible cases 放入分母。
  • LLM analytics assistant 解释了未批准指标。
  • 实验 SRM 失败但结果被继续用于 scale decision。
  • benefits register 把全部 pre/post 改善归因给 AI。

响应流程:

Detect
-> classify severity
-> freeze affected decision
-> identify lineage and consumers
-> correct metric / dashboard / AI summary
-> communicate impacted decisions
-> update contract and tests
-> add regression check

12. Templates

12.1 North Star Metric Canvas

FieldFilled example
Product / use caseCustomer Service AI Policy Copilot
Target userContact center agents handling regulated servicing intents
Business problemHigh AHT and reopen rate caused by policy search friction and inconsistent answers
One-sentence North Stargrounded AI-assisted customer issues resolved without reopen or policy breach
Qualified value eventeligible contact, AI answer exposed, agent accepted or customer self-served, resolved, no reopen in 7 days, citation QA pass
Primary value dimensioncost per resolved contact and customer issue resolution quality
Input metric groupscoverage、exposure、acceptance、groundedness、cycle time、reopen、cost
Guardrailswrong policy answer、PII leakage、vulnerable customer escalation miss、complaint spike
Causal evidence planagent-team cluster A/B with triggered analysis and QA sample
Finance translationincremental resolved contacts * adjusted cost reduction - AI total cost - risk cost
Scale ruleexpand only if FCR improves, AHT decreases, critical policy defects remain zero, and cost per resolved contact improves

12.2 Metric Tree Template

Business outcome:
  reduce cost per resolved customer issue while maintaining policy compliance

North Star:
  grounded AI-assisted customer issues resolved without reopen or policy breach

Qualified value event:
  eligible contact + AI exposure + accepted answer + resolution + QA pass + no reopen

Input metrics:
  coverage:
    approved intent coverage
    knowledge base freshness
  exposure:
    AI answer visible rate
    triggered contact rate
  adoption:
    accepted answer rate
    edit distance
  quality:
    citation correctness
    unsupported claim rate
  workflow:
    AHT
    transfer rate
    after-call work
  outcome:
    FCR
    reopen rate
    complaint rate
  economics:
    AI cost per resolved contact
    QA cost per contact
  guardrails:
    wrong policy answer
    PII leakage
    vulnerable customer miss

12.3 Guardrail Matrix Template

GuardrailSeverityMetricThresholdDetectionDecision
Wrong regulated policy answerCriticalexpert QA critical defect count0 per release gateQA sample + user reportstop affected intent and run root cause
PII leakageCriticalconfirmed leakage event0DLP + trace auditdisable feature path and incident response
Reopen rateHigh7-day reopen rateno statistically credible increase above controlexperiment scorecardpause scale and diagnose intents
LatencyMediumP95 response latency<= 2.5 seconds for agent desktopobservability dashboardroute optimization or fallback
Cost overrunMediumAI cost per resolved contact<= approved unit economics ceilingcost ledgermodel routing review

12.4 Experiment Design Brief

FieldFilled example
HypothesisGrounded AI policy answers reduce AHT and reopen rate for eligible servicing intents without increasing policy defects
TreatmentAI answer with citation shown in agent desktop
ControlExisting policy search and macro workflow
Unit of assignmentAgent team
Unit of analysisContact
EligibilityApproved servicing intents, excluding complaints and vulnerable customer cases in first pilot
Primary metricGrounded AI-assisted resolved contact rate
Secondary metricsAHT、after-call work、transfer rate、agent acceptance
Guardrailswrong policy answer、PII leakage、complaint、reopen、latency
Exposure loggingcontact_id、agent_id、team_id、variant、AI_visible、accepted、edited
AnalysisITT + triggered exposure analysis, case mix adjustment, pre-registered slices
Decision rulescale if primary improves, cost per resolved contact improves, no critical guardrail breach

12.5 Benefits Realization Register

FieldFilled example
Benefit idCX-AI-BEN-004
Use caseCustomer Service AI Policy Copilot
Baseline620,000 monthly eligible contacts, AHT P50 7.8 min, reopen 11.2%
Incremental effectCluster experiment estimates -0.7 min AHT per exposed resolved contact
Adoption64% eligible exposed contacts accepted AI answer
Qualitycitation QA pass 96.4%, critical policy defect 0
Gross valuecapacity equivalent from reduced handle time for accepted contacts
Costmodel, retrieval, QA sample, training, platform support
Risk adjustmentcomplaint guardrail unchanged; high-risk intents excluded until separate gate
Recognized benefitcapacity redeployed to backlog and peak coverage
Sign-offCX operations and finance approved for limited scale
Next review60-day post-scale value audit with expanded intent set

12.6 Scale / Stop Memo

Section内容要求
Decisionscale, limited scale, continue pilot, redesign, stop
EvidenceNorth Star movement, input metric movement, causal estimate, confidence
Guardrailspass / breach / trend / mitigation
Unit economicsvalue per event, cost per event, scale cost curve
Adoptiontarget user adoption and workflow change evidence
Residual riskrisk owner view and control plan
Benefitsfinance treatment and benefits register update
Platform reusereusable components, shared controls, additional use cases
Decision logowner, date, rationale, conditions

13. Review Checklists

13.1 North Star Review

  • North Star 是否是合格价值事件, 而不是调用量、登录量或生成量。
  • 是否清楚连接客户价值、业务价值和 AI 贡献。
  • 是否能拆成团队可拉动的 input metrics。
  • 是否有明确 guardrail, 防止用风险换增长。
  • 是否能按风险等级、渠道、客群、团队、地区分层。
  • 是否能翻译成 finance 可讨论的价值。
  • 是否不鼓励越权自动化或低质量快速完成。

13.2 Metric Contract Review

  • 指标名称、定义、公式、分子、分母、粒度、时间窗口是否清楚。
  • source-of-truth、数据质量 SLO、血缘、owner 是否明确。
  • AI consumption policy 是否说明哪些 AI 系统可使用该指标。
  • 口径变更是否有审批、版本和影响分析。
  • 指标是否可被 eval、dashboard、LLM analytics 和 Value Office 一致消费。
  • 是否定义 metric incident 的 freeze、correction 和 communication 流程。

13.3 Experiment / Causal Evidence Review

  • treatment、control、eligibility、assignment、exposure 是否清楚。
  • randomization unit 与 analysis unit 是否匹配, 聚类如何处理。
  • 是否记录 assignment、exposure、adoption、action、outcome、guardrail。
  • 是否预先声明 primary、secondary、guardrail 和 slices。
  • 是否检查 SRM、case mix、seasonality、spillover、metric drift。
  • 无法随机时, 准实验假设是否写清楚并做敏感性检查。
  • 是否同时报告 ITT 和 triggered analysis, 避免只看采纳者。

13.4 Benefits Realization Review

  • baseline 是否在 pilot 前冻结。
  • observed change 与 incremental effect 是否分开。
  • 收益是否扣除 AI total cost、human review、QA、培训、治理和风险成本。
  • 节省时间是否转成 headcount、capacity、SLA、revenue 或 risk reduction 的具体兑现路径。
  • finance、business owner、risk owner 是否认可口径。
  • scale 后是否安排 post-scale audit, 防止 pilot 效果衰减。

13.5 Guardrail Review

  • critical guardrail 是否有 zero-tolerance 或 hard stop。
  • guardrail 是否覆盖客户伤害、合规、隐私、安全、公平、运营、财务和可靠性。
  • 阈值是否按风险 tier 区分。
  • 是否定义 detection source、owner、response time 和 rollback。
  • 是否防止平均指标掩盖高风险 segment 伤害。
  • 是否把 guardrail breach 纳入 risk-adjusted value。

14. 反模式

Anti-pattern表现为什么危险更好做法
把 model accuracy 当 North Star“准确率 95%”成为唯一成功指标无法证明采用、流程和业务价值用 qualified value event + eval guardrail
把 AI 调用量当价值API calls、answers generated、tokens consumed 增长激励无效使用和成本膨胀统计 accepted and quality-passed workflow outcomes
只报节省小时数用主观估计乘以使用次数finance 难认可, 忽略返工和风险用 causal estimate + capacity redeployment
只看平均值全体 AHT 下降高风险 segment 可能恶化按 risk tier、channel、customer segment 分层
Guardrail 后置上线后再看投诉和合规问题高风险场景可能不可逆release gate 前定义 stop rules
只看采纳者采纳 AI 的人表现更好selection bias 高估效果同时看 assignment、exposure、ITT、triggered
把 pilot 团队成功当全量成功最强团队试点表现好scale 后 adoption 和质量衰减phased rollout + heterogeneity analysis
把平台接入数当平台价值接了 20 个模型和 50 个应用不代表更快、更安全、更便宜用 DORA-style delivery + value/risk gates
用 revenue 直接做财富 AI North Star推荐后销售额上升可能牺牲 suitability 和客户信任compliant accepted interactions + risk-adjusted revenue
指标无 ownerdashboard 数字没人负责事故时无法修复和解释metric contract + RACI + incident playbook

15. 30 天训练计划

目标: 30 天内产出一套可放入作品集的 AI Product Metrics / North Star / Value Measurement 证据包, 选择一个金融零售 use case 深做, 同时覆盖平台治理视角。

Day训练主题产出
1选择 use case: AML、客服、信贷、财富/分行或 AI 平台Use case decision card
2写 problem baseline: volume、cost、quality、risk、cycle timeBaseline table
3定义 AI intervention 和 decision boundaryIntervention brief
4识别用户、流程、system-of-record 和 risk ownerStakeholder / system map
5设计 3 个候选 North Star 并打分North Star option matrix
6选定 North Star 和 qualified value eventNorth Star canvas
7拆 North Star 到 input metricsMetric tree v1
8定义 AI quality / eval metricsEval-to-business matrix
9设计 guardrail categories 和 critical thresholdsGuardrail matrix v1
10写 metric contract: name、formula、grain、source、ownerMetric contract
11设计 telemetry: assignment、exposure、adoption、action、outcomeTelemetry spec
12画 data lineage: source -> metric -> dashboard -> AI summaryMetric lineage map
13设计 causal evidence plan: A/B、cluster、DiD 或 time seriesExperiment / quasi-experiment brief
14识别因果威胁: selection、seasonality、spillover、case mixThreat-to-validity register
15设计 benefits register 字段和 finance translationBenefits register v1
16计算 gross value、cost、risk adjustment 的样例Risk-adjusted value model
17定义 adoption realization factor 和 quality pass factorValue adjustment rules
18写 pilot release gate: eval、risk、ops、costPilot gate checklist
19写 scale / stop decision ruleScale / stop memo skeleton with filled example
20补 DORA-style 平台指标或工程交付指标Platform metric addendum
21做 AML case version 或客服 case version 的完整示例Case metric pack
22做信贷或财富/分行 case 的对比示例Second case comparison
23设计 dashboard 信息架构: exec、product、risk、ops 四层Dashboard outline
24写 product analytics governance RACIRACI table
25写 metric incident response 流程Metric incident playbook
26整理反模式和面试风险点Anti-pattern cheat sheet
27写 5 个高阶面试答案Interview answer pack v1
28把所有产物整理成 portfolio narrativePortfolio storyline
29做自评: 是否有 North Star、guardrail、causal、benefits、governanceReview checklist results
30完成最终作品集包Final AI product metrics portfolio pack

完成标准:

  • 有一个清楚的 North Star, 且不是 activity metric。
  • 有完整 metric tree 和 guardrail matrix。
  • 有 causal evidence plan, 不只看 pre/post。
  • 有 risk-adjusted value model 和 benefits register。
  • 有 product analytics governance: metric contract、RACI、incident。
  • 能用金融零售语言讲清收益兑现和风险控制。

16. 面试答案

Q1: 你会如何为银行的 AI 客服 Copilot 设计 North Star?

30 秒版本

我不会用调用量或生成答案数做 North Star。我会定义为“有证据支持、被采纳、一次解决且没有 reopen 或政策违规的 AI 辅助客户问题数”。这个指标同时包含客户价值、业务价值、AI 贡献和风险边界。

2 分钟版本

我会先定义 qualified value event: eligible contact、AI answer exposed、agent accepted 或客户自助完成、问题 resolved、7 天内无 reopen、引用 QA 通过、无 critical policy defect。然后拆 input metrics: intent coverage、knowledge freshness、exposure rate、acceptance rate、citation correctness、AHT、FCR、reopen、complaint、cost per resolved contact。guardrail 包括 wrong policy answer、PII leakage、vulnerable customer escalation miss 和 latency。归因上优先用 agent-team cluster A/B, 同时记录 assignment、exposure、adoption、outcome, 避免只看采纳者造成 selection bias。收益兑现时只认可通过质量和风险门槛的增量解决量, 再扣除模型、QA、培训和平台成本。

Q2: AI eval 指标和业务指标是什么关系?

30 秒版本

Eval 指标证明 AI 行为是否合格, 业务指标证明流程和经营结果是否改善。eval 是 release gate, 不是 ROI 本身。

2 分钟版本

例如信贷 memo assistant 的 eval 指标包括 policy citation correctness、unsupported risk rationale、missing document detection 和 prohibited recommendation language。这些能决定是否允许进入 pilot。但业务指标是 cycle time、rework、underwriter capacity、appeal overturn、policy exception defect。两者通过 metric tree 连接: 如果 citation correctness 提升, memo acceptance 和 rework 应改善, 再影响 cycle time 和成本。若 eval 提升但业务结果没变, 可能是工作流嵌入差、用户不信任、case mix 变化或 AI 只改善了不重要的片段。

Q3: 如何证明 AI 项目的收益不是季节性或团队选择造成的?

30 秒版本

要建立反事实。能随机就做 A/B 或 cluster rollout; 不能随机就用 DiD、interrupted time series、matched cohort 或 synthetic control, 并记录 assignment、exposure、adoption、outcome 和 guardrail。

2 分钟版本

我会先定义 treatment 和 eligibility, 再选择 randomization unit。客服适合 agent team cluster A/B; AML 可能用 phased rollout 加 matched case complexity; 财富分行适合 advisor cohort rollout 加 compliance sampling。分析时区分 ITT 和 triggered exposure, 预先声明 primary metric、guardrail 和 segment slices。还要检查 SRM、case mix、seasonality、spillover、metric drift 和 outcome delay。最后只把可信增量效果放入 benefits register, 不把全部 pre/post 变化算给 AI。

Q4: 你如何做 risk-adjusted AI ROI?

30 秒版本

我会从可信增量价值开始, 乘以 adoption 和 quality pass factor, 再扣除 AI 全成本和 expected risk cost。金融零售不能只算效率, 还要扣质量、合规、客户伤害和治理成本。

2 分钟版本

公式是: risk-adjusted net value = credible incremental value * adoption realization * quality pass - model/platform/data/QA/change/governance cost - expected risk cost - opportunity cost。举例 AML Copilot, 如果每个 case 节省 6 分钟但只有 70% eligible cases 真正采用, 且高风险 typology 需要更强人工复核, 那么收益要按采用和质量通过比例调整。若 evidence defect 或 SAR narrative unsupported claim 出现, 相关价值应扣除或冻结。finance sign-off 也要说明节省时间如何兑现为 backlog reduction、capacity redeployment 或成本减少。

Q5: AI 平台的 North Star 怎么设计?

30 秒版本

平台 North Star 不应是模型接入数或 API 调用量, 而应是“通过共享平台交付、通过价值/风险/可靠性/成本门禁的生产 AI workflows 数量”。

2 分钟版本

AI 平台价值来自复用和治理能力, 例如 model gateway、prompt registry、eval harness、observability、cost ledger、policy guardrail 和 audit log。输入指标包括 platform adoption by use case、eval coverage、trace coverage、policy coverage、model routing hit rate、cost per workflow。DORA-style 指标可以证明交付能力: lead time to AI change、deployment frequency、change failure rate、MTTR 和 reliability。但这些要连接业务 use case value, 否则平台只是工程活动。guardrail 包括 unapproved model use、audit log gap、policy bypass、cost overrun 和 change failure。

Q6: 如果 North Star 上升但 guardrail 恶化, 你会怎么处理?

30 秒版本

先冻结扩容, 看 guardrail severity。critical breach 直接 stop 或 rollback affected path; 非 critical 则分 segment、case type、版本和 workflow step 诊断, 在风险 owner 接受前不把增长计入可兑现价值。

2 分钟版本

North Star 不能压过风险边界。比如客服 AI 的 resolved contacts 上升, 但 wrong policy answer 或投诉上升, 我会先检查是否集中在某些 intent、知识库版本、agent cohort 或 prompt 版本。critical policy defect 要关闭受影响 intent, 更新 eval set 和 release gate。对于 bounded degradation, 可以限制流量、增加 human review、调整 retrieval filter 或回到 pilot。收益计算中, guardrail breach 影响的事件不算 qualified value event, 还要进入 expected risk cost。


17. 作品集交付物

一套高级 AI Product Metrics 作品集可以包含以下资产:

Artifact内容评估标准
One-page metric strategyuse case、North Star、qualified event、metric tree、guardrail一页能讲清价值和风险
North Star option matrix2-3 个候选 North Star 的取舍能说明为什么不用 activity metric
AI product metric taxonomybusiness、workflow、adoption、eval、guardrail、cost、platform分类清楚, owner 清楚
Metric contractdefinition、formula、grain、source、owner、AI consumption policy可被 dashboard / eval / audit 复用
Guardrail matrixseverity、threshold、detection、decision有 hard stop 和 review trigger
Experiment / causal design brieftreatment、control、unit、telemetry、analysis、threats能证明增量价值
Risk-adjusted value modelgross value、cost、risk adjustment、finance treatment不夸大 ROI
Benefits realization registerbaseline、target、incremental estimate、sign-off、scale decision能支撑 Value Office review
Dashboard information architectureexec、product、risk、ops、platform 分层视图不把所有指标堆在一起
Product analytics governance packRACI、metric incident、change policy、lineage可审计、可运营
Financial retail case packAML、客服、信贷、财富/分行、AI 平台示例展示领域迁移能力
Interview answer pack6-10 个高阶问题答案能讲清 North Star、causal、guardrail、benefits

作品集叙事建议:

I did not start with model accuracy.
I started with the business decision and the qualified value event.
Then I designed the North Star, input metrics, guardrails, causal evidence, risk-adjusted value and benefits realization governance.
This is how I would help a regulated financial institution scale AI without confusing usage with value.

18. 最终检查: 一套指标体系是否成熟

QuestionMature answer
North Star 是什么一个带质量、采用、风险和成本约束的 qualified value event
输入指标是什么能被产品、数据、模型、运营、平台团队直接拉动
Guardrail 是什么有 owner、阈值、检测、响应和 stop / rollback 规则
AI 质量如何度量eval、QA、human review、trace、online monitoring 联动
业务收益如何证明使用实验或准实验建立 credible counterfactual
收益如何兑现进入 benefits register, 由 business 和 finance 认可
风险如何进入 ROI用 expected risk cost、quality pass、guardrail breach 调整
平台价值如何证明用复用、成本、可靠性和 DORA-style 交付指标连接 use case value
指标如何治理metric contract、semantic layer、RACI、lineage、change policy、incident
是否适合金融零售覆盖合规、隐私、公平、客户伤害、审计和人工责任边界

一句话收束:

高级 AI 产品度量不是“看 AI 有没有被使用”, 而是证明 AI 在受控风险下创造了可归因、可兑现、可扩展、可治理的业务价值。