AI 扩展计划 / Playbooks

AI Real-Time Feature Store / Decisioning Playbook

这些来源作为学习锚点, 用于建立平台术语、架构边界和治理语言。它们不构成供应商选型建议、法律意见或监管意见。

865 行AI_REAL_TIME_FEATURE_STORE_DECISIONING_PLAYBOOK.md

AI Real-Time Feature Store & Decisioning Playbook

定位: 面向 AI Product Architect / AI PM / Data Product / Risk Product / Platform Architect 的实时特征平台与实时决策架构手册。核心目标: 把 feature store、streaming features、online/offline consistency、point-in-time correctness、freshness SLO、回放审计和金融零售实时决策串成可设计、可评审、可上线、可面试表达的能力。核心结论: 实时 AI 决策不是“模型 API 加几个实时字段”。它是一套以 entity/time 语义、feature contract、低延迟 online serving、历史一致训练集、决策编排、监控回放和治理门禁为中心的平台能力。

Source Anchors

这些来源作为学习锚点, 用于建立平台术语、架构边界和治理语言。它们不构成供应商选型建议、法律意见或监管意见。

Source	Link	本文用法
Feast Docs: Quickstart	https://docs.feast.dev/getting-started/quickstart	理解 offline store、online store、materialization、push features、real-time inference 的基本工程形态
Feast Docs: Use Cases	https://docs.feast.dev/getting-started/use-cases	对齐 risk scorecards、historical feature retrieval、feature monitoring、point-in-time training data 等场景语言
Feast GitHub	https://github.com/feast-dev/feast	将 Feast 作为开源 feature store 参考实现, 理解 registry、feature server、offline/online serving 的产品边界
Feast Feature Server	https://docs.feast.dev/getting-started/components/feature-server	参考在线特征服务 API、push/read 路径和生产安全通信要求
Uber Michelangelo	https://www.uber.com/us/en/blog/michelangelo-machine-learning-platform/	参考端到端 ML 平台如何覆盖 data、training、deployment、prediction、monitoring
Uber Palette Meta Store Journey	https://www.uber.com/us/en/blog/palette-meta-store-journey/	参考大规模 feature store 如何管理 curated features、自动生成 pipeline 和 feature dispersal
Metaflow Docs	https://docs.metaflow.org/introduction/what-is-metaflow	参考生产 ML workflow、版本化、可复现、从本地到生产的流程治理
Metaflow Production Deployments	https://docs.metaflow.org/production/introduction	参考生产部署、event triggering、fresh results、cache、模型服务和故障恢复思路
NIST AI RMF	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 语言组织 AI 决策系统风险、监控和治理证据

1. 定位: 这份 Playbook 补什么能力

很多 AI 产品文档会把实时决策写成:

接入交易流。
调用模型评分。
根据阈值拦截。
记录结果。

这只能描述一个 demo。金融零售生产系统真正难的是:

难点	生产问题	架构含义
时间语义	训练样本是否只能看到决策当时已知的信息	需要 point-in-time correctness、event_time、created_time、late-arriving data 控制
一致性	训练使用的特征和线上推理使用的特征是否同源同义	需要 feature registry、feature contract、offline/online parity test
新鲜度	支付欺诈或 Agent 工具风险判断能否使用足够新的行为信号	需要 freshness SLO、streaming feature pipeline、online store TTL
延迟	决策必须在授权、支付、客服工具调用前完成	需要低延迟 feature serving、规则短路、fallback 策略
泄漏	模型是否使用了未来才知道或决策后才生成的字段	需要 leakage review、feature availability timestamp、回放验证
审计	监管、风控、客户争议时能否复现当时为什么做出决定	需要 decision event、feature snapshot、model/rule/policy version、replay
治理	新特征是否改变公平性、隐私、拒绝原因、客户影响	需要 feature governance、上线门禁、risk sign-off

一句话:

实时特征平台的产品价值不是“多存一些特征”, 而是让高风险决策在正确时间、使用正确实体、读取正确版本、满足延迟和新鲜度约束, 并且可证明、可回放、可治理。

2. 能力地图: 从 Batch Model 到 Real-Time Decisioning Platform

层级	典型做法	主要风险	成熟表达
Batch scoring	每晚跑批生成风险分	风险信号过期, 无法拦截瞬时欺诈	适合低频、低时效场景, 如月度营销名单
Near-real-time scoring	每几分钟刷新特征或分数	窗口延迟、补数、重复事件处理不清	适合信贷预审批、客户 next-best-action
Real-time feature serving	在线读取最新 entity features	online/offline skew、freshness 不稳定	适合支付欺诈、账户接管、客服工具风险
Real-time decisioning	特征、模型、规则、策略、人工升级一起编排	决策不可解释、审计困难、误拦截影响客户	适合客户权益、资金、KYC、信贷、Agent 工具调用风险
Closed-loop decision platform	决策、反馈、回放、监控、治理闭环	反馈污染、策略漂移、监管证据缺失	生产级 AI 风控、信用、支付、运营决策平台

角色视角:

角色	关注点	关键产出
AI Product Architect	平台边界、能力复用、业务系统集成、风险分层	Reference architecture、ADR、rollout strategy
AI PM	场景优先级、延迟体验、误杀成本、人工兜底、指标	Decision PRD、freshness SLO、release gate
Data Product	特征 owner、契约、质量、血缘、数据产品化	Feature contract、quality SLO、monitoring pack
Risk Product	阈值、规则、模型、原因码、override、risk appetite	Decision policy、champion/challenger、override analysis
Platform Architect	offline/online store、streaming、serving、audit、可靠性	C4、sequence、capacity、resilience plan

3. 实时决策参考架构

3.1 架构图文字版

事件源
  支付授权、登录、设备指纹、交易、KYC 上传、贷款申请、CRM 交互、Agent 工具调用意图

事件接入层
  API Gateway / Kafka / CDC / webhook / file drop
  责任: schema validation、idempotency key、event_time、source timestamp、PII classification

流式计算层
  Flink / Spark Structured Streaming / Kafka Streams / managed streaming
  责任: window aggregation、watermark、late event handling、entity enrichment、feature value emission

Feature Platform
  Feature Registry
    feature definition、entity、owner、version、contract、permission、freshness SLO、lineage
  Offline Store
    historical features、point-in-time training retrieval、backfill、batch scoring
  Online Store
    low-latency lookup、TTL、latest feature values、serving freshness
  Feature Server
    standardized API、auth、rate limit、request trace、online feature retrieval

Decisioning Layer
  Decision Orchestrator
    collects request context、online features、model score、rules、policy、reason codes
  Model Serving
    fraud / credit / KYC / NBA / tool-risk models, versioned and observable
  Rules and Policy Engine
    hard blocks、risk appetite、regulatory rules、manual review routing、fallback
  Human Review Queue
    high-risk exceptions、borderline cases、disputes、adverse action review

Systems of Record
  payment switch、core banking、loan origination、KYC case manager、CRM、contact center、AI tool gateway

Observability and Governance
  feature freshness、quality、latency、skew、model drift、decision outcomes、override、audit replay、incident loop

3.2 Mermaid 视图

flowchart TB
  E[Events: payment, KYC, credit, CRM, agent tool intent] --> I[Event ingestion and schema validation]
  I --> S[Streaming feature pipelines]
  I --> L[Immutable event log]
  S --> R[Feature registry and contracts]
  S --> O[Online feature store]
  S --> F[Offline feature store]
  F --> T[Point-in-time training dataset]
  O --> FS[Feature server]
  FS --> D[Decision orchestrator]
  T --> M[Model training and validation]
  M --> MS[Model serving]
  MS --> D
  D --> P[Rules and policy engine]
  P --> A[Allow / block / step-up / manual review / fallback]
  A --> B[Business systems]
  D --> G[Decision audit log]
  R --> G
  L --> RP[Replay and simulation]
  G --> RP
  RP --> V[Release gate and governance review]
  Mon[Freshness, skew, drift, latency, error budget monitoring] --> V
  Mon --> Inc[Incident and rollback]

3.3 组件责任边界

Component	负责	不负责
Event ingestion	事件结构、身份、幂等、event_time、source metadata、数据分类	不做复杂模型决策
Streaming feature pipeline	窗口聚合、entity enrichment、late event policy、feature emission	不决定业务拦截阈值
Feature registry	特征定义、owner、版本、契约、血缘、权限、SLO、用途	不替代数据仓库或模型注册表
Offline store	历史特征、训练集构造、回放、批量评分	不提供毫秒级线上决策
Online store	低延迟读取、TTL、最新值、serving freshness	不承担复杂历史 join
Feature server	API、鉴权、限流、trace、在线读取	不直接执行高风险业务动作
Decision orchestrator	汇聚请求、特征、模型、规则、政策、reason codes、fallback	不拥有所有特征定义
Rules and policy engine	硬性规则、人工升级、监管约束、风险偏好	不训练模型
Audit and replay	决策证据、版本、快照、回放、模拟、争议处理	不只做技术日志

4. Entity / Time 语义: 实时特征平台的底层契约

4.1 Entity 不是主键那么简单

金融零售实时决策通常同时涉及多个 entity:

Entity	示例	决策价值
customer_id	零售客户、持卡人、借款人	历史行为、风险等级、生命周期
account_id	存款账户、信用卡账户、贷款账户	账户余额、还款、交易模式
card_id / token_id	实体卡、虚拟卡、wallet token	支付授权、盗刷、设备迁移
merchant_id	商户、收单商户、平台店铺	商户风险、MCC、拒付率
device_id	设备指纹、浏览器、移动设备	账户接管、KYC 文档上传风险
application_id	贷款申请、开户申请、KYC case	决策上下文、材料状态、审批阶段
agent_session_id	客服 Agent 会话、工具调用计划	工具风险、越权、客户影响

成熟设计要明确:

一个 feature 是否绑定单 entity, 还是 multi-entity join。
entity resolution 谁负责, 置信度是多少。
线上请求缺失 entity 时如何降级。
entity merge / split 后历史特征如何处理。
高风险实体关系是否需要图谱或 link analysis 作为独立证据。

4.2 Time 字段的最小语义

Time field	含义	错误使用的后果
`event_time`	业务事件实际发生时间	决策窗口错位, 训练集泄漏
`ingestion_time`	平台收到事件时间	把网络延迟误当成业务延迟
`created_time`	特征值或源记录生成时间	历史训练集误用未来才生成的特征
`available_time`	特征对决策系统可用的时间	忽略数据延迟, 造成 point-in-time 错误
`decision_time`	模型或规则做出决策的时间	审计无法复现
`effective_time`	政策、规则、特征定义生效时间	新旧规则混用
`expiry_time`	特征值、规则或来源失效时间	使用过期风险信号

生产级要求:

Training sample at decision_time T
只能 join available_time <= T 的特征值。
如果源事件 event_time <= T 但 available_time > T, 训练时也不能使用。

4.3 Point-in-Time Correctness

Point-in-time correctness 的核心不是 SQL 技巧, 而是业务可证明性:

检查点	合格表现	高风险失败
Label timing	label 发生在决策之后, 且窗口定义清楚	用 chargeback 结果反向污染欺诈特征
Feature availability	训练只使用当时已可用特征	用贷后表现、人工审核结论、KYC 最终状态做贷前特征
Historical join	每个 entity 按 decision_time 取最近可用值	直接取最新快照
Late-arriving data	回补策略不改变历史决策时点可见信息	回放时看到了线上当时没有的数据
Rule/model version	回放使用当时生效版本	用当前规则解释过去决定
Source version	政策、黑名单、特征定义按有效期取值	用未来更新的名单解释过去拦截

5. Online / Offline Feature Consistency

5.1 一致性目标

Feature store 的核心价值之一是让模型训练和线上服务共享特征定义、血缘和质量约束。

一致性维度	训练侧	服务侧	验证方式
Definition	feature view、SQL、UDF、window、过滤条件	materialized / pushed value 使用同一 contract	definition hash、contract review
Entity	historical entity dataframe	online lookup entity keys	entity completeness、join hit rate
Time	point-in-time join	request time latest available value	replay parity test
Transformation	batch transform	stream transform	golden entity comparison
Defaults	missing/null imputation	online fallback/default	null policy parity
Freshness	historical backfill	online TTL and freshness SLO	freshness monitor
Permission	training dataset access	serving request access	entitlement replay

5.2 Training-Serving Skew Taxonomy

Skew type	典型表现	金融零售影响	控制
Compute skew	batch SQL 与 streaming job 逻辑不同	欺诈分数上线后失准	单一 feature contract、共享测试集、代码生成或同源 transform
Freshness skew	训练假设 T 时刻可用, 线上实际延迟 2 分钟	支付拦截漏判	available_time、freshness SLO、degraded decision policy
Missingness skew	训练样本补齐完整, 线上大量缺值	新客户、薄档客户被误判	online missing policy、slice monitoring
Entity skew	训练按 customer_id, 线上按 account_id	客户关系错配	entity mapping contract、join hit rate
Default skew	训练缺失填 0, 线上缺失填 null 或 previous value	risk score 偏移	default value registry、contract test
Policy skew	训练集使用旧黑名单, 线上用新黑名单	结果不可复现	source versioning、policy effective_time
Timezone/clock skew	不同系统时间基准不一致	窗口聚合错误	UTC normalization、clock drift monitor

5.3 Feature Leakage Taxonomy

Leakage type	示例	为什么危险	防控
Future leakage	使用 T+7 天拒付结果预测 T 时刻支付欺诈	离线指标虚高, 线上失效	point-in-time retrieval、label cutoff
Post-decision leakage	使用人工审核结论作为自动审核输入	模型学习人工结果而非前置信号	feature availability review
Target leakage	特征直接编码目标变量	信贷、KYC、欺诈模型虚假高 AUC	leakage audit、feature importance review
Operational leakage	用“进入人工队列”预测高风险	模型复刻旧流程偏差	process feature review
Source leakage	只在坏样本中存在的数据源字段	样本选择偏差	source coverage and missingness slice
Feedback leakage	使用模型拦截后的结果作为未拦截结果标签	策略自证正确	exploration、reject inference、shadow labels

6. Streaming Features and Freshness SLO

6.1 Streaming Feature 类型

Feature type	示例	决策价值
Rolling count	5 分钟内同卡失败支付次数	支付欺诈、账户接管
Rolling amount	10 分钟内跨境交易总额	授权风控、反洗钱线索
Velocity	1 小时内设备切换次数	登录风险、KYC 文档风险
Distinct count	30 分钟内同设备关联客户数	设备农场、合成身份
Ratio	24 小时内失败交易率	商户风险、支付稳定性
Time since last event	距离上次成功 KYC 上传时间	KYC case prioritization
Sequence pattern	登录、改手机号、发起转账连续发生	账户接管
Agent action context	当前会话已读取客户数据次数、拟调用工具风险	客服 Agent 工具风控

6.2 Event-Time vs Processing-Time

选择	适用	风险
Event-time windows	风险判断依赖真实业务发生时间	需要 watermark 和 late event 策略
Processing-time windows	只关心平台收到后的实时处理	网络延迟会改变窗口含义
Hybrid	线上低延迟用 processing-time, 回放和训练用 event-time + available_time	需要清楚标记线上/离线差异

金融场景建议:

支付授权风控以低延迟为第一约束, 但训练和回放必须记录 event_time、ingestion_time、available_time。
信贷预审批通常能容忍秒级到分钟级延迟, 但必须严格控制贷后和人工结果泄漏。
Agent 工具风险判断通常需要会话内实时状态, 其窗口更接近 processing-time, 但审计仍要保存事件序列。

6.3 Freshness SLO and Error Budget

Freshness 不是单一指标, 至少要拆成:

指标	定义	例子
Source lag	源事件发生到平台接收的时间	支付事件 99p < 500ms
Pipeline lag	平台接收后到特征计算完成的时间	velocity feature 99p < 800ms
Online materialization lag	特征计算完成到 online store 可读的时间	online write 99p < 300ms
Serving freshness	决策时读取到的特征年龄	`card_decline_count_5m` age 99p < 2s
Decision latency	请求进入到返回 allow/block/review 的时间	支付授权 99p < 120ms
Staleness rate	超过 freshness threshold 的请求比例	每日 < 0.1%

Error budget 要绑定业务动作:

Feature group	Freshness SLO	Error budget	超预算动作
Payment fraud velocity	99.9% requests age <= 2s	每日 stale requests <= 0.1%	降级到规则保守模式, 高风险交易 step-up
Credit pre-approval bureau enrichment	99% records age <= 24h	每月 stale applications <= 1%	停止自动预审批, 转人工或重新拉取
KYC document risk	99% document signals age <= 5m	每日 stale cases <= 0.5%	暂停自动通过, 只允许 manual review
Agent tool-risk session features	99.5% session state age <= 1s	每日 stale tool calls <= 0.2%	高风险工具 require approval

7. 金融零售实时决策场景

7.1 支付欺诈实时拦截

维度	设计要点
Decision point	授权请求进入 payment switch 后, 返回 approve / decline / step-up / manual review 前
Entities	card_id、token_id、customer_id、merchant_id、device_id、ip、account_id
Real-time features	5 分钟失败次数、10 分钟金额 velocity、同设备客户数、merchant risk velocity、geo jump、new payee flag
Offline features	客户历史风险等级、商户历史拒付率、账户生命周期、历史 disputed transaction ratio
Latency budget	决策总延迟通常按毫秒级管理, feature lookup 必须有 strict timeout
Fallback	特征超时走 rules-only 或 step-up, 高风险不默认放行
Audit	request context、feature values、feature age、model version、rule version、decision、reason code
Risk tradeoff	false negative 是欺诈损失; false positive 是客户体验、收入和投诉

高级产品判断:

不要把“模型分数高”直接等同拒绝。支付场景应设计 step-up、3DS、限额、延迟放行、人工审核等多动作策略。
对新客户、薄档客户、跨境场景要独立监控误杀率, 否则实时模型会把缺数据当风险。
对规则和模型冲突要记录 decision arbitration, 因为事后争议通常问的是“为什么当时没拦或为什么拦了”。

7.2 信贷预审批

维度	设计要点
Decision point	客户浏览产品、进入申请、额度提升、营销触达前
Entities	customer_id、account_id、application_id、household_id
Real-time features	最近收入入账、近期逾期、账户余额变动、近期硬查询、申请频率
Offline features	信用历史、收入稳定性、产品持有、历史还款、行为评分
Freshness	多数特征可分钟到天级; 高风险信用 bureau 或内部 delinquency 状态必须版本清楚
Decision output	eligible / not eligible / invite to apply / manual review / insufficient data
Governance	公平性、拒绝原因、adverse action、可解释性、人审边界
Leakage risk	使用贷后表现、审批结论、人工备注或拒绝原因作为贷前特征

高级产品判断:

预审批不是最终授信。PRD 要明确 customer-facing wording, 避免把 marketing eligibility 写成信用承诺。
Feature contract 要标明哪些特征可用于 eligibility, 哪些只能用于 internal ranking, 哪些不可用于 adverse action reason。
回放验证要覆盖被拒、薄档、低收入、地区、渠道等 slice, 不能只看总体 KS/AUC。

7.3 KYC 文档决策

维度	设计要点
Decision point	文档上传、OCR、真实性检测、名单筛查、case routing
Entities	customer_id、application_id、document_id、device_id、ip、beneficial_owner_id
Real-time features	上传设备变化、文档重复使用、OCR confidence、liveness risk、同设备多申请、制裁筛查状态
Offline features	客户历史 KYC remediation、国家/行业风险、实体结构复杂度、历史材料缺失率
Decision output	auto-pass / request-more-info / enhanced due diligence / manual review / reject recommendation
Freshness	文档和名单状态需要可追溯的 effective_time; sanctions/PEP source version 必须可审计
Audit	文档版本、OCR 输出、模型分数、规则、人工 override、客户沟通记录
Leakage risk	使用最终人工 KYC status 训练上传时的自动分流模型

高级产品判断:

KYC 自动化的安全边界通常是 triage 和 evidence gathering, 不是无约束最终拒绝。
对 sanctions、PEP、adverse media 相关特征, source version 和 match confidence 必须进入审计日志。
文档模型、规则和人工审核形成闭环后, 要防止只学习历史人工偏差。

7.4 客户 Next-Best-Action

维度	设计要点
Decision point	App 首页、客服会话、营销触达、分行客户经理工作台
Entities	customer_id、household_id、channel_id、session_id
Real-time features	当前会话意图、最近交易、投诉状态、服务失败、渠道活跃、近期触达
Offline features	客户价值、产品持有、生命周期、偏好、风险限制、同意状态
Decision output	recommend / suppress / defer / service-first / human follow-up
Governance	consent、fair treatment、suitability、vulnerability、投诉状态、频控
Freshness	客户投诉、opt-out、风险限制必须近实时生效
Monitoring	uplift、complaint rate、opt-out rate、offer fatigue、protected slice

高级产品判断:

NBA 不是单纯推荐系统。金融零售必须把 suitability、consent、complaint、vulnerability、risk restriction 放进 policy layer。
有些实时信号应触发“不要卖, 先服务”, 例如刚发生支付失败或投诉升级。
特征平台要支持 suppress features 和 policy features, 不只是 propensity features。

7.5 客服 Agent 工具调用风险判断

维度	设计要点
Decision point	LLM Agent 准备调用 read/write/external-send 工具前
Entities	agent_session_id、user_id、customer_id、case_id、tool_id、tenant_id
Real-time features	当前会话敏感度、已读取客户字段数量、工具风险等级、prompt injection signal、DLP hit、approval history
Offline features	用户角色、历史权限、工具目录、客户风险、case 类型、政策版本
Decision output	allow / redact_then_allow / dry_run / require_approval / deny / kill_switched
Freshness	session state 和 kill switch 必须秒级生效
Audit	tool proposal、arguments、features、policy decision、approver、tool result summary
Leakage risk	把模型建议当作授权依据, 或让工具结果中的指令触发下一步工具

高级产品判断:

这里的“特征”不是传统 ML 特征而已, 也是 policy decision context。
Agent 风险判断需要低延迟在线特征和规则引擎协同, 不应只靠 prompt guardrail。
任何涉及外发、资金、客户权益、监管记录的工具调用, 都要可回放和可证明当时为什么允许或阻断。

8. 产品决策: 何时需要实时 Feature Store

8.1 不需要实时 Feature Store 的情况

场景	更合适方案
每日批量营销名单	Batch feature table + campaign rules
低风险内部报表	Warehouse semantic layer + BI metrics
少量特征且无复用	Service-local cache + contract tests
纯文档问答	RAG source governance + retrieval eval
无明确决策动作	先做 decision discovery, 不急于建实时平台

8.2 需要平台化的触发信号

触发信号	含义
多个模型重复实现同一特征	特征需要 registry、owner、复用和质量管理
训练和线上特征逻辑频繁不一致	需要 offline/online consistency 和 parity tests
决策对特征新鲜度敏感	需要 streaming features、freshness SLO、online store
高风险决策需要审计回放	需要 immutable event log、feature snapshot、decision trace
特征涉及 PII、信用、KYC、支付风险	需要 feature governance、permission、retention、risk review
多团队共用实体和时间语义	需要 entity registry、feature contracts、data product model

8.3 Feast / 自建 / 商业平台取舍

选项	适合	风险
Feast-style open-source feature store	团队有平台工程能力, 需要开放架构和可控集成	需要自建治理、UI、SLO、权限、运营能力
自建轻量 feature platform	场景窄, 架构简单, 团队需要快速掌控关键路径	容易演变成隐性平台, 缺少 registry 和治理
商业 feature platform	多团队、多云、多治理要求, 希望缩短平台建设周期	供应商锁定、集成复杂、金融审计要求仍需内部负责
Warehouse + online cache	批量特征为主, 低频在线读取	point-in-time、streaming、parity 和 SLO 需额外设计

选型问题不是“谁功能多”, 而是:

能否证明训练和线上特征同义?
能否按 decision_time 回放?
能否定义并达成 freshness SLO?
能否治理 PII、权限、血缘、owner、用途和风险?
能否在事故中定位 feature、model、rule、policy 哪一层失败?

9. 治理模型

9.1 Feature Lifecycle

阶段	关键动作	证据
Propose	说明业务用途、entity、time、source、风险、预期决策影响	Feature proposal
Contract	定义计算逻辑、freshness、质量、权限、retention、allowed use	Feature contract
Build	实现 batch/stream transform、tests、lineage、registry entry	CI result、data quality report
Validate	offline/online parity、leakage review、replay、slice impact	Validation report
Approve	PM、Data owner、Risk、Compliance、Architect 签核	Governance review
Serve	materialize/push 到 online store, 接入 feature server	Serving readiness
Monitor	freshness、quality、latency、skew、drift、decision outcome	Dashboard and alerts
Deprecate	标记替代特征、停止新依赖、保留审计回放	Deprecation record

9.2 NIST AI RMF 映射

AI RMF function	在实时决策平台中的落点
Govern	feature ownership、decision authority、risk appetite、approval workflow、audit responsibility
Map	场景、客户影响、数据来源、entity/time、决策动作、失败模式、受影响人群
Measure	freshness、skew、leakage、latency、model quality、false positive/negative、override、incident
Manage	release gate、fallback、manual review、kill switch、rollback、feature deprecation、post-incident replay

9.3 高风险特征治理原则

原则	解释
Purpose-bound	特征必须声明 allowed use, 例如 fraud detection、credit eligibility、customer service risk
Time-aware	每个特征必须有 event_time、available_time 或明确的 snapshot semantics
Owner-backed	业务 owner、data owner、technical owner、risk owner 不可缺失
Explainable enough	影响客户权益的特征要能生成 reason code 或进入解释链
Privacy-aware	PII、PCI、敏感身份、信用、KYC、AML 数据必须标记和最小化
Replayable	高风险决策使用的特征值必须能在审计中复现或证明
Degradable	特征不可用时有明确 fallback, 不让模型自由猜测

10. 可落地交付物模板

以下模板都用具体示例填充。复制到项目时可以替换业务名和阈值, 但不要删除 owner、time、freshness、audit、risk 字段。

10.1 实时决策架构图文字版

Use case:
  支付欺诈实时拦截

Decision SLA:
  授权请求总决策 99p <= 120ms; feature lookup 99p <= 20ms; model scoring 99p <= 35ms

Event sources:
  payment_authorization, card_decline, device_fingerprint, merchant_profile, customer_profile, chargeback_case

Entities:
  card_id, token_id, customer_id, merchant_id, device_id, account_id

Streaming features:
  card_decline_count_5m, card_amount_sum_10m, device_distinct_customer_count_30m,
  merchant_high_risk_auth_count_10m, geo_velocity_score_1h

Offline features:
  customer_lifetime_dispute_rate_180d, merchant_chargeback_rate_90d,
  account_age_days, customer_risk_segment, prior_fraud_case_count_365d

Feature platform:
  Registry stores feature contracts, owners, versions, allowed uses, freshness SLO, lineage.
  Offline store supports point-in-time retrieval for training and replay.
  Online store serves latest features with TTL and feature age.
  Feature server enforces auth, timeout, trace id and rate limit.

Decision orchestration:
  Request context + online features + fraud model score + hard rules + policy constraints.
  Output: approve, decline, step_up_authentication, manual_review.
  Fallback: if velocity features stale, use conservative rule set and step-up for high-risk slices.

Audit and replay:
  Store request id, entity ids, feature values, feature age, model version, rule version,
  policy version, decision output, reason codes, downstream action, feedback label.

10.2 Feature Contract: 支付欺诈 Velocity 特征

Field	Example
Feature name	`card_decline_count_5m`
Business purpose	支付授权实时欺诈拦截和 step-up routing
Allowed use	fraud detection、payment authorization risk、fraud model training、audit replay
Disallowed use	credit eligibility、marketing targeting、customer value ranking
Primary entity	`card_id`
Secondary entity	`token_id`, `customer_id`
Source events	`payment_authorization` with status `declined`
Event time	`payment_authorization.event_time_utc`
Available time	max of ingestion timestamp and streaming output timestamp
Window	Rolling 5 minutes, event-time based, watermark 30 seconds
Aggregation	Count declined authorizations for same card_id excluding system test transactions
Late event policy	Late events within 30 seconds update online value; later events only affect offline replay
Null policy	Missing value means no event observed; default `0` with `is_missing=false`
TTL	10 minutes in online store
Freshness SLO	99.9% online reads feature age <= 2 seconds
Offline retrieval	Point-in-time join by `decision_time`, using `available_time <= decision_time`
Online serving	Feature server returns value, feature timestamp, feature age, contract version
Data classification	Customer financial behavior; confidential; no external sharing
Owner	Payment Risk Data Product Owner
Risk owner	Fraud Strategy Lead
Technical owner	Real-Time Feature Platform Team
Quality tests	non-negative integer, 99.9% not null, drift alert on 7-day percentile shift > 30%
Parity tests	1,000 golden card_id/time samples compare streaming output vs offline recomputation
Leakage control	Excludes chargeback label, manual review result, post-decision dispute status
Audit fields	feature_name, value, event_time, available_time, contract_version, source_event_count
Review cadence	Monthly risk review; immediate review after fraud incident or source schema change

10.3 Freshness and Error Budget

Item	Definition
Feature group	Payment fraud velocity features
Business impact	Stale features may miss rapid fraud bursts or cause overly conservative step-up
SLO window	Daily, calculated by decision requests
Primary SLO	99.9% of online feature reads have feature_age <= 2 seconds
Secondary SLO	Feature server 99p latency <= 20ms; online store read error rate <= 0.05%
Error budget	Stale feature reads > 2 seconds must stay <= 0.1% per day
Burn alert	2-hour rolling stale rate >= 0.05% triggers warning; >= 0.1% triggers incident
Degraded mode	High-risk transactions route to step-up; low-risk transactions use rules-only score
Stop condition	Stale rate >= 0.5% for 15 minutes disables model path for affected region
Recovery condition	Freshness SLO met for 30 minutes, replay confirms no material missed fraud cluster
Owner	Real-Time Feature Platform on-call plus Payment Risk on-call
Audit evidence	Alert id, affected features, affected entity count, decision fallback count, replay result

10.4 上线门禁

Gate	Pass evidence	Blocking failure
Feature contract	Contract includes entity, time, source, owner, allowed use, freshness, leakage control	Missing owner, missing time semantics, allowed use unclear
Point-in-time validation	10,000 historical decisions replay with `available_time <= decision_time`	Any future feature used in high-risk sample
Offline/online parity	Golden sample parity >= 99.5%, all differences explained	Streaming and batch definitions diverge on critical feature
Freshness SLO	7-day load test meets SLO and error budget	No degraded mode or stale rate above threshold
Latency	Decision path 99p within scenario SLA under peak load	Timeout causes silent default approve for high-risk request
Leakage review	Feature list reviewed for target, post-decision and operational leakage	Manual review result or future label appears in model input
Monitoring	Dashboards and alerts for freshness, nulls, drift, skew, latency, decisions, overrides	No alert owner or no incident route
Audit replay	Sample decisions replay to same decision or documented tolerance	Cannot reconstruct feature values, model version or rules
Governance	PM, Data owner, Risk, Compliance, Architect approve release scope	Risk owner rejects or customer-impact boundary unclear
Rollback	Feature disable flag, model rollback, rules fallback tested	Rollback requires code deploy during incident

10.5 回放验证方案

Step	Execution	Evidence
1	Select 30 days of historical payment authorization events and decisions	Replay cohort manifest with date range, regions, channels, risk slices
2	Rebuild feature values using event log and `available_time <= decision_time`	Offline feature table with contract version and computation hash
3	Compare rebuilt features against stored decision-time feature snapshot	Parity report by feature, entity, channel, timestamp bucket
4	Run current candidate model and rules in shadow mode on historical requests	Shadow decision table with score, reason codes and proposed action
5	Compare candidate decisions to historical outcomes and human overrides	False positive / false negative / step-up impact by slice
6	Inject freshness degradation and missing-feature scenarios	Resilience report showing fallback decisions and customer impact
7	Replay known fraud incidents and near misses	Incident replay memo with missed signal, new feature contribution and residual risk
8	Produce release recommendation	Pilot / limited release / no-go memo with risk acceptance owner

回放通过标准示例:

Metric	Threshold
Critical point-in-time violation	0
Critical feature leakage	0
Feature parity for top 20 high-impact features	>= 99.5%
Stored decision replay reconstruction	>= 99% exact or documented deterministic tolerance
High-risk stale fallback correctness	100% routes to step-up, manual review or deny according to policy

10.6 监控清单

Monitor	Signal	Slice
Feature freshness	feature_age p50/p95/p99, stale rate, online TTL expiration	feature group, region, channel, entity type
Feature quality	null rate, range violation, enum drift, negative value, outlier rate	new vs existing customer, product, merchant category
Online/offline skew	golden sample parity, batch vs stream delta	feature, window, source system
Serving reliability	feature server latency, timeout, error rate, cache hit rate	API client, region, risk tier
Decision quality	approve/block/step-up/manual-review rate	channel, merchant, customer segment, protected slice where legally appropriate
Model drift	score distribution, calibration, PSI/CSI, reason-code shift	model version, feature group
Business outcome	confirmed fraud, chargeback, false positive complaint, approval conversion	cohort, decision action
Override	human override rate, override reason, override outcome	reviewer team, scenario
Governance	contract violations, unapproved feature use, stale owner review	feature owner, platform team
Audit health	missing trace field, replay failure, version lookup failure	decision service, model, rules engine
Incident	SLO burn, kill switch activation, fallback count	feature group, workflow, tenant

10.7 治理评审表

Review area	Question	Payment fraud example answer	Decision
Business purpose	这个特征或模型服务哪个明确决策动作	授权风险拦截和 step-up routing	Approved
Customer impact	可能造成什么客户影响	误拒支付、额外认证、交易延迟	Approved with FP monitoring
Data source	来源是否权威且有 owner	Payment switch event stream, owner Payment Platform	Approved
Time semantics	是否定义 event_time、available_time、decision_time	三者均进入 contract 和 audit trace	Approved
Leakage	是否包含未来、贷后、人工结论或目标变量	排除 chargeback label 和 manual review result	Approved
Privacy	是否涉及 PII/PCI/敏感金融行为	使用 tokenized card_id; no PAN in feature store	Approved with DLP evidence
Fairness	是否可能对特定群体产生不利影响	监控新客户、跨境、薄档客户误拦截	Approved with monthly slice review
Explainability	是否能给出 reason code 或人工解释	velocity, geo jump, merchant risk as reason factors	Approved
Freshness	SLO 是否与业务动作匹配	99.9% <= 2s; stale routes to step-up	Approved
Replay	争议时是否能复现	feature snapshot + event replay + model/rule version	Approved
Operations	谁响应 SLO 事故	Feature platform on-call + Payment Risk on-call	Approved
Scope	是否限定上线范围	Card-not-present US region controlled release	Approved

10.8 Decision Audit Schema

decision_event:
  decision_id: "payauth-2026-06-29-00048192"
  use_case: "payment_fraud_authorization"
  decision_time_utc: "2026-06-29T14:05:31.830Z"
  request_context:
    channel: "card_not_present"
    amount_currency: "USD"
    amount_value: 284.90
    merchant_category: "electronics"
  entities:
    card_id_hash: "card_hash_8f10"
    customer_id_hash: "cust_hash_91ab"
    merchant_id_hash: "m_hash_3e52"
    device_id_hash: "dev_hash_77cc"
  features:
    - name: "card_decline_count_5m"
      value: 3
      feature_time_utc: "2026-06-29T14:05:30.510Z"
      feature_age_ms: 1320
      contract_version: "v4"
    - name: "device_distinct_customer_count_30m"
      value: 5
      feature_time_utc: "2026-06-29T14:05:29.980Z"
      feature_age_ms: 1850
      contract_version: "v2"
  model:
    model_name: "payment_fraud_rt"
    model_version: "2026-06-20-champion"
    score: 0.87
    threshold_band: "step_up"
  rules:
    rule_policy_version: "fraud-policy-2026-06-15"
    triggered_rules:
      - "velocity_high"
      - "new_device_high_amount"
  decision:
    action: "step_up_authentication"
    reason_codes:
      - "recent_decline_velocity"
      - "new_device_pattern"
    fallback_used: false
  audit:
    trace_id: "trace_42cb6"
    feature_server_latency_ms: 14
    model_latency_ms: 27
    policy_latency_ms: 5
    replayable: true

11. 30 天训练计划

Day	主题	任务	产出
1	Use case framing	选定支付欺诈、信贷预审批、KYC 或 Agent 工具风险中的一个场景, 写 decision point 和 customer impact	`decision-use-case-brief.md`
2	Decision taxonomy	定义 allow / block / step-up / manual review / fallback 的业务含义	`decision-action-taxonomy.md`
3	Entity model	列出 customer、account、card、merchant、device、application、session 等实体关系	`entity-model.md`
4	Time semantics	定义 event_time、available_time、decision_time、effective_time、expiry_time	`time-semantics-note.md`
5	Source inventory	盘点事件源、批量源、政策源、标签源、人工审核源	`source-inventory.md`
6	Feature candidate review	为 20 个候选特征标注用途、风险、泄漏可能性、freshness 需求	`feature-candidate-review.md`
7	Leakage review	识别 future、post-decision、target、operational、feedback leakage	`leakage-review.md`
8	Feature contract	写 3 个高影响特征 contract, 覆盖 entity/time/source/freshness/audit	`feature-contract-pack.md`
9	Architecture diagram	画事件、streaming、offline/online store、decisioning、audit、monitoring 架构	`realtime-decision-architecture.md`
10	Offline retrieval	设计 point-in-time training dataset 构造规则	`pit-training-dataset-spec.md`
11	Online serving	设计 feature server API、timeout、default、auth、trace 字段	`online-serving-spec.md`
12	Streaming design	定义 rolling windows、watermark、late event policy、TTL	`streaming-feature-design.md`
13	Freshness SLO	为 feature group 写 SLO、error budget、degraded mode、recovery	`freshness-error-budget.md`
14	Decision orchestration	定义模型、规则、policy、fallback、人工升级的组合逻辑	`decision-orchestration-spec.md`
15	Audit schema	写 decision event schema, 包括 feature snapshot、model/rule/policy version	`decision-audit-schema.md`
16	Replay cohort	选择历史样本切片: 时间、渠道、客户、商户、风险等级	`replay-cohort-manifest.md`
17	Replay validation	设计离线重算、线上快照对比、shadow decision 对比	`replay-validation-plan.md`
18	Parity tests	设计 batch vs stream、offline vs online、default policy 一致性测试	`parity-test-pack.md`
19	Monitoring dashboard	定义 freshness、quality、skew、latency、decision、override、audit 指标	`monitoring-metric-pack.md`
20	Incident workflow	写 stale feature、online store outage、leakage discovery、bad threshold 的响应流程	`decisioning-incident-runbook.md`
21	Governance review	填写治理评审表, 明确 PM、Data、Risk、Compliance、Architect 责任	`governance-review-record.md`
22	Release gate	写 prototype、shadow、pilot、production 四级门禁	`release-gate-spec.md`
23	Fallback design	针对 feature timeout、model timeout、rule engine failure 设计降级	`fallback-and-resilience-plan.md`
24	Fairness and slice review	为客户、渠道、地区、薄档、新客户等 slice 设计监控	`slice-impact-review.md`
25	Reason codes	定义模型分数、规则触发、客户沟通、内部解释之间的映射	`reason-code-mapping.md`
26	Champion/challenger	设计 shadow model、threshold experiment、risk appetite 评估	`champion-challenger-plan.md`
27	Platform roadmap	区分场景级实现、共享 feature platform、企业 decisioning platform	`platform-roadmap.md`
28	Architecture ADR	写是否采用 Feast-style feature store、streaming platform、rules engine 的 ADR	`feature-platform-adr.md`
29	Portfolio case	整理 problem、architecture、contracts、SLO、replay、governance、business impact	`portfolio-case-study.md`
30	Interview pack	准备 30 秒、2 分钟、CTO、PM、Risk、Data Product 深挖回答	`interview-answer-pack.md`

30 天完成标准:

能画出实时特征与决策平台的端到端架构。
能解释 entity/time/available_time 对 point-in-time correctness 的影响。
能写 feature contract, 并区分 allowed use 与 disallowed use。
能设计 freshness SLO、error budget、degraded mode 和 recovery condition。
能识别 feature leakage 与 training-serving skew。
能用 replay 验证线上决策, 而不是只看离线 AUC。
能把 feature store 上线讲成产品、架构、风险、审计和运营能力。

12. 面试回答

12.1 30 秒版本

实时特征平台的关键不是把特征放进 Redis, 而是保证模型训练和线上决策在 entity、time、definition、freshness 和权限上保持一致。我会用 feature registry 管特征契约, offline store 做 point-in-time training retrieval 和回放, online store 做低延迟服务, streaming pipeline 生成 freshness-sensitive features, 再通过 decision orchestrator 把特征、模型、规则、policy 和人工升级组合起来。上线门禁会要求无 feature leakage、offline/online parity 达标、freshness SLO 有 error budget、决策可回放、监控和 fallback 已验证。

12.2 2 分钟版本

我会把实时决策拆成四层。

第一层是时间和实体语义。每个特征必须定义 entity key、event_time、available_time、decision_time、TTL 和 allowed use。训练集只能使用决策当时已经可用的信息, 否则离线指标会因为未来信息泄漏而虚高。

第二层是 feature platform。Feature registry 管 owner、contract、version、lineage、freshness SLO 和权限; offline store 负责 point-in-time retrieval、backfill、batch scoring 和 replay; online store 负责低延迟 lookup; feature server 负责鉴权、限流、trace 和特征年龄返回。

第三层是 decisioning。实时请求进入后, decision orchestrator 拉取 online features, 调用模型服务, 执行规则和 policy, 输出 allow、block、step-up、manual review 或 fallback。支付欺诈强调毫秒级延迟和 velocity features; 信贷预审批强调泄漏控制、公平性和原因码; KYC 强调文档证据、名单版本和人工审核边界; Agent 工具风险强调 session state、DLP、approval 和 audit。

第四层是治理和运营。上线前跑 offline/online parity、leakage review、freshness load test、historical replay、shadow decision 和 governance review。上线后监控 freshness、null、drift、skew、latency、decision outcome、override 和 audit replay health。高风险特征超预算时不应该静默放行, 而要进入 step-up、manual review 或规则降级。

12.3 CTO 深挖

Q: Feature store 和普通数据仓库有什么本质区别?

A: 数据仓库解决分析和批处理, feature store 解决可复用的训练与服务特征。差异在于 feature store 必须管理 entity/time 语义、point-in-time retrieval、online serving、offline/online consistency、freshness、feature contract 和 model serving integration。实时决策里, 它还要返回 feature age、contract version 和 trace, 支撑审计和回放。

Q: 如何避免 training-serving skew?

A: 我会把 skew 当成 release gate, 不靠口头约定。具体做法是 feature contract 单源定义, batch 和 stream 使用同一语义; 训练集用 available_time 做 point-in-time join; online serving 返回 feature timestamp 和 age; 用 golden entity/time 样本比较 offline recomputation、stream output 和 online snapshot; default/null policy 也进入 contract。任何关键特征 parity 不达标, 不进入 production。

Q: 实时特征延迟达不到怎么办?

A: 先区分 source lag、pipeline lag、online materialization lag、serving latency。产品上要有 degraded mode: 支付高风险交易 step-up, KYC 暂停 auto-pass, Agent 高风险工具 require approval。架构上要设置 timeout、TTL、cache、异步补偿、fallback feature group 和 kill switch。不能让 stale feature 变成 silent approve。

Q: 回放为什么难?

A: 因为回放不是重新跑当前代码。它要恢复当时的事件可见性、feature contract、model version、rule version、policy version、source version 和 decision context。late-arriving data、规则变更、实体合并、source 修正都会破坏复现。高风险场景应保存 decision-time feature snapshot, 同时保留 event log 支持重新计算和差异解释。

12.4 PM 深挖

Q: 如何决定第一阶段做哪个实时决策场景?

A: 我会看四个维度: 决策时效价值、错误成本、数据 readiness、可回放性。支付欺诈时效价值高但延迟和误杀风险高; 信贷预审批对治理和公平性要求高; KYC 文档决策需要证据和人工边界; Agent 工具风险可以先从高风险工具审批切入。第一阶段适合选择业务价值明确、人工兜底存在、历史事件和标签足够、能做 shadow replay 的场景。

Q: 如何定义实时特征平台的产品成功?

A: 不只看模型指标。平台指标包括新特征接入时间、特征复用率、contract 覆盖率、freshness SLO 达成率、offline/online parity、回放成功率、feature incident 数、上线门禁通过率。业务指标按场景看欺诈损失、误拦截投诉、预审批转化、KYC 周转时间、Agent 工具误授权拦截。治理指标看 unapproved feature use、leakage incidents、审计取证时间。

Q: 如何处理 false positive 和用户体验?

A: 实时决策不要只有 allow/block 二元动作。支付可以 step-up, KYC 可以 request-more-info, 信贷可以 invite-to-apply 而不是承诺授信, Agent 工具可以 dry-run 或 require approval。PM 要把错误成本转成动作梯度, 并监控 false positive by slice、override reason、投诉和转化损失。

12.5 Risk / Compliance 深挖

Q: 如何证明没有 feature leakage?

A: 证据包括 feature contract 中的 event_time 和 available_time, 训练集生成逻辑的 point-in-time join, leakage review 对 future、post-decision、target、operational、feedback leakage 的逐项结论, 以及历史 replay 中 available_time <= decision_time 的校验结果。高风险特征还要有 source lineage 和人工审核字段排除证明。

Q: 客户争议某次支付被拦截, 你如何解释?

A: 我会调取 decision audit event, 展示当时请求上下文、实体、特征值、feature age、模型版本、规则版本、policy 版本、触发 reason codes、fallback 状态和下游动作。解释应基于当时可用信息, 而不是用事后 chargeback 或当前模型重跑结果替代。

Q: AI RMF 如何落到这个平台?

A: Govern 是 owner、审批、风险偏好和审计责任; Map 是场景、客户影响、数据源、entity/time 和失败模式; Measure 是 freshness、skew、leakage、latency、模型质量、误杀、override 和 incident; Manage 是 release gate、fallback、manual review、kill switch、rollback 和 post-incident replay。

12.6 Data Product 深挖

Q: Feature contract 最重要的字段是什么?

A: 高风险实时决策里最关键的是 business purpose、allowed/disallowed use、entity、event_time、available_time、source、transformation、freshness SLO、null/default policy、TTL、owner、data classification、leakage control、online/offline serving semantics 和 audit fields。没有这些字段, 特征就只是数据列, 不是可治理的数据产品。

Q: 如何管理 feature deprecation?

A: 不能直接删除。先标记 deprecated, 禁止新模型依赖, 列出替代特征, 跑依赖扫描, 保留历史回放能力, 更新 model lineage, 通知 owner 和 risk reviewer。对曾经影响客户决策的特征, 还要保留 contract、source lineage 和版本到审计保留期结束。

13. 自检清单

Area	Check
Architecture	是否包含事件接入、流式特征、registry、offline store、online store、feature server、decision orchestrator、rules/policy、audit/replay、monitoring
Entity/time	是否定义 entity keys、event_time、available_time、decision_time、effective_time、expiry_time
Consistency	是否有 offline/online parity、batch/stream parity、default/null policy 一致性
Leakage	是否覆盖 future、post-decision、target、operational、source、feedback leakage
Freshness	是否有 SLO、error budget、burn alert、degraded mode、recovery condition
Financial fit	是否覆盖支付欺诈、信贷预审批、KYC、NBA、客服 Agent 工具风险
Governance	是否有 owner、allowed use、PII、retention、risk review、approval
Release	是否有 point-in-time validation、load test、shadow mode、replay、rollback
Monitoring	是否监控 freshness、quality、skew、latency、outcome、override、audit health
Audit	是否能复现当时特征值、模型、规则、政策和决策理由

14. 最终记忆句

Real-time decisioning is a time-aware, governed, replayable decision architecture.
Feature store is the contract layer between historical learning and online action.

中文表达:

实时特征平台的本质, 是把“过去如何训练”和“此刻如何决策”放到同一套实体、时间、契约、服务、监控和审计体系里。