返回 Papers
AI 扩展计划 / Playbooks

AI Real-Time Feature Store / Decisioning Playbook

这些来源作为学习锚点, 用于建立平台术语、架构边界和治理语言。它们不构成供应商选型建议、法律意见或监管意见。

865AI_REAL_TIME_FEATURE_STORE_DECISIONING_PLAYBOOK.md

AI Real-Time Feature Store & Decisioning Playbook

定位: 面向 AI Product Architect / AI PM / Data Product / Risk Product / Platform Architect 的实时特征平台与实时决策架构手册。 核心目标: 把 feature store、streaming features、online/offline consistency、point-in-time correctness、freshness SLO、回放审计和金融零售实时决策串成可设计、可评审、可上线、可面试表达的能力。 核心结论: 实时 AI 决策不是“模型 API 加几个实时字段”。它是一套以 entity/time 语义、feature contract、低延迟 online serving、历史一致训练集、决策编排、监控回放和治理门禁为中心的平台能力。


Source Anchors

这些来源作为学习锚点, 用于建立平台术语、架构边界和治理语言。它们不构成供应商选型建议、法律意见或监管意见。

SourceLink本文用法
Feast Docs: Quickstarthttps://docs.feast.dev/getting-started/quickstart理解 offline store、online store、materialization、push features、real-time inference 的基本工程形态
Feast Docs: Use Caseshttps://docs.feast.dev/getting-started/use-cases对齐 risk scorecards、historical feature retrieval、feature monitoring、point-in-time training data 等场景语言
Feast GitHubhttps://github.com/feast-dev/feast将 Feast 作为开源 feature store 参考实现, 理解 registry、feature server、offline/online serving 的产品边界
Feast Feature Serverhttps://docs.feast.dev/getting-started/components/feature-server参考在线特征服务 API、push/read 路径和生产安全通信要求
Uber Michelangelohttps://www.uber.com/us/en/blog/michelangelo-machine-learning-platform/参考端到端 ML 平台如何覆盖 data、training、deployment、prediction、monitoring
Uber Palette Meta Store Journeyhttps://www.uber.com/us/en/blog/palette-meta-store-journey/参考大规模 feature store 如何管理 curated features、自动生成 pipeline 和 feature dispersal
Metaflow Docshttps://docs.metaflow.org/introduction/what-is-metaflow参考生产 ML workflow、版本化、可复现、从本地到生产的流程治理
Metaflow Production Deploymentshttps://docs.metaflow.org/production/introduction参考生产部署、event triggering、fresh results、cache、模型服务和故障恢复思路
NIST AI RMFhttps://www.nist.gov/itl/ai-risk-management-framework用 Govern / Map / Measure / Manage 语言组织 AI 决策系统风险、监控和治理证据

1. 定位: 这份 Playbook 补什么能力

很多 AI 产品文档会把实时决策写成:

  • 接入交易流。
  • 调用模型评分。
  • 根据阈值拦截。
  • 记录结果。

这只能描述一个 demo。金融零售生产系统真正难的是:

难点生产问题架构含义
时间语义训练样本是否只能看到决策当时已知的信息需要 point-in-time correctness、event_time、created_time、late-arriving data 控制
一致性训练使用的特征和线上推理使用的特征是否同源同义需要 feature registry、feature contract、offline/online parity test
新鲜度支付欺诈或 Agent 工具风险判断能否使用足够新的行为信号需要 freshness SLO、streaming feature pipeline、online store TTL
延迟决策必须在授权、支付、客服工具调用前完成需要低延迟 feature serving、规则短路、fallback 策略
泄漏模型是否使用了未来才知道或决策后才生成的字段需要 leakage review、feature availability timestamp、回放验证
审计监管、风控、客户争议时能否复现当时为什么做出决定需要 decision event、feature snapshot、model/rule/policy version、replay
治理新特征是否改变公平性、隐私、拒绝原因、客户影响需要 feature governance、上线门禁、risk sign-off

一句话:

实时特征平台的产品价值不是“多存一些特征”, 而是让高风险决策在正确时间、使用正确实体、读取正确版本、满足延迟和新鲜度约束, 并且可证明、可回放、可治理。


2. 能力地图: 从 Batch Model 到 Real-Time Decisioning Platform

层级典型做法主要风险成熟表达
Batch scoring每晚跑批生成风险分风险信号过期, 无法拦截瞬时欺诈适合低频、低时效场景, 如月度营销名单
Near-real-time scoring每几分钟刷新特征或分数窗口延迟、补数、重复事件处理不清适合信贷预审批、客户 next-best-action
Real-time feature serving在线读取最新 entity featuresonline/offline skew、freshness 不稳定适合支付欺诈、账户接管、客服工具风险
Real-time decisioning特征、模型、规则、策略、人工升级一起编排决策不可解释、审计困难、误拦截影响客户适合客户权益、资金、KYC、信贷、Agent 工具调用风险
Closed-loop decision platform决策、反馈、回放、监控、治理闭环反馈污染、策略漂移、监管证据缺失生产级 AI 风控、信用、支付、运营决策平台

角色视角:

角色关注点关键产出
AI Product Architect平台边界、能力复用、业务系统集成、风险分层Reference architecture、ADR、rollout strategy
AI PM场景优先级、延迟体验、误杀成本、人工兜底、指标Decision PRD、freshness SLO、release gate
Data Product特征 owner、契约、质量、血缘、数据产品化Feature contract、quality SLO、monitoring pack
Risk Product阈值、规则、模型、原因码、override、risk appetiteDecision policy、champion/challenger、override analysis
Platform Architectoffline/online store、streaming、serving、audit、可靠性C4、sequence、capacity、resilience plan

3. 实时决策参考架构

3.1 架构图文字版

事件源
  支付授权、登录、设备指纹、交易、KYC 上传、贷款申请、CRM 交互、Agent 工具调用意图

事件接入层
  API Gateway / Kafka / CDC / webhook / file drop
  责任: schema validation、idempotency key、event_time、source timestamp、PII classification

流式计算层
  Flink / Spark Structured Streaming / Kafka Streams / managed streaming
  责任: window aggregation、watermark、late event handling、entity enrichment、feature value emission

Feature Platform
  Feature Registry
    feature definition、entity、owner、version、contract、permission、freshness SLO、lineage
  Offline Store
    historical features、point-in-time training retrieval、backfill、batch scoring
  Online Store
    low-latency lookup、TTL、latest feature values、serving freshness
  Feature Server
    standardized API、auth、rate limit、request trace、online feature retrieval

Decisioning Layer
  Decision Orchestrator
    collects request context、online features、model score、rules、policy、reason codes
  Model Serving
    fraud / credit / KYC / NBA / tool-risk models, versioned and observable
  Rules and Policy Engine
    hard blocks、risk appetite、regulatory rules、manual review routing、fallback
  Human Review Queue
    high-risk exceptions、borderline cases、disputes、adverse action review

Systems of Record
  payment switch、core banking、loan origination、KYC case manager、CRM、contact center、AI tool gateway

Observability and Governance
  feature freshness、quality、latency、skew、model drift、decision outcomes、override、audit replay、incident loop

3.2 Mermaid 视图

flowchart TB
  E[Events: payment, KYC, credit, CRM, agent tool intent] --> I[Event ingestion and schema validation]
  I --> S[Streaming feature pipelines]
  I --> L[Immutable event log]
  S --> R[Feature registry and contracts]
  S --> O[Online feature store]
  S --> F[Offline feature store]
  F --> T[Point-in-time training dataset]
  O --> FS[Feature server]
  FS --> D[Decision orchestrator]
  T --> M[Model training and validation]
  M --> MS[Model serving]
  MS --> D
  D --> P[Rules and policy engine]
  P --> A[Allow / block / step-up / manual review / fallback]
  A --> B[Business systems]
  D --> G[Decision audit log]
  R --> G
  L --> RP[Replay and simulation]
  G --> RP
  RP --> V[Release gate and governance review]
  Mon[Freshness, skew, drift, latency, error budget monitoring] --> V
  Mon --> Inc[Incident and rollback]

3.3 组件责任边界

Component负责不负责
Event ingestion事件结构、身份、幂等、event_time、source metadata、数据分类不做复杂模型决策
Streaming feature pipeline窗口聚合、entity enrichment、late event policy、feature emission不决定业务拦截阈值
Feature registry特征定义、owner、版本、契约、血缘、权限、SLO、用途不替代数据仓库或模型注册表
Offline store历史特征、训练集构造、回放、批量评分不提供毫秒级线上决策
Online store低延迟读取、TTL、最新值、serving freshness不承担复杂历史 join
Feature serverAPI、鉴权、限流、trace、在线读取不直接执行高风险业务动作
Decision orchestrator汇聚请求、特征、模型、规则、政策、reason codes、fallback不拥有所有特征定义
Rules and policy engine硬性规则、人工升级、监管约束、风险偏好不训练模型
Audit and replay决策证据、版本、快照、回放、模拟、争议处理不只做技术日志

4. Entity / Time 语义: 实时特征平台的底层契约

4.1 Entity 不是主键那么简单

金融零售实时决策通常同时涉及多个 entity:

Entity示例决策价值
customer_id零售客户、持卡人、借款人历史行为、风险等级、生命周期
account_id存款账户、信用卡账户、贷款账户账户余额、还款、交易模式
card_id / token_id实体卡、虚拟卡、wallet token支付授权、盗刷、设备迁移
merchant_id商户、收单商户、平台店铺商户风险、MCC、拒付率
device_id设备指纹、浏览器、移动设备账户接管、KYC 文档上传风险
application_id贷款申请、开户申请、KYC case决策上下文、材料状态、审批阶段
agent_session_id客服 Agent 会话、工具调用计划工具风险、越权、客户影响

成熟设计要明确:

  • 一个 feature 是否绑定单 entity, 还是 multi-entity join。
  • entity resolution 谁负责, 置信度是多少。
  • 线上请求缺失 entity 时如何降级。
  • entity merge / split 后历史特征如何处理。
  • 高风险实体关系是否需要图谱或 link analysis 作为独立证据。

4.2 Time 字段的最小语义

Time field含义错误使用的后果
event_time业务事件实际发生时间决策窗口错位, 训练集泄漏
ingestion_time平台收到事件时间把网络延迟误当成业务延迟
created_time特征值或源记录生成时间历史训练集误用未来才生成的特征
available_time特征对决策系统可用的时间忽略数据延迟, 造成 point-in-time 错误
decision_time模型或规则做出决策的时间审计无法复现
effective_time政策、规则、特征定义生效时间新旧规则混用
expiry_time特征值、规则或来源失效时间使用过期风险信号

生产级要求:

Training sample at decision_time T
只能 join available_time <= T 的特征值。
如果源事件 event_time <= T 但 available_time > T, 训练时也不能使用。

4.3 Point-in-Time Correctness

Point-in-time correctness 的核心不是 SQL 技巧, 而是业务可证明性:

检查点合格表现高风险失败
Label timinglabel 发生在决策之后, 且窗口定义清楚用 chargeback 结果反向污染欺诈特征
Feature availability训练只使用当时已可用特征用贷后表现、人工审核结论、KYC 最终状态做贷前特征
Historical join每个 entity 按 decision_time 取最近可用值直接取最新快照
Late-arriving data回补策略不改变历史决策时点可见信息回放时看到了线上当时没有的数据
Rule/model version回放使用当时生效版本用当前规则解释过去决定
Source version政策、黑名单、特征定义按有效期取值用未来更新的名单解释过去拦截

5. Online / Offline Feature Consistency

5.1 一致性目标

Feature store 的核心价值之一是让模型训练和线上服务共享特征定义、血缘和质量约束。

一致性维度训练侧服务侧验证方式
Definitionfeature view、SQL、UDF、window、过滤条件materialized / pushed value 使用同一 contractdefinition hash、contract review
Entityhistorical entity dataframeonline lookup entity keysentity completeness、join hit rate
Timepoint-in-time joinrequest time latest available valuereplay parity test
Transformationbatch transformstream transformgolden entity comparison
Defaultsmissing/null imputationonline fallback/defaultnull policy parity
Freshnesshistorical backfillonline TTL and freshness SLOfreshness monitor
Permissiontraining dataset accessserving request accessentitlement replay

5.2 Training-Serving Skew Taxonomy

Skew type典型表现金融零售影响控制
Compute skewbatch SQL 与 streaming job 逻辑不同欺诈分数上线后失准单一 feature contract、共享测试集、代码生成或同源 transform
Freshness skew训练假设 T 时刻可用, 线上实际延迟 2 分钟支付拦截漏判available_time、freshness SLO、degraded decision policy
Missingness skew训练样本补齐完整, 线上大量缺值新客户、薄档客户被误判online missing policy、slice monitoring
Entity skew训练按 customer_id, 线上按 account_id客户关系错配entity mapping contract、join hit rate
Default skew训练缺失填 0, 线上缺失填 null 或 previous valuerisk score 偏移default value registry、contract test
Policy skew训练集使用旧黑名单, 线上用新黑名单结果不可复现source versioning、policy effective_time
Timezone/clock skew不同系统时间基准不一致窗口聚合错误UTC normalization、clock drift monitor

5.3 Feature Leakage Taxonomy

Leakage type示例为什么危险防控
Future leakage使用 T+7 天拒付结果预测 T 时刻支付欺诈离线指标虚高, 线上失效point-in-time retrieval、label cutoff
Post-decision leakage使用人工审核结论作为自动审核输入模型学习人工结果而非前置信号feature availability review
Target leakage特征直接编码目标变量信贷、KYC、欺诈模型虚假高 AUCleakage audit、feature importance review
Operational leakage用“进入人工队列”预测高风险模型复刻旧流程偏差process feature review
Source leakage只在坏样本中存在的数据源字段样本选择偏差source coverage and missingness slice
Feedback leakage使用模型拦截后的结果作为未拦截结果标签策略自证正确exploration、reject inference、shadow labels

6. Streaming Features and Freshness SLO

6.1 Streaming Feature 类型

Feature type示例决策价值
Rolling count5 分钟内同卡失败支付次数支付欺诈、账户接管
Rolling amount10 分钟内跨境交易总额授权风控、反洗钱线索
Velocity1 小时内设备切换次数登录风险、KYC 文档风险
Distinct count30 分钟内同设备关联客户数设备农场、合成身份
Ratio24 小时内失败交易率商户风险、支付稳定性
Time since last event距离上次成功 KYC 上传时间KYC case prioritization
Sequence pattern登录、改手机号、发起转账连续发生账户接管
Agent action context当前会话已读取客户数据次数、拟调用工具风险客服 Agent 工具风控

6.2 Event-Time vs Processing-Time

选择适用风险
Event-time windows风险判断依赖真实业务发生时间需要 watermark 和 late event 策略
Processing-time windows只关心平台收到后的实时处理网络延迟会改变窗口含义
Hybrid线上低延迟用 processing-time, 回放和训练用 event-time + available_time需要清楚标记线上/离线差异

金融场景建议:

  • 支付授权风控以低延迟为第一约束, 但训练和回放必须记录 event_time、ingestion_time、available_time。
  • 信贷预审批通常能容忍秒级到分钟级延迟, 但必须严格控制贷后和人工结果泄漏。
  • Agent 工具风险判断通常需要会话内实时状态, 其窗口更接近 processing-time, 但审计仍要保存事件序列。

6.3 Freshness SLO and Error Budget

Freshness 不是单一指标, 至少要拆成:

指标定义例子
Source lag源事件发生到平台接收的时间支付事件 99p < 500ms
Pipeline lag平台接收后到特征计算完成的时间velocity feature 99p < 800ms
Online materialization lag特征计算完成到 online store 可读的时间online write 99p < 300ms
Serving freshness决策时读取到的特征年龄card_decline_count_5m age 99p < 2s
Decision latency请求进入到返回 allow/block/review 的时间支付授权 99p < 120ms
Staleness rate超过 freshness threshold 的请求比例每日 < 0.1%

Error budget 要绑定业务动作:

Feature groupFreshness SLOError budget超预算动作
Payment fraud velocity99.9% requests age <= 2s每日 stale requests <= 0.1%降级到规则保守模式, 高风险交易 step-up
Credit pre-approval bureau enrichment99% records age <= 24h每月 stale applications <= 1%停止自动预审批, 转人工或重新拉取
KYC document risk99% document signals age <= 5m每日 stale cases <= 0.5%暂停自动通过, 只允许 manual review
Agent tool-risk session features99.5% session state age <= 1s每日 stale tool calls <= 0.2%高风险工具 require approval

7. 金融零售实时决策场景

7.1 支付欺诈实时拦截

维度设计要点
Decision point授权请求进入 payment switch 后, 返回 approve / decline / step-up / manual review 前
Entitiescard_id、token_id、customer_id、merchant_id、device_id、ip、account_id
Real-time features5 分钟失败次数、10 分钟金额 velocity、同设备客户数、merchant risk velocity、geo jump、new payee flag
Offline features客户历史风险等级、商户历史拒付率、账户生命周期、历史 disputed transaction ratio
Latency budget决策总延迟通常按毫秒级管理, feature lookup 必须有 strict timeout
Fallback特征超时走 rules-only 或 step-up, 高风险不默认放行
Auditrequest context、feature values、feature age、model version、rule version、decision、reason code
Risk tradeofffalse negative 是欺诈损失; false positive 是客户体验、收入和投诉

高级产品判断:

  • 不要把“模型分数高”直接等同拒绝。支付场景应设计 step-up、3DS、限额、延迟放行、人工审核等多动作策略。
  • 对新客户、薄档客户、跨境场景要独立监控误杀率, 否则实时模型会把缺数据当风险。
  • 对规则和模型冲突要记录 decision arbitration, 因为事后争议通常问的是“为什么当时没拦或为什么拦了”。

7.2 信贷预审批

维度设计要点
Decision point客户浏览产品、进入申请、额度提升、营销触达前
Entitiescustomer_id、account_id、application_id、household_id
Real-time features最近收入入账、近期逾期、账户余额变动、近期硬查询、申请频率
Offline features信用历史、收入稳定性、产品持有、历史还款、行为评分
Freshness多数特征可分钟到天级; 高风险信用 bureau 或内部 delinquency 状态必须版本清楚
Decision outputeligible / not eligible / invite to apply / manual review / insufficient data
Governance公平性、拒绝原因、adverse action、可解释性、人审边界
Leakage risk使用贷后表现、审批结论、人工备注或拒绝原因作为贷前特征

高级产品判断:

  • 预审批不是最终授信。PRD 要明确 customer-facing wording, 避免把 marketing eligibility 写成信用承诺。
  • Feature contract 要标明哪些特征可用于 eligibility, 哪些只能用于 internal ranking, 哪些不可用于 adverse action reason。
  • 回放验证要覆盖被拒、薄档、低收入、地区、渠道等 slice, 不能只看总体 KS/AUC。

7.3 KYC 文档决策

维度设计要点
Decision point文档上传、OCR、真实性检测、名单筛查、case routing
Entitiescustomer_id、application_id、document_id、device_id、ip、beneficial_owner_id
Real-time features上传设备变化、文档重复使用、OCR confidence、liveness risk、同设备多申请、制裁筛查状态
Offline features客户历史 KYC remediation、国家/行业风险、实体结构复杂度、历史材料缺失率
Decision outputauto-pass / request-more-info / enhanced due diligence / manual review / reject recommendation
Freshness文档和名单状态需要可追溯的 effective_time; sanctions/PEP source version 必须可审计
Audit文档版本、OCR 输出、模型分数、规则、人工 override、客户沟通记录
Leakage risk使用最终人工 KYC status 训练上传时的自动分流模型

高级产品判断:

  • KYC 自动化的安全边界通常是 triage 和 evidence gathering, 不是无约束最终拒绝。
  • 对 sanctions、PEP、adverse media 相关特征, source version 和 match confidence 必须进入审计日志。
  • 文档模型、规则和人工审核形成闭环后, 要防止只学习历史人工偏差。

7.4 客户 Next-Best-Action

维度设计要点
Decision pointApp 首页、客服会话、营销触达、分行客户经理工作台
Entitiescustomer_id、household_id、channel_id、session_id
Real-time features当前会话意图、最近交易、投诉状态、服务失败、渠道活跃、近期触达
Offline features客户价值、产品持有、生命周期、偏好、风险限制、同意状态
Decision outputrecommend / suppress / defer / service-first / human follow-up
Governanceconsent、fair treatment、suitability、vulnerability、投诉状态、频控
Freshness客户投诉、opt-out、风险限制必须近实时生效
Monitoringuplift、complaint rate、opt-out rate、offer fatigue、protected slice

高级产品判断:

  • NBA 不是单纯推荐系统。金融零售必须把 suitability、consent、complaint、vulnerability、risk restriction 放进 policy layer。
  • 有些实时信号应触发“不要卖, 先服务”, 例如刚发生支付失败或投诉升级。
  • 特征平台要支持 suppress features 和 policy features, 不只是 propensity features。

7.5 客服 Agent 工具调用风险判断

维度设计要点
Decision pointLLM Agent 准备调用 read/write/external-send 工具前
Entitiesagent_session_id、user_id、customer_id、case_id、tool_id、tenant_id
Real-time features当前会话敏感度、已读取客户字段数量、工具风险等级、prompt injection signal、DLP hit、approval history
Offline features用户角色、历史权限、工具目录、客户风险、case 类型、政策版本
Decision outputallow / redact_then_allow / dry_run / require_approval / deny / kill_switched
Freshnesssession state 和 kill switch 必须秒级生效
Audittool proposal、arguments、features、policy decision、approver、tool result summary
Leakage risk把模型建议当作授权依据, 或让工具结果中的指令触发下一步工具

高级产品判断:

  • 这里的“特征”不是传统 ML 特征而已, 也是 policy decision context。
  • Agent 风险判断需要低延迟在线特征和规则引擎协同, 不应只靠 prompt guardrail。
  • 任何涉及外发、资金、客户权益、监管记录的工具调用, 都要可回放和可证明当时为什么允许或阻断。

8. 产品决策: 何时需要实时 Feature Store

8.1 不需要实时 Feature Store 的情况

场景更合适方案
每日批量营销名单Batch feature table + campaign rules
低风险内部报表Warehouse semantic layer + BI metrics
少量特征且无复用Service-local cache + contract tests
纯文档问答RAG source governance + retrieval eval
无明确决策动作先做 decision discovery, 不急于建实时平台

8.2 需要平台化的触发信号

触发信号含义
多个模型重复实现同一特征特征需要 registry、owner、复用和质量管理
训练和线上特征逻辑频繁不一致需要 offline/online consistency 和 parity tests
决策对特征新鲜度敏感需要 streaming features、freshness SLO、online store
高风险决策需要审计回放需要 immutable event log、feature snapshot、decision trace
特征涉及 PII、信用、KYC、支付风险需要 feature governance、permission、retention、risk review
多团队共用实体和时间语义需要 entity registry、feature contracts、data product model

8.3 Feast / 自建 / 商业平台取舍

选项适合风险
Feast-style open-source feature store团队有平台工程能力, 需要开放架构和可控集成需要自建治理、UI、SLO、权限、运营能力
自建轻量 feature platform场景窄, 架构简单, 团队需要快速掌控关键路径容易演变成隐性平台, 缺少 registry 和治理
商业 feature platform多团队、多云、多治理要求, 希望缩短平台建设周期供应商锁定、集成复杂、金融审计要求仍需内部负责
Warehouse + online cache批量特征为主, 低频在线读取point-in-time、streaming、parity 和 SLO 需额外设计

选型问题不是“谁功能多”, 而是:

能否证明训练和线上特征同义?
能否按 decision_time 回放?
能否定义并达成 freshness SLO?
能否治理 PII、权限、血缘、owner、用途和风险?
能否在事故中定位 feature、model、rule、policy 哪一层失败?

9. 治理模型

9.1 Feature Lifecycle

阶段关键动作证据
Propose说明业务用途、entity、time、source、风险、预期决策影响Feature proposal
Contract定义计算逻辑、freshness、质量、权限、retention、allowed useFeature contract
Build实现 batch/stream transform、tests、lineage、registry entryCI result、data quality report
Validateoffline/online parity、leakage review、replay、slice impactValidation report
ApprovePM、Data owner、Risk、Compliance、Architect 签核Governance review
Servematerialize/push 到 online store, 接入 feature serverServing readiness
Monitorfreshness、quality、latency、skew、drift、decision outcomeDashboard and alerts
Deprecate标记替代特征、停止新依赖、保留审计回放Deprecation record

9.2 NIST AI RMF 映射

AI RMF function在实时决策平台中的落点
Governfeature ownership、decision authority、risk appetite、approval workflow、audit responsibility
Map场景、客户影响、数据来源、entity/time、决策动作、失败模式、受影响人群
Measurefreshness、skew、leakage、latency、model quality、false positive/negative、override、incident
Managerelease gate、fallback、manual review、kill switch、rollback、feature deprecation、post-incident replay

9.3 高风险特征治理原则

原则解释
Purpose-bound特征必须声明 allowed use, 例如 fraud detection、credit eligibility、customer service risk
Time-aware每个特征必须有 event_time、available_time 或明确的 snapshot semantics
Owner-backed业务 owner、data owner、technical owner、risk owner 不可缺失
Explainable enough影响客户权益的特征要能生成 reason code 或进入解释链
Privacy-awarePII、PCI、敏感身份、信用、KYC、AML 数据必须标记和最小化
Replayable高风险决策使用的特征值必须能在审计中复现或证明
Degradable特征不可用时有明确 fallback, 不让模型自由猜测

10. 可落地交付物模板

以下模板都用具体示例填充。复制到项目时可以替换业务名和阈值, 但不要删除 owner、time、freshness、audit、risk 字段。

10.1 实时决策架构图文字版

Use case:
  支付欺诈实时拦截

Decision SLA:
  授权请求总决策 99p <= 120ms; feature lookup 99p <= 20ms; model scoring 99p <= 35ms

Event sources:
  payment_authorization, card_decline, device_fingerprint, merchant_profile, customer_profile, chargeback_case

Entities:
  card_id, token_id, customer_id, merchant_id, device_id, account_id

Streaming features:
  card_decline_count_5m, card_amount_sum_10m, device_distinct_customer_count_30m,
  merchant_high_risk_auth_count_10m, geo_velocity_score_1h

Offline features:
  customer_lifetime_dispute_rate_180d, merchant_chargeback_rate_90d,
  account_age_days, customer_risk_segment, prior_fraud_case_count_365d

Feature platform:
  Registry stores feature contracts, owners, versions, allowed uses, freshness SLO, lineage.
  Offline store supports point-in-time retrieval for training and replay.
  Online store serves latest features with TTL and feature age.
  Feature server enforces auth, timeout, trace id and rate limit.

Decision orchestration:
  Request context + online features + fraud model score + hard rules + policy constraints.
  Output: approve, decline, step_up_authentication, manual_review.
  Fallback: if velocity features stale, use conservative rule set and step-up for high-risk slices.

Audit and replay:
  Store request id, entity ids, feature values, feature age, model version, rule version,
  policy version, decision output, reason codes, downstream action, feedback label.

10.2 Feature Contract: 支付欺诈 Velocity 特征

FieldExample
Feature namecard_decline_count_5m
Business purpose支付授权实时欺诈拦截和 step-up routing
Allowed usefraud detection、payment authorization risk、fraud model training、audit replay
Disallowed usecredit eligibility、marketing targeting、customer value ranking
Primary entitycard_id
Secondary entitytoken_id, customer_id
Source eventspayment_authorization with status declined
Event timepayment_authorization.event_time_utc
Available timemax of ingestion timestamp and streaming output timestamp
WindowRolling 5 minutes, event-time based, watermark 30 seconds
AggregationCount declined authorizations for same card_id excluding system test transactions
Late event policyLate events within 30 seconds update online value; later events only affect offline replay
Null policyMissing value means no event observed; default 0 with is_missing=false
TTL10 minutes in online store
Freshness SLO99.9% online reads feature age <= 2 seconds
Offline retrievalPoint-in-time join by decision_time, using available_time <= decision_time
Online servingFeature server returns value, feature timestamp, feature age, contract version
Data classificationCustomer financial behavior; confidential; no external sharing
OwnerPayment Risk Data Product Owner
Risk ownerFraud Strategy Lead
Technical ownerReal-Time Feature Platform Team
Quality testsnon-negative integer, 99.9% not null, drift alert on 7-day percentile shift > 30%
Parity tests1,000 golden card_id/time samples compare streaming output vs offline recomputation
Leakage controlExcludes chargeback label, manual review result, post-decision dispute status
Audit fieldsfeature_name, value, event_time, available_time, contract_version, source_event_count
Review cadenceMonthly risk review; immediate review after fraud incident or source schema change

10.3 Freshness and Error Budget

ItemDefinition
Feature groupPayment fraud velocity features
Business impactStale features may miss rapid fraud bursts or cause overly conservative step-up
SLO windowDaily, calculated by decision requests
Primary SLO99.9% of online feature reads have feature_age <= 2 seconds
Secondary SLOFeature server 99p latency <= 20ms; online store read error rate <= 0.05%
Error budgetStale feature reads > 2 seconds must stay <= 0.1% per day
Burn alert2-hour rolling stale rate >= 0.05% triggers warning; >= 0.1% triggers incident
Degraded modeHigh-risk transactions route to step-up; low-risk transactions use rules-only score
Stop conditionStale rate >= 0.5% for 15 minutes disables model path for affected region
Recovery conditionFreshness SLO met for 30 minutes, replay confirms no material missed fraud cluster
OwnerReal-Time Feature Platform on-call plus Payment Risk on-call
Audit evidenceAlert id, affected features, affected entity count, decision fallback count, replay result

10.4 上线门禁

GatePass evidenceBlocking failure
Feature contractContract includes entity, time, source, owner, allowed use, freshness, leakage controlMissing owner, missing time semantics, allowed use unclear
Point-in-time validation10,000 historical decisions replay with available_time <= decision_timeAny future feature used in high-risk sample
Offline/online parityGolden sample parity >= 99.5%, all differences explainedStreaming and batch definitions diverge on critical feature
Freshness SLO7-day load test meets SLO and error budgetNo degraded mode or stale rate above threshold
LatencyDecision path 99p within scenario SLA under peak loadTimeout causes silent default approve for high-risk request
Leakage reviewFeature list reviewed for target, post-decision and operational leakageManual review result or future label appears in model input
MonitoringDashboards and alerts for freshness, nulls, drift, skew, latency, decisions, overridesNo alert owner or no incident route
Audit replaySample decisions replay to same decision or documented toleranceCannot reconstruct feature values, model version or rules
GovernancePM, Data owner, Risk, Compliance, Architect approve release scopeRisk owner rejects or customer-impact boundary unclear
RollbackFeature disable flag, model rollback, rules fallback testedRollback requires code deploy during incident

10.5 回放验证方案

StepExecutionEvidence
1Select 30 days of historical payment authorization events and decisionsReplay cohort manifest with date range, regions, channels, risk slices
2Rebuild feature values using event log and available_time <= decision_timeOffline feature table with contract version and computation hash
3Compare rebuilt features against stored decision-time feature snapshotParity report by feature, entity, channel, timestamp bucket
4Run current candidate model and rules in shadow mode on historical requestsShadow decision table with score, reason codes and proposed action
5Compare candidate decisions to historical outcomes and human overridesFalse positive / false negative / step-up impact by slice
6Inject freshness degradation and missing-feature scenariosResilience report showing fallback decisions and customer impact
7Replay known fraud incidents and near missesIncident replay memo with missed signal, new feature contribution and residual risk
8Produce release recommendationPilot / limited release / no-go memo with risk acceptance owner

回放通过标准示例:

MetricThreshold
Critical point-in-time violation0
Critical feature leakage0
Feature parity for top 20 high-impact features>= 99.5%
Stored decision replay reconstruction>= 99% exact or documented deterministic tolerance
High-risk stale fallback correctness100% routes to step-up, manual review or deny according to policy

10.6 监控清单

MonitorSignalSlice
Feature freshnessfeature_age p50/p95/p99, stale rate, online TTL expirationfeature group, region, channel, entity type
Feature qualitynull rate, range violation, enum drift, negative value, outlier ratenew vs existing customer, product, merchant category
Online/offline skewgolden sample parity, batch vs stream deltafeature, window, source system
Serving reliabilityfeature server latency, timeout, error rate, cache hit rateAPI client, region, risk tier
Decision qualityapprove/block/step-up/manual-review ratechannel, merchant, customer segment, protected slice where legally appropriate
Model driftscore distribution, calibration, PSI/CSI, reason-code shiftmodel version, feature group
Business outcomeconfirmed fraud, chargeback, false positive complaint, approval conversioncohort, decision action
Overridehuman override rate, override reason, override outcomereviewer team, scenario
Governancecontract violations, unapproved feature use, stale owner reviewfeature owner, platform team
Audit healthmissing trace field, replay failure, version lookup failuredecision service, model, rules engine
IncidentSLO burn, kill switch activation, fallback countfeature group, workflow, tenant

10.7 治理评审表

Review areaQuestionPayment fraud example answerDecision
Business purpose这个特征或模型服务哪个明确决策动作授权风险拦截和 step-up routingApproved
Customer impact可能造成什么客户影响误拒支付、额外认证、交易延迟Approved with FP monitoring
Data source来源是否权威且有 ownerPayment switch event stream, owner Payment PlatformApproved
Time semantics是否定义 event_time、available_time、decision_time三者均进入 contract 和 audit traceApproved
Leakage是否包含未来、贷后、人工结论或目标变量排除 chargeback label 和 manual review resultApproved
Privacy是否涉及 PII/PCI/敏感金融行为使用 tokenized card_id; no PAN in feature storeApproved with DLP evidence
Fairness是否可能对特定群体产生不利影响监控新客户、跨境、薄档客户误拦截Approved with monthly slice review
Explainability是否能给出 reason code 或人工解释velocity, geo jump, merchant risk as reason factorsApproved
FreshnessSLO 是否与业务动作匹配99.9% <= 2s; stale routes to step-upApproved
Replay争议时是否能复现feature snapshot + event replay + model/rule versionApproved
Operations谁响应 SLO 事故Feature platform on-call + Payment Risk on-callApproved
Scope是否限定上线范围Card-not-present US region controlled releaseApproved

10.8 Decision Audit Schema

decision_event:
  decision_id: "payauth-2026-06-29-00048192"
  use_case: "payment_fraud_authorization"
  decision_time_utc: "2026-06-29T14:05:31.830Z"
  request_context:
    channel: "card_not_present"
    amount_currency: "USD"
    amount_value: 284.90
    merchant_category: "electronics"
  entities:
    card_id_hash: "card_hash_8f10"
    customer_id_hash: "cust_hash_91ab"
    merchant_id_hash: "m_hash_3e52"
    device_id_hash: "dev_hash_77cc"
  features:
    - name: "card_decline_count_5m"
      value: 3
      feature_time_utc: "2026-06-29T14:05:30.510Z"
      feature_age_ms: 1320
      contract_version: "v4"
    - name: "device_distinct_customer_count_30m"
      value: 5
      feature_time_utc: "2026-06-29T14:05:29.980Z"
      feature_age_ms: 1850
      contract_version: "v2"
  model:
    model_name: "payment_fraud_rt"
    model_version: "2026-06-20-champion"
    score: 0.87
    threshold_band: "step_up"
  rules:
    rule_policy_version: "fraud-policy-2026-06-15"
    triggered_rules:
      - "velocity_high"
      - "new_device_high_amount"
  decision:
    action: "step_up_authentication"
    reason_codes:
      - "recent_decline_velocity"
      - "new_device_pattern"
    fallback_used: false
  audit:
    trace_id: "trace_42cb6"
    feature_server_latency_ms: 14
    model_latency_ms: 27
    policy_latency_ms: 5
    replayable: true

11. 30 天训练计划

Day主题任务产出
1Use case framing选定支付欺诈、信贷预审批、KYC 或 Agent 工具风险中的一个场景, 写 decision point 和 customer impactdecision-use-case-brief.md
2Decision taxonomy定义 allow / block / step-up / manual review / fallback 的业务含义decision-action-taxonomy.md
3Entity model列出 customer、account、card、merchant、device、application、session 等实体关系entity-model.md
4Time semantics定义 event_time、available_time、decision_time、effective_time、expiry_timetime-semantics-note.md
5Source inventory盘点事件源、批量源、政策源、标签源、人工审核源source-inventory.md
6Feature candidate review为 20 个候选特征标注用途、风险、泄漏可能性、freshness 需求feature-candidate-review.md
7Leakage review识别 future、post-decision、target、operational、feedback leakageleakage-review.md
8Feature contract写 3 个高影响特征 contract, 覆盖 entity/time/source/freshness/auditfeature-contract-pack.md
9Architecture diagram画事件、streaming、offline/online store、decisioning、audit、monitoring 架构realtime-decision-architecture.md
10Offline retrieval设计 point-in-time training dataset 构造规则pit-training-dataset-spec.md
11Online serving设计 feature server API、timeout、default、auth、trace 字段online-serving-spec.md
12Streaming design定义 rolling windows、watermark、late event policy、TTLstreaming-feature-design.md
13Freshness SLO为 feature group 写 SLO、error budget、degraded mode、recoveryfreshness-error-budget.md
14Decision orchestration定义模型、规则、policy、fallback、人工升级的组合逻辑decision-orchestration-spec.md
15Audit schema写 decision event schema, 包括 feature snapshot、model/rule/policy versiondecision-audit-schema.md
16Replay cohort选择历史样本切片: 时间、渠道、客户、商户、风险等级replay-cohort-manifest.md
17Replay validation设计离线重算、线上快照对比、shadow decision 对比replay-validation-plan.md
18Parity tests设计 batch vs stream、offline vs online、default policy 一致性测试parity-test-pack.md
19Monitoring dashboard定义 freshness、quality、skew、latency、decision、override、audit 指标monitoring-metric-pack.md
20Incident workflow写 stale feature、online store outage、leakage discovery、bad threshold 的响应流程decisioning-incident-runbook.md
21Governance review填写治理评审表, 明确 PM、Data、Risk、Compliance、Architect 责任governance-review-record.md
22Release gate写 prototype、shadow、pilot、production 四级门禁release-gate-spec.md
23Fallback design针对 feature timeout、model timeout、rule engine failure 设计降级fallback-and-resilience-plan.md
24Fairness and slice review为客户、渠道、地区、薄档、新客户等 slice 设计监控slice-impact-review.md
25Reason codes定义模型分数、规则触发、客户沟通、内部解释之间的映射reason-code-mapping.md
26Champion/challenger设计 shadow model、threshold experiment、risk appetite 评估champion-challenger-plan.md
27Platform roadmap区分场景级实现、共享 feature platform、企业 decisioning platformplatform-roadmap.md
28Architecture ADR写是否采用 Feast-style feature store、streaming platform、rules engine 的 ADRfeature-platform-adr.md
29Portfolio case整理 problem、architecture、contracts、SLO、replay、governance、business impactportfolio-case-study.md
30Interview pack准备 30 秒、2 分钟、CTO、PM、Risk、Data Product 深挖回答interview-answer-pack.md

30 天完成标准:

  • 能画出实时特征与决策平台的端到端架构。
  • 能解释 entity/time/available_time 对 point-in-time correctness 的影响。
  • 能写 feature contract, 并区分 allowed use 与 disallowed use。
  • 能设计 freshness SLO、error budget、degraded mode 和 recovery condition。
  • 能识别 feature leakage 与 training-serving skew。
  • 能用 replay 验证线上决策, 而不是只看离线 AUC。
  • 能把 feature store 上线讲成产品、架构、风险、审计和运营能力。

12. 面试回答

12.1 30 秒版本

实时特征平台的关键不是把特征放进 Redis, 而是保证模型训练和线上决策在 entity、time、definition、freshness 和权限上保持一致。我会用 feature registry 管特征契约, offline store 做 point-in-time training retrieval 和回放, online store 做低延迟服务, streaming pipeline 生成 freshness-sensitive features, 再通过 decision orchestrator 把特征、模型、规则、policy 和人工升级组合起来。上线门禁会要求无 feature leakage、offline/online parity 达标、freshness SLO 有 error budget、决策可回放、监控和 fallback 已验证。

12.2 2 分钟版本

我会把实时决策拆成四层。

第一层是时间和实体语义。每个特征必须定义 entity key、event_time、available_time、decision_time、TTL 和 allowed use。训练集只能使用决策当时已经可用的信息, 否则离线指标会因为未来信息泄漏而虚高。

第二层是 feature platform。Feature registry 管 owner、contract、version、lineage、freshness SLO 和权限; offline store 负责 point-in-time retrieval、backfill、batch scoring 和 replay; online store 负责低延迟 lookup; feature server 负责鉴权、限流、trace 和特征年龄返回。

第三层是 decisioning。实时请求进入后, decision orchestrator 拉取 online features, 调用模型服务, 执行规则和 policy, 输出 allow、block、step-up、manual review 或 fallback。支付欺诈强调毫秒级延迟和 velocity features; 信贷预审批强调泄漏控制、公平性和原因码; KYC 强调文档证据、名单版本和人工审核边界; Agent 工具风险强调 session state、DLP、approval 和 audit。

第四层是治理和运营。上线前跑 offline/online parity、leakage review、freshness load test、historical replay、shadow decision 和 governance review。上线后监控 freshness、null、drift、skew、latency、decision outcome、override 和 audit replay health。高风险特征超预算时不应该静默放行, 而要进入 step-up、manual review 或规则降级。

12.3 CTO 深挖

Q: Feature store 和普通数据仓库有什么本质区别?

A: 数据仓库解决分析和批处理, feature store 解决可复用的训练与服务特征。差异在于 feature store 必须管理 entity/time 语义、point-in-time retrieval、online serving、offline/online consistency、freshness、feature contract 和 model serving integration。实时决策里, 它还要返回 feature age、contract version 和 trace, 支撑审计和回放。

Q: 如何避免 training-serving skew?

A: 我会把 skew 当成 release gate, 不靠口头约定。具体做法是 feature contract 单源定义, batch 和 stream 使用同一语义; 训练集用 available_time 做 point-in-time join; online serving 返回 feature timestamp 和 age; 用 golden entity/time 样本比较 offline recomputation、stream output 和 online snapshot; default/null policy 也进入 contract。任何关键特征 parity 不达标, 不进入 production。

Q: 实时特征延迟达不到怎么办?

A: 先区分 source lag、pipeline lag、online materialization lag、serving latency。产品上要有 degraded mode: 支付高风险交易 step-up, KYC 暂停 auto-pass, Agent 高风险工具 require approval。架构上要设置 timeout、TTL、cache、异步补偿、fallback feature group 和 kill switch。不能让 stale feature 变成 silent approve。

Q: 回放为什么难?

A: 因为回放不是重新跑当前代码。它要恢复当时的事件可见性、feature contract、model version、rule version、policy version、source version 和 decision context。late-arriving data、规则变更、实体合并、source 修正都会破坏复现。高风险场景应保存 decision-time feature snapshot, 同时保留 event log 支持重新计算和差异解释。

12.4 PM 深挖

Q: 如何决定第一阶段做哪个实时决策场景?

A: 我会看四个维度: 决策时效价值、错误成本、数据 readiness、可回放性。支付欺诈时效价值高但延迟和误杀风险高; 信贷预审批对治理和公平性要求高; KYC 文档决策需要证据和人工边界; Agent 工具风险可以先从高风险工具审批切入。第一阶段适合选择业务价值明确、人工兜底存在、历史事件和标签足够、能做 shadow replay 的场景。

Q: 如何定义实时特征平台的产品成功?

A: 不只看模型指标。平台指标包括新特征接入时间、特征复用率、contract 覆盖率、freshness SLO 达成率、offline/online parity、回放成功率、feature incident 数、上线门禁通过率。业务指标按场景看欺诈损失、误拦截投诉、预审批转化、KYC 周转时间、Agent 工具误授权拦截。治理指标看 unapproved feature use、leakage incidents、审计取证时间。

Q: 如何处理 false positive 和用户体验?

A: 实时决策不要只有 allow/block 二元动作。支付可以 step-up, KYC 可以 request-more-info, 信贷可以 invite-to-apply 而不是承诺授信, Agent 工具可以 dry-run 或 require approval。PM 要把错误成本转成动作梯度, 并监控 false positive by slice、override reason、投诉和转化损失。

12.5 Risk / Compliance 深挖

Q: 如何证明没有 feature leakage?

A: 证据包括 feature contract 中的 event_time 和 available_time, 训练集生成逻辑的 point-in-time join, leakage review 对 future、post-decision、target、operational、feedback leakage 的逐项结论, 以及历史 replay 中 available_time <= decision_time 的校验结果。高风险特征还要有 source lineage 和人工审核字段排除证明。

Q: 客户争议某次支付被拦截, 你如何解释?

A: 我会调取 decision audit event, 展示当时请求上下文、实体、特征值、feature age、模型版本、规则版本、policy 版本、触发 reason codes、fallback 状态和下游动作。解释应基于当时可用信息, 而不是用事后 chargeback 或当前模型重跑结果替代。

Q: AI RMF 如何落到这个平台?

A: Govern 是 owner、审批、风险偏好和审计责任; Map 是场景、客户影响、数据源、entity/time 和失败模式; Measure 是 freshness、skew、leakage、latency、模型质量、误杀、override 和 incident; Manage 是 release gate、fallback、manual review、kill switch、rollback 和 post-incident replay。

12.6 Data Product 深挖

Q: Feature contract 最重要的字段是什么?

A: 高风险实时决策里最关键的是 business purpose、allowed/disallowed use、entity、event_time、available_time、source、transformation、freshness SLO、null/default policy、TTL、owner、data classification、leakage control、online/offline serving semantics 和 audit fields。没有这些字段, 特征就只是数据列, 不是可治理的数据产品。

Q: 如何管理 feature deprecation?

A: 不能直接删除。先标记 deprecated, 禁止新模型依赖, 列出替代特征, 跑依赖扫描, 保留历史回放能力, 更新 model lineage, 通知 owner 和 risk reviewer。对曾经影响客户决策的特征, 还要保留 contract、source lineage 和版本到审计保留期结束。


13. 自检清单

AreaCheck
Architecture是否包含事件接入、流式特征、registry、offline store、online store、feature server、decision orchestrator、rules/policy、audit/replay、monitoring
Entity/time是否定义 entity keys、event_time、available_time、decision_time、effective_time、expiry_time
Consistency是否有 offline/online parity、batch/stream parity、default/null policy 一致性
Leakage是否覆盖 future、post-decision、target、operational、source、feedback leakage
Freshness是否有 SLO、error budget、burn alert、degraded mode、recovery condition
Financial fit是否覆盖支付欺诈、信贷预审批、KYC、NBA、客服 Agent 工具风险
Governance是否有 owner、allowed use、PII、retention、risk review、approval
Release是否有 point-in-time validation、load test、shadow mode、replay、rollback
Monitoring是否监控 freshness、quality、skew、latency、outcome、override、audit health
Audit是否能复现当时特征值、模型、规则、政策和决策理由

14. 最终记忆句

Real-time decisioning is a time-aware, governed, replayable decision architecture.
Feature store is the contract layer between historical learning and online action.

中文表达:

实时特征平台的本质, 是把“过去如何训练”和“此刻如何决策”放到同一套实体、时间、契约、服务、监控和审计体系里。