AI 底层逻辑 / 经典论文

Feature Stores / Real-Time ML：Feast、Michelangelo 与实时决策

一句话:

314 行ai-foundations/papers/37-feature-stores-real-time-ml-feast-michelangelo.md

Feature Stores / Real-Time ML 解读

面向对象: AI Architect / Platform PM / Data Product Manager / Risk Product / ML Platform Owner。核心问题: 为什么实时 AI 决策不能只靠模型服务？Feature Store 如何解决训练-服务一致性、时间正确性、实时新鲜度、复用、监控和治理？学习目标: 理解 Feast、Uber Michelangelo、offline/online store、point-in-time correctness、training-serving skew、streaming features，并映射到支付欺诈、信贷预审批、KYC、推荐和 Agent 风险决策。

Source Anchors

Source	Link	用途
Feast docs	https://docs.feast.dev/	理解开源 feature store 的 entity、feature view、offline/online serving
Feast GitHub	https://github.com/feast-dev/feast	理解工程边界、注册表、provider 和 serving 模式
Uber Michelangelo	https://www.uber.com/blog/michelangelo-machine-learning-platform/	理解大规模 ML 平台、特征、训练、部署和监控
Metaflow docs	https://docs.metaflow.org/	理解 ML/data workflow、版本化、实验和生产流水线
NIST AI RMF	https://www.nist.gov/itl/ai-risk-management-framework	把实时 ML 纳入治理、测量、监控和风险管理

一句话:

Feature Store 是把“模型需要的业务事实”产品化、版本化、可复用、可回放、可监控的 AI 数据基础设施。

1. 为什么 Feature Store 是架构问题

实时 ML 决策常见失败不在模型，而在特征:

训练用的字段和线上服务字段不一致。
标签发生在未来，训练时泄露。
实时特征延迟过高或缺失。
相同特征被多个团队重复计算，口径不同。
新模型上线后因为特征漂移导致性能下降。
审计时无法解释某次决策用的是什么特征值。
批处理特征和流式特征时效不同，模型行为不可预测。

Feature Store 的价值:

能力	解决的问题
Feature registry	知道有哪些特征、owner、定义、版本
Offline store	训练、回放、历史特征
Online store	低延迟 serving
Point-in-time join	防止未来信息泄露
Materialization	把离线/流式特征推到 online store
Feature view	管理 entity、schema、freshness、source
Reuse	避免团队重复造相同特征
Monitoring	新鲜度、缺失率、漂移、质量

2. Offline / Online 一致性

Feature Store 的核心承诺:

training features ~= serving features

2.1 Offline Store

用于:

历史训练集生成。
backtesting。
point-in-time join。
模型验证。
审计回放。

数据形态:

Data warehouse。
Lakehouse。
Batch tables。
Historical event logs。

2.2 Online Store

用于:

毫秒级或低秒级特征读取。
实时欺诈、推荐、定价、风控、Agent gating。
serving-time feature lookup。

常见实现:

Redis / DynamoDB / Cassandra / Bigtable / key-value store。
Streaming materialization。
Feature service API。

2.3 一致性难题

问题	例子	控制
Training-serving skew	训练用 SQL 逻辑和线上 Java/Python 逻辑不同	统一 feature definition、feature view
Data leakage	训练时用了决策后才知道的字段	point-in-time join、label cutoff
Freshness drift	实时特征 10 分钟未更新	freshness SLO、alert
Null mismatch	训练填 0，线上返回 null	default policy、schema contract
Entity mismatch	customer_id/account_id/device_id 映射不一致	entity registry、identity resolution
Version mismatch	模型 v2 读取旧特征口径	model-feature dependency registry

3. Point-in-Time Correctness

实时决策系统必须回答:

在当时那个时间点，系统实际能知道什么？

错误示例:

2026-06-01 做信贷预审批
训练样本却使用了 2026-06-05 才出现的逾期结果

这会造成 label leakage，让离线指标虚高，线上失败。

正确思路:

entity_id + event_timestamp
  -> join only feature values available before event_timestamp
  -> enforce feature timestamp and created timestamp rules

高级产品/架构问题:

是否区分 event time、processing time、available time。
特征是否有 late-arriving data。
回放训练集是否能重现当时线上可见状态。
审计是否能取回当时特征值、模型版本、策略版本和输出。

4. Streaming Features 与 Freshness SLO

实时特征例子:

场景	实时特征
支付欺诈	过去 5 分钟同设备交易次数、同卡失败次数、商户 velocity
信贷预审批	最近工资入账、余额变化、外部 bureau 更新状态
KYC	文档上传次数、OCR 失败次数、设备/IP 异常
推荐	当前 session 点击、搜索、购物车
Agent 工具风险	用户刚刚请求的 action、工具失败次数、policy violation count

Freshness SLO 示例:

Feature tier	Freshness	Missing tolerance	使用场景
Critical real-time	< 5s	near 0	支付拦截、account takeover
Operational real-time	< 60s	very low	KYC routing、客服 Agent gating
Near real-time	< 15m	low	推荐、next-best-action
Batch	daily/weekly	medium	长期客户画像、策略分析

设计问题:

特征过期时是拒绝、降级、改用 batch 特征还是人工复核。
streaming pipeline 出现 lag 时如何保护客户和业务。
online store 返回 stale value 是否被模型感知。
freshness alert 是否和 model serving、risk workflow 联动。

5. Feature Contract

Feature Contract 是数据产品和模型产品之间的契约。

字段	说明
Feature name	稳定命名，不跟随临时项目
Entity	customer、account、device、merchant、case
Definition	业务口径和计算逻辑
Source	系统、表、topic、owner
Timestamp semantics	event time、processing time、available time
Freshness SLO	最大可接受延迟
Null/default policy	缺失如何处理
Allowed use	训练、serving、monitoring、analysis
Prohibited use	不得用于某些自动决策或客户细分
Privacy class	PII、sensitive、consent required
Quality checks	range、distribution、missing、drift
Consumers	模型、规则、dashboard、Agent
Approval owner	data owner、risk owner、privacy owner

高级判断:

Feature 不是字段，是带 owner、语义、时效、权限和质量承诺的数据产品。
Feature Contract 应进入模型发布门禁。
高风险特征需要 purpose limitation 和解释边界。

6. 实时决策架构

Event / Request
  -> identity and entity resolution
  -> online feature lookup
  -> real-time feature computation
  -> model score
  -> rules / policy engine
  -> decision service
  -> action / recommendation / escalation
  -> trace log
  -> feedback / label capture
  -> offline training and replay

6.1 决策服务不是模型服务

模型服务输出 score，决策服务输出 action:

层	输出	责任
Feature service	feature vector	数据时效、口径、权限
Model service	score / probability / embedding	预测质量
Rule/policy service	allow/deny/escalate constraints	合规和业务规则
Decision service	approve/reject/review/recommend	业务决策编排
Workflow service	case/action/task	人机流程

6.2 金融零售案例

支付欺诈:

transaction event
  -> card/account/device/merchant velocity features
  -> fraud model
  -> policy rules
  -> approve / step-up / decline / manual review
  -> customer notification and dispute path

信贷预审批:

customer context
  -> eligibility features
  -> bureau / income / relationship features
  -> model score
  -> policy boundary
  -> marketing eligibility or manual review

Agent 工具授权:

agent wants to call tool
  -> user role + customer context + action risk + recent behavior features
  -> policy model/rules
  -> allow / require approval / block
  -> audit trace

7. Monitoring 与 Release Gate

Feature monitoring:

指标	含义
freshness lag	特征延迟
missing rate	缺失率
distribution drift	分布漂移
schema drift	schema/类型变化
online/offline parity	训练和服务口径差异
entity coverage	entity 是否匹配
fallback rate	降级或默认值比例
decision impact	特征异常影响了多少决策

发布门禁:

所有 critical features 有 contract。
point-in-time dataset 通过 leakage check。
online/offline parity 通过抽样验证。
freshness SLO 有监控和 alert。
特征缺失和默认值策略通过风险评审。
模型版本绑定 feature view 版本。
可回放某次决策的完整 feature vector。
高风险动作有 policy engine 和人工升级路径。

8. 面试表达

30 秒版本

Feature Store 解决的是实时 ML 的生产一致性问题。它把特征定义、历史训练、线上 serving、point-in-time correctness、新鲜度、复用和监控统一起来，避免训练-服务偏差和数据泄露。在支付欺诈、信贷、KYC、推荐和 Agent 工具授权里，特征平台往往比模型本身更决定系统是否可靠。

2 分钟版本

我会把实时 AI 决策拆成 feature service、model service、policy service、decision service 和 workflow service。Feature Store 负责 offline/online 一致性，offline store 生成历史训练集和回放，online store 支撑低延迟 serving。关键控制是 point-in-time correctness，确保训练样本只使用当时可见的信息，避免 leakage。实时场景还要定义 freshness SLO，例如支付欺诈 velocity 特征可能要求 5 秒内更新，而客户画像可以日更。上线时我会要求 feature contract、schema/quality/freshness monitoring、online-offline parity、模型-特征版本绑定和决策回放能力。

架构师版本

Feature Store 是企业 AI 平台的数据控制面。它不是缓存表，而是特征定义、entity 语义、时间语义、权限、质量、SLO、版本和 lineage 的组合系统。没有它，实时决策会陷入重复计算、口径不一致、无法审计和模型漂移不可控。

9. 作品集任务

选择一个实时决策场景，产出:

实时决策架构图。
10 个 feature contracts。
freshness SLO matrix。
point-in-time training dataset 设计。
online/offline parity 测试方案。
feature monitoring dashboard 草图。
release gate checklist。
一页 incident 回放: 某个 stale feature 导致错误决策时如何定位和回滚。