AI Real-Time Feature Store / Decisioning Playbook
这些来源作为学习锚点, 用于建立平台术语、架构边界和治理语言。它们不构成供应商选型建议、法律意见或监管意见。
AI Real-Time Feature Store & Decisioning Playbook
定位: 面向 AI Product Architect / AI PM / Data Product / Risk Product / Platform Architect 的实时特征平台与实时决策架构手册。 核心目标: 把 feature store、streaming features、online/offline consistency、point-in-time correctness、freshness SLO、回放审计和金融零售实时决策串成可设计、可评审、可上线、可面试表达的能力。 核心结论: 实时 AI 决策不是“模型 API 加几个实时字段”。它是一套以 entity/time 语义、feature contract、低延迟 online serving、历史一致训练集、决策编排、监控回放和治理门禁为中心的平台能力。
Source Anchors
这些来源作为学习锚点, 用于建立平台术语、架构边界和治理语言。它们不构成供应商选型建议、法律意见或监管意见。
| Source | Link | 本文用法 |
|---|---|---|
| Feast Docs: Quickstart | https://docs.feast.dev/getting-started/quickstart | 理解 offline store、online store、materialization、push features、real-time inference 的基本工程形态 |
| Feast Docs: Use Cases | https://docs.feast.dev/getting-started/use-cases | 对齐 risk scorecards、historical feature retrieval、feature monitoring、point-in-time training data 等场景语言 |
| Feast GitHub | https://github.com/feast-dev/feast | 将 Feast 作为开源 feature store 参考实现, 理解 registry、feature server、offline/online serving 的产品边界 |
| Feast Feature Server | https://docs.feast.dev/getting-started/components/feature-server | 参考在线特征服务 API、push/read 路径和生产安全通信要求 |
| Uber Michelangelo | https://www.uber.com/us/en/blog/michelangelo-machine-learning-platform/ | 参考端到端 ML 平台如何覆盖 data、training、deployment、prediction、monitoring |
| Uber Palette Meta Store Journey | https://www.uber.com/us/en/blog/palette-meta-store-journey/ | 参考大规模 feature store 如何管理 curated features、自动生成 pipeline 和 feature dispersal |
| Metaflow Docs | https://docs.metaflow.org/introduction/what-is-metaflow | 参考生产 ML workflow、版本化、可复现、从本地到生产的流程治理 |
| Metaflow Production Deployments | https://docs.metaflow.org/production/introduction | 参考生产部署、event triggering、fresh results、cache、模型服务和故障恢复思路 |
| NIST AI RMF | https://www.nist.gov/itl/ai-risk-management-framework | 用 Govern / Map / Measure / Manage 语言组织 AI 决策系统风险、监控和治理证据 |
1. 定位: 这份 Playbook 补什么能力
很多 AI 产品文档会把实时决策写成:
- 接入交易流。
- 调用模型评分。
- 根据阈值拦截。
- 记录结果。
这只能描述一个 demo。金融零售生产系统真正难的是:
| 难点 | 生产问题 | 架构含义 |
|---|---|---|
| 时间语义 | 训练样本是否只能看到决策当时已知的信息 | 需要 point-in-time correctness、event_time、created_time、late-arriving data 控制 |
| 一致性 | 训练使用的特征和线上推理使用的特征是否同源同义 | 需要 feature registry、feature contract、offline/online parity test |
| 新鲜度 | 支付欺诈或 Agent 工具风险判断能否使用足够新的行为信号 | 需要 freshness SLO、streaming feature pipeline、online store TTL |
| 延迟 | 决策必须在授权、支付、客服工具调用前完成 | 需要低延迟 feature serving、规则短路、fallback 策略 |
| 泄漏 | 模型是否使用了未来才知道或决策后才生成的字段 | 需要 leakage review、feature availability timestamp、回放验证 |
| 审计 | 监管、风控、客户争议时能否复现当时为什么做出决定 | 需要 decision event、feature snapshot、model/rule/policy version、replay |
| 治理 | 新特征是否改变公平性、隐私、拒绝原因、客户影响 | 需要 feature governance、上线门禁、risk sign-off |
一句话:
实时特征平台的产品价值不是“多存一些特征”, 而是让高风险决策在正确时间、使用正确实体、读取正确版本、满足延迟和新鲜度约束, 并且可证明、可回放、可治理。
2. 能力地图: 从 Batch Model 到 Real-Time Decisioning Platform
| 层级 | 典型做法 | 主要风险 | 成熟表达 |
|---|---|---|---|
| Batch scoring | 每晚跑批生成风险分 | 风险信号过期, 无法拦截瞬时欺诈 | 适合低频、低时效场景, 如月度营销名单 |
| Near-real-time scoring | 每几分钟刷新特征或分数 | 窗口延迟、补数、重复事件处理不清 | 适合信贷预审批、客户 next-best-action |
| Real-time feature serving | 在线读取最新 entity features | online/offline skew、freshness 不稳定 | 适合支付欺诈、账户接管、客服工具风险 |
| Real-time decisioning | 特征、模型、规则、策略、人工升级一起编排 | 决策不可解释、审计困难、误拦截影响客户 | 适合客户权益、资金、KYC、信贷、Agent 工具调用风险 |
| Closed-loop decision platform | 决策、反馈、回放、监控、治理闭环 | 反馈污染、策略漂移、监管证据缺失 | 生产级 AI 风控、信用、支付、运营决策平台 |
角色视角:
| 角色 | 关注点 | 关键产出 |
|---|---|---|
| AI Product Architect | 平台边界、能力复用、业务系统集成、风险分层 | Reference architecture、ADR、rollout strategy |
| AI PM | 场景优先级、延迟体验、误杀成本、人工兜底、指标 | Decision PRD、freshness SLO、release gate |
| Data Product | 特征 owner、契约、质量、血缘、数据产品化 | Feature contract、quality SLO、monitoring pack |
| Risk Product | 阈值、规则、模型、原因码、override、risk appetite | Decision policy、champion/challenger、override analysis |
| Platform Architect | offline/online store、streaming、serving、audit、可靠性 | C4、sequence、capacity、resilience plan |
3. 实时决策参考架构
3.1 架构图文字版
事件源
支付授权、登录、设备指纹、交易、KYC 上传、贷款申请、CRM 交互、Agent 工具调用意图
事件接入层
API Gateway / Kafka / CDC / webhook / file drop
责任: schema validation、idempotency key、event_time、source timestamp、PII classification
流式计算层
Flink / Spark Structured Streaming / Kafka Streams / managed streaming
责任: window aggregation、watermark、late event handling、entity enrichment、feature value emission
Feature Platform
Feature Registry
feature definition、entity、owner、version、contract、permission、freshness SLO、lineage
Offline Store
historical features、point-in-time training retrieval、backfill、batch scoring
Online Store
low-latency lookup、TTL、latest feature values、serving freshness
Feature Server
standardized API、auth、rate limit、request trace、online feature retrieval
Decisioning Layer
Decision Orchestrator
collects request context、online features、model score、rules、policy、reason codes
Model Serving
fraud / credit / KYC / NBA / tool-risk models, versioned and observable
Rules and Policy Engine
hard blocks、risk appetite、regulatory rules、manual review routing、fallback
Human Review Queue
high-risk exceptions、borderline cases、disputes、adverse action review
Systems of Record
payment switch、core banking、loan origination、KYC case manager、CRM、contact center、AI tool gateway
Observability and Governance
feature freshness、quality、latency、skew、model drift、decision outcomes、override、audit replay、incident loop
3.2 Mermaid 视图
flowchart TB
E[Events: payment, KYC, credit, CRM, agent tool intent] --> I[Event ingestion and schema validation]
I --> S[Streaming feature pipelines]
I --> L[Immutable event log]
S --> R[Feature registry and contracts]
S --> O[Online feature store]
S --> F[Offline feature store]
F --> T[Point-in-time training dataset]
O --> FS[Feature server]
FS --> D[Decision orchestrator]
T --> M[Model training and validation]
M --> MS[Model serving]
MS --> D
D --> P[Rules and policy engine]
P --> A[Allow / block / step-up / manual review / fallback]
A --> B[Business systems]
D --> G[Decision audit log]
R --> G
L --> RP[Replay and simulation]
G --> RP
RP --> V[Release gate and governance review]
Mon[Freshness, skew, drift, latency, error budget monitoring] --> V
Mon --> Inc[Incident and rollback]
3.3 组件责任边界
| Component | 负责 | 不负责 |
|---|---|---|
| Event ingestion | 事件结构、身份、幂等、event_time、source metadata、数据分类 | 不做复杂模型决策 |
| Streaming feature pipeline | 窗口聚合、entity enrichment、late event policy、feature emission | 不决定业务拦截阈值 |
| Feature registry | 特征定义、owner、版本、契约、血缘、权限、SLO、用途 | 不替代数据仓库或模型注册表 |
| Offline store | 历史特征、训练集构造、回放、批量评分 | 不提供毫秒级线上决策 |
| Online store | 低延迟读取、TTL、最新值、serving freshness | 不承担复杂历史 join |
| Feature server | API、鉴权、限流、trace、在线读取 | 不直接执行高风险业务动作 |
| Decision orchestrator | 汇聚请求、特征、模型、规则、政策、reason codes、fallback | 不拥有所有特征定义 |
| Rules and policy engine | 硬性规则、人工升级、监管约束、风险偏好 | 不训练模型 |
| Audit and replay | 决策证据、版本、快照、回放、模拟、争议处理 | 不只做技术日志 |
4. Entity / Time 语义: 实时特征平台的底层契约
4.1 Entity 不是主键那么简单
金融零售实时决策通常同时涉及多个 entity:
| Entity | 示例 | 决策价值 |
|---|---|---|
| customer_id | 零售客户、持卡人、借款人 | 历史行为、风险等级、生命周期 |
| account_id | 存款账户、信用卡账户、贷款账户 | 账户余额、还款、交易模式 |
| card_id / token_id | 实体卡、虚拟卡、wallet token | 支付授权、盗刷、设备迁移 |
| merchant_id | 商户、收单商户、平台店铺 | 商户风险、MCC、拒付率 |
| device_id | 设备指纹、浏览器、移动设备 | 账户接管、KYC 文档上传风险 |
| application_id | 贷款申请、开户申请、KYC case | 决策上下文、材料状态、审批阶段 |
| agent_session_id | 客服 Agent 会话、工具调用计划 | 工具风险、越权、客户影响 |
成熟设计要明确:
- 一个 feature 是否绑定单 entity, 还是 multi-entity join。
- entity resolution 谁负责, 置信度是多少。
- 线上请求缺失 entity 时如何降级。
- entity merge / split 后历史特征如何处理。
- 高风险实体关系是否需要图谱或 link analysis 作为独立证据。
4.2 Time 字段的最小语义
| Time field | 含义 | 错误使用的后果 |
|---|---|---|
event_time | 业务事件实际发生时间 | 决策窗口错位, 训练集泄漏 |
ingestion_time | 平台收到事件时间 | 把网络延迟误当成业务延迟 |
created_time | 特征值或源记录生成时间 | 历史训练集误用未来才生成的特征 |
available_time | 特征对决策系统可用的时间 | 忽略数据延迟, 造成 point-in-time 错误 |
decision_time | 模型或规则做出决策的时间 | 审计无法复现 |
effective_time | 政策、规则、特征定义生效时间 | 新旧规则混用 |
expiry_time | 特征值、规则或来源失效时间 | 使用过期风险信号 |
生产级要求:
Training sample at decision_time T
只能 join available_time <= T 的特征值。
如果源事件 event_time <= T 但 available_time > T, 训练时也不能使用。
4.3 Point-in-Time Correctness
Point-in-time correctness 的核心不是 SQL 技巧, 而是业务可证明性:
| 检查点 | 合格表现 | 高风险失败 |
|---|---|---|
| Label timing | label 发生在决策之后, 且窗口定义清楚 | 用 chargeback 结果反向污染欺诈特征 |
| Feature availability | 训练只使用当时已可用特征 | 用贷后表现、人工审核结论、KYC 最终状态做贷前特征 |
| Historical join | 每个 entity 按 decision_time 取最近可用值 | 直接取最新快照 |
| Late-arriving data | 回补策略不改变历史决策时点可见信息 | 回放时看到了线上当时没有的数据 |
| Rule/model version | 回放使用当时生效版本 | 用当前规则解释过去决定 |
| Source version | 政策、黑名单、特征定义按有效期取值 | 用未来更新的名单解释过去拦截 |
5. Online / Offline Feature Consistency
5.1 一致性目标
Feature store 的核心价值之一是让模型训练和线上服务共享特征定义、血缘和质量约束。
| 一致性维度 | 训练侧 | 服务侧 | 验证方式 |
|---|---|---|---|
| Definition | feature view、SQL、UDF、window、过滤条件 | materialized / pushed value 使用同一 contract | definition hash、contract review |
| Entity | historical entity dataframe | online lookup entity keys | entity completeness、join hit rate |
| Time | point-in-time join | request time latest available value | replay parity test |
| Transformation | batch transform | stream transform | golden entity comparison |
| Defaults | missing/null imputation | online fallback/default | null policy parity |
| Freshness | historical backfill | online TTL and freshness SLO | freshness monitor |
| Permission | training dataset access | serving request access | entitlement replay |
5.2 Training-Serving Skew Taxonomy
| Skew type | 典型表现 | 金融零售影响 | 控制 |
|---|---|---|---|
| Compute skew | batch SQL 与 streaming job 逻辑不同 | 欺诈分数上线后失准 | 单一 feature contract、共享测试集、代码生成或同源 transform |
| Freshness skew | 训练假设 T 时刻可用, 线上实际延迟 2 分钟 | 支付拦截漏判 | available_time、freshness SLO、degraded decision policy |
| Missingness skew | 训练样本补齐完整, 线上大量缺值 | 新客户、薄档客户被误判 | online missing policy、slice monitoring |
| Entity skew | 训练按 customer_id, 线上按 account_id | 客户关系错配 | entity mapping contract、join hit rate |
| Default skew | 训练缺失填 0, 线上缺失填 null 或 previous value | risk score 偏移 | default value registry、contract test |
| Policy skew | 训练集使用旧黑名单, 线上用新黑名单 | 结果不可复现 | source versioning、policy effective_time |
| Timezone/clock skew | 不同系统时间基准不一致 | 窗口聚合错误 | UTC normalization、clock drift monitor |
5.3 Feature Leakage Taxonomy
| Leakage type | 示例 | 为什么危险 | 防控 |
|---|---|---|---|
| Future leakage | 使用 T+7 天拒付结果预测 T 时刻支付欺诈 | 离线指标虚高, 线上失效 | point-in-time retrieval、label cutoff |
| Post-decision leakage | 使用人工审核结论作为自动审核输入 | 模型学习人工结果而非前置信号 | feature availability review |
| Target leakage | 特征直接编码目标变量 | 信贷、KYC、欺诈模型虚假高 AUC | leakage audit、feature importance review |
| Operational leakage | 用“进入人工队列”预测高风险 | 模型复刻旧流程偏差 | process feature review |
| Source leakage | 只在坏样本中存在的数据源字段 | 样本选择偏差 | source coverage and missingness slice |
| Feedback leakage | 使用模型拦截后的结果作为未拦截结果标签 | 策略自证正确 | exploration、reject inference、shadow labels |
6. Streaming Features and Freshness SLO
6.1 Streaming Feature 类型
| Feature type | 示例 | 决策价值 |
|---|---|---|
| Rolling count | 5 分钟内同卡失败支付次数 | 支付欺诈、账户接管 |
| Rolling amount | 10 分钟内跨境交易总额 | 授权风控、反洗钱线索 |
| Velocity | 1 小时内设备切换次数 | 登录风险、KYC 文档风险 |
| Distinct count | 30 分钟内同设备关联客户数 | 设备农场、合成身份 |
| Ratio | 24 小时内失败交易率 | 商户风险、支付稳定性 |
| Time since last event | 距离上次成功 KYC 上传时间 | KYC case prioritization |
| Sequence pattern | 登录、改手机号、发起转账连续发生 | 账户接管 |
| Agent action context | 当前会话已读取客户数据次数、拟调用工具风险 | 客服 Agent 工具风控 |
6.2 Event-Time vs Processing-Time
| 选择 | 适用 | 风险 |
|---|---|---|
| Event-time windows | 风险判断依赖真实业务发生时间 | 需要 watermark 和 late event 策略 |
| Processing-time windows | 只关心平台收到后的实时处理 | 网络延迟会改变窗口含义 |
| Hybrid | 线上低延迟用 processing-time, 回放和训练用 event-time + available_time | 需要清楚标记线上/离线差异 |
金融场景建议:
- 支付授权风控以低延迟为第一约束, 但训练和回放必须记录 event_time、ingestion_time、available_time。
- 信贷预审批通常能容忍秒级到分钟级延迟, 但必须严格控制贷后和人工结果泄漏。
- Agent 工具风险判断通常需要会话内实时状态, 其窗口更接近 processing-time, 但审计仍要保存事件序列。
6.3 Freshness SLO and Error Budget
Freshness 不是单一指标, 至少要拆成:
| 指标 | 定义 | 例子 |
|---|---|---|
| Source lag | 源事件发生到平台接收的时间 | 支付事件 99p < 500ms |
| Pipeline lag | 平台接收后到特征计算完成的时间 | velocity feature 99p < 800ms |
| Online materialization lag | 特征计算完成到 online store 可读的时间 | online write 99p < 300ms |
| Serving freshness | 决策时读取到的特征年龄 | card_decline_count_5m age 99p < 2s |
| Decision latency | 请求进入到返回 allow/block/review 的时间 | 支付授权 99p < 120ms |
| Staleness rate | 超过 freshness threshold 的请求比例 | 每日 < 0.1% |
Error budget 要绑定业务动作:
| Feature group | Freshness SLO | Error budget | 超预算动作 |
|---|---|---|---|
| Payment fraud velocity | 99.9% requests age <= 2s | 每日 stale requests <= 0.1% | 降级到规则保守模式, 高风险交易 step-up |
| Credit pre-approval bureau enrichment | 99% records age <= 24h | 每月 stale applications <= 1% | 停止自动预审批, 转人工或重新拉取 |
| KYC document risk | 99% document signals age <= 5m | 每日 stale cases <= 0.5% | 暂停自动通过, 只允许 manual review |
| Agent tool-risk session features | 99.5% session state age <= 1s | 每日 stale tool calls <= 0.2% | 高风险工具 require approval |
7. 金融零售实时决策场景
7.1 支付欺诈实时拦截
| 维度 | 设计要点 |
|---|---|
| Decision point | 授权请求进入 payment switch 后, 返回 approve / decline / step-up / manual review 前 |
| Entities | card_id、token_id、customer_id、merchant_id、device_id、ip、account_id |
| Real-time features | 5 分钟失败次数、10 分钟金额 velocity、同设备客户数、merchant risk velocity、geo jump、new payee flag |
| Offline features | 客户历史风险等级、商户历史拒付率、账户生命周期、历史 disputed transaction ratio |
| Latency budget | 决策总延迟通常按毫秒级管理, feature lookup 必须有 strict timeout |
| Fallback | 特征超时走 rules-only 或 step-up, 高风险不默认放行 |
| Audit | request context、feature values、feature age、model version、rule version、decision、reason code |
| Risk tradeoff | false negative 是欺诈损失; false positive 是客户体验、收入和投诉 |
高级产品判断:
- 不要把“模型分数高”直接等同拒绝。支付场景应设计 step-up、3DS、限额、延迟放行、人工审核等多动作策略。
- 对新客户、薄档客户、跨境场景要独立监控误杀率, 否则实时模型会把缺数据当风险。
- 对规则和模型冲突要记录 decision arbitration, 因为事后争议通常问的是“为什么当时没拦或为什么拦了”。
7.2 信贷预审批
| 维度 | 设计要点 |
|---|---|
| Decision point | 客户浏览产品、进入申请、额度提升、营销触达前 |
| Entities | customer_id、account_id、application_id、household_id |
| Real-time features | 最近收入入账、近期逾期、账户余额变动、近期硬查询、申请频率 |
| Offline features | 信用历史、收入稳定性、产品持有、历史还款、行为评分 |
| Freshness | 多数特征可分钟到天级; 高风险信用 bureau 或内部 delinquency 状态必须版本清楚 |
| Decision output | eligible / not eligible / invite to apply / manual review / insufficient data |
| Governance | 公平性、拒绝原因、adverse action、可解释性、人审边界 |
| Leakage risk | 使用贷后表现、审批结论、人工备注或拒绝原因作为贷前特征 |
高级产品判断:
- 预审批不是最终授信。PRD 要明确 customer-facing wording, 避免把 marketing eligibility 写成信用承诺。
- Feature contract 要标明哪些特征可用于 eligibility, 哪些只能用于 internal ranking, 哪些不可用于 adverse action reason。
- 回放验证要覆盖被拒、薄档、低收入、地区、渠道等 slice, 不能只看总体 KS/AUC。
7.3 KYC 文档决策
| 维度 | 设计要点 |
|---|---|
| Decision point | 文档上传、OCR、真实性检测、名单筛查、case routing |
| Entities | customer_id、application_id、document_id、device_id、ip、beneficial_owner_id |
| Real-time features | 上传设备变化、文档重复使用、OCR confidence、liveness risk、同设备多申请、制裁筛查状态 |
| Offline features | 客户历史 KYC remediation、国家/行业风险、实体结构复杂度、历史材料缺失率 |
| Decision output | auto-pass / request-more-info / enhanced due diligence / manual review / reject recommendation |
| Freshness | 文档和名单状态需要可追溯的 effective_time; sanctions/PEP source version 必须可审计 |
| Audit | 文档版本、OCR 输出、模型分数、规则、人工 override、客户沟通记录 |
| Leakage risk | 使用最终人工 KYC status 训练上传时的自动分流模型 |
高级产品判断:
- KYC 自动化的安全边界通常是 triage 和 evidence gathering, 不是无约束最终拒绝。
- 对 sanctions、PEP、adverse media 相关特征, source version 和 match confidence 必须进入审计日志。
- 文档模型、规则和人工审核形成闭环后, 要防止只学习历史人工偏差。
7.4 客户 Next-Best-Action
| 维度 | 设计要点 |
|---|---|
| Decision point | App 首页、客服会话、营销触达、分行客户经理工作台 |
| Entities | customer_id、household_id、channel_id、session_id |
| Real-time features | 当前会话意图、最近交易、投诉状态、服务失败、渠道活跃、近期触达 |
| Offline features | 客户价值、产品持有、生命周期、偏好、风险限制、同意状态 |
| Decision output | recommend / suppress / defer / service-first / human follow-up |
| Governance | consent、fair treatment、suitability、vulnerability、投诉状态、频控 |
| Freshness | 客户投诉、opt-out、风险限制必须近实时生效 |
| Monitoring | uplift、complaint rate、opt-out rate、offer fatigue、protected slice |
高级产品判断:
- NBA 不是单纯推荐系统。金融零售必须把 suitability、consent、complaint、vulnerability、risk restriction 放进 policy layer。
- 有些实时信号应触发“不要卖, 先服务”, 例如刚发生支付失败或投诉升级。
- 特征平台要支持 suppress features 和 policy features, 不只是 propensity features。
7.5 客服 Agent 工具调用风险判断
| 维度 | 设计要点 |
|---|---|
| Decision point | LLM Agent 准备调用 read/write/external-send 工具前 |
| Entities | agent_session_id、user_id、customer_id、case_id、tool_id、tenant_id |
| Real-time features | 当前会话敏感度、已读取客户字段数量、工具风险等级、prompt injection signal、DLP hit、approval history |
| Offline features | 用户角色、历史权限、工具目录、客户风险、case 类型、政策版本 |
| Decision output | allow / redact_then_allow / dry_run / require_approval / deny / kill_switched |
| Freshness | session state 和 kill switch 必须秒级生效 |
| Audit | tool proposal、arguments、features、policy decision、approver、tool result summary |
| Leakage risk | 把模型建议当作授权依据, 或让工具结果中的指令触发下一步工具 |
高级产品判断:
- 这里的“特征”不是传统 ML 特征而已, 也是 policy decision context。
- Agent 风险判断需要低延迟在线特征和规则引擎协同, 不应只靠 prompt guardrail。
- 任何涉及外发、资金、客户权益、监管记录的工具调用, 都要可回放和可证明当时为什么允许或阻断。
8. 产品决策: 何时需要实时 Feature Store
8.1 不需要实时 Feature Store 的情况
| 场景 | 更合适方案 |
|---|---|
| 每日批量营销名单 | Batch feature table + campaign rules |
| 低风险内部报表 | Warehouse semantic layer + BI metrics |
| 少量特征且无复用 | Service-local cache + contract tests |
| 纯文档问答 | RAG source governance + retrieval eval |
| 无明确决策动作 | 先做 decision discovery, 不急于建实时平台 |
8.2 需要平台化的触发信号
| 触发信号 | 含义 |
|---|---|
| 多个模型重复实现同一特征 | 特征需要 registry、owner、复用和质量管理 |
| 训练和线上特征逻辑频繁不一致 | 需要 offline/online consistency 和 parity tests |
| 决策对特征新鲜度敏感 | 需要 streaming features、freshness SLO、online store |
| 高风险决策需要审计回放 | 需要 immutable event log、feature snapshot、decision trace |
| 特征涉及 PII、信用、KYC、支付风险 | 需要 feature governance、permission、retention、risk review |
| 多团队共用实体和时间语义 | 需要 entity registry、feature contracts、data product model |
8.3 Feast / 自建 / 商业平台取舍
| 选项 | 适合 | 风险 |
|---|---|---|
| Feast-style open-source feature store | 团队有平台工程能力, 需要开放架构和可控集成 | 需要自建治理、UI、SLO、权限、运营能力 |
| 自建轻量 feature platform | 场景窄, 架构简单, 团队需要快速掌控关键路径 | 容易演变成隐性平台, 缺少 registry 和治理 |
| 商业 feature platform | 多团队、多云、多治理要求, 希望缩短平台建设周期 | 供应商锁定、集成复杂、金融审计要求仍需内部负责 |
| Warehouse + online cache | 批量特征为主, 低频在线读取 | point-in-time、streaming、parity 和 SLO 需额外设计 |
选型问题不是“谁功能多”, 而是:
能否证明训练和线上特征同义?
能否按 decision_time 回放?
能否定义并达成 freshness SLO?
能否治理 PII、权限、血缘、owner、用途和风险?
能否在事故中定位 feature、model、rule、policy 哪一层失败?
9. 治理模型
9.1 Feature Lifecycle
| 阶段 | 关键动作 | 证据 |
|---|---|---|
| Propose | 说明业务用途、entity、time、source、风险、预期决策影响 | Feature proposal |
| Contract | 定义计算逻辑、freshness、质量、权限、retention、allowed use | Feature contract |
| Build | 实现 batch/stream transform、tests、lineage、registry entry | CI result、data quality report |
| Validate | offline/online parity、leakage review、replay、slice impact | Validation report |
| Approve | PM、Data owner、Risk、Compliance、Architect 签核 | Governance review |
| Serve | materialize/push 到 online store, 接入 feature server | Serving readiness |
| Monitor | freshness、quality、latency、skew、drift、decision outcome | Dashboard and alerts |
| Deprecate | 标记替代特征、停止新依赖、保留审计回放 | Deprecation record |
9.2 NIST AI RMF 映射
| AI RMF function | 在实时决策平台中的落点 |
|---|---|
| Govern | feature ownership、decision authority、risk appetite、approval workflow、audit responsibility |
| Map | 场景、客户影响、数据来源、entity/time、决策动作、失败模式、受影响人群 |
| Measure | freshness、skew、leakage、latency、model quality、false positive/negative、override、incident |
| Manage | release gate、fallback、manual review、kill switch、rollback、feature deprecation、post-incident replay |
9.3 高风险特征治理原则
| 原则 | 解释 |
|---|---|
| Purpose-bound | 特征必须声明 allowed use, 例如 fraud detection、credit eligibility、customer service risk |
| Time-aware | 每个特征必须有 event_time、available_time 或明确的 snapshot semantics |
| Owner-backed | 业务 owner、data owner、technical owner、risk owner 不可缺失 |
| Explainable enough | 影响客户权益的特征要能生成 reason code 或进入解释链 |
| Privacy-aware | PII、PCI、敏感身份、信用、KYC、AML 数据必须标记和最小化 |
| Replayable | 高风险决策使用的特征值必须能在审计中复现或证明 |
| Degradable | 特征不可用时有明确 fallback, 不让模型自由猜测 |
10. 可落地交付物模板
以下模板都用具体示例填充。复制到项目时可以替换业务名和阈值, 但不要删除 owner、time、freshness、audit、risk 字段。
10.1 实时决策架构图文字版
Use case:
支付欺诈实时拦截
Decision SLA:
授权请求总决策 99p <= 120ms; feature lookup 99p <= 20ms; model scoring 99p <= 35ms
Event sources:
payment_authorization, card_decline, device_fingerprint, merchant_profile, customer_profile, chargeback_case
Entities:
card_id, token_id, customer_id, merchant_id, device_id, account_id
Streaming features:
card_decline_count_5m, card_amount_sum_10m, device_distinct_customer_count_30m,
merchant_high_risk_auth_count_10m, geo_velocity_score_1h
Offline features:
customer_lifetime_dispute_rate_180d, merchant_chargeback_rate_90d,
account_age_days, customer_risk_segment, prior_fraud_case_count_365d
Feature platform:
Registry stores feature contracts, owners, versions, allowed uses, freshness SLO, lineage.
Offline store supports point-in-time retrieval for training and replay.
Online store serves latest features with TTL and feature age.
Feature server enforces auth, timeout, trace id and rate limit.
Decision orchestration:
Request context + online features + fraud model score + hard rules + policy constraints.
Output: approve, decline, step_up_authentication, manual_review.
Fallback: if velocity features stale, use conservative rule set and step-up for high-risk slices.
Audit and replay:
Store request id, entity ids, feature values, feature age, model version, rule version,
policy version, decision output, reason codes, downstream action, feedback label.
10.2 Feature Contract: 支付欺诈 Velocity 特征
| Field | Example |
|---|---|
| Feature name | card_decline_count_5m |
| Business purpose | 支付授权实时欺诈拦截和 step-up routing |
| Allowed use | fraud detection、payment authorization risk、fraud model training、audit replay |
| Disallowed use | credit eligibility、marketing targeting、customer value ranking |
| Primary entity | card_id |
| Secondary entity | token_id, customer_id |
| Source events | payment_authorization with status declined |
| Event time | payment_authorization.event_time_utc |
| Available time | max of ingestion timestamp and streaming output timestamp |
| Window | Rolling 5 minutes, event-time based, watermark 30 seconds |
| Aggregation | Count declined authorizations for same card_id excluding system test transactions |
| Late event policy | Late events within 30 seconds update online value; later events only affect offline replay |
| Null policy | Missing value means no event observed; default 0 with is_missing=false |
| TTL | 10 minutes in online store |
| Freshness SLO | 99.9% online reads feature age <= 2 seconds |
| Offline retrieval | Point-in-time join by decision_time, using available_time <= decision_time |
| Online serving | Feature server returns value, feature timestamp, feature age, contract version |
| Data classification | Customer financial behavior; confidential; no external sharing |
| Owner | Payment Risk Data Product Owner |
| Risk owner | Fraud Strategy Lead |
| Technical owner | Real-Time Feature Platform Team |
| Quality tests | non-negative integer, 99.9% not null, drift alert on 7-day percentile shift > 30% |
| Parity tests | 1,000 golden card_id/time samples compare streaming output vs offline recomputation |
| Leakage control | Excludes chargeback label, manual review result, post-decision dispute status |
| Audit fields | feature_name, value, event_time, available_time, contract_version, source_event_count |
| Review cadence | Monthly risk review; immediate review after fraud incident or source schema change |
10.3 Freshness and Error Budget
| Item | Definition |
|---|---|
| Feature group | Payment fraud velocity features |
| Business impact | Stale features may miss rapid fraud bursts or cause overly conservative step-up |
| SLO window | Daily, calculated by decision requests |
| Primary SLO | 99.9% of online feature reads have feature_age <= 2 seconds |
| Secondary SLO | Feature server 99p latency <= 20ms; online store read error rate <= 0.05% |
| Error budget | Stale feature reads > 2 seconds must stay <= 0.1% per day |
| Burn alert | 2-hour rolling stale rate >= 0.05% triggers warning; >= 0.1% triggers incident |
| Degraded mode | High-risk transactions route to step-up; low-risk transactions use rules-only score |
| Stop condition | Stale rate >= 0.5% for 15 minutes disables model path for affected region |
| Recovery condition | Freshness SLO met for 30 minutes, replay confirms no material missed fraud cluster |
| Owner | Real-Time Feature Platform on-call plus Payment Risk on-call |
| Audit evidence | Alert id, affected features, affected entity count, decision fallback count, replay result |
10.4 上线门禁
| Gate | Pass evidence | Blocking failure |
|---|---|---|
| Feature contract | Contract includes entity, time, source, owner, allowed use, freshness, leakage control | Missing owner, missing time semantics, allowed use unclear |
| Point-in-time validation | 10,000 historical decisions replay with available_time <= decision_time | Any future feature used in high-risk sample |
| Offline/online parity | Golden sample parity >= 99.5%, all differences explained | Streaming and batch definitions diverge on critical feature |
| Freshness SLO | 7-day load test meets SLO and error budget | No degraded mode or stale rate above threshold |
| Latency | Decision path 99p within scenario SLA under peak load | Timeout causes silent default approve for high-risk request |
| Leakage review | Feature list reviewed for target, post-decision and operational leakage | Manual review result or future label appears in model input |
| Monitoring | Dashboards and alerts for freshness, nulls, drift, skew, latency, decisions, overrides | No alert owner or no incident route |
| Audit replay | Sample decisions replay to same decision or documented tolerance | Cannot reconstruct feature values, model version or rules |
| Governance | PM, Data owner, Risk, Compliance, Architect approve release scope | Risk owner rejects or customer-impact boundary unclear |
| Rollback | Feature disable flag, model rollback, rules fallback tested | Rollback requires code deploy during incident |
10.5 回放验证方案
| Step | Execution | Evidence |
|---|---|---|
| 1 | Select 30 days of historical payment authorization events and decisions | Replay cohort manifest with date range, regions, channels, risk slices |
| 2 | Rebuild feature values using event log and available_time <= decision_time | Offline feature table with contract version and computation hash |
| 3 | Compare rebuilt features against stored decision-time feature snapshot | Parity report by feature, entity, channel, timestamp bucket |
| 4 | Run current candidate model and rules in shadow mode on historical requests | Shadow decision table with score, reason codes and proposed action |
| 5 | Compare candidate decisions to historical outcomes and human overrides | False positive / false negative / step-up impact by slice |
| 6 | Inject freshness degradation and missing-feature scenarios | Resilience report showing fallback decisions and customer impact |
| 7 | Replay known fraud incidents and near misses | Incident replay memo with missed signal, new feature contribution and residual risk |
| 8 | Produce release recommendation | Pilot / limited release / no-go memo with risk acceptance owner |
回放通过标准示例:
| Metric | Threshold |
|---|---|
| Critical point-in-time violation | 0 |
| Critical feature leakage | 0 |
| Feature parity for top 20 high-impact features | >= 99.5% |
| Stored decision replay reconstruction | >= 99% exact or documented deterministic tolerance |
| High-risk stale fallback correctness | 100% routes to step-up, manual review or deny according to policy |
10.6 监控清单
| Monitor | Signal | Slice |
|---|---|---|
| Feature freshness | feature_age p50/p95/p99, stale rate, online TTL expiration | feature group, region, channel, entity type |
| Feature quality | null rate, range violation, enum drift, negative value, outlier rate | new vs existing customer, product, merchant category |
| Online/offline skew | golden sample parity, batch vs stream delta | feature, window, source system |
| Serving reliability | feature server latency, timeout, error rate, cache hit rate | API client, region, risk tier |
| Decision quality | approve/block/step-up/manual-review rate | channel, merchant, customer segment, protected slice where legally appropriate |
| Model drift | score distribution, calibration, PSI/CSI, reason-code shift | model version, feature group |
| Business outcome | confirmed fraud, chargeback, false positive complaint, approval conversion | cohort, decision action |
| Override | human override rate, override reason, override outcome | reviewer team, scenario |
| Governance | contract violations, unapproved feature use, stale owner review | feature owner, platform team |
| Audit health | missing trace field, replay failure, version lookup failure | decision service, model, rules engine |
| Incident | SLO burn, kill switch activation, fallback count | feature group, workflow, tenant |
10.7 治理评审表
| Review area | Question | Payment fraud example answer | Decision |
|---|---|---|---|
| Business purpose | 这个特征或模型服务哪个明确决策动作 | 授权风险拦截和 step-up routing | Approved |
| Customer impact | 可能造成什么客户影响 | 误拒支付、额外认证、交易延迟 | Approved with FP monitoring |
| Data source | 来源是否权威且有 owner | Payment switch event stream, owner Payment Platform | Approved |
| Time semantics | 是否定义 event_time、available_time、decision_time | 三者均进入 contract 和 audit trace | Approved |
| Leakage | 是否包含未来、贷后、人工结论或目标变量 | 排除 chargeback label 和 manual review result | Approved |
| Privacy | 是否涉及 PII/PCI/敏感金融行为 | 使用 tokenized card_id; no PAN in feature store | Approved with DLP evidence |
| Fairness | 是否可能对特定群体产生不利影响 | 监控新客户、跨境、薄档客户误拦截 | Approved with monthly slice review |
| Explainability | 是否能给出 reason code 或人工解释 | velocity, geo jump, merchant risk as reason factors | Approved |
| Freshness | SLO 是否与业务动作匹配 | 99.9% <= 2s; stale routes to step-up | Approved |
| Replay | 争议时是否能复现 | feature snapshot + event replay + model/rule version | Approved |
| Operations | 谁响应 SLO 事故 | Feature platform on-call + Payment Risk on-call | Approved |
| Scope | 是否限定上线范围 | Card-not-present US region controlled release | Approved |
10.8 Decision Audit Schema
decision_event:
decision_id: "payauth-2026-06-29-00048192"
use_case: "payment_fraud_authorization"
decision_time_utc: "2026-06-29T14:05:31.830Z"
request_context:
channel: "card_not_present"
amount_currency: "USD"
amount_value: 284.90
merchant_category: "electronics"
entities:
card_id_hash: "card_hash_8f10"
customer_id_hash: "cust_hash_91ab"
merchant_id_hash: "m_hash_3e52"
device_id_hash: "dev_hash_77cc"
features:
- name: "card_decline_count_5m"
value: 3
feature_time_utc: "2026-06-29T14:05:30.510Z"
feature_age_ms: 1320
contract_version: "v4"
- name: "device_distinct_customer_count_30m"
value: 5
feature_time_utc: "2026-06-29T14:05:29.980Z"
feature_age_ms: 1850
contract_version: "v2"
model:
model_name: "payment_fraud_rt"
model_version: "2026-06-20-champion"
score: 0.87
threshold_band: "step_up"
rules:
rule_policy_version: "fraud-policy-2026-06-15"
triggered_rules:
- "velocity_high"
- "new_device_high_amount"
decision:
action: "step_up_authentication"
reason_codes:
- "recent_decline_velocity"
- "new_device_pattern"
fallback_used: false
audit:
trace_id: "trace_42cb6"
feature_server_latency_ms: 14
model_latency_ms: 27
policy_latency_ms: 5
replayable: true
11. 30 天训练计划
| Day | 主题 | 任务 | 产出 |
|---|---|---|---|
| 1 | Use case framing | 选定支付欺诈、信贷预审批、KYC 或 Agent 工具风险中的一个场景, 写 decision point 和 customer impact | decision-use-case-brief.md |
| 2 | Decision taxonomy | 定义 allow / block / step-up / manual review / fallback 的业务含义 | decision-action-taxonomy.md |
| 3 | Entity model | 列出 customer、account、card、merchant、device、application、session 等实体关系 | entity-model.md |
| 4 | Time semantics | 定义 event_time、available_time、decision_time、effective_time、expiry_time | time-semantics-note.md |
| 5 | Source inventory | 盘点事件源、批量源、政策源、标签源、人工审核源 | source-inventory.md |
| 6 | Feature candidate review | 为 20 个候选特征标注用途、风险、泄漏可能性、freshness 需求 | feature-candidate-review.md |
| 7 | Leakage review | 识别 future、post-decision、target、operational、feedback leakage | leakage-review.md |
| 8 | Feature contract | 写 3 个高影响特征 contract, 覆盖 entity/time/source/freshness/audit | feature-contract-pack.md |
| 9 | Architecture diagram | 画事件、streaming、offline/online store、decisioning、audit、monitoring 架构 | realtime-decision-architecture.md |
| 10 | Offline retrieval | 设计 point-in-time training dataset 构造规则 | pit-training-dataset-spec.md |
| 11 | Online serving | 设计 feature server API、timeout、default、auth、trace 字段 | online-serving-spec.md |
| 12 | Streaming design | 定义 rolling windows、watermark、late event policy、TTL | streaming-feature-design.md |
| 13 | Freshness SLO | 为 feature group 写 SLO、error budget、degraded mode、recovery | freshness-error-budget.md |
| 14 | Decision orchestration | 定义模型、规则、policy、fallback、人工升级的组合逻辑 | decision-orchestration-spec.md |
| 15 | Audit schema | 写 decision event schema, 包括 feature snapshot、model/rule/policy version | decision-audit-schema.md |
| 16 | Replay cohort | 选择历史样本切片: 时间、渠道、客户、商户、风险等级 | replay-cohort-manifest.md |
| 17 | Replay validation | 设计离线重算、线上快照对比、shadow decision 对比 | replay-validation-plan.md |
| 18 | Parity tests | 设计 batch vs stream、offline vs online、default policy 一致性测试 | parity-test-pack.md |
| 19 | Monitoring dashboard | 定义 freshness、quality、skew、latency、decision、override、audit 指标 | monitoring-metric-pack.md |
| 20 | Incident workflow | 写 stale feature、online store outage、leakage discovery、bad threshold 的响应流程 | decisioning-incident-runbook.md |
| 21 | Governance review | 填写治理评审表, 明确 PM、Data、Risk、Compliance、Architect 责任 | governance-review-record.md |
| 22 | Release gate | 写 prototype、shadow、pilot、production 四级门禁 | release-gate-spec.md |
| 23 | Fallback design | 针对 feature timeout、model timeout、rule engine failure 设计降级 | fallback-and-resilience-plan.md |
| 24 | Fairness and slice review | 为客户、渠道、地区、薄档、新客户等 slice 设计监控 | slice-impact-review.md |
| 25 | Reason codes | 定义模型分数、规则触发、客户沟通、内部解释之间的映射 | reason-code-mapping.md |
| 26 | Champion/challenger | 设计 shadow model、threshold experiment、risk appetite 评估 | champion-challenger-plan.md |
| 27 | Platform roadmap | 区分场景级实现、共享 feature platform、企业 decisioning platform | platform-roadmap.md |
| 28 | Architecture ADR | 写是否采用 Feast-style feature store、streaming platform、rules engine 的 ADR | feature-platform-adr.md |
| 29 | Portfolio case | 整理 problem、architecture、contracts、SLO、replay、governance、business impact | portfolio-case-study.md |
| 30 | Interview pack | 准备 30 秒、2 分钟、CTO、PM、Risk、Data Product 深挖回答 | interview-answer-pack.md |
30 天完成标准:
- 能画出实时特征与决策平台的端到端架构。
- 能解释 entity/time/available_time 对 point-in-time correctness 的影响。
- 能写 feature contract, 并区分 allowed use 与 disallowed use。
- 能设计 freshness SLO、error budget、degraded mode 和 recovery condition。
- 能识别 feature leakage 与 training-serving skew。
- 能用 replay 验证线上决策, 而不是只看离线 AUC。
- 能把 feature store 上线讲成产品、架构、风险、审计和运营能力。
12. 面试回答
12.1 30 秒版本
实时特征平台的关键不是把特征放进 Redis, 而是保证模型训练和线上决策在 entity、time、definition、freshness 和权限上保持一致。我会用 feature registry 管特征契约, offline store 做 point-in-time training retrieval 和回放, online store 做低延迟服务, streaming pipeline 生成 freshness-sensitive features, 再通过 decision orchestrator 把特征、模型、规则、policy 和人工升级组合起来。上线门禁会要求无 feature leakage、offline/online parity 达标、freshness SLO 有 error budget、决策可回放、监控和 fallback 已验证。
12.2 2 分钟版本
我会把实时决策拆成四层。
第一层是时间和实体语义。每个特征必须定义 entity key、event_time、available_time、decision_time、TTL 和 allowed use。训练集只能使用决策当时已经可用的信息, 否则离线指标会因为未来信息泄漏而虚高。
第二层是 feature platform。Feature registry 管 owner、contract、version、lineage、freshness SLO 和权限; offline store 负责 point-in-time retrieval、backfill、batch scoring 和 replay; online store 负责低延迟 lookup; feature server 负责鉴权、限流、trace 和特征年龄返回。
第三层是 decisioning。实时请求进入后, decision orchestrator 拉取 online features, 调用模型服务, 执行规则和 policy, 输出 allow、block、step-up、manual review 或 fallback。支付欺诈强调毫秒级延迟和 velocity features; 信贷预审批强调泄漏控制、公平性和原因码; KYC 强调文档证据、名单版本和人工审核边界; Agent 工具风险强调 session state、DLP、approval 和 audit。
第四层是治理和运营。上线前跑 offline/online parity、leakage review、freshness load test、historical replay、shadow decision 和 governance review。上线后监控 freshness、null、drift、skew、latency、decision outcome、override 和 audit replay health。高风险特征超预算时不应该静默放行, 而要进入 step-up、manual review 或规则降级。
12.3 CTO 深挖
Q: Feature store 和普通数据仓库有什么本质区别?
A: 数据仓库解决分析和批处理, feature store 解决可复用的训练与服务特征。差异在于 feature store 必须管理 entity/time 语义、point-in-time retrieval、online serving、offline/online consistency、freshness、feature contract 和 model serving integration。实时决策里, 它还要返回 feature age、contract version 和 trace, 支撑审计和回放。
Q: 如何避免 training-serving skew?
A: 我会把 skew 当成 release gate, 不靠口头约定。具体做法是 feature contract 单源定义, batch 和 stream 使用同一语义; 训练集用 available_time 做 point-in-time join; online serving 返回 feature timestamp 和 age; 用 golden entity/time 样本比较 offline recomputation、stream output 和 online snapshot; default/null policy 也进入 contract。任何关键特征 parity 不达标, 不进入 production。
Q: 实时特征延迟达不到怎么办?
A: 先区分 source lag、pipeline lag、online materialization lag、serving latency。产品上要有 degraded mode: 支付高风险交易 step-up, KYC 暂停 auto-pass, Agent 高风险工具 require approval。架构上要设置 timeout、TTL、cache、异步补偿、fallback feature group 和 kill switch。不能让 stale feature 变成 silent approve。
Q: 回放为什么难?
A: 因为回放不是重新跑当前代码。它要恢复当时的事件可见性、feature contract、model version、rule version、policy version、source version 和 decision context。late-arriving data、规则变更、实体合并、source 修正都会破坏复现。高风险场景应保存 decision-time feature snapshot, 同时保留 event log 支持重新计算和差异解释。
12.4 PM 深挖
Q: 如何决定第一阶段做哪个实时决策场景?
A: 我会看四个维度: 决策时效价值、错误成本、数据 readiness、可回放性。支付欺诈时效价值高但延迟和误杀风险高; 信贷预审批对治理和公平性要求高; KYC 文档决策需要证据和人工边界; Agent 工具风险可以先从高风险工具审批切入。第一阶段适合选择业务价值明确、人工兜底存在、历史事件和标签足够、能做 shadow replay 的场景。
Q: 如何定义实时特征平台的产品成功?
A: 不只看模型指标。平台指标包括新特征接入时间、特征复用率、contract 覆盖率、freshness SLO 达成率、offline/online parity、回放成功率、feature incident 数、上线门禁通过率。业务指标按场景看欺诈损失、误拦截投诉、预审批转化、KYC 周转时间、Agent 工具误授权拦截。治理指标看 unapproved feature use、leakage incidents、审计取证时间。
Q: 如何处理 false positive 和用户体验?
A: 实时决策不要只有 allow/block 二元动作。支付可以 step-up, KYC 可以 request-more-info, 信贷可以 invite-to-apply 而不是承诺授信, Agent 工具可以 dry-run 或 require approval。PM 要把错误成本转成动作梯度, 并监控 false positive by slice、override reason、投诉和转化损失。
12.5 Risk / Compliance 深挖
Q: 如何证明没有 feature leakage?
A: 证据包括 feature contract 中的 event_time 和 available_time, 训练集生成逻辑的 point-in-time join, leakage review 对 future、post-decision、target、operational、feedback leakage 的逐项结论, 以及历史 replay 中 available_time <= decision_time 的校验结果。高风险特征还要有 source lineage 和人工审核字段排除证明。
Q: 客户争议某次支付被拦截, 你如何解释?
A: 我会调取 decision audit event, 展示当时请求上下文、实体、特征值、feature age、模型版本、规则版本、policy 版本、触发 reason codes、fallback 状态和下游动作。解释应基于当时可用信息, 而不是用事后 chargeback 或当前模型重跑结果替代。
Q: AI RMF 如何落到这个平台?
A: Govern 是 owner、审批、风险偏好和审计责任; Map 是场景、客户影响、数据源、entity/time 和失败模式; Measure 是 freshness、skew、leakage、latency、模型质量、误杀、override 和 incident; Manage 是 release gate、fallback、manual review、kill switch、rollback 和 post-incident replay。
12.6 Data Product 深挖
Q: Feature contract 最重要的字段是什么?
A: 高风险实时决策里最关键的是 business purpose、allowed/disallowed use、entity、event_time、available_time、source、transformation、freshness SLO、null/default policy、TTL、owner、data classification、leakage control、online/offline serving semantics 和 audit fields。没有这些字段, 特征就只是数据列, 不是可治理的数据产品。
Q: 如何管理 feature deprecation?
A: 不能直接删除。先标记 deprecated, 禁止新模型依赖, 列出替代特征, 跑依赖扫描, 保留历史回放能力, 更新 model lineage, 通知 owner 和 risk reviewer。对曾经影响客户决策的特征, 还要保留 contract、source lineage 和版本到审计保留期结束。
13. 自检清单
| Area | Check |
|---|---|
| Architecture | 是否包含事件接入、流式特征、registry、offline store、online store、feature server、decision orchestrator、rules/policy、audit/replay、monitoring |
| Entity/time | 是否定义 entity keys、event_time、available_time、decision_time、effective_time、expiry_time |
| Consistency | 是否有 offline/online parity、batch/stream parity、default/null policy 一致性 |
| Leakage | 是否覆盖 future、post-decision、target、operational、source、feedback leakage |
| Freshness | 是否有 SLO、error budget、burn alert、degraded mode、recovery condition |
| Financial fit | 是否覆盖支付欺诈、信贷预审批、KYC、NBA、客服 Agent 工具风险 |
| Governance | 是否有 owner、allowed use、PII、retention、risk review、approval |
| Release | 是否有 point-in-time validation、load test、shadow mode、replay、rollback |
| Monitoring | 是否监控 freshness、quality、skew、latency、outcome、override、audit health |
| Audit | 是否能复现当时特征值、模型、规则、政策和决策理由 |
14. 最终记忆句
Real-time decisioning is a time-aware, governed, replayable decision architecture.
Feature store is the contract layer between historical learning and online action.
中文表达:
实时特征平台的本质, 是把“过去如何训练”和“此刻如何决策”放到同一套实体、时间、契约、服务、监控和审计体系里。