AI Data Contracts / Lineage / Quality Playbook
这些来源作为方法锚点, 不替代企业内部 legal、compliance、model risk、privacy 和 architecture review 的正式判断。
AI Data Contracts / Lineage / Quality Playbook
定位: 面向 AI Data Product Manager / AI Product Architect / Data Architect / AI Governance / Risk Tech 的高级数据治理与产品化手册。 目标: 把 AI data contracts、lineage、metadata product、schema evolution、data quality SLO、drift、label governance 和 data incident response 连接成可上线、可审计、可运营的企业 AI 数据控制面。 核心观点: AI 数据治理不是“数据清洗”和“字段说明”, 而是为 AI use case 提供可签约、可追溯、可测试、可变更、可问责、可复盘的数据产品能力。
Source Anchors
这些来源作为方法锚点, 不替代企业内部 legal、compliance、model risk、privacy 和 architecture review 的正式判断。
| Anchor | Link | 本手册使用方式 |
|---|---|---|
| OpenLineage Docs | https://openlineage.io/docs/ | 将 job、run、dataset、facet 和 runtime lineage event 转成 AI 数据链路观测设计。 |
| OpenLineage Object Model | https://openlineage.io/docs/spec/object-model/ | 区分 runtime lineage、design-time job metadata、dataset metadata, 支撑训练、评测、RAG 和 feature pipeline 的证据链。 |
| OpenLineage Facets | https://openlineage.io/docs/spec/facets/ | 用 facet 扩展 source code、schema、quality、model、prompt、RAG corpus、label batch 等 AI 元数据。 |
| DataHub Data Contracts | https://docs.datahub.com/docs/generated/metamodel/entities/datacontract | 将 contract 表达为 schema、freshness、quality、SLA assertions, 并接入 CI/CD 和 quality 工具。 |
| DataHub Lineage | https://docs.datahub.com/docs/api/tutorials/lineage | 参考 table-level、column-level、data job、dashboard、chart 的 lineage 表达方式。 |
| OpenMetadata Data Contracts | https://docs.open-metadata.org/v1.13.x/how-to-guides/data-contracts | 用 schema、semantics、security、quality assertions、SLA、terms of use 和 status 组织 contract。 |
| OpenMetadata Data Lineage | https://docs.open-metadata.org/v1.13.x/how-to-guides/data-lineage | 参考 table、column、pipeline、dashboard、ML model 的可视化 lineage 和 impact analysis。 |
| OpenMetadata Quality Observability | https://docs.open-metadata.org/v1.13.x/how-to-guides/data-quality-observability | 将 tests、profiler、alerts、incident manager 和 anomaly detection 纳入数据运营闭环。 |
| Great Expectations GX Core | https://docs.greatexpectations.io/docs/core/introduction/gx_overview/ | 用 Expectation Suite、Validation Definition、Validation Result 和 Checkpoint 建立数据质量测试与报告。 |
| Great Expectations Checkpoints | https://docs.greatexpectations.io/docs/core/trigger_actions_based_on_results/create_a_checkpoint_with_actions/ | 将 validation results 转成 notification、Data Docs、custom action 和发布门禁。 |
| NIST AI RMF Core | https://airc.nist.gov/airmf-resources/airmf/5-sec-core/ | 用 Govern / Map / Measure / Manage 组织 AI 数据风险治理。 |
| NIST AI RMF GenAI Profile | https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence | 将 GenAI 生命周期中的数据来源、评估、监控和风险响应转成 artifact。 |
| 本仓库 AI Data Product Management Playbook | docs/AI_DATA_PRODUCT_MANAGEMENT_PLAYBOOK.md | 本手册继续展开 contract、metadata、lineage、quality SLO 和 feedback loop。 |
| 本仓库 AI Requirements-to-Eval Cookbook | docs/AI_REQUIREMENTS_TO_EVAL_COOKBOOK.md | 将数据契约和 lineage 接到 eval contract、release gate、monitoring 和 incident loop。 |
1. 定位: AI 数据控制面的高级能力
这份手册不重复基础 BA、传统数据仓库或普通数据治理概念。它补的是 AI 产品和架构进入生产后必须具备的四类能力:
| 能力 | 高级问题 | 可交付证据 |
|---|---|---|
| AI data contracts | AI 工作流依赖的数据是否有 schema、语义、质量、权限、刷新、用途和变更承诺 | data contract、contract tests、schema change approval |
| Lineage | 一个模型输出、RAG 回答、eval 结果或特征值能否追溯到源记录、转换、版本、权限和质量检查 | lineage map、OpenLineage event、impact analysis |
| Metadata product | 元数据是否从被动目录升级为 AI runtime 的控制面 | metadata product canvas、catalog policy、ownership model |
| Quality operations | 数据质量下降、漂移、标签争议、语料过期和 schema 破坏能否被检测、拦截、分级和复盘 | quality SLO matrix、incident playbook、drift dashboard |
一句话:
AI data product 不是“可以被模型使用的数据”, 而是能以 contract 形式对 AI 消费者负责的数据能力。
2. 为什么重要
企业 AI 的很多事故表面上是模型问题, 根因却是数据产品失控。
| 表面问题 | 深层数据根因 | 业务风险 |
|---|---|---|
| KYC 文档抽取结果频繁错字段 | 文档类型、OCR 版本、字段语义、置信度阈值没有 contract | 客户重复补件、开户延迟、监管抽查解释困难 |
| AML copilot 总结 case 结论偏差 | case label 没有来源、审核人、时间点、jurisdiction 和分歧记录 | 错误风险叙事、调查遗漏、模型评测失真 |
| 信贷模型表现突然下降 | 上游申请渠道、收入字段、拒绝原因、客户分群发生 drift | 不公平结果、审批质量下降、模型风险升级 |
| RAG 政策助手引用旧政策 | 语料库版本、生效日期、下线日期和 index refresh 没有 SLO | 客服误导、员工错误执行、合规风险 |
| 客服对话 trace 无法复盘 | prompt、retrieval、tool call、human edit 和用户反馈没有统一 trace | 无法定位事故、无法沉淀 golden set |
| 推荐系统转化波动 | 特征定义、窗口口径、归因事件和实验分桶发生 schema 或 distribution 变化 | 错误营销、客户体验下降、收入归因失真 |
AI 数据控制面要解决的不是单点数据质量, 而是跨 source、pipeline、metadata、contract、eval、runtime、feedback 和 incident 的端到端可证明性。
3. 能力地图
| 层级 | 关键对象 | 核心职责 | 成熟交付物 |
|---|---|---|---|
| Source layer | core banking、CRM、KYC、AML、policy repo、call center、feature events | 明确 source-of-truth、owner、权限、保留期、业务语义 | source inventory、ownership map |
| Contract layer | AI data contract、schema assertion、freshness assertion、usage policy | 让 producer 和 AI consumer 对可用性、质量和用途达成机器可测试承诺 | data contract、contract testing suite |
| Metadata product layer | DataHub / OpenMetadata、glossary、domain、classification、data product | 让 metadata 成为 discovery、governance、runtime filtering 和 audit 的控制面 | metadata product canvas、catalog policy |
| Lineage layer | OpenLineage event、column lineage、feature lineage、RAG corpus lineage | 记录数据从源系统到 AI 输出的证据链和影响范围 | lineage map、impact analysis report |
| Quality SLO layer | Great Expectations、OpenMetadata tests、DataHub assertions | 将 completeness、freshness、accuracy、consistency、validity、coverage、access correctness 转成 SLO | quality SLO matrix、quality dashboard |
| Drift layer | feature drift、data drift、label drift、policy drift、corpus freshness drift | 识别训练、线上、评测和检索语料分布变化 | drift report、retraining or recrawl decision |
| Incident layer | data incident、AI incident、contract violation、schema break | 统一分级、止血、根因、修复、复盘和证据保全 | data incident playbook、post-incident review |
| Governance layer | RACI、approval、release gate、model risk、privacy、audit | 把数据责任嵌入 AI 生命周期 | governance RACI、release evidence pack |
4. Reference Architecture
flowchart LR
S1[KYC / CRM / Core Banking] --> I[Ingestion and Transformation]
S2[AML Case System] --> I
S3[Credit LOS / Bureau / Ledger] --> I
S4[Policy Repository] --> RAG[RAG Corpus Builder]
S5[Contact Center / Trace Store] --> I
S6[Digital Events / Feature Logs] --> F[Feature Pipeline]
I --> OL[OpenLineage Events]
F --> OL
RAG --> OL
OL --> M[Metadata Platform: DataHub / OpenMetadata]
M --> C[AI Data Contracts]
C --> Q[Quality Validation: GX / Assertions / Tests]
Q --> DP[AI Data Products]
DP --> FS[Feature Store]
DP --> TS[Training Dataset Registry]
DP --> ES[Eval Dataset Registry]
DP --> KB[RAG Corpus Registry]
DP --> LS[Label Store]
FS --> AI[AI Services / Models / Agents]
TS --> AI
ES --> Gate[Eval and Release Gate]
KB --> AI
LS --> Gate
AI --> T[AI Trace and Feedback]
T --> M
T --> Incident[Data and AI Incident Response]
Incident --> C
Incident --> Q
4.1 Control Plane
AI 数据控制面至少包含六个中心:
| 中心 | 负责什么 | 不能缺的能力 |
|---|---|---|
| Metadata catalog | 发现、owner、domain、classification、glossary、usage、lineage | DataHub 或 OpenMetadata 作为治理入口 |
| Contract registry | 数据契约、schema、SLO、allowed AI use、change policy | 版本、审批、contract testing、violation 状态 |
| Lineage backend | job、run、dataset、column、feature、corpus、eval sample 的证据链 | OpenLineage event、impact analysis、trace linkage |
| Quality runner | expectation、assertion、checkpoint、anomaly detection | Great Expectations、OpenMetadata tests、DataHub assertions |
| AI asset registry | training set、eval set、RAG corpus、feature set、label set | provenance card、version、risk tier、owner |
| Incident workflow | breach detection、severity、containment、root cause、postmortem | 数据事故和 AI 输出事故联动 |
4.2 Runtime Instrumentation
OpenLineage 适合记录 pipeline job 的运行事实, 但 AI 数据链路需要扩展 facet。一个金融零售 AI pipeline 的 event 最少应能表达:
| Facet | 字段 | 用途 |
|---|---|---|
| source facet | source_system、source_record_count、extract_window、jurisdiction | 解释数据来自哪里和覆盖什么 |
| contract facet | contract_id、contract_version、assertion_status、breaking_change_flag | 连接 producer commitment 和 pipeline 状态 |
| quality facet | validation_suite_id、checkpoint_id、failed_expectations、severity | 将 quality results 进入 lineage graph |
| model data facet | training_dataset_id、feature_set_id、snapshot_time、sampling_policy | 追溯训练数据 |
| eval data facet | eval_dataset_id、golden_set_version、reviewer_pool、rubric_version | 追溯评测数据 |
| RAG corpus facet | corpus_id、document_version、effective_date、index_version、chunking_policy | 追溯检索语料 |
| label facet | label_batch_id、label_source、review_status、inter_annotator_agreement | 管理标签质量和来源 |
4.3 Release Gates
AI 数据 release gate 不只检查模型分数, 还要检查数据契约是否可依赖。
| Gate | 必过条件 | 失败动作 |
|---|---|---|
| G1 Contract Gate | contract active、owner assigned、allowed AI use 明确、schema assertions 通过 | 阻止进入生产 pipeline |
| G2 Lineage Gate | source -> curated -> feature/training/eval/RAG -> AI output 有可查询 lineage | 不允许进入高风险 use case |
| G3 Quality Gate | critical SLO 通过, severe quality breach 为 0 | 降级、回滚、人工复核 |
| G4 Drift Gate | feature/data drift 在阈值内, label drift 有解释 | 暂停自动化决策辅助或触发重评估 |
| G5 Incident Readiness Gate | severity、owner、communication、containment、postmortem 流程已演练 | 延迟扩大用户范围 |
5. 架构模式
5.1 AI Data Contract as API Boundary
把数据契约当成 AI 系统的 API 边界。模型、RAG、eval、feature pipeline 和 agent tool 不直接信任表或文件, 只信任通过 contract 的数据产品。
| Contract area | AI-specific requirement | 金融零售例子 |
|---|---|---|
| Schema | 字段类型、必填性、枚举、嵌套结构、nullable policy | document_type 只允许 passport、driver_license、utility_bill、bank_statement |
| Semantics | 字段含义、时间口径、单位、状态定义、业务优先级 | case_status=closed 表示调查结束, 不等于 SAR 已提交 |
| Freshness | source event 到 AI 消费可见的最大延迟 | KYC 文档抽取结果 P95 在 15 分钟内进入 review queue |
| Quality | completeness、validity、consistency、accuracy、coverage | 关键身份字段完整率 >= 99.5% |
| Permission | field-level、row-level、retrieval entitlement、trace redaction | 客服可检索公开政策, 不可检索 AML investigation notes |
| Allowed AI use | RAG、eval、training、routing、analytics、monitoring 的边界 | 客服对话可用于质量分析和 eval 抽样, 不直接进入训练集 |
| Change policy | schema evolution、deprecation、backfill、dual-run、consumer approval | credit feature 改口径必须重跑 validation 和 model risk impact review |
| Incident trigger | 什么 violation 触发停用、降级或人工复核 | 政策库 freshness breach 超过 24 小时, RAG 只允许回答并提示人工确认 |
成熟设计不是“文档里有 contract”, 而是 contract 能被 CI、pipeline、catalog、quality runner 和 release gate 自动读取。
5.2 Metadata Product as AI Control Plane
Metadata product 的职责不是收藏描述, 而是让 AI runtime 能使用 metadata 做过滤、路由、解释和审计。
| Metadata type | AI runtime 用途 | 例子 |
|---|---|---|
| Business metadata | 场景路由、用户意图映射、指标切片 | product_line、risk_type、customer_segment |
| Technical metadata | freshness、schema version、job run、source checksum | ingestion_time、contract_version、index_version |
| Governance metadata | PII、consent、retention、policy classification | PII.Sensitive、retention_7y、no_model_training |
| Quality metadata | validation result、failure history、SLO status | expectation_suite_id、quality_score、breach_count |
| Lineage metadata | source records、transform jobs、downstream AI assets | source_table -> feature_set -> training_dataset |
| Eval metadata | scenario、severity、rubric、expected behavior | AML_typology、critical_failure、expert_reviewed |
| Feedback metadata | human edit、override reason、accepted suggestion | wrong_policy、missing_evidence、tone_issue |
AI Product Architect 要把 metadata platform 设计成 active metadata layer: RAG filter、model routing、access control、incident triage、impact analysis 和 release evidence 都从这里读取。
5.3 OpenLineage for Runtime and Design-Time Evidence
OpenLineage 的 job、run、dataset 模型适合把分散的 pipeline 变成统一 lineage graph。AI 场景要同时记录:
| Lineage type | 记录对象 | 关键问题 |
|---|---|---|
| Runtime lineage | 每次 pipeline run、input dataset、output dataset、status、timestamp | 哪次运行生产了被模型使用的数据 |
| Design-time job metadata | job 源码位置、声明输入输出、owner、代码版本 | 如果代码变更, 影响哪些 AI assets |
| Dataset metadata | schema、owner、documentation、classification、contract | 数据本身的定义和治理状态是什么 |
| Column lineage | 字段级转换、派生字段、敏感字段传播 | risk_score 来自哪些源字段 |
| AI asset lineage | feature set、training set、eval set、RAG corpus、label batch | 模型和回答依赖哪些版本 |
实践原则:
- pipeline 的每个关键边界都要发 event: source extract、curation、quality validation、feature materialization、training snapshot、eval build、RAG indexing。
- 对高风险字段保留 column-level lineage, 例如 income、risk_rating、customer_type、adverse_action_reason。
- 对 RAG corpus 记录 document id、version、effective date、approval status、chunking policy、embedding model、index version。
- 对 eval dataset 记录 query source、gold evidence、reviewer、rubric、risk tier、sample inclusion reason。
5.4 Schema Evolution without AI Breakage
schema evolution 在 AI 系统里比普通报表更危险, 因为模型可能静默吸收错误语义。
| Change type | 兼容性 | 处理方式 | AI 风险 |
|---|---|---|---|
| Additive field | 通常兼容 | contract 增加字段, consumer 可选择使用 | 新字段进入 prompt 或 training 前需用途审批 |
| Rename field | 高风险 | 建立 alias、dual-run、deprecation window、consumer migration | prompt、feature code、eval parser 静默失效 |
| Type change | 高风险 | contract test 阻断, migration plan, backfill validation | 数值被当字符串、日期解析错误 |
| Enum expansion | 中高风险 | 新枚举进入 glossary 和 routing policy, 更新 eval cases | 模型未学习新状态, rule fallback 漏掉 |
| Semantic change | 最高风险 | 新 contract major version, impact analysis, model/eval rerun | 字段名不变但业务含义改变 |
| Nullability change | 中高风险 | completeness SLO 重算, consumer fail-fast | 摘要缺字段、模型输入缺失偏差 |
| Aggregation window change | 高风险 | versioned metric / feature definition, drift comparison | 特征分布变化, 线上离线不一致 |
版本策略:
| Version | 适用 | 示例 |
|---|---|---|
| patch | 文档描述、owner、非行为性 metadata 修正 | kyc_doc_contract_v1.0.1 |
| minor | 兼容增加字段、增加质量测试、增加标签 | credit_feature_contract_v1.1 |
| major | 字段删除、重命名、语义改变、SLO 降级、用途边界改变 | aml_label_contract_v2 |
5.5 Data Quality SLO as Product Commitment
Great Expectations、OpenMetadata tests 和 DataHub assertions 都应服务于同一个目标: 让数据质量成为 AI 产品的 SLO, 不是事后报告。
| SLO dimension | AI-specific measure | Release relevance |
|---|---|---|
| Freshness | source event 到 feature/RAG/eval 可用的延迟 | 防止旧政策、旧账户状态、旧标签进入 AI |
| Completeness | 关键字段、关键文档、关键标签覆盖率 | 防止模型基于缺证据输出 |
| Validity | 类型、范围、枚举、格式、业务规则 | 防止输入解析错误 |
| Consistency | 跨系统、跨表、跨版本口径一致 | 防止客户、case、产品状态冲突 |
| Accuracy | 与 source-of-truth 或人工审核对账 | 防止事实错误 |
| Coverage | 场景、分群、边界样本覆盖 | 防止 eval 或训练样本偏 |
| Access correctness | 权限过滤、脱敏、用途限制 | 防止数据泄露 |
| Trace completeness | AI request 是否有 source、version、prompt、retrieval、tool、review | 支撑事故复盘 |
5.6 Feature/Data Drift and Policy Drift
AI 数据漂移需要按消费方式分层处理。
| Drift type | 检测对象 | 触发问题 | 产品动作 |
|---|---|---|---|
| Feature drift | numerical / categorical feature distribution | 线上申请渠道变化、收入字段分布异常 | 触发 model performance review 和 feature owner review |
| Data drift | source records、document types、conversation topics | 新 KYC 文档类型、客服投诉主题变化 | 扩展 contract、更新 parser、补 eval cases |
| Label drift | label distribution、reviewer agreement、case outcome lag | AML disposition 标准变化、审核团队口径不一致 | 重新校准 label rubric 和 adjudication |
| Corpus drift | policy docs、effective dates、approval status | 旧政策未下线、新政策未入库 | recrawl、reindex、RAG freshness warning |
| Schema drift | columns、types、nested structure、enum | upstream 发布破坏性变更 | contract test fail-fast |
| Usage drift | AI consumer、新下游、prompt 变更 | 数据被用于未批准训练或外部分享 | purpose limitation review |
5.7 Training / Eval / RAG Corpus Lineage
训练数据、评测数据和 RAG 语料都叫数据, 但治理目标不同。
| Asset | Lineage 必填项 | 质量重点 | 典型事故 |
|---|---|---|---|
| Training dataset lineage | source snapshot、feature definition、label source、sampling、exclusion、PII handling | representativeness、leakage、label validity、feature consistency | 训练集混入未来信息或不合规数据 |
| Eval dataset lineage | scenario source、gold evidence、expected behavior、reviewer、rubric、severity | coverage、review quality、rubric stability、regression value | eval 集和线上风险不匹配 |
| RAG corpus lineage | source doc、approval status、effective date、chunking、embedding、index version | freshness、authority、permission、citation granularity | 引用旧政策或未批准草稿 |
5.8 Label Governance
标签是 AI 风险系统里的决策资产, 不是普通字段。
| Label area | Governance requirement | 金融零售例子 |
|---|---|---|
| Label definition | label ontology、positive/negative criteria、exclusions | AML suspicious_activity_confirmed 的定义和排除项 |
| Label source | expert、operation outcome、customer feedback、LLM-assisted、synthetic | SAR filing、case closure、QA outcome |
| Reviewer model | reviewer role、training、dual review、adjudicator | 高风险 AML case 双人复核 |
| Agreement metric | inter-annotator agreement、conflict rate、review latency | AML typology 标签一致性 >= 0.85 |
| Versioning | label rubric version、policy version、jurisdiction | BSA/AML 规则更新后标注版本变化 |
| Leakage control | 标签是否引用未来信息、人工结论时间点 | 信贷违约标签不能泄露审批后才知道的信息 |
| Auditability | label event、reviewer、timestamp、evidence | 监管或模型风险审查可复现 |
5.9 Contract Testing
contract testing 要覆盖三层:
| Test layer | 测试内容 | 运行时机 |
|---|---|---|
| Producer contract tests | schema、semantic metadata、quality assertions、freshness、row/column rules | pipeline build、release、daily run |
| Consumer contract tests | prompt parser、feature code、eval loader、RAG indexer 对 contract 的依赖 | AI service build、model retrain、index rebuild |
| Change impact tests | schema evolution、downstream lineage、model/eval/RAG assets impact | pull request、catalog contract change、source release |
高风险 AI use case 的 contract test fail 应直接阻断生产发布, 不能只发通知。
6. 金融零售案例蓝图
| 场景 | 数据链路 | 关键 contract | Lineage 重点 | Quality / Drift 重点 |
|---|---|---|---|---|
| KYC 文档抽取数据链路 | upload -> OCR -> extraction -> validation -> review queue -> customer profile | 文档类型、字段置信度、PII 分类、review 状态、人工修正 | 文档版本、OCR 模型、extractor 版本、reviewer edit | extraction accuracy、critical field completeness、stale document detection |
| AML case labels | transaction monitoring -> alert -> case investigation -> disposition -> label store -> eval/training | label definition、jurisdiction、case status、reviewer、evidence completeness | label batch、case evidence、SAR decision、QA sample | label agreement、case outcome lag、typology distribution drift |
| 信贷模型训练数据 | application -> bureau -> income verification -> feature pipeline -> training snapshot | feature definition、time window、leakage exclusion、allowed use | source snapshot、feature code、model dataset version | feature drift、missing income, adverse action reason coverage |
| RAG 政策库版本 | policy repo -> approval -> chunk -> embed -> index -> answer citation | approval status、effective date、expiry date、jurisdiction、role access | source doc、chunk id、embedding model、index version | corpus freshness、citation correctness、permission correctness |
| 客服对话 trace 数据 | conversation -> intent -> retrieval/tool -> answer -> human edit -> feedback | consent、PII redaction、logging purpose、retention、feedback labels | request trace、prompt version、retrieval docs、tool calls、human edits | trace completeness、unsafe response, policy drift, topic drift |
| 推荐系统特征数据 | event stream -> identity resolution -> feature aggregation -> model scoring -> campaign response | event schema、identity match、window definition、opt-out policy | event source、feature window、experiment assignment | feature drift、event loss、consent change、conversion label lag |
6.1 KYC 文档抽取
架构重点:
- 对每类文件建立 document contract: 文件类型、字段集合、置信度阈值、可接受缺失、人工复核规则。
- OCR 和 extractor 版本进入 lineage, 人工修正作为 feedback label 回流。
- 身份字段进入 customer profile 前必须通过 critical field SLO, 例如姓名、出生日期、证件号、地址。
- 文档有效期和地区规则进入 metadata, 不能只存在 prompt 或人工 SOP 中。
关键交付物:
kyc_document_extraction_v1data contract。- source document -> OCR run -> extraction result -> validation -> reviewer edit 的 lineage map。
- extraction accuracy、field completeness、review override rate、cycle time 的 SLO matrix。
6.2 AML Case Labels
架构重点:
- AML label 必须区分 alert outcome、case disposition、SAR filed、typology、QA finding, 不能合成一个模糊
label字段。 - 标签要记录 reviewer role、decision timestamp、evidence references、rubric version、jurisdiction。
- 训练和 eval 使用不同 label snapshot, 并明确 case closure lag, 避免将未来结论泄露到模型训练。
关键交付物:
- AML label governance rubric。
- label batch provenance card。
- label drift dashboard: typology mix、agreement、dispute rate、late outcome update。
6.3 信贷模型训练数据
架构重点:
- 特征定义必须版本化, 包括时间窗口、聚合口径、排除规则、source priority。
- income、employment、bureau score、existing relationship 等字段必须有 source-of-truth 和 reconciliation rule。
- adverse action reason、reject reason 和 model reason code 要在 lineage 中能追溯到特征和规则。
关键交付物:
- credit feature contract。
- training dataset lineage card。
- feature/data drift report 和 model risk impact memo。
6.4 RAG 政策库版本
架构重点:
- 政策文档进入 RAG index 前必须是 approved 状态, 并带生效日期、失效日期、jurisdiction、business unit、role access。
- chunking policy、embedding model、index version 必须进入 corpus lineage。
- 引用必须能回到文档版本和段落级证据, 并能识别新旧政策冲突。
关键交付物:
- RAG corpus freshness policy。
- policy corpus lineage map。
- citation support 和 stale source incident playbook。
6.5 客服对话 Trace 数据
架构重点:
- 每次对话 trace 要连接 user intent、prompt version、retrieved evidence、tool call、model version、human edit、feedback。
- 对话日志进入 eval 或训练前必须完成 consent、PII redaction、purpose limitation 和 retention 检查。
- 用户不满意和人工改写要分类为 factual error、missing evidence、tone issue、policy issue、tool issue。
关键交付物:
- customer service trace contract。
- trace completeness SLO。
- feedback-to-eval conversion policy。
6.6 推荐系统特征数据
架构重点:
- 事件 schema、identity resolution、feature window、campaign assignment 和 conversion label 要分别建 contract。
- opt-out、consent、sensitive segment exclusion 必须作为 feature pipeline 的 hard control。
- 特征漂移和事件丢失要与 campaign performance 一起看, 不能只看模型 AUC 或 CTR。
关键交付物:
- feature contract 和 event contract。
- recommendation feature lineage map。
- feature/data drift monitor 和 campaign rollback rule。
7. 产品决策框架
| 决策 | 选项 | 推荐判断 |
|---|---|---|
| Contract orientation | producer-owned、consumer-specific、hybrid | 高复用核心数据用 producer-owned; 高风险 AI consumer 可追加 consumer-specific assertions |
| Enforcement mode | warn、quarantine、block | 高风险合规、信贷、客户影响场景对 critical breach 使用 block |
| Metadata platform | DataHub、OpenMetadata、组合使用 | 以组织现有生态和自动化能力为准; 不把 catalog 当静态 wiki |
| Lineage granularity | table、column、record、chunk、feature | 高风险字段、RAG citation、model training 需要 column/chunk/feature 级 |
| Schema evolution | backward compatible、versioned break、dual-run | 语义变化和字段删除必须 major version + impact analysis |
| Quality tooling | GX、OpenMetadata tests、DataHub assertions、dbt tests | contract 层统一表达, 执行工具可组合 |
| Drift action | monitor、review、retrain、rollback、human-only | drift 影响客户权益或合规义务时, 先降级再评估 |
| Label source | expert、operational、LLM-assisted、synthetic | 高风险 label 以 expert 或 audited operational label 为主 |
| RAG recrawl | time-based、event-based、approval-based | 政策类 corpus 以 approval event 和 effective date 为核心 |
| Incident severity | data only、AI output affected、customer/regulatory affected | 一旦影响客户决策、监管义务或权限泄露, 升到高严重度 |
ADR 问题清单:
| ADR section | 问题 |
|---|---|
| Context | 哪个 AI use case 依赖这份数据, 风险等级是什么 |
| Decision | contract、lineage、quality、drift、incident 的边界如何定义 |
| Alternatives | 只做 catalog、只做 quality tests、只做 pipeline monitoring 的不足是什么 |
| Consequences | 对发布速度、数据 producer 成本、审计证据、AI 质量的影响是什么 |
| Review trigger | 哪些 schema、policy、model、regulatory 或 incident 事件会触发 ADR 复核 |
8. Governance and Operating Model
8.1 RACI
| Artifact / Activity | AI DPM | AI Product Architect | Data Architect | Data Owner | Risk / Compliance | MLOps / Data Eng | Security / Privacy |
|---|---|---|---|---|---|---|---|
| AI data contract | A | C | R | A | C | R | C |
| Lineage map | C | A | R | C | C | R | C |
| Quality SLO matrix | A | C | R | A | C | R | C |
| Schema change approval | C | A | R | A | C | R | C |
| Eval dataset provenance card | A | C | C | C | R | R | C |
| RAG corpus freshness policy | A | C | R | A | C | R | C |
| Label governance rubric | A | C | C | R | A | R | C |
| Data incident response | A | A | R | R | A | R | R |
8.2 NIST AI RMF Mapping
| Function | 数据控制面动作 | Evidence |
|---|---|---|
| Govern | 定义 ownership、policy、allowed use、RACI、approval、third-party data controls | governance charter、RACI、policy mapping |
| Map | 识别 AI use case、source-of-truth、data flows、risk tier、affected stakeholders | data flow map、lineage map、risk tier memo |
| Measure | 建立 contract tests、quality SLO、drift metrics、label agreement、trace completeness | validation results、drift report、quality dashboard |
| Manage | 运行 release gates、incident response、rollback、data repair、post-incident review | release evidence pack、incident report、change log |
8.3 Governance Cadence
| Cadence | 会议对象 | 输入 | 输出 |
|---|---|---|---|
| Weekly | AI DPM、Data Owner、Data Eng、MLOps | contract violations、quality breach、schema changes | repair actions、release blockers |
| Monthly | AI Product Architect、Risk、Compliance、Privacy | drift trends、incident trends、label disputes、RAG freshness | control updates、eval refresh decisions |
| Quarterly | Governance board、Model Risk、Security、Business Owner | high-risk AI portfolio、audit findings、SLO history | risk acceptance、funding, decommission decisions |
| Event-driven | Incident responders | data breach、schema break、wrong output、policy corpus stale | containment、customer/regulatory impact assessment |
9. 可落地交付物模板
9.1 Data Contract Template
以下是 kyc_document_extraction_v1 的完整样例, 可作为 AI data contracts 的结构模板。
contract_id: kyc_document_extraction_v1
contract_version: 1.0.0
status: active
domain: retail_banking_kyc
producer:
team: kyc_data_platform
owner: kyc_data_owner
steward: kyc_operations_steward
consumers:
- kyc_document_extraction_service
- kyc_review_queue
- kyc_quality_eval_harness
source_of_truth:
system: enterprise_document_management
source_dataset: kyc_uploaded_documents
conflict_rule: reviewed_extraction_overrides_raw_ocr
schema:
document_id:
type: string
required: true
pii_class: internal_identifier
customer_id:
type: string
required: true
pii_class: direct_identifier
document_type:
type: enum
required: true
allowed_values:
- passport
- driver_license
- utility_bill
- bank_statement
extracted_fields:
type: object
required: true
fields:
full_name:
type: string
required: true
confidence_min: 0.92
date_of_birth:
type: date
required: true
confidence_min: 0.95
address:
type: string
required: false
confidence_min: 0.90
extraction_model_version:
type: string
required: true
review_status:
type: enum
required: true
allowed_values:
- auto_accepted
- human_review_required
- human_corrected
- rejected
semantics:
full_name: customer legal name extracted from approved identity evidence
review_status: operational status after automated validation and human review
freshness_slo:
p95_minutes_from_upload_to_extraction: 15
p99_minutes_from_review_to_profile_update: 60
quality_slo:
critical_field_completeness_min: 0.995
document_type_validity_min: 0.999
reviewer_override_rate_review_threshold: 0.08
permissions:
row_level_rule: assigned_branch_or_kyc_operations
field_redaction:
date_of_birth: masked_outside_kyc_and_compliance
document_id: visible_to_support_with_case_context
allowed_ai_use:
rag: false
eval: true
training: true
routing: true
customer_facing_generation: false
retention:
raw_document: governed_by_kyc_record_policy
extracted_fields: governed_by_customer_profile_policy
eval_samples: redacted_and_reviewed
change_policy:
additive_field: minor_version_with_consumer_notice
semantic_change: major_version_with_risk_review
enum_change: minor_version_with_eval_refresh
incident_triggers:
permission_leakage: severity_0
critical_field_completeness_below_slo: severity_1
freshness_p95_breach_two_runs: severity_2
contract_tests:
- schema_assertion_required_columns
- enum_assertion_document_type
- completeness_assertion_critical_fields
- freshness_assertion_upload_to_extraction
- permission_assertion_branch_scope
9.2 Lineage Map Template
| Node ID | Asset | Type | Owner | Contract | Quality Gate | Downstream AI Use | Evidence |
|---|---|---|---|---|---|---|---|
src_kyc_docs | kyc_uploaded_documents | source table / object store | KYC Data Owner | kyc_document_extraction_v1 | upload integrity check | extraction, review | source checksum、upload timestamp |
job_ocr | OCR extraction job | OpenLineage job | Data Engineering | ocr_input_contract_v1 | OCR success rate | document parsing | run id、OCR model version |
ds_extracted | kyc_extracted_fields | curated dataset | KYC Data Platform | kyc_document_extraction_v1 | GX checkpoint kyc_extract_daily | profile update、eval | validation result、failed rows |
job_review | human review workflow | operational job | KYC Operations | kyc_review_contract_v1 | reviewer completeness | label feedback | reviewer id hash、edit reason |
ai_eval | kyc_extraction_eval_v2026_06 | eval dataset | AI DPM | eval_dataset_contract_v1 | provenance card approved | model release gate | rubric version、sample reasons |
flowchart LR
A[Uploaded KYC Document] --> B[OCR Run]
B --> C[Extraction Result]
C --> D[Quality Validation]
D --> E[Human Review]
E --> F[Customer Profile Update]
E --> G[Eval Dataset]
C --> H[Training Snapshot]
D --> I[Contract Violation Dashboard]
Lineage map 审核问题:
| Question | Pass evidence |
|---|---|
| 每个 AI 输出能否回到 source record 和 transform run | trace_id、run_id、dataset version |
| 高风险字段是否有 column-level lineage | column transform、source field、derived rule |
| 数据变更能否做 downstream impact analysis | catalog lineage query、consumer list |
| 质量检查结果是否在 lineage 中可见 | validation result linked to run |
| 人工修正是否成为 feedback lineage | reviewer edit event、reason code |
9.3 Quality SLO Matrix
| Data Product | SLO | Target | Measurement | Breach Severity | Automated Action | Owner |
|---|---|---|---|---|---|---|
| KYC extraction | critical field completeness | >= 99.5% | GX expectation over extracted fields | S1 | quarantine failed batch and route to human review | KYC Data Owner |
| AML labels | reviewer agreement | >= 0.85 | agreement score by typology and jurisdiction | S2 | pause new label ingestion for disputed typology | AML QA Lead |
| Credit training set | feature null rate for income | <= 1.0% | feature validation checkpoint | S1 | block training snapshot promotion | Credit Data Owner |
| Policy RAG corpus | approved active document coverage | 100% for in-scope policy set | corpus registry vs policy repository | S1 | block index promotion | Policy Owner |
| Customer service trace | trace completeness | >= 98% | spans with prompt, retrieval, model, response, feedback fields | S2 | exclude incomplete trace from eval conversion | AI Platform Owner |
| Recommendation features | event ingestion freshness P95 | <= 10 minutes | event time to feature availability | S2 | switch campaign scoring to last stable feature snapshot | Growth Data Owner |
| All high-risk AI datasets | permission assertion pass rate | 100% sampled pass | access test and redaction test | S0 | disable affected endpoint and open security incident | Security Owner |
SLO 设计原则:
- SLO 要按 AI use case 风险等级分层, 不能全局统一。
- critical fields 的 breach 动作要在 contract 中预定义。
- low-risk analytics 可以 warn, regulated decision support 必须 block 或 human-only。
- 平均值不能掩盖分群风险, 需要按 jurisdiction、product、channel、customer segment 切片。
9.4 Schema Change Approval
| Section | 内容 |
|---|---|
| Change ID | schema_change_credit_features_2026_06_income_window |
| Requested by | Credit Data Platform |
| Affected contract | credit_feature_contract_v2.3 |
| Change type | semantic change and aggregation window change |
| Current definition | verified_income_90d_avg 使用过去 90 天 verified income records |
| New definition | verified_income_180d_avg 使用过去 180 天 verified income records, 排除 disputed records |
| Affected consumers | credit scoring model, adverse action reason service, model monitoring, portfolio analytics |
| Downstream impact | feature distribution shift expected; adverse action reason mapping requires rerun; existing eval slices require refresh |
| Required tests | schema assertion, null rate check, distribution comparison, model backtest, fairness slice review, reason code consistency |
| Rollback | keep verified_income_90d_avg materialized for 60 days and preserve model route switch |
| Approvers | Data Owner, AI Product Architect, Model Risk, Credit Policy Owner |
| Decision | approved for shadow mode and blocked from production promotion until SLO and backtest pass |
Approval checklist:
| Check | Evidence |
|---|---|
| Contract version updated | major or minor version chosen with reason |
| Lineage impact generated | downstream AI assets listed |
| Quality tests updated | expectations and assertions changed |
| Eval and model impact reviewed | backtest and regression report available |
| Communication completed | producers and consumers informed through catalog and release notes |
| Rollback path verified | previous dataset and feature definition available |
9.5 Eval Dataset Provenance Card
| Field | Value |
|---|---|
| Eval dataset ID | aml_case_narrative_eval_v2026_06 |
| Purpose | Evaluate AML copilot case summary grounding, completeness, escalation, and policy compliance |
| Source systems | AML case management, transaction monitoring alerts, QA review outcomes |
| Source time window | cases closed from 2025-07-01 to 2026-05-31 |
| Sampling policy | stratified by typology, jurisdiction, risk tier, case complexity, historical failure mode |
| Exclusion policy | active investigations, legally restricted cases, cases with unresolved QA dispute |
| PII handling | analyst-facing eval uses masked customer identifiers; expert reviewers access source evidence through approved case system |
| Gold evidence | transaction summary, alert rationale, analyst notes, approved disposition, QA findings |
| Label source | expert AML review and audited operational outcomes |
| Reviewer model | two expert reviewers for high-risk cases, adjudicator for disagreement |
| Rubric version | aml_summary_rubric_v3 |
| Severity tags | unsupported_claim, missing_evidence, under_escalation, wrong_typology, policy_violation |
| Known limitations | closed-case distribution underrepresents newly emerging typologies; drift review monthly |
| Refresh cadence | monthly minor refresh, quarterly rubric calibration |
| Owner | AI DPM for AML Copilot |
| Risk sign-off | AML Compliance and Model Risk |
Provenance card quality bar:
- 每条 eval case 都有 source reference、expected behavior、unacceptable behavior、severity 和 reviewer record。
- 评测集不混入未授权生产 PII 副本。
- LLM-assisted labels 只能作为 reviewer 工作辅助, 不能作为高风险 golden label 的唯一来源。
- eval dataset lineage 能连接到 source case、label batch、rubric version 和 release gate。
9.6 RAG Corpus Freshness Policy
| Policy area | Rule |
|---|---|
| Corpus ID | retail_policy_rag_corpus |
| Source-of-truth | approved policy repository, not shared drive, email attachment, or draft folder |
| Inclusion rule | document status must be approved, effective date active, jurisdiction and business unit assigned |
| Exclusion rule | draft, expired, superseded, legal-review-only, no-RAG-use documents |
| Freshness SLO | approved policy change indexed within 4 hours; emergency regulatory bulletin indexed within 60 minutes |
| Version handling | old version remains retrievable only for historical-date questions and is excluded from current-policy answers |
| Chunking policy | section-aware chunks with page, section, paragraph, table row, effective date and policy owner metadata |
| Embedding policy | embedding model version recorded; re-embed triggered by model change, chunking change, or major policy format change |
| Permission policy | role, region, business unit and document classification filter applied before context assembly |
| Citation policy | answer must cite current active policy version and section; conflicting policies trigger escalation message |
| Staleness action | if corpus freshness SLO breached, customer-facing answers disabled and employee-facing answers display stale-source warning |
| Audit | every answer stores corpus_id、index_version、doc_ids、chunk_ids、effective_date、retrieval filters |
9.7 Data Incident Playbook
| Phase | Action | Owner | Evidence |
|---|---|---|---|
| Detect | contract violation, quality SLO breach, lineage gap, drift alert, wrong AI output, access failure | Monitoring Owner | alert id、dataset、contract、time window |
| Classify | assign severity based on data scope, AI output impact, customer impact, regulatory impact | AI DPM + Risk | severity decision log |
| Contain | block pipeline, quarantine batch, roll back feature/RAG index, disable affected route, switch to human review | Data Eng + MLOps | containment action record |
| Assess impact | identify downstream models, eval sets, RAG corpora, traces, reports, customers, analysts | Data Architect | lineage impact report |
| Repair | fix data, rerun validation, rebuild asset, refresh eval or labels, re-run release gate | Data Owner + MLOps | repair run id、validation result |
| Communicate | notify business, risk, compliance, security, affected product owners | Incident Lead | communication log |
| Decide restart | verify SLO pass, downstream impact cleared, risk sign-off completed | AI Product Architect | restart approval |
| Review | root cause, missed detection, control improvement, contract update, training update | Governance Board | post-incident review |
Severity examples:
| Severity | Trigger | Required response |
|---|---|---|
| S0 | unauthorized PII exposed through RAG context or trace logs | immediate endpoint disablement, security incident, legal/privacy review |
| S1 | stale policy corpus causes wrong customer-facing answer | disable customer-facing response path, notify compliance, rebuild corpus |
| S2 | credit feature freshness breach affects scoring shadow pipeline | block model promotion and rerun validation |
| S3 | non-critical metadata owner missing for low-risk analytics dataset | catalog correction within governance cadence |
9.8 Contract Testing Matrix
| Test | Example | Tooling pattern | Block condition |
|---|---|---|---|
| Schema required columns | customer_id, document_type, review_status exist | DataHub schema assertion, OpenMetadata contract, GX expectation | missing critical column |
| Type and enum validation | document_type in approved values | GX expectation, dbt test, OpenMetadata no-code test | invalid enum in production batch |
| Freshness validation | policy doc indexed within 4 hours | DataHub freshness assertion, custom checkpoint | freshness breach for high-risk RAG |
| Completeness validation | credit income feature null rate <= 1% | GX checkpoint with action | critical feature breach |
| Semantic metadata validation | owner、domain、description、PII tag present | OpenMetadata semantics and governance checks | high-risk dataset lacks owner or PII tag |
| Permission validation | analyst cannot retrieve cases outside assignment | integration test against retrieval endpoint | unauthorized access possible |
| Downstream compatibility | prompt parser, feature loader, eval loader pass contract version | CI consumer tests | consumer fails against new contract |
| Lineage completeness | run emits source, input, output, contract and quality facets | OpenLineage event validation | missing lineage for regulated AI asset |
10. 30天训练计划
| Day | 训练主题 | 当日产出 |
|---|---|---|
| 1 | 选择一个金融零售 AI use case, 明确 risk tier 和 AI data products | one-page use case data map |
| 2 | 盘点 source-of-truth、系统 owner、数据消费者和 allowed AI use | source inventory |
| 3 | 为核心数据产品写第一版 AI data contract | contract draft |
| 4 | 把 schema、semantics、security、quality、SLA、terms of use 拆成 assertions | assertion catalog |
| 5 | 设计 producer contract tests 和 consumer contract tests | contract testing matrix |
| 6 | 绘制 source -> curated -> AI asset 的 lineage map | lineage map |
| 7 | 设计 OpenLineage event 和 AI-specific facets | lineage event spec |
| 8 | 定义 metadata product: domain、owner、classification、glossary、data product | metadata product canvas |
| 9 | 用 DataHub 或 OpenMetadata 思路设计 catalog policy | catalog governance policy |
| 10 | 为一个数据产品设计 Great Expectations / assertion suite | quality validation suite |
| 11 | 建立 quality SLO matrix, 分 low / medium / high risk | SLO matrix |
| 12 | 设计 schema evolution policy 和 approval workflow | schema change approval template |
| 13 | 用 KYC 文档抽取场景演练 contract + lineage + SLO | KYC case artifact pack |
| 14 | 用 AML case labels 场景设计 label governance | label rubric and provenance card |
| 15 | 用信贷模型训练数据设计 training dataset lineage | training data lineage card |
| 16 | 用 RAG 政策库设计 corpus freshness 和 citation policy | RAG corpus freshness policy |
| 17 | 用客服 trace 数据设计 trace contract 和 feedback taxonomy | trace data contract |
| 18 | 用推荐特征数据设计 feature drift monitor | feature drift report format |
| 19 | 设计 eval dataset provenance card 和 golden set refresh cadence | eval provenance card |
| 20 | 设计 feature/data drift、label drift、corpus drift 指标 | drift metric catalog |
| 21 | 设计 data incident severity matrix 和 containment actions | data incident playbook |
| 22 | 做一次 schema breaking change tabletop exercise | impact analysis memo |
| 23 | 做一次 stale RAG corpus incident tabletop exercise | RAG incident report |
| 24 | 做一次 label dispute and adjudication exercise | label governance decision log |
| 25 | 将 NIST AI RMF Govern / Map / Measure / Manage 映射到 artifacts | AI data risk control matrix |
| 26 | 设计 release gates: contract、lineage、quality、drift、incident readiness | release gate checklist |
| 27 | 准备 DataHub / OpenMetadata / OpenLineage / GX 的架构选型说明 | tooling ADR |
| 28 | 把一个案例整理成作品集叙事: problem -> architecture -> controls -> outcomes | portfolio story |
| 29 | 练习 10 道面试题, 每题 30 秒和 2 分钟版本 | interview answer sheet |
| 30 | 做综合复盘: artifacts、风险、缺口、下一轮深化方向 | 30-day capstone pack |
训练交付物最低标准:
- 至少一个完整 data contract。
- 至少一张 lineage map, 覆盖 source、transform、quality、AI asset、runtime trace。
- 至少一张 quality SLO matrix, 带 breach action。
- 至少一个 eval dataset provenance card。
- 至少一个 RAG corpus freshness policy。
- 至少一个 data incident playbook。
- 至少一个 schema change approval。
- 至少一份 tooling ADR。
11. 面试题与回答
Q1: AI data contracts 和传统数据字典有什么本质区别?
30秒版本: 数据字典解释字段, AI data contract 承诺 AI consumer 可以如何可靠、安全、合规地使用数据。它覆盖 schema、语义、质量 SLO、freshness、权限、allowed AI use、变更流程和 incident trigger, 并且应该能被 contract tests 自动验证。
2分钟版本: 在 AI 系统里, 数据变化不会只影响报表, 还会影响模型训练、RAG 引用、eval 结果、agent routing 和客户/员工决策。传统数据字典通常是描述性资产, 无法阻止 schema breaking change、旧政策入库、标签口径漂移或权限泄漏。AI data contract 是 producer 和 AI consumer 之间的可执行承诺: 生产者承诺结构、语义、刷新、质量和用途; 消费者声明如何使用以及破坏时如何降级。金融零售中, 例如 KYC 文档抽取字段如果没有 contract, 模型可能把 OCR 低置信度地址写入客户资料; 有 contract 后, completeness、confidence、review_status 和 human review 都会成为上线门禁。
Q2: 你会如何设计 AI lineage, 才能支持事故复盘?
30秒版本: 我会把 lineage 设计到 source、pipeline、contract、quality、AI asset 和 runtime trace 六层。每个模型输出或 RAG 回答都能回到 source records、transform run、contract version、quality result、index/model/eval version 和用户可见证据。
2分钟版本: 只做 table-level lineage 不够。AI 事故常发生在字段、chunk、feature、label 或 prompt 级别。我会使用 OpenLineage 记录 pipeline run、input/output datasets 和 facets, 在 metadata catalog 中连接 DataHub/OpenMetadata 的 dataset、column、job、contract 和 owner。对 RAG, lineage 要记录 source doc、effective date、chunk id、embedding model、index version、retrieval filters 和 citations。对训练, 要记录 source snapshot、feature definitions、label source 和 sampling policy。对 eval, 要记录 gold evidence、rubric、reviewer 和 severity。这样 stale policy、错误 label、schema break 或权限泄漏都能通过 lineage impact analysis 找到受影响模型、答案和客户流程。
Q3: Schema evolution 在 AI 项目里为什么需要更严格?
30秒版本: AI consumer 可能静默吸收字段语义变化, 不像普通 API 那样立刻报错。字段名不变但口径变了, 可能导致训练数据偏差、eval 失真、RAG 错引和特征漂移。
2分钟版本:
我会把 schema evolution 分为 additive、rename、type change、enum expansion、semantic change、nullability change 和 aggregation window change。新增字段一般 minor version, 但语义变化、字段删除和窗口口径变化必须 major version, 触发 downstream impact analysis、contract tests、model/eval rerun 和 risk review。例如信贷 verified_income_90d_avg 改成 180 天窗口, 字段类型不变, 但模型特征分布和 adverse action reason 都会变化。正确做法是 dual-run、backfill validation、drift comparison、shadow mode 和明确 rollback。
Q4: Training data lineage、eval dataset lineage 和 RAG corpus lineage 怎么区分?
30秒版本: 训练数据 lineage 证明模型学到了什么; eval lineage 证明测试样本和 gold evidence 是否可信; RAG corpus lineage 证明回答检索了哪个版本的权威知识。三者都要有来源和版本, 但治理重点不同。
2分钟版本: 训练数据重点是 source snapshot、feature definitions、label source、sampling、exclusion、PII handling 和 leakage control。eval 数据重点是 scenario source、gold evidence、expected behavior、reviewer、rubric、severity 和 refresh cadence。RAG 语料重点是 source document、approval status、effective date、expiry date、chunking policy、embedding model、index version、permission filters 和 citation support。金融场景中, 把它们混在一起会出问题: 客服日志可以用于 eval 抽样, 但未必能用于训练; 过期政策可用于历史问题, 但不能支撑当前政策回答。
Q5: 如何设定 data quality SLO?
30秒版本: 从 AI use case 风险和业务后果倒推 SLO。高风险场景设置 hard-stop 指标, 如权限正确率 100%、关键字段完整率 >= 99.5%、RAG 当前政策覆盖 100%。低风险场景可用 warn 和人工修复。
2分钟版本: 我不会只设通用 null rate。AI data quality SLO 至少覆盖 freshness、completeness、validity、consistency、accuracy、coverage、access correctness 和 trace completeness。KYC 抽取关注关键字段完整率、置信度和 reviewer override rate; AML labels 关注 reviewer agreement、dispute rate 和 outcome lag; RAG 政策库关注 approved active document coverage、index freshness 和 citation correctness; 推荐特征关注事件 freshness、identity match 和 feature drift。每个 SLO 都要有 measurement、target、severity、automated action 和 owner。
Q6: Feature/data drift 发现后产品经理应该如何决策?
30秒版本: 先判断漂移是否影响客户权益、合规义务或关键业务决策。低风险可监控和解释; 中高风险要降级、暂停自动化、切到人工复核、重跑 eval 或触发模型/数据修复。
2分钟版本: 漂移不是单一模型指标。feature drift 可能来自渠道变化、上游 schema 变化或真实业务变化; data drift 可能来自新文档类型、新客户群或新政策; label drift 可能来自审核口径变化; corpus drift 可能来自政策更新。产品决策要看 drift 的原因和影响范围。例如信贷收入特征漂移影响审批辅助, 应进入 model risk review; 政策库 corpus freshness breach 影响客服回答, 应禁用客户可见回答并显示人工确认; 推荐系统事件漂移影响营销, 可以回滚到稳定特征快照并暂停相关 campaign。
Q7: DataHub、OpenMetadata、OpenLineage、Great Expectations 在架构里分别承担什么角色?
30秒版本: OpenLineage 记录 lineage events; DataHub/OpenMetadata 作为 metadata catalog 和 governance control plane; Great Expectations 做数据质量 expectations、validation 和 checkpoints。它们不是互斥工具, 而是控制面的不同层。
2分钟版本: OpenLineage 适合从 pipeline runtime 采集 job、run、dataset 和 facets, 让 lineage graph 有执行证据。DataHub 和 OpenMetadata 管理 data assets、owner、domains、glossary、classification、contracts、lineage、quality signals 和 governance workflow。Great Expectations 把数据假设转成可运行的 Expectation Suites、Validation Definitions、Validation Results 和 Checkpoints, 并通过 actions 发通知或更新文档。在成熟架构里, contract 定义在 catalog, validation 由 GX 或内置 assertions 执行, 结果回写 catalog, lineage 记录每次运行和数据产物。
Q8: AML case labels 为什么需要 label governance?
30秒版本: AML label 影响训练、eval、风险叙事和模型上线判断。没有 label definition、reviewer、jurisdiction、evidence、rubric 和 disagreement handling, 模型会学习错误或不一致的调查结论。
2分钟版本:
AML 的 suspicious_activity_confirmed 不是普通标签, 它可能代表不同调查阶段、司法辖区和 QA 口径。label governance 要明确 ontology、positive/negative criteria、case status、SAR decision、typology、reviewer role、evidence reference、timestamp、rubric version 和 adjudication。还要监控 inter-annotator agreement、label drift、dispute rate 和 case outcome lag。高风险 label 不应由 LLM 自动生成作为唯一真值, LLM 可以辅助预分类或生成 reviewer draft, 但最终 golden label 要有专家或审计过的运营结论。
Q9: 如果 RAG 政策库引用了过期政策, 你如何处理?
30秒版本: 先止血: 禁用受影响客户可见回答或切到人工确认。再用 lineage 找到 source doc、chunk、index version、retrieval filter 和受影响 trace。修复后重建 index、重跑 citation eval、更新 freshness policy 和 incident controls。
2分钟版本: 我会按 data incident playbook 做。第一步分类严重度: 是否客户可见、是否影响合规义务、是否造成错误执行。第二步 containment: affected corpus/index 下线, route policy 切到 current approved index 或 human-only。第三步 impact analysis: 用 corpus lineage 找出过期 doc version、chunk ids、answers、users、business units。第四步 repair: 修正 source approval/expiry metadata, 重新 chunk、embed、index, 运行 freshness、permission、citation support 和 regression eval。第五步 post-incident review: 为什么 policy repo approval event 没触发 reindex, 为什么 freshness SLO 没拦截, contract 和 alert 如何更新。
Q10: 如何把 metadata product 做成作品集亮点?
30秒版本: 不要只展示 catalog 截图, 要展示 metadata 如何驱动 AI 控制: RAG permission filter、contract testing、lineage impact analysis、quality SLO、drift dashboard、incident response 和 release gate。
2分钟版本: 作品集可以选择一个金融零售 AI use case, 例如 KYC extraction 或 policy RAG。展示完整链路: data product canvas、data contract、lineage map、quality SLO matrix、schema change approval、eval provenance card、RAG freshness policy 和 incident playbook。然后用一个具体事故演练说明 metadata 的价值: 旧政策进入 corpus 后, catalog 的 effective_date、approval_status、lineage、index_version 和 trace 让团队能定位影响、禁用错误路径、修复并证明恢复。这样体现的是 AI Product Architect 和 Data Architect 能力, 不是简单数据整理。
12. 上线自检清单
| Check | Pass condition |
|---|---|
| Contract | 每个高风险 AI data product 有 active contract、owner、allowed use、change policy |
| Contract tests | producer、consumer、change impact tests 已接入 CI 或 pipeline |
| Lineage | source -> transform -> quality -> AI asset -> runtime trace 可追溯 |
| Metadata | owner、domain、classification、glossary、PII、retention、access policy 已登记 |
| Quality SLO | critical metrics 有 target、measurement、owner、breach action |
| Drift | feature/data/label/corpus/schema/usage drift 有指标和决策规则 |
| Training data | snapshot、feature definition、label source、sampling、exclusion、PII handling 可复现 |
| Eval data | provenance card、gold evidence、reviewer、rubric、severity、refresh cadence 完整 |
| RAG corpus | approval status、effective date、chunking、index version、permission filters、citation policy 完整 |
| Label governance | label definition、reviewer model、agreement metric、adjudication、versioning 完整 |
| Incident response | severity、containment、impact analysis、repair、restart、post-incident review 已演练 |
| Governance | RACI、release gate、risk sign-off、audit evidence pack 可展示 |
13. 参考来源链接
- OpenLineage Docs: https://openlineage.io/docs/
- OpenLineage Object Model: https://openlineage.io/docs/spec/object-model/
- OpenLineage Facets: https://openlineage.io/docs/spec/facets/
- DataHub Data Contracts: https://docs.datahub.com/docs/generated/metamodel/entities/datacontract
- DataHub Lineage API Tutorial: https://docs.datahub.com/docs/api/tutorials/lineage
- OpenMetadata Data Contracts: https://docs.open-metadata.org/v1.13.x/how-to-guides/data-contracts
- OpenMetadata Data Lineage: https://docs.open-metadata.org/v1.13.x/how-to-guides/data-lineage
- OpenMetadata Data Quality Observability: https://docs.open-metadata.org/v1.13.x/how-to-guides/data-quality-observability
- Great Expectations GX Core Overview: https://docs.greatexpectations.io/docs/core/introduction/gx_overview/
- Great Expectations Checkpoints with Actions: https://docs.greatexpectations.io/docs/core/trigger_actions_based_on_results/create_a_checkpoint_with_actions/
- NIST AI RMF Core: https://airc.nist.gov/airmf-resources/airmf/5-sec-core/
- NIST AI RMF Generative AI Profile: https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence