AI 扩展计划 / Playbooks

AI Data Contracts / Lineage / Quality Playbook

这些来源作为方法锚点, 不替代企业内部 legal、compliance、model risk、privacy 和 architecture review 的正式判断。

897 行AI_DATA_CONTRACTS_LINEAGE_QUALITY_PLAYBOOK.md

AI Data Contracts / Lineage / Quality Playbook

定位: 面向 AI Data Product Manager / AI Product Architect / Data Architect / AI Governance / Risk Tech 的高级数据治理与产品化手册。目标: 把 AI data contracts、lineage、metadata product、schema evolution、data quality SLO、drift、label governance 和 data incident response 连接成可上线、可审计、可运营的企业 AI 数据控制面。核心观点: AI 数据治理不是“数据清洗”和“字段说明”, 而是为 AI use case 提供可签约、可追溯、可测试、可变更、可问责、可复盘的数据产品能力。

Source Anchors

这些来源作为方法锚点, 不替代企业内部 legal、compliance、model risk、privacy 和 architecture review 的正式判断。

Anchor	Link	本手册使用方式
OpenLineage Docs	https://openlineage.io/docs/	将 job、run、dataset、facet 和 runtime lineage event 转成 AI 数据链路观测设计。
OpenLineage Object Model	https://openlineage.io/docs/spec/object-model/	区分 runtime lineage、design-time job metadata、dataset metadata, 支撑训练、评测、RAG 和 feature pipeline 的证据链。
OpenLineage Facets	https://openlineage.io/docs/spec/facets/	用 facet 扩展 source code、schema、quality、model、prompt、RAG corpus、label batch 等 AI 元数据。
DataHub Data Contracts	https://docs.datahub.com/docs/generated/metamodel/entities/datacontract	将 contract 表达为 schema、freshness、quality、SLA assertions, 并接入 CI/CD 和 quality 工具。
DataHub Lineage	https://docs.datahub.com/docs/api/tutorials/lineage	参考 table-level、column-level、data job、dashboard、chart 的 lineage 表达方式。
OpenMetadata Data Contracts	https://docs.open-metadata.org/v1.13.x/how-to-guides/data-contracts	用 schema、semantics、security、quality assertions、SLA、terms of use 和 status 组织 contract。
OpenMetadata Data Lineage	https://docs.open-metadata.org/v1.13.x/how-to-guides/data-lineage	参考 table、column、pipeline、dashboard、ML model 的可视化 lineage 和 impact analysis。
OpenMetadata Quality Observability	https://docs.open-metadata.org/v1.13.x/how-to-guides/data-quality-observability	将 tests、profiler、alerts、incident manager 和 anomaly detection 纳入数据运营闭环。
Great Expectations GX Core	https://docs.greatexpectations.io/docs/core/introduction/gx_overview/	用 Expectation Suite、Validation Definition、Validation Result 和 Checkpoint 建立数据质量测试与报告。
Great Expectations Checkpoints	https://docs.greatexpectations.io/docs/core/trigger_actions_based_on_results/create_a_checkpoint_with_actions/	将 validation results 转成 notification、Data Docs、custom action 和发布门禁。
NIST AI RMF Core	https://airc.nist.gov/airmf-resources/airmf/5-sec-core/	用 Govern / Map / Measure / Manage 组织 AI 数据风险治理。
NIST AI RMF GenAI Profile	https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence	将 GenAI 生命周期中的数据来源、评估、监控和风险响应转成 artifact。
本仓库 AI Data Product Management Playbook	`docs/AI_DATA_PRODUCT_MANAGEMENT_PLAYBOOK.md`	本手册继续展开 contract、metadata、lineage、quality SLO 和 feedback loop。
本仓库 AI Requirements-to-Eval Cookbook	`docs/AI_REQUIREMENTS_TO_EVAL_COOKBOOK.md`	将数据契约和 lineage 接到 eval contract、release gate、monitoring 和 incident loop。

1. 定位: AI 数据控制面的高级能力

这份手册不重复基础 BA、传统数据仓库或普通数据治理概念。它补的是 AI 产品和架构进入生产后必须具备的四类能力:

能力	高级问题	可交付证据
AI data contracts	AI 工作流依赖的数据是否有 schema、语义、质量、权限、刷新、用途和变更承诺	data contract、contract tests、schema change approval
Lineage	一个模型输出、RAG 回答、eval 结果或特征值能否追溯到源记录、转换、版本、权限和质量检查	lineage map、OpenLineage event、impact analysis
Metadata product	元数据是否从被动目录升级为 AI runtime 的控制面	metadata product canvas、catalog policy、ownership model
Quality operations	数据质量下降、漂移、标签争议、语料过期和 schema 破坏能否被检测、拦截、分级和复盘	quality SLO matrix、incident playbook、drift dashboard

一句话:

AI data product 不是“可以被模型使用的数据”, 而是能以 contract 形式对 AI 消费者负责的数据能力。

2. 为什么重要

企业 AI 的很多事故表面上是模型问题, 根因却是数据产品失控。

表面问题	深层数据根因	业务风险
KYC 文档抽取结果频繁错字段	文档类型、OCR 版本、字段语义、置信度阈值没有 contract	客户重复补件、开户延迟、监管抽查解释困难
AML copilot 总结 case 结论偏差	case label 没有来源、审核人、时间点、jurisdiction 和分歧记录	错误风险叙事、调查遗漏、模型评测失真
信贷模型表现突然下降	上游申请渠道、收入字段、拒绝原因、客户分群发生 drift	不公平结果、审批质量下降、模型风险升级
RAG 政策助手引用旧政策	语料库版本、生效日期、下线日期和 index refresh 没有 SLO	客服误导、员工错误执行、合规风险
客服对话 trace 无法复盘	prompt、retrieval、tool call、human edit 和用户反馈没有统一 trace	无法定位事故、无法沉淀 golden set
推荐系统转化波动	特征定义、窗口口径、归因事件和实验分桶发生 schema 或 distribution 变化	错误营销、客户体验下降、收入归因失真

AI 数据控制面要解决的不是单点数据质量, 而是跨 source、pipeline、metadata、contract、eval、runtime、feedback 和 incident 的端到端可证明性。

3. 能力地图

层级	关键对象	核心职责	成熟交付物
Source layer	core banking、CRM、KYC、AML、policy repo、call center、feature events	明确 source-of-truth、owner、权限、保留期、业务语义	source inventory、ownership map
Contract layer	AI data contract、schema assertion、freshness assertion、usage policy	让 producer 和 AI consumer 对可用性、质量和用途达成机器可测试承诺	data contract、contract testing suite
Metadata product layer	DataHub / OpenMetadata、glossary、domain、classification、data product	让 metadata 成为 discovery、governance、runtime filtering 和 audit 的控制面	metadata product canvas、catalog policy
Lineage layer	OpenLineage event、column lineage、feature lineage、RAG corpus lineage	记录数据从源系统到 AI 输出的证据链和影响范围	lineage map、impact analysis report
Quality SLO layer	Great Expectations、OpenMetadata tests、DataHub assertions	将 completeness、freshness、accuracy、consistency、validity、coverage、access correctness 转成 SLO	quality SLO matrix、quality dashboard
Drift layer	feature drift、data drift、label drift、policy drift、corpus freshness drift	识别训练、线上、评测和检索语料分布变化	drift report、retraining or recrawl decision
Incident layer	data incident、AI incident、contract violation、schema break	统一分级、止血、根因、修复、复盘和证据保全	data incident playbook、post-incident review
Governance layer	RACI、approval、release gate、model risk、privacy、audit	把数据责任嵌入 AI 生命周期	governance RACI、release evidence pack

4. Reference Architecture

flowchart LR
  S1[KYC / CRM / Core Banking] --> I[Ingestion and Transformation]
  S2[AML Case System] --> I
  S3[Credit LOS / Bureau / Ledger] --> I
  S4[Policy Repository] --> RAG[RAG Corpus Builder]
  S5[Contact Center / Trace Store] --> I
  S6[Digital Events / Feature Logs] --> F[Feature Pipeline]

  I --> OL[OpenLineage Events]
  F --> OL
  RAG --> OL

  OL --> M[Metadata Platform: DataHub / OpenMetadata]
  M --> C[AI Data Contracts]
  C --> Q[Quality Validation: GX / Assertions / Tests]

  Q --> DP[AI Data Products]
  DP --> FS[Feature Store]
  DP --> TS[Training Dataset Registry]
  DP --> ES[Eval Dataset Registry]
  DP --> KB[RAG Corpus Registry]
  DP --> LS[Label Store]

  FS --> AI[AI Services / Models / Agents]
  TS --> AI
  ES --> Gate[Eval and Release Gate]
  KB --> AI
  LS --> Gate

  AI --> T[AI Trace and Feedback]
  T --> M
  T --> Incident[Data and AI Incident Response]
  Incident --> C
  Incident --> Q

4.1 Control Plane

AI 数据控制面至少包含六个中心:

中心	负责什么	不能缺的能力
Metadata catalog	发现、owner、domain、classification、glossary、usage、lineage	DataHub 或 OpenMetadata 作为治理入口
Contract registry	数据契约、schema、SLO、allowed AI use、change policy	版本、审批、contract testing、violation 状态
Lineage backend	job、run、dataset、column、feature、corpus、eval sample 的证据链	OpenLineage event、impact analysis、trace linkage
Quality runner	expectation、assertion、checkpoint、anomaly detection	Great Expectations、OpenMetadata tests、DataHub assertions
AI asset registry	training set、eval set、RAG corpus、feature set、label set	provenance card、version、risk tier、owner
Incident workflow	breach detection、severity、containment、root cause、postmortem	数据事故和 AI 输出事故联动

4.2 Runtime Instrumentation

OpenLineage 适合记录 pipeline job 的运行事实, 但 AI 数据链路需要扩展 facet。一个金融零售 AI pipeline 的 event 最少应能表达:

Facet	字段	用途
source facet	source_system、source_record_count、extract_window、jurisdiction	解释数据来自哪里和覆盖什么
contract facet	contract_id、contract_version、assertion_status、breaking_change_flag	连接 producer commitment 和 pipeline 状态
quality facet	validation_suite_id、checkpoint_id、failed_expectations、severity	将 quality results 进入 lineage graph
model data facet	training_dataset_id、feature_set_id、snapshot_time、sampling_policy	追溯训练数据
eval data facet	eval_dataset_id、golden_set_version、reviewer_pool、rubric_version	追溯评测数据
RAG corpus facet	corpus_id、document_version、effective_date、index_version、chunking_policy	追溯检索语料
label facet	label_batch_id、label_source、review_status、inter_annotator_agreement	管理标签质量和来源

4.3 Release Gates

AI 数据 release gate 不只检查模型分数, 还要检查数据契约是否可依赖。

Gate	必过条件	失败动作
G1 Contract Gate	contract active、owner assigned、allowed AI use 明确、schema assertions 通过	阻止进入生产 pipeline
G2 Lineage Gate	source -> curated -> feature/training/eval/RAG -> AI output 有可查询 lineage	不允许进入高风险 use case
G3 Quality Gate	critical SLO 通过, severe quality breach 为 0	降级、回滚、人工复核
G4 Drift Gate	feature/data drift 在阈值内, label drift 有解释	暂停自动化决策辅助或触发重评估
G5 Incident Readiness Gate	severity、owner、communication、containment、postmortem 流程已演练	延迟扩大用户范围

5. 架构模式

5.1 AI Data Contract as API Boundary

把数据契约当成 AI 系统的 API 边界。模型、RAG、eval、feature pipeline 和 agent tool 不直接信任表或文件, 只信任通过 contract 的数据产品。

Contract area	AI-specific requirement	金融零售例子
Schema	字段类型、必填性、枚举、嵌套结构、nullable policy	`document_type` 只允许 passport、driver_license、utility_bill、bank_statement
Semantics	字段含义、时间口径、单位、状态定义、业务优先级	`case_status=closed` 表示调查结束, 不等于 SAR 已提交
Freshness	source event 到 AI 消费可见的最大延迟	KYC 文档抽取结果 P95 在 15 分钟内进入 review queue
Quality	completeness、validity、consistency、accuracy、coverage	关键身份字段完整率 >= 99.5%
Permission	field-level、row-level、retrieval entitlement、trace redaction	客服可检索公开政策, 不可检索 AML investigation notes
Allowed AI use	RAG、eval、training、routing、analytics、monitoring 的边界	客服对话可用于质量分析和 eval 抽样, 不直接进入训练集
Change policy	schema evolution、deprecation、backfill、dual-run、consumer approval	credit feature 改口径必须重跑 validation 和 model risk impact review
Incident trigger	什么 violation 触发停用、降级或人工复核	政策库 freshness breach 超过 24 小时, RAG 只允许回答并提示人工确认

成熟设计不是“文档里有 contract”, 而是 contract 能被 CI、pipeline、catalog、quality runner 和 release gate 自动读取。

5.2 Metadata Product as AI Control Plane

Metadata product 的职责不是收藏描述, 而是让 AI runtime 能使用 metadata 做过滤、路由、解释和审计。

Metadata type	AI runtime 用途	例子
Business metadata	场景路由、用户意图映射、指标切片	product_line、risk_type、customer_segment
Technical metadata	freshness、schema version、job run、source checksum	ingestion_time、contract_version、index_version
Governance metadata	PII、consent、retention、policy classification	PII.Sensitive、retention_7y、no_model_training
Quality metadata	validation result、failure history、SLO status	expectation_suite_id、quality_score、breach_count
Lineage metadata	source records、transform jobs、downstream AI assets	source_table -> feature_set -> training_dataset
Eval metadata	scenario、severity、rubric、expected behavior	AML_typology、critical_failure、expert_reviewed
Feedback metadata	human edit、override reason、accepted suggestion	wrong_policy、missing_evidence、tone_issue

AI Product Architect 要把 metadata platform 设计成 active metadata layer: RAG filter、model routing、access control、incident triage、impact analysis 和 release evidence 都从这里读取。

5.3 OpenLineage for Runtime and Design-Time Evidence

OpenLineage 的 job、run、dataset 模型适合把分散的 pipeline 变成统一 lineage graph。AI 场景要同时记录:

Lineage type	记录对象	关键问题
Runtime lineage	每次 pipeline run、input dataset、output dataset、status、timestamp	哪次运行生产了被模型使用的数据
Design-time job metadata	job 源码位置、声明输入输出、owner、代码版本	如果代码变更, 影响哪些 AI assets
Dataset metadata	schema、owner、documentation、classification、contract	数据本身的定义和治理状态是什么
Column lineage	字段级转换、派生字段、敏感字段传播	`risk_score` 来自哪些源字段
AI asset lineage	feature set、training set、eval set、RAG corpus、label batch	模型和回答依赖哪些版本

实践原则:

pipeline 的每个关键边界都要发 event: source extract、curation、quality validation、feature materialization、training snapshot、eval build、RAG indexing。
对高风险字段保留 column-level lineage, 例如 income、risk_rating、customer_type、adverse_action_reason。
对 RAG corpus 记录 document id、version、effective date、approval status、chunking policy、embedding model、index version。
对 eval dataset 记录 query source、gold evidence、reviewer、rubric、risk tier、sample inclusion reason。

5.4 Schema Evolution without AI Breakage

schema evolution 在 AI 系统里比普通报表更危险, 因为模型可能静默吸收错误语义。

Change type	兼容性	处理方式	AI 风险
Additive field	通常兼容	contract 增加字段, consumer 可选择使用	新字段进入 prompt 或 training 前需用途审批
Rename field	高风险	建立 alias、dual-run、deprecation window、consumer migration	prompt、feature code、eval parser 静默失效
Type change	高风险	contract test 阻断, migration plan, backfill validation	数值被当字符串、日期解析错误
Enum expansion	中高风险	新枚举进入 glossary 和 routing policy, 更新 eval cases	模型未学习新状态, rule fallback 漏掉
Semantic change	最高风险	新 contract major version, impact analysis, model/eval rerun	字段名不变但业务含义改变
Nullability change	中高风险	completeness SLO 重算, consumer fail-fast	摘要缺字段、模型输入缺失偏差
Aggregation window change	高风险	versioned metric / feature definition, drift comparison	特征分布变化, 线上离线不一致

版本策略:

Version	适用	示例
patch	文档描述、owner、非行为性 metadata 修正	`kyc_doc_contract_v1.0.1`
minor	兼容增加字段、增加质量测试、增加标签	`credit_feature_contract_v1.1`
major	字段删除、重命名、语义改变、SLO 降级、用途边界改变	`aml_label_contract_v2`

5.5 Data Quality SLO as Product Commitment

Great Expectations、OpenMetadata tests 和 DataHub assertions 都应服务于同一个目标: 让数据质量成为 AI 产品的 SLO, 不是事后报告。

SLO dimension	AI-specific measure	Release relevance
Freshness	source event 到 feature/RAG/eval 可用的延迟	防止旧政策、旧账户状态、旧标签进入 AI
Completeness	关键字段、关键文档、关键标签覆盖率	防止模型基于缺证据输出
Validity	类型、范围、枚举、格式、业务规则	防止输入解析错误
Consistency	跨系统、跨表、跨版本口径一致	防止客户、case、产品状态冲突
Accuracy	与 source-of-truth 或人工审核对账	防止事实错误
Coverage	场景、分群、边界样本覆盖	防止 eval 或训练样本偏
Access correctness	权限过滤、脱敏、用途限制	防止数据泄露
Trace completeness	AI request 是否有 source、version、prompt、retrieval、tool、review	支撑事故复盘

5.6 Feature/Data Drift and Policy Drift

AI 数据漂移需要按消费方式分层处理。

Drift type	检测对象	触发问题	产品动作
Feature drift	numerical / categorical feature distribution	线上申请渠道变化、收入字段分布异常	触发 model performance review 和 feature owner review
Data drift	source records、document types、conversation topics	新 KYC 文档类型、客服投诉主题变化	扩展 contract、更新 parser、补 eval cases
Label drift	label distribution、reviewer agreement、case outcome lag	AML disposition 标准变化、审核团队口径不一致	重新校准 label rubric 和 adjudication
Corpus drift	policy docs、effective dates、approval status	旧政策未下线、新政策未入库	recrawl、reindex、RAG freshness warning
Schema drift	columns、types、nested structure、enum	upstream 发布破坏性变更	contract test fail-fast
Usage drift	AI consumer、新下游、prompt 变更	数据被用于未批准训练或外部分享	purpose limitation review

5.7 Training / Eval / RAG Corpus Lineage

训练数据、评测数据和 RAG 语料都叫数据, 但治理目标不同。

Asset	Lineage 必填项	质量重点	典型事故
Training dataset lineage	source snapshot、feature definition、label source、sampling、exclusion、PII handling	representativeness、leakage、label validity、feature consistency	训练集混入未来信息或不合规数据
Eval dataset lineage	scenario source、gold evidence、expected behavior、reviewer、rubric、severity	coverage、review quality、rubric stability、regression value	eval 集和线上风险不匹配
RAG corpus lineage	source doc、approval status、effective date、chunking、embedding、index version	freshness、authority、permission、citation granularity	引用旧政策或未批准草稿

5.8 Label Governance

标签是 AI 风险系统里的决策资产, 不是普通字段。

Label area	Governance requirement	金融零售例子
Label definition	label ontology、positive/negative criteria、exclusions	AML `suspicious_activity_confirmed` 的定义和排除项
Label source	expert、operation outcome、customer feedback、LLM-assisted、synthetic	SAR filing、case closure、QA outcome
Reviewer model	reviewer role、training、dual review、adjudicator	高风险 AML case 双人复核
Agreement metric	inter-annotator agreement、conflict rate、review latency	AML typology 标签一致性 >= 0.85
Versioning	label rubric version、policy version、jurisdiction	BSA/AML 规则更新后标注版本变化
Leakage control	标签是否引用未来信息、人工结论时间点	信贷违约标签不能泄露审批后才知道的信息
Auditability	label event、reviewer、timestamp、evidence	监管或模型风险审查可复现

5.9 Contract Testing

contract testing 要覆盖三层:

Test layer	测试内容	运行时机
Producer contract tests	schema、semantic metadata、quality assertions、freshness、row/column rules	pipeline build、release、daily run
Consumer contract tests	prompt parser、feature code、eval loader、RAG indexer 对 contract 的依赖	AI service build、model retrain、index rebuild
Change impact tests	schema evolution、downstream lineage、model/eval/RAG assets impact	pull request、catalog contract change、source release

高风险 AI use case 的 contract test fail 应直接阻断生产发布, 不能只发通知。

6. 金融零售案例蓝图

场景	数据链路	关键 contract	Lineage 重点	Quality / Drift 重点
KYC 文档抽取数据链路	upload -> OCR -> extraction -> validation -> review queue -> customer profile	文档类型、字段置信度、PII 分类、review 状态、人工修正	文档版本、OCR 模型、extractor 版本、reviewer edit	extraction accuracy、critical field completeness、stale document detection
AML case labels	transaction monitoring -> alert -> case investigation -> disposition -> label store -> eval/training	label definition、jurisdiction、case status、reviewer、evidence completeness	label batch、case evidence、SAR decision、QA sample	label agreement、case outcome lag、typology distribution drift
信贷模型训练数据	application -> bureau -> income verification -> feature pipeline -> training snapshot	feature definition、time window、leakage exclusion、allowed use	source snapshot、feature code、model dataset version	feature drift、missing income, adverse action reason coverage
RAG 政策库版本	policy repo -> approval -> chunk -> embed -> index -> answer citation	approval status、effective date、expiry date、jurisdiction、role access	source doc、chunk id、embedding model、index version	corpus freshness、citation correctness、permission correctness
客服对话 trace 数据	conversation -> intent -> retrieval/tool -> answer -> human edit -> feedback	consent、PII redaction、logging purpose、retention、feedback labels	request trace、prompt version、retrieval docs、tool calls、human edits	trace completeness、unsafe response, policy drift, topic drift
推荐系统特征数据	event stream -> identity resolution -> feature aggregation -> model scoring -> campaign response	event schema、identity match、window definition、opt-out policy	event source、feature window、experiment assignment	feature drift、event loss、consent change、conversion label lag

6.1 KYC 文档抽取

架构重点:

对每类文件建立 document contract: 文件类型、字段集合、置信度阈值、可接受缺失、人工复核规则。
OCR 和 extractor 版本进入 lineage, 人工修正作为 feedback label 回流。
身份字段进入 customer profile 前必须通过 critical field SLO, 例如姓名、出生日期、证件号、地址。
文档有效期和地区规则进入 metadata, 不能只存在 prompt 或人工 SOP 中。

关键交付物:

kyc_document_extraction_v1 data contract。
source document -> OCR run -> extraction result -> validation -> reviewer edit 的 lineage map。
extraction accuracy、field completeness、review override rate、cycle time 的 SLO matrix。

6.2 AML Case Labels

架构重点:

AML label 必须区分 alert outcome、case disposition、SAR filed、typology、QA finding, 不能合成一个模糊 label 字段。
标签要记录 reviewer role、decision timestamp、evidence references、rubric version、jurisdiction。
训练和 eval 使用不同 label snapshot, 并明确 case closure lag, 避免将未来结论泄露到模型训练。

关键交付物:

AML label governance rubric。
label batch provenance card。
label drift dashboard: typology mix、agreement、dispute rate、late outcome update。

6.3 信贷模型训练数据

架构重点:

特征定义必须版本化, 包括时间窗口、聚合口径、排除规则、source priority。
income、employment、bureau score、existing relationship 等字段必须有 source-of-truth 和 reconciliation rule。
adverse action reason、reject reason 和 model reason code 要在 lineage 中能追溯到特征和规则。

关键交付物:

credit feature contract。
training dataset lineage card。
feature/data drift report 和 model risk impact memo。

6.4 RAG 政策库版本

架构重点:

政策文档进入 RAG index 前必须是 approved 状态, 并带生效日期、失效日期、jurisdiction、business unit、role access。
chunking policy、embedding model、index version 必须进入 corpus lineage。
引用必须能回到文档版本和段落级证据, 并能识别新旧政策冲突。

关键交付物:

RAG corpus freshness policy。
policy corpus lineage map。
citation support 和 stale source incident playbook。

6.5 客服对话 Trace 数据

架构重点:

每次对话 trace 要连接 user intent、prompt version、retrieved evidence、tool call、model version、human edit、feedback。
对话日志进入 eval 或训练前必须完成 consent、PII redaction、purpose limitation 和 retention 检查。
用户不满意和人工改写要分类为 factual error、missing evidence、tone issue、policy issue、tool issue。

关键交付物:

customer service trace contract。
trace completeness SLO。
feedback-to-eval conversion policy。

6.6 推荐系统特征数据

架构重点:

事件 schema、identity resolution、feature window、campaign assignment 和 conversion label 要分别建 contract。
opt-out、consent、sensitive segment exclusion 必须作为 feature pipeline 的 hard control。
特征漂移和事件丢失要与 campaign performance 一起看, 不能只看模型 AUC 或 CTR。

关键交付物:

feature contract 和 event contract。
recommendation feature lineage map。
feature/data drift monitor 和 campaign rollback rule。

7. 产品决策框架

决策	选项	推荐判断
Contract orientation	producer-owned、consumer-specific、hybrid	高复用核心数据用 producer-owned; 高风险 AI consumer 可追加 consumer-specific assertions
Enforcement mode	warn、quarantine、block	高风险合规、信贷、客户影响场景对 critical breach 使用 block
Metadata platform	DataHub、OpenMetadata、组合使用	以组织现有生态和自动化能力为准; 不把 catalog 当静态 wiki
Lineage granularity	table、column、record、chunk、feature	高风险字段、RAG citation、model training 需要 column/chunk/feature 级
Schema evolution	backward compatible、versioned break、dual-run	语义变化和字段删除必须 major version + impact analysis
Quality tooling	GX、OpenMetadata tests、DataHub assertions、dbt tests	contract 层统一表达, 执行工具可组合
Drift action	monitor、review、retrain、rollback、human-only	drift 影响客户权益或合规义务时, 先降级再评估
Label source	expert、operational、LLM-assisted、synthetic	高风险 label 以 expert 或 audited operational label 为主
RAG recrawl	time-based、event-based、approval-based	政策类 corpus 以 approval event 和 effective date 为核心
Incident severity	data only、AI output affected、customer/regulatory affected	一旦影响客户决策、监管义务或权限泄露, 升到高严重度

ADR 问题清单:

ADR section	问题
Context	哪个 AI use case 依赖这份数据, 风险等级是什么
Decision	contract、lineage、quality、drift、incident 的边界如何定义
Alternatives	只做 catalog、只做 quality tests、只做 pipeline monitoring 的不足是什么
Consequences	对发布速度、数据 producer 成本、审计证据、AI 质量的影响是什么
Review trigger	哪些 schema、policy、model、regulatory 或 incident 事件会触发 ADR 复核

8. Governance and Operating Model

8.1 RACI

Artifact / Activity	AI DPM	AI Product Architect	Data Architect	Data Owner	Risk / Compliance	MLOps / Data Eng	Security / Privacy
AI data contract	A	C	R	A	C	R	C
Lineage map	C	A	R	C	C	R	C
Quality SLO matrix	A	C	R	A	C	R	C
Schema change approval	C	A	R	A	C	R	C
Eval dataset provenance card	A	C	C	C	R	R	C
RAG corpus freshness policy	A	C	R	A	C	R	C
Label governance rubric	A	C	C	R	A	R	C
Data incident response	A	A	R	R	A	R	R

8.2 NIST AI RMF Mapping

Function	数据控制面动作	Evidence
Govern	定义 ownership、policy、allowed use、RACI、approval、third-party data controls	governance charter、RACI、policy mapping
Map	识别 AI use case、source-of-truth、data flows、risk tier、affected stakeholders	data flow map、lineage map、risk tier memo
Measure	建立 contract tests、quality SLO、drift metrics、label agreement、trace completeness	validation results、drift report、quality dashboard
Manage	运行 release gates、incident response、rollback、data repair、post-incident review	release evidence pack、incident report、change log

8.3 Governance Cadence

Cadence	会议对象	输入	输出
Weekly	AI DPM、Data Owner、Data Eng、MLOps	contract violations、quality breach、schema changes	repair actions、release blockers
Monthly	AI Product Architect、Risk、Compliance、Privacy	drift trends、incident trends、label disputes、RAG freshness	control updates、eval refresh decisions
Quarterly	Governance board、Model Risk、Security、Business Owner	high-risk AI portfolio、audit findings、SLO history	risk acceptance、funding, decommission decisions
Event-driven	Incident responders	data breach、schema break、wrong output、policy corpus stale	containment、customer/regulatory impact assessment

9. 可落地交付物模板

9.1 Data Contract Template

以下是 kyc_document_extraction_v1 的完整样例, 可作为 AI data contracts 的结构模板。

contract_id: kyc_document_extraction_v1
contract_version: 1.0.0
status: active
domain: retail_banking_kyc
producer:
  team: kyc_data_platform
  owner: kyc_data_owner
  steward: kyc_operations_steward
consumers:
  - kyc_document_extraction_service
  - kyc_review_queue
  - kyc_quality_eval_harness
source_of_truth:
  system: enterprise_document_management
  source_dataset: kyc_uploaded_documents
  conflict_rule: reviewed_extraction_overrides_raw_ocr
schema:
  document_id:
    type: string
    required: true
    pii_class: internal_identifier
  customer_id:
    type: string
    required: true
    pii_class: direct_identifier
  document_type:
    type: enum
    required: true
    allowed_values:
      - passport
      - driver_license
      - utility_bill
      - bank_statement
  extracted_fields:
    type: object
    required: true
    fields:
      full_name:
        type: string
        required: true
        confidence_min: 0.92
      date_of_birth:
        type: date
        required: true
        confidence_min: 0.95
      address:
        type: string
        required: false
        confidence_min: 0.90
  extraction_model_version:
    type: string
    required: true
  review_status:
    type: enum
    required: true
    allowed_values:
      - auto_accepted
      - human_review_required
      - human_corrected
      - rejected
semantics:
  full_name: customer legal name extracted from approved identity evidence
  review_status: operational status after automated validation and human review
freshness_slo:
  p95_minutes_from_upload_to_extraction: 15
  p99_minutes_from_review_to_profile_update: 60
quality_slo:
  critical_field_completeness_min: 0.995
  document_type_validity_min: 0.999
  reviewer_override_rate_review_threshold: 0.08
permissions:
  row_level_rule: assigned_branch_or_kyc_operations
  field_redaction:
    date_of_birth: masked_outside_kyc_and_compliance
    document_id: visible_to_support_with_case_context
allowed_ai_use:
  rag: false
  eval: true
  training: true
  routing: true
  customer_facing_generation: false
retention:
  raw_document: governed_by_kyc_record_policy
  extracted_fields: governed_by_customer_profile_policy
  eval_samples: redacted_and_reviewed
change_policy:
  additive_field: minor_version_with_consumer_notice
  semantic_change: major_version_with_risk_review
  enum_change: minor_version_with_eval_refresh
incident_triggers:
  permission_leakage: severity_0
  critical_field_completeness_below_slo: severity_1
  freshness_p95_breach_two_runs: severity_2
contract_tests:
  - schema_assertion_required_columns
  - enum_assertion_document_type
  - completeness_assertion_critical_fields
  - freshness_assertion_upload_to_extraction
  - permission_assertion_branch_scope

9.2 Lineage Map Template

Node ID	Asset	Type	Owner	Contract	Quality Gate	Downstream AI Use	Evidence
`src_kyc_docs`	`kyc_uploaded_documents`	source table / object store	KYC Data Owner	`kyc_document_extraction_v1`	upload integrity check	extraction, review	source checksum、upload timestamp
`job_ocr`	OCR extraction job	OpenLineage job	Data Engineering	`ocr_input_contract_v1`	OCR success rate	document parsing	run id、OCR model version
`ds_extracted`	`kyc_extracted_fields`	curated dataset	KYC Data Platform	`kyc_document_extraction_v1`	GX checkpoint `kyc_extract_daily`	profile update、eval	validation result、failed rows
`job_review`	human review workflow	operational job	KYC Operations	`kyc_review_contract_v1`	reviewer completeness	label feedback	reviewer id hash、edit reason
`ai_eval`	`kyc_extraction_eval_v2026_06`	eval dataset	AI DPM	`eval_dataset_contract_v1`	provenance card approved	model release gate	rubric version、sample reasons

flowchart LR
  A[Uploaded KYC Document] --> B[OCR Run]
  B --> C[Extraction Result]
  C --> D[Quality Validation]
  D --> E[Human Review]
  E --> F[Customer Profile Update]
  E --> G[Eval Dataset]
  C --> H[Training Snapshot]
  D --> I[Contract Violation Dashboard]

Lineage map 审核问题:

Question	Pass evidence
每个 AI 输出能否回到 source record 和 transform run	trace_id、run_id、dataset version
高风险字段是否有 column-level lineage	column transform、source field、derived rule
数据变更能否做 downstream impact analysis	catalog lineage query、consumer list
质量检查结果是否在 lineage 中可见	validation result linked to run
人工修正是否成为 feedback lineage	reviewer edit event、reason code

9.3 Quality SLO Matrix

Data Product	SLO	Target	Measurement	Breach Severity	Automated Action	Owner
KYC extraction	critical field completeness	>= 99.5%	GX expectation over extracted fields	S1	quarantine failed batch and route to human review	KYC Data Owner
AML labels	reviewer agreement	>= 0.85	agreement score by typology and jurisdiction	S2	pause new label ingestion for disputed typology	AML QA Lead
Credit training set	feature null rate for income	<= 1.0%	feature validation checkpoint	S1	block training snapshot promotion	Credit Data Owner
Policy RAG corpus	approved active document coverage	100% for in-scope policy set	corpus registry vs policy repository	S1	block index promotion	Policy Owner
Customer service trace	trace completeness	>= 98%	spans with prompt, retrieval, model, response, feedback fields	S2	exclude incomplete trace from eval conversion	AI Platform Owner
Recommendation features	event ingestion freshness P95	<= 10 minutes	event time to feature availability	S2	switch campaign scoring to last stable feature snapshot	Growth Data Owner
All high-risk AI datasets	permission assertion pass rate	100% sampled pass	access test and redaction test	S0	disable affected endpoint and open security incident	Security Owner

SLO 设计原则:

SLO 要按 AI use case 风险等级分层, 不能全局统一。
critical fields 的 breach 动作要在 contract 中预定义。
low-risk analytics 可以 warn, regulated decision support 必须 block 或 human-only。
平均值不能掩盖分群风险, 需要按 jurisdiction、product、channel、customer segment 切片。

9.4 Schema Change Approval

Section	内容
Change ID	`schema_change_credit_features_2026_06_income_window`
Requested by	Credit Data Platform
Affected contract	`credit_feature_contract_v2.3`
Change type	semantic change and aggregation window change
Current definition	`verified_income_90d_avg` 使用过去 90 天 verified income records
New definition	`verified_income_180d_avg` 使用过去 180 天 verified income records, 排除 disputed records
Affected consumers	credit scoring model, adverse action reason service, model monitoring, portfolio analytics
Downstream impact	feature distribution shift expected; adverse action reason mapping requires rerun; existing eval slices require refresh
Required tests	schema assertion, null rate check, distribution comparison, model backtest, fairness slice review, reason code consistency
Rollback	keep `verified_income_90d_avg` materialized for 60 days and preserve model route switch
Approvers	Data Owner, AI Product Architect, Model Risk, Credit Policy Owner
Decision	approved for shadow mode and blocked from production promotion until SLO and backtest pass

Approval checklist:

Check	Evidence
Contract version updated	major or minor version chosen with reason
Lineage impact generated	downstream AI assets listed
Quality tests updated	expectations and assertions changed
Eval and model impact reviewed	backtest and regression report available
Communication completed	producers and consumers informed through catalog and release notes
Rollback path verified	previous dataset and feature definition available

9.5 Eval Dataset Provenance Card

Field	Value
Eval dataset ID	`aml_case_narrative_eval_v2026_06`
Purpose	Evaluate AML copilot case summary grounding, completeness, escalation, and policy compliance
Source systems	AML case management, transaction monitoring alerts, QA review outcomes
Source time window	cases closed from 2025-07-01 to 2026-05-31
Sampling policy	stratified by typology, jurisdiction, risk tier, case complexity, historical failure mode
Exclusion policy	active investigations, legally restricted cases, cases with unresolved QA dispute
PII handling	analyst-facing eval uses masked customer identifiers; expert reviewers access source evidence through approved case system
Gold evidence	transaction summary, alert rationale, analyst notes, approved disposition, QA findings
Label source	expert AML review and audited operational outcomes
Reviewer model	two expert reviewers for high-risk cases, adjudicator for disagreement
Rubric version	`aml_summary_rubric_v3`
Severity tags	unsupported_claim, missing_evidence, under_escalation, wrong_typology, policy_violation
Known limitations	closed-case distribution underrepresents newly emerging typologies; drift review monthly
Refresh cadence	monthly minor refresh, quarterly rubric calibration
Owner	AI DPM for AML Copilot
Risk sign-off	AML Compliance and Model Risk

Provenance card quality bar:

每条 eval case 都有 source reference、expected behavior、unacceptable behavior、severity 和 reviewer record。
评测集不混入未授权生产 PII 副本。
LLM-assisted labels 只能作为 reviewer 工作辅助, 不能作为高风险 golden label 的唯一来源。
eval dataset lineage 能连接到 source case、label batch、rubric version 和 release gate。

9.6 RAG Corpus Freshness Policy

Policy area	Rule
Corpus ID	`retail_policy_rag_corpus`
Source-of-truth	approved policy repository, not shared drive, email attachment, or draft folder
Inclusion rule	document status must be approved, effective date active, jurisdiction and business unit assigned
Exclusion rule	draft, expired, superseded, legal-review-only, no-RAG-use documents
Freshness SLO	approved policy change indexed within 4 hours; emergency regulatory bulletin indexed within 60 minutes
Version handling	old version remains retrievable only for historical-date questions and is excluded from current-policy answers
Chunking policy	section-aware chunks with page, section, paragraph, table row, effective date and policy owner metadata
Embedding policy	embedding model version recorded; re-embed triggered by model change, chunking change, or major policy format change
Permission policy	role, region, business unit and document classification filter applied before context assembly
Citation policy	answer must cite current active policy version and section; conflicting policies trigger escalation message
Staleness action	if corpus freshness SLO breached, customer-facing answers disabled and employee-facing answers display stale-source warning
Audit	every answer stores corpus_id、index_version、doc_ids、chunk_ids、effective_date、retrieval filters

9.7 Data Incident Playbook

Phase	Action	Owner	Evidence
Detect	contract violation, quality SLO breach, lineage gap, drift alert, wrong AI output, access failure	Monitoring Owner	alert id、dataset、contract、time window
Classify	assign severity based on data scope, AI output impact, customer impact, regulatory impact	AI DPM + Risk	severity decision log
Contain	block pipeline, quarantine batch, roll back feature/RAG index, disable affected route, switch to human review	Data Eng + MLOps	containment action record
Assess impact	identify downstream models, eval sets, RAG corpora, traces, reports, customers, analysts	Data Architect	lineage impact report
Repair	fix data, rerun validation, rebuild asset, refresh eval or labels, re-run release gate	Data Owner + MLOps	repair run id、validation result
Communicate	notify business, risk, compliance, security, affected product owners	Incident Lead	communication log
Decide restart	verify SLO pass, downstream impact cleared, risk sign-off completed	AI Product Architect	restart approval
Review	root cause, missed detection, control improvement, contract update, training update	Governance Board	post-incident review

Severity examples:

Severity	Trigger	Required response
S0	unauthorized PII exposed through RAG context or trace logs	immediate endpoint disablement, security incident, legal/privacy review
S1	stale policy corpus causes wrong customer-facing answer	disable customer-facing response path, notify compliance, rebuild corpus
S2	credit feature freshness breach affects scoring shadow pipeline	block model promotion and rerun validation
S3	non-critical metadata owner missing for low-risk analytics dataset	catalog correction within governance cadence

9.8 Contract Testing Matrix

Test	Example	Tooling pattern	Block condition
Schema required columns	`customer_id`, `document_type`, `review_status` exist	DataHub schema assertion, OpenMetadata contract, GX expectation	missing critical column
Type and enum validation	`document_type` in approved values	GX expectation, dbt test, OpenMetadata no-code test	invalid enum in production batch
Freshness validation	policy doc indexed within 4 hours	DataHub freshness assertion, custom checkpoint	freshness breach for high-risk RAG
Completeness validation	credit income feature null rate <= 1%	GX checkpoint with action	critical feature breach
Semantic metadata validation	owner、domain、description、PII tag present	OpenMetadata semantics and governance checks	high-risk dataset lacks owner or PII tag
Permission validation	analyst cannot retrieve cases outside assignment	integration test against retrieval endpoint	unauthorized access possible
Downstream compatibility	prompt parser, feature loader, eval loader pass contract version	CI consumer tests	consumer fails against new contract
Lineage completeness	run emits source, input, output, contract and quality facets	OpenLineage event validation	missing lineage for regulated AI asset

10. 30天训练计划

Day	训练主题	当日产出
1	选择一个金融零售 AI use case, 明确 risk tier 和 AI data products	one-page use case data map
2	盘点 source-of-truth、系统 owner、数据消费者和 allowed AI use	source inventory
3	为核心数据产品写第一版 AI data contract	contract draft
4	把 schema、semantics、security、quality、SLA、terms of use 拆成 assertions	assertion catalog
5	设计 producer contract tests 和 consumer contract tests	contract testing matrix
6	绘制 source -> curated -> AI asset 的 lineage map	lineage map
7	设计 OpenLineage event 和 AI-specific facets	lineage event spec
8	定义 metadata product: domain、owner、classification、glossary、data product	metadata product canvas
9	用 DataHub 或 OpenMetadata 思路设计 catalog policy	catalog governance policy
10	为一个数据产品设计 Great Expectations / assertion suite	quality validation suite
11	建立 quality SLO matrix, 分 low / medium / high risk	SLO matrix
12	设计 schema evolution policy 和 approval workflow	schema change approval template
13	用 KYC 文档抽取场景演练 contract + lineage + SLO	KYC case artifact pack
14	用 AML case labels 场景设计 label governance	label rubric and provenance card
15	用信贷模型训练数据设计 training dataset lineage	training data lineage card
16	用 RAG 政策库设计 corpus freshness 和 citation policy	RAG corpus freshness policy
17	用客服 trace 数据设计 trace contract 和 feedback taxonomy	trace data contract
18	用推荐特征数据设计 feature drift monitor	feature drift report format
19	设计 eval dataset provenance card 和 golden set refresh cadence	eval provenance card
20	设计 feature/data drift、label drift、corpus drift 指标	drift metric catalog
21	设计 data incident severity matrix 和 containment actions	data incident playbook
22	做一次 schema breaking change tabletop exercise	impact analysis memo
23	做一次 stale RAG corpus incident tabletop exercise	RAG incident report
24	做一次 label dispute and adjudication exercise	label governance decision log
25	将 NIST AI RMF Govern / Map / Measure / Manage 映射到 artifacts	AI data risk control matrix
26	设计 release gates: contract、lineage、quality、drift、incident readiness	release gate checklist
27	准备 DataHub / OpenMetadata / OpenLineage / GX 的架构选型说明	tooling ADR
28	把一个案例整理成作品集叙事: problem -> architecture -> controls -> outcomes	portfolio story
29	练习 10 道面试题, 每题 30 秒和 2 分钟版本	interview answer sheet
30	做综合复盘: artifacts、风险、缺口、下一轮深化方向	30-day capstone pack

训练交付物最低标准:

至少一个完整 data contract。
至少一张 lineage map, 覆盖 source、transform、quality、AI asset、runtime trace。
至少一张 quality SLO matrix, 带 breach action。
至少一个 eval dataset provenance card。
至少一个 RAG corpus freshness policy。
至少一个 data incident playbook。
至少一个 schema change approval。
至少一份 tooling ADR。

11. 面试题与回答

Q1: AI data contracts 和传统数据字典有什么本质区别?

30秒版本: 数据字典解释字段, AI data contract 承诺 AI consumer 可以如何可靠、安全、合规地使用数据。它覆盖 schema、语义、质量 SLO、freshness、权限、allowed AI use、变更流程和 incident trigger, 并且应该能被 contract tests 自动验证。

2分钟版本: 在 AI 系统里, 数据变化不会只影响报表, 还会影响模型训练、RAG 引用、eval 结果、agent routing 和客户/员工决策。传统数据字典通常是描述性资产, 无法阻止 schema breaking change、旧政策入库、标签口径漂移或权限泄漏。AI data contract 是 producer 和 AI consumer 之间的可执行承诺: 生产者承诺结构、语义、刷新、质量和用途; 消费者声明如何使用以及破坏时如何降级。金融零售中, 例如 KYC 文档抽取字段如果没有 contract, 模型可能把 OCR 低置信度地址写入客户资料; 有 contract 后, completeness、confidence、review_status 和 human review 都会成为上线门禁。

Q2: 你会如何设计 AI lineage, 才能支持事故复盘?

30秒版本: 我会把 lineage 设计到 source、pipeline、contract、quality、AI asset 和 runtime trace 六层。每个模型输出或 RAG 回答都能回到 source records、transform run、contract version、quality result、index/model/eval version 和用户可见证据。

2分钟版本: 只做 table-level lineage 不够。AI 事故常发生在字段、chunk、feature、label 或 prompt 级别。我会使用 OpenLineage 记录 pipeline run、input/output datasets 和 facets, 在 metadata catalog 中连接 DataHub/OpenMetadata 的 dataset、column、job、contract 和 owner。对 RAG, lineage 要记录 source doc、effective date、chunk id、embedding model、index version、retrieval filters 和 citations。对训练, 要记录 source snapshot、feature definitions、label source 和 sampling policy。对 eval, 要记录 gold evidence、rubric、reviewer 和 severity。这样 stale policy、错误 label、schema break 或权限泄漏都能通过 lineage impact analysis 找到受影响模型、答案和客户流程。

Q3: Schema evolution 在 AI 项目里为什么需要更严格?

30秒版本: AI consumer 可能静默吸收字段语义变化, 不像普通 API 那样立刻报错。字段名不变但口径变了, 可能导致训练数据偏差、eval 失真、RAG 错引和特征漂移。

2分钟版本: 我会把 schema evolution 分为 additive、rename、type change、enum expansion、semantic change、nullability change 和 aggregation window change。新增字段一般 minor version, 但语义变化、字段删除和窗口口径变化必须 major version, 触发 downstream impact analysis、contract tests、model/eval rerun 和 risk review。例如信贷 verified_income_90d_avg 改成 180 天窗口, 字段类型不变, 但模型特征分布和 adverse action reason 都会变化。正确做法是 dual-run、backfill validation、drift comparison、shadow mode 和明确 rollback。

Q4: Training data lineage、eval dataset lineage 和 RAG corpus lineage 怎么区分?

30秒版本: 训练数据 lineage 证明模型学到了什么; eval lineage 证明测试样本和 gold evidence 是否可信; RAG corpus lineage 证明回答检索了哪个版本的权威知识。三者都要有来源和版本, 但治理重点不同。

2分钟版本: 训练数据重点是 source snapshot、feature definitions、label source、sampling、exclusion、PII handling 和 leakage control。eval 数据重点是 scenario source、gold evidence、expected behavior、reviewer、rubric、severity 和 refresh cadence。RAG 语料重点是 source document、approval status、effective date、expiry date、chunking policy、embedding model、index version、permission filters 和 citation support。金融场景中, 把它们混在一起会出问题: 客服日志可以用于 eval 抽样, 但未必能用于训练; 过期政策可用于历史问题, 但不能支撑当前政策回答。

Q5: 如何设定 data quality SLO?

30秒版本: 从 AI use case 风险和业务后果倒推 SLO。高风险场景设置 hard-stop 指标, 如权限正确率 100%、关键字段完整率 >= 99.5%、RAG 当前政策覆盖 100%。低风险场景可用 warn 和人工修复。

2分钟版本: 我不会只设通用 null rate。AI data quality SLO 至少覆盖 freshness、completeness、validity、consistency、accuracy、coverage、access correctness 和 trace completeness。KYC 抽取关注关键字段完整率、置信度和 reviewer override rate; AML labels 关注 reviewer agreement、dispute rate 和 outcome lag; RAG 政策库关注 approved active document coverage、index freshness 和 citation correctness; 推荐特征关注事件 freshness、identity match 和 feature drift。每个 SLO 都要有 measurement、target、severity、automated action 和 owner。

Q6: Feature/data drift 发现后产品经理应该如何决策?

30秒版本: 先判断漂移是否影响客户权益、合规义务或关键业务决策。低风险可监控和解释; 中高风险要降级、暂停自动化、切到人工复核、重跑 eval 或触发模型/数据修复。

2分钟版本: 漂移不是单一模型指标。feature drift 可能来自渠道变化、上游 schema 变化或真实业务变化; data drift 可能来自新文档类型、新客户群或新政策; label drift 可能来自审核口径变化; corpus drift 可能来自政策更新。产品决策要看 drift 的原因和影响范围。例如信贷收入特征漂移影响审批辅助, 应进入 model risk review; 政策库 corpus freshness breach 影响客服回答, 应禁用客户可见回答并显示人工确认; 推荐系统事件漂移影响营销, 可以回滚到稳定特征快照并暂停相关 campaign。

Q7: DataHub、OpenMetadata、OpenLineage、Great Expectations 在架构里分别承担什么角色?

30秒版本: OpenLineage 记录 lineage events; DataHub/OpenMetadata 作为 metadata catalog 和 governance control plane; Great Expectations 做数据质量 expectations、validation 和 checkpoints。它们不是互斥工具, 而是控制面的不同层。

2分钟版本: OpenLineage 适合从 pipeline runtime 采集 job、run、dataset 和 facets, 让 lineage graph 有执行证据。DataHub 和 OpenMetadata 管理 data assets、owner、domains、glossary、classification、contracts、lineage、quality signals 和 governance workflow。Great Expectations 把数据假设转成可运行的 Expectation Suites、Validation Definitions、Validation Results 和 Checkpoints, 并通过 actions 发通知或更新文档。在成熟架构里, contract 定义在 catalog, validation 由 GX 或内置 assertions 执行, 结果回写 catalog, lineage 记录每次运行和数据产物。

Q8: AML case labels 为什么需要 label governance?

30秒版本: AML label 影响训练、eval、风险叙事和模型上线判断。没有 label definition、reviewer、jurisdiction、evidence、rubric 和 disagreement handling, 模型会学习错误或不一致的调查结论。

2分钟版本: AML 的 suspicious_activity_confirmed 不是普通标签, 它可能代表不同调查阶段、司法辖区和 QA 口径。label governance 要明确 ontology、positive/negative criteria、case status、SAR decision、typology、reviewer role、evidence reference、timestamp、rubric version 和 adjudication。还要监控 inter-annotator agreement、label drift、dispute rate 和 case outcome lag。高风险 label 不应由 LLM 自动生成作为唯一真值, LLM 可以辅助预分类或生成 reviewer draft, 但最终 golden label 要有专家或审计过的运营结论。

Q9: 如果 RAG 政策库引用了过期政策, 你如何处理?

30秒版本: 先止血: 禁用受影响客户可见回答或切到人工确认。再用 lineage 找到 source doc、chunk、index version、retrieval filter 和受影响 trace。修复后重建 index、重跑 citation eval、更新 freshness policy 和 incident controls。

2分钟版本: 我会按 data incident playbook 做。第一步分类严重度: 是否客户可见、是否影响合规义务、是否造成错误执行。第二步 containment: affected corpus/index 下线, route policy 切到 current approved index 或 human-only。第三步 impact analysis: 用 corpus lineage 找出过期 doc version、chunk ids、answers、users、business units。第四步 repair: 修正 source approval/expiry metadata, 重新 chunk、embed、index, 运行 freshness、permission、citation support 和 regression eval。第五步 post-incident review: 为什么 policy repo approval event 没触发 reindex, 为什么 freshness SLO 没拦截, contract 和 alert 如何更新。

Q10: 如何把 metadata product 做成作品集亮点?

30秒版本: 不要只展示 catalog 截图, 要展示 metadata 如何驱动 AI 控制: RAG permission filter、contract testing、lineage impact analysis、quality SLO、drift dashboard、incident response 和 release gate。

2分钟版本: 作品集可以选择一个金融零售 AI use case, 例如 KYC extraction 或 policy RAG。展示完整链路: data product canvas、data contract、lineage map、quality SLO matrix、schema change approval、eval provenance card、RAG freshness policy 和 incident playbook。然后用一个具体事故演练说明 metadata 的价值: 旧政策进入 corpus 后, catalog 的 effective_date、approval_status、lineage、index_version 和 trace 让团队能定位影响、禁用错误路径、修复并证明恢复。这样体现的是 AI Product Architect 和 Data Architect 能力, 不是简单数据整理。

12. 上线自检清单

Check	Pass condition
Contract	每个高风险 AI data product 有 active contract、owner、allowed use、change policy
Contract tests	producer、consumer、change impact tests 已接入 CI 或 pipeline
Lineage	source -> transform -> quality -> AI asset -> runtime trace 可追溯
Metadata	owner、domain、classification、glossary、PII、retention、access policy 已登记
Quality SLO	critical metrics 有 target、measurement、owner、breach action
Drift	feature/data/label/corpus/schema/usage drift 有指标和决策规则
Training data	snapshot、feature definition、label source、sampling、exclusion、PII handling 可复现
Eval data	provenance card、gold evidence、reviewer、rubric、severity、refresh cadence 完整
RAG corpus	approval status、effective date、chunking、index version、permission filters、citation policy 完整
Label governance	label definition、reviewer model、agreement metric、adjudication、versioning 完整
Incident response	severity、containment、impact analysis、repair、restart、post-incident review 已演练
Governance	RACI、release gate、risk sign-off、audit evidence pack 可展示

13. 参考来源链接

OpenLineage Docs: https://openlineage.io/docs/
OpenLineage Object Model: https://openlineage.io/docs/spec/object-model/
OpenLineage Facets: https://openlineage.io/docs/spec/facets/
DataHub Data Contracts: https://docs.datahub.com/docs/generated/metamodel/entities/datacontract
DataHub Lineage API Tutorial: https://docs.datahub.com/docs/api/tutorials/lineage
OpenMetadata Data Contracts: https://docs.open-metadata.org/v1.13.x/how-to-guides/data-contracts
OpenMetadata Data Lineage: https://docs.open-metadata.org/v1.13.x/how-to-guides/data-lineage
OpenMetadata Data Quality Observability: https://docs.open-metadata.org/v1.13.x/how-to-guides/data-quality-observability
Great Expectations GX Core Overview: https://docs.greatexpectations.io/docs/core/introduction/gx_overview/
Great Expectations Checkpoints with Actions: https://docs.greatexpectations.io/docs/core/trigger_actions_based_on_results/create_a_checkpoint_with_actions/
NIST AI RMF Core: https://airc.nist.gov/airmf-resources/airmf/5-sec-core/
NIST AI RMF Generative AI Profile: https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence