AI 底层逻辑 / 经典论文

AI Eval Dataset Lifecycle：黄金集与测试数据工厂架构

Date: 2026-06-30

473 行ai-foundations/papers/163-ai-eval-dataset-lifecycle-golden-set-test-data-factory-architecture.md

AI 评估数据集生命周期架构：Eval Dataset Lifecycle / Golden Set / Test Data Factory

Date: 2026-06-30
Status: evergreen
Audience: experienced CBAP / financial retail PM / product architect / solution architect / AI governance lead / model risk partner
Output: advanced architecture note for portfolio, interview discussion, release governance design and AI platform roadmap

Why it matters for AI product/architecture

在金融零售 AI 项目里，eval dataset 不是一张测试表，也不是数据科学团队私有的实验文件。它是一类生产控制资产：决定系统如何证明可用、如何发现回归、如何解释上线决策、如何响应监管和内审问题。

成熟团队会把 AI eval dataset 当成生命周期对象管理：

business risk
  -> dataset portfolio
  -> golden / challenge / adversarial / regression sets
  -> real and synthetic case factory
  -> label governance
  -> coverage and drift control
  -> release promotion gates
  -> evidence packet
  -> production failure mining
  -> dataset retirement and retention

这件事对 PM / BA / 架构师重要，是因为 AI 产品上线的核心问题不是“模型分数高不高”，而是：

高级问题	为什么是架构问题
什么样本有资格进入 golden set	需要业务、风险、数据、模型和运营共同定义，不是工程师随手挑样本
哪些场景必须作为 hard gate	牵涉客户影响、监管义务、人工复核、回滚和管理层风险接受
synthetic case 与真实案例如何组合	涉及隐私、稀有风险、覆盖率、代表性和可复现性
标签口径变化如何影响历史分数	需要 lineage、版本、impact analysis 和比较口径
覆盖漂移如何触发补样	生产分布变化会让旧 golden set 逐步失效
审计如何复核一次上线判断	必须能追溯 dataset version、case source、label authority、run result、approval 和 exception

一句话：

AI Eval Dataset Lifecycle Architecture = 把测试数据从一次性样本升级为可版本化、可治理、可度量、可审计、可退役的质量资产供应链。

Concept diagram

flowchart TB
  A[Business use case and risk tier] --> B[Dataset intake]
  B --> C{Source type}
  C --> D[Real production or historical cases]
  C --> E[Synthetic and mutated cases]
  C --> F[Policy and scenario authored cases]

  D --> G[Privacy, consent, retention and de-identification gate]
  E --> H[Realism and oracle gate]
  F --> I[Business policy and SME gate]

  G --> J[Candidate case registry]
  H --> J
  I --> J

  J --> K[Label and expected-behavior governance]
  K --> L[Coverage model]
  L --> M{Promotion decision}

  M --> N[Golden set]
  M --> O[Challenge set]
  M --> P[Adversarial set]
  M --> Q[Regression set]
  M --> R[Monitor-only sample]

  N --> S[Eval runner and release gates]
  O --> S
  P --> S
  Q --> S
  S --> T[Evidence packet and approval ledger]
  T --> U[Production monitoring and failure mining]
  U --> B

  V[Dataset lineage graph] --- J
  V --- K
  V --- S
  W[Retention and legal hold controls] --- D
  W --- T

Core architecture model

1. Dataset portfolio, not one dataset

不同数据集服务不同决策，不能混成一个“测试集”。

Dataset type	用途	进入门槛	典型金融零售案例
Golden set	稳定比较核心能力，保护关键业务行为	标签已裁决，覆盖关键流程，预期行为清晰，版本冻结	KYC 文档审核、信用 memo 摘要、客户服务政策回答
Challenge set	压测边界、复杂例外、低频高风险情境	场景难度明确，失败原因可解释，不追求生产占比	AML 复杂 typology、投诉多产品责任归因、支付争议例外
Adversarial set	测试绕过、注入、越权、数据泄露、过度代理	安全目标明确，严重度定义清楚，禁止被平均分稀释	RAG prompt injection、agent 越权更新 CRM、泄露客户 PII
Regression set	每次变更必须重复执行，保护已修复缺陷和关键路径	来自历史事故、缺陷、投诉、模型退步或风险发现	曾经错误解释贷款拒绝原因、曾经漏升级高风险 AML case
Synthetic set	扩展覆盖、构造稀有场景、保护隐私和测试边界组合	生成逻辑可追溯，事实一致，预期输出可验证	罕见 KYC 文件组合、极端支付异常、双语投诉
Production sample set	监控线上分布、发现新失败、衡量覆盖漂移	抽样策略、隐私控制、保留期限、用途边界已批准	客服真实问答抽样、analyst copilot trace、投诉回复质量抽检

架构原则：

Golden protects stable core behavior.
Challenge exposes business edge cases.
Adversarial protects trust and security boundaries.
Regression prevents known failures from returning.
Synthetic expands controlled coverage.
Production sample detects distribution change.

2. Case object as the atomic unit

每个 eval case 应被设计成可追溯对象，而不是一行 prompt。

Field	说明
case_id	全局唯一编号，例如 `KYC-DOC-GOLD-2026Q3-00142`
use_case_id	绑定业务用例和风险等级
source_type	real / synthetic / policy-authored / mutated / incident-derived
source_lineage	原始系统、事件、采样窗口、生成规则、脱敏版本、legal hold 状态
customer_context	渠道、产品、地区、语言、客户类型、脆弱客户标记等可用切片
workflow_state	AI 介入时的流程节点、人工角色、可用证据、禁止动作
input_payload	用户问题、文档摘要、交易事实、case notes、retrieved evidence 或 tool observation
expected_behavior	期望 AI 做什么：回答、拒答、升级、引用、生成草稿、调用工具前请求批准
unacceptable_behavior	明确的失败行为：无证据断言、错误拒绝、越权承诺、泄露数据、漏升级
labels	业务标签、风险标签、policy 标签、failure mode、severity、segment
label_authority	SME / risk / compliance / model validation / product owner 的裁决记录
dataset_membership	属于 golden、challenge、adversarial、regression、monitor-only 的哪个版本
retention_policy	保留期限、删除条件、匿名化状态、审计保留和访问控制
evidence_links	评审、审批、运行结果、缺陷、发布 gate、变更影响记录

3. Test data factory

Test Data Factory 不是“生成更多测试样本”的脚本，而是受控的案例供应链。

Factory capability	架构职责	示例
Case mining	从生产 trace、缺陷、投诉、人工 override、事故和监控告警中发现候选案例	客服 copilot 真实问答中出现错误政策解释
Case mutation	在保持业务事实一致的情况下改变语言、渠道、客户类型、金额、日期、证据完整性	把同一个支付争议案例变成 mobile / branch / Spanish / missing receipt 版本
Synthetic authoring	基于政策、流程和风险情景构造真实世界低频案例	构造高风险 AML typology 与无害相似行为的对照组
Privacy transformation	脱敏、tokenization、数据最小化、PII mask、合成替代、访问隔离	KYC 地址证明保留字段结构但移除真实姓名和地址
Oracle construction	维护 reference answer、expected action、expected refusal、tool argument、citation target	信贷 copilot 必须引用正确 reason code，不得生成新拒绝原因
Coverage balancing	按产品、渠道、语言、风险、失败模式、客户影响调整组合	投诉数据不能只覆盖高频信用卡投诉，也要覆盖贷款、欺诈、脆弱客户
Version packaging	生成可复现 dataset release，绑定 manifest、hash、lineage 和 approval	`AML-COPILOT-REGRESSION-v2026.06.30`

4. Lineage graph

Dataset lineage 要回答：

这个 case 从哪里来?
经过哪些隐私和质量处理?
谁给了什么标签和裁决?
何时进入哪个 dataset version?
哪些 release 使用过它?
哪些结果和审批依赖它?
何时需要退役或重标?

推荐把 lineage 看成图，而不是文件夹：

source event
  -> candidate case
  -> de-identified case
  -> label decision
  -> dataset version
  -> eval run
  -> release gate
  -> evidence packet
  -> production outcome

Lifecycle states and gates

Lifecycle states

State	含义	允许动作	不允许动作
Discovered	从生产、事故、政策、SME workshop 或 synthetic factory 发现候选	记录来源、初步分类、风险标记	直接进入 release gate
Quarantined	含敏感数据、法律保留、质量疑问或用途不明	做隐私评估、访问限制、源系统确认	扩散到开发环境或供应商
Candidate	已通过基础用途和隐私检查，等待标签和覆盖评审	标注、裁决、切片归类、事实检查	作为 golden score 宣传
Reviewed	标签、预期行为、风险严重度已由授权角色确认	进入 coverage review 和 promotion gate	未经版本化地反复修改历史结果
Promoted	被批准进入 golden / challenge / adversarial / regression 等集合	参与 release gate、证据包、趋势分析	静默修改 case 内容或标签
Active	当前版本用于 release、monitoring 或 periodic review	执行、比较、抽样复核、影响分析	被模型团队为通过测试而定向删除
Watchlisted	发现覆盖漂移、标签争议、政策变化或质量风险	限制使用、触发重评、保留历史结果	用于高风险上线的唯一证据
Deprecated	被新政策、新流程、新产品或更高质量案例替代	保留历史 lineage、停止新增依赖	从历史证据中物理删除
Archived	超过 active 用途但仍需保留用于审计或复盘	按保留策略归档、访问受控	用于新 release gate
Purged	保留期限届满且无法律/审计保留	记录删除证明	恢复使用或重新分发

Promotion gates

Gate	关键判断	必要证据	典型阻断条件
Intake gate	候选 case 是否属于批准用途	use case mapping、source lineage、business rationale	来源不明、用途不在批准范围
Privacy and retention gate	是否允许用于 eval，保留多久，谁能访问	data classification、PII handling、retention rule、access group	生产 PII 未脱敏、客户同意边界不清、legal hold 冲突
Label authority gate	标签和 expected behavior 是否被授权角色裁决	reviewer role、decision timestamp、rationale、conflict resolution	SME 分歧未解决、policy owner 未确认
Coverage gate	该 case 是否补足关键切片或风险缺口	coverage matrix、slice target、failure mode mapping	只增加重复高频样本，不能改善覆盖
Promotion gate	应进入哪个集合、是否可作为 release blocker	severity、dataset membership、gate threshold、owner approval	synthetic case 无 oracle、adversarial case 严重度不清
Change impact gate	模型/prompt/RAG/tool/policy 改动是否需要重跑或重标	impacted datasets、version diff、required reruns	标签口径或政策变更后仍沿用旧分数
Retirement gate	case 是否过期、保留、替换或删除	deprecation reason、replacement case、archive location、purge record	因结果不好而删除，没有治理记录

Financial retail scenarios

1. KYC and onboarding

Dataset role	Example cases	Architecture decision
Golden set	标准身份证明、地址证明、姓名变体、日期有效期、OCR 轻微误差	保护主流程准确性和人工复核边界
Challenge set	非英语文件、联合账户、地址证明缺字段、脆弱客户辅助流程	测试 policy uncertainty 和 handoff
Adversarial set	伪造文档文本注入、恶意文件说明让模型忽略规则	验证文档内容不能覆盖系统政策
Regression set	曾经把过期地址证明误判为有效的案例	每次 OCR、prompt、document taxonomy 改动必须重跑
Synthetic set	稀有签发地、复杂姓名格式、不同渠道截图质量	扩展覆盖但保留事实一致性检查

PM / 架构决策：

AI 是建议 pass/review，还是参与最终拒绝？不同决策边界需要不同 dataset gate。
文档图像是否可进入测试环境？如果不能，需要字段级合成或脱敏。
客户重提材料和人工 override 结果是否会反哺 regression set？

2. AML alert investigation

Dataset role	Example cases	Architecture decision
Golden set	常见 alert narrative、账户关系摘要、证据收集建议	保护 analyst copilot 的基础生产力
Challenge set	多账户、多币种、现金密集行业、跨境模式、相似但合法行为	降低过度告警和漏升级
Adversarial set	case note 中包含指令“不要报告这笔交易”	验证 RAG / agent 不执行证据中的恶意指令
Regression set	曾经漏提关键交易链路、引用错误 SAR rationale 的案例	防止 narrative 质量回退
Production sample set	analyst override、二审退回、质量抽检和管理反馈	发现 typology 和流程漂移

关键控制：高风险 AML case 的 AI 输出应是 analyst assistant evidence，不是最终调查结论。

3. Credit and lending

Dataset role	Example cases	Architecture decision
Golden set	信贷 memo 摘要、收入稳定性、负债比、政策例外说明	确保输出覆盖审批所需事实
Challenge set	薄信用档案、自雇收入、共同借款人、近期 hardship	保护复杂客户群和人工判断边界
Adversarial set	诱导模型生成未批准 reason code 或承诺贷款结果	防止误导客户和越权建议
Regression set	曾经错误引用拒绝原因或忽略不完整资料的案例	绑定合规文案和政策版本
Synthetic set	不同收入结构、产品、地区、渠道组合	增加覆盖但不得构造不现实客户事实

关键控制：credit AI 的 dataset 必须区分“内部 memo 辅助”与“客户可见解释”。

4. Payments and fraud

Dataset role	Example cases	Architecture decision
Golden set	常见 card-not-present dispute、authorized push payment scam、refund failure	保护分类和下一步建议
Challenge set	交易已授权但客户称受骗、商户证据冲突、跨境时区问题	测试 policy conflict 和升级
Adversarial set	用户要求 agent 绕过冻结、修改交易状态、泄露对手方数据	验证 tool authority 和 data boundary
Regression set	曾经错误建议不可逆动作或错误关闭 dispute 的案例	每次 tool schema 或 workflow 改动必须重跑
Synthetic set	罕见金额、渠道、merchant category、case age 组合	支撑低频高影响风险覆盖

5. Contact center and complaints

Dataset role	Example cases	Architecture decision
Golden set	高频政策问答、账户服务、费用解释、投诉摘要	保护客户可见答案质量
Challenge set	情绪激烈客户、多产品投诉、监管敏感表达、脆弱客户	测试语气、升级、完整性
Adversarial set	prompt injection、越权退款、要求披露他人账户	保护客户数据和操作权限
Regression set	曾经导致投诉升级的错误话术和错误政策解释	将 complaint RCA 接回 dataset lifecycle
Production sample set	QA 抽检、投诉根因、客户回访、人工改写	发现覆盖漂移和新政策问题

Metrics/control/evidence model

Metrics

Metric	解释	决策用途
Coverage by slice	产品、渠道、语言、地区、客户类型、风险等级、流程节点的覆盖	判断 dataset 是否能支撑上线范围
Critical scenario coverage	高影响失败模式是否有 case	设定 hard gate
Synthetic-real mix	synthetic 与真实/历史样本占比	判断代表性、隐私和稀有场景覆盖的平衡
Label dispute rate	标签和 expected behavior 被争议的比例	发现政策口径不稳
Label aging	标签依据的政策、流程或产品版本是否过期	触发重评或退役
Lineage completeness	case 来源、处理、标签、版本、运行、审批是否完整	支撑审计可追溯
Coverage drift	生产分布与 active dataset 切片差距	触发 case mining 和补样
Escaped failure capture	生产缺陷进入 regression set 的比例和时效	衡量闭环质量
Gate sensitivity	dataset 是否能发现高风险退步	防止平均分掩盖关键失败
Retention compliance	active / archived / purged 状态是否符合策略	控制隐私、审计和法律风险

Controls

Control	Control intent	Evidence
Dataset owner and steward assignment	每个 dataset version 有业务和技术责任人	owner registry、approval ledger
Source lineage capture	防止来源不明、无法复现、不可审计	source manifest、hash、sampling window
Privacy gate	控制生产数据进入 eval 的用途、访问和保留	privacy review、de-identification report、access log
Label authority	防止标签口径被单一工程团队决定	reviewer role、decision rationale、conflict record
Promotion gate	确认 case 是否有资格影响 release	gate memo、coverage diff、severity mapping
Dataset immutability for release runs	防止上线后篡改证据	dataset version hash、run artifact、approval timestamp
Contamination prevention	防止模型训练或 prompt tuning 使用门禁集并污染比较	access boundary、training exclusion record
Retirement workflow	防止过期案例继续阻断或支持错误决策	deprecation log、replacement mapping、archive record

Evidence packet

一次高风险 AI release 至少应能归档：

dataset inventory
coverage matrix
dataset version manifests and hashes
case lineage report
label authority and conflict log
privacy and retention decision
eval run report
slice failure report
promotion gate memo
open exceptions and accepted residual risk
release approval ledger
monitoring plan and failure-mining trigger

Anti-patterns and failure modes

Anti-pattern	表现	后果	修正方式
One big golden file	所有样本混在一个表里，既做开发又做上线门禁	分数不可解释，容易污染，无法追溯	建立 portfolio：golden / challenge / adversarial / regression / monitor-only
Average-score release	平均分提升就放行	关键高风险场景退步被掩盖	为 critical failure、unauthorized action、PII leakage 设置 hard stop
Production dump testing	把生产日志直接丢进测试环境	隐私、访问、保留和客户信任风险	使用 privacy gate、脱敏、最小化、用途批准和访问隔离
Synthetic-only optimism	大量合成样本看似覆盖广	不代表真实运营噪声和客户行为	用真实样本抽检、生产 trace mining 和 realism review 平衡
Frozen forever golden set	golden set 多季度不变	产品、政策、渠道和客户分布漂移后失效	建立 coverage drift 和 retirement cadence
Label drift without versioning	政策口径改变但历史标签未记录版本	新旧分数不可比较	标签版本、policy version、impact analysis 同步管理
Test set contamination	开发团队反复调 prompt 以通过门禁集	线上泛化能力下降	限制访问、保留 holdout、记录训练排除
Deleting hard cases	因难以通过而删除边界样本	风险被美化，审计不可解释	通过 exception / risk acceptance 处理，不静默删除
No retirement path	过期政策案例继续阻断 release	团队绕过 eval 或失去信任	建立 deprecation、replacement、archive、purge 流程
Evidence after the fact	上线后补截图、补表格	证据弱，无法证明控制真实运行	在 pipeline 中自动生成 versioned evidence

Architecture mapping to RAG / Agent / Copilot / Eval / Governance

Architecture domain	Dataset lifecycle implication	Financial retail example
RAG	Case 需要包含 query、retrieved sources、expected citation、policy conflict、answerability 和 source freshness	客服 RAG 必须识别费用政策的例外条款，不得引用过期条款
Agent	Case 需要包含 tool authority、approval requirement、tool argument oracle、side-effect boundary 和 rollback expectation	支付 agent 可以草拟 dispute note，但不能未经批准关闭 case
Copilot	Case 需要包含 human role、editable draft、override reason、handoff trigger 和 analyst feedback	AML copilot 生成 narrative，analyst 保留裁决权
Eval platform	Dataset version 必须绑定 evaluator、run config、component version、threshold 和 result artifact	prompt / model / retriever / tool 组合的 release gate 可复现
Governance	Dataset lifecycle 必须连接 risk tier、privacy、retention、access、approval、exception、evidence binder	高风险 credit use case 的每个 release 能回答“用什么数据证明没有引入关键退步”
Observability	Production trace mining 将线上失败、投诉、override、latency 和 fallback 反馈到 candidate backlog	客服投诉 RCA 发现新失败模式，进入 regression set
Change management	模型、prompt、RAG、tool、policy、workflow 变更都能触发 dataset impact analysis	RAG index rebuild 后重跑 citation 和 policy conflict datasets

ADR draft

ADR: Establish Governed Eval Dataset Lifecycle and Test Data Factory

Status: proposed
Date: 2026-06-30
Decision owners: AI Product Owner, AI Platform Architect, Data Governance, Model Risk, Privacy, Business Operations

Context

AI systems in KYC, AML, credit, payments, contact center and complaints depend on eval datasets for release decisions, regression protection, monitoring and audit evidence. Current practice often treats eval samples as ad hoc spreadsheets. This creates weak lineage, unclear label authority, poor coverage visibility, privacy risk and unreliable release comparisons.

Decision

We will establish a governed eval dataset lifecycle and test data factory with the following capabilities:

Dataset portfolio: golden, challenge, adversarial, regression, synthetic and production sample sets.
Case registry: each eval case has source lineage, privacy state, labels, expected behavior, severity, dataset membership and retention policy.
Promotion gates: intake, privacy, label authority, coverage, promotion, change impact and retirement.
Test data factory: controlled real-case mining, synthetic authoring, mutation, privacy transformation, oracle construction and version packaging.
Evidence automation: every release run stores dataset manifest, version hash, run result, slice failures, approval, exception and retention decision.

Decision drivers

Release decisions must be defensible by slice, not only by aggregate score.
High-impact financial retail use cases require traceable evidence for customers, operations, risk, compliance and audit.
Production failures and complaints must become regression assets.
Synthetic data is useful but must be governed for realism, oracle quality and lineage.
Privacy and retention decisions must be made before data enters AI evaluation workflows.

Consequences

Positive:

Stronger release gates and regression control.
Better portfolio visibility across KYC, AML, credit, payments and customer operations.
Faster response to incidents through reusable regression cases.
Clearer evidence for audit, model risk and management review.

Trade-offs:

Upfront effort to define case schema, owners, gates and access boundaries.
Slower promotion of new cases until label authority and privacy checks are complete.
Need for tooling to avoid manual evidence assembly.

Alternatives considered

Alternative	Why not selected
Keep eval cases in team spreadsheets	Low overhead but weak lineage, access control, evidence quality and version comparability
Use only vendor benchmark results	Useful for model selection, insufficient for use-case-specific financial retail risk
Use only production samples	High realism but privacy-heavy, biased toward observed traffic and weak for rare/adversarial scenarios
Use only synthetic samples	Good for coverage expansion, insufficient for production distribution and operational noise

Implementation notes

Start with one high-value use case such as contact center RAG or AML copilot.
Define the case object schema and dataset inventory before building automation.
Make promotion gate output part of release evidence, not a separate governance attachment.
Treat dataset retirement as a first-class workflow to prevent stale gates.

Interview answer: 30秒, 2分钟, CTO版本

30秒版本

AI eval dataset 不能只是一张测试表。我会把它设计成生命周期资产：golden set 保护核心能力，challenge set 覆盖复杂业务边界，adversarial set 测试安全和越权，regression set 防止已知缺陷回归，synthetic set 扩展稀有场景，production sample set 发现分布漂移。每个 case 都要有来源、标签、预期行为、覆盖切片、隐私保留、版本和证据链。这样上线决策才能回答“我们用什么数据证明这个版本可放行，以及哪些风险还没覆盖”。

2分钟版本

在金融零售 AI 里，我不会把 eval dataset 当成 QA 附件，而会当成一个质量控制产品。架构上先定义 dataset portfolio：golden、challenge、adversarial、regression、synthetic、production sample。然后定义 case object：source lineage、workflow context、expected behavior、unacceptable behavior、severity、label authority、dataset membership、retention policy 和 evidence links。

核心治理是 lifecycle gate。候选 case 先过 intake gate，确认属于批准用途；再过 privacy and retention gate，确认生产数据是否能用于 eval；然后过 label authority gate，确保 SME、risk 或 policy owner 对预期行为有裁决；最后过 coverage 和 promotion gate，决定它进入 golden、challenge、adversarial 还是 regression set。上线时，eval run 必须绑定 dataset version、component version、threshold、slice result 和 release approval。

这套架构的产品价值是把 release discussion 从“分数升了没有”变成“关键客户、关键流程、关键失败模式有没有覆盖；高风险场景有没有 hard stop；生产投诉和事故有没有进入 regression set；数据和标签变更有没有 impact analysis”。对 KYC、AML、credit、payments、contact center 和 complaints，这比单纯调模型更接近真实的产品和风险管理。

CTO版本

我会把 eval dataset lifecycle 放进 AI platform control plane。底层是 case registry 和 lineage graph，上面是 test data factory，负责真实案例挖掘、合成案例生成、变体构造、脱敏、oracle 管理和 version packaging。再上面是 dataset promotion engine，把 case 推入 golden、challenge、adversarial、regression 或 monitor-only 集合，并把每次 promotion 的 privacy、label、coverage、severity 和 owner 作为可审计事件。

对 release pipeline，我会要求任何 model / prompt / RAG / tool / workflow / policy 变更都触发 dataset impact analysis，并自动选择必须重跑的数据集。高风险 slice 失败、越权 tool action、PII 泄露、错误拒绝原因这类问题设置 hard stop；普通质量变化用趋势和置信区间辅助决策。上线证据包不靠人工整理，而由平台生成：dataset manifest、hash、run result、slice diff、exception、approval、retention decision 和 monitoring trigger。

战略上，这让组织避免两个极端：一边是 ad hoc eval 无法审计，另一边是治理流程拖慢交付。平台化后，团队可以更快做模型迁移、RAG 语料刷新和 agent 权限调整，同时保留管理层、内审和监管能理解的证据链。

7-day practice plan

Day	Practice focus	Concrete output
Day 1	选一个金融零售 AI use case，例如 contact center RAG 或 AML copilot	写一页 use case boundary：用户、流程节点、AI 角色、禁止用途、风险等级
Day 2	设计 dataset portfolio	列出 golden / challenge / adversarial / regression / synthetic / production sample 的用途和样本来源
Day 3	定义 case object schema	写 20 个字段和每个字段的治理意义
Day 4	设计 promotion gates	画出 intake、privacy、label authority、coverage、promotion、retirement gate 的责任和证据
Day 5	构建 coverage matrix	用产品、渠道、语言、客户类型、失败模式、风险等级填一张覆盖矩阵
Day 6	写 release evidence packet	模拟一次 prompt change，生成 dataset impact、run result、slice failure、exception 和 approval 说明
Day 7	准备面试讲述	用 KYC、AML、payments、complaints 各讲一个 dataset lifecycle 决策案例

Source anchors with links

这些来源用于校准风险管理、管理体系、需求与架构描述、安全威胁、可观测性和证据语言。本文把它们转成 AI eval dataset lifecycle 的产品架构，不构成法律、合规、审计或认证结论。

Source	Link	使用方式
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 思路组织 dataset risk、coverage、measurement 和 lifecycle management
NIST AI RMF Generative AI Profile	https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence	将生成式 AI 风险转成 adversarial、privacy、content provenance、human oversight 和 monitoring cases
ISO/IEC 42001 AI management system	https://www.iso.org/standard/81230.html	用 AI management system 视角组织 owner、operational control、performance evaluation、improvement 和 management evidence
ISO/IEC/IEEE 29148 Requirements Engineering	https://www.iso.org/standard/72089.html	将业务需求、约束、质量属性和 acceptance condition 转成 eval case 与 expected behavior
ISO/IEC/IEEE 42010 Architecture Description	https://www.iso.org/standard/74393.html	用 architecture viewpoint / concern / stakeholder 思路描述 dataset lifecycle 的不同视角
OWASP LLM Top 10	https://owasp.org/www-project-top-10-for-large-language-model-applications/	将 prompt injection、sensitive information disclosure、excessive agency、supply chain 等风险转成 adversarial set
OpenTelemetry docs	https://opentelemetry.io/docs/	用 trace、metric、log 和 observability 连接 production monitoring、failure mining 和 evidence capture