返回 Papers
AI 底层逻辑 / 经典论文

AI Eval Dataset Lifecycle:黄金集与测试数据工厂架构

Date: 2026-06-30

473ai-foundations/papers/163-ai-eval-dataset-lifecycle-golden-set-test-data-factory-architecture.md

AI 评估数据集生命周期架构:Eval Dataset Lifecycle / Golden Set / Test Data Factory

Date: 2026-06-30
Status: evergreen
Audience: experienced CBAP / financial retail PM / product architect / solution architect / AI governance lead / model risk partner
Output: advanced architecture note for portfolio, interview discussion, release governance design and AI platform roadmap


Why it matters for AI product/architecture

在金融零售 AI 项目里,eval dataset 不是一张测试表,也不是数据科学团队私有的实验文件。它是一类生产控制资产:决定系统如何证明可用、如何发现回归、如何解释上线决策、如何响应监管和内审问题。

成熟团队会把 AI eval dataset 当成生命周期对象管理:

business risk
  -> dataset portfolio
  -> golden / challenge / adversarial / regression sets
  -> real and synthetic case factory
  -> label governance
  -> coverage and drift control
  -> release promotion gates
  -> evidence packet
  -> production failure mining
  -> dataset retirement and retention

这件事对 PM / BA / 架构师重要,是因为 AI 产品上线的核心问题不是“模型分数高不高”,而是:

高级问题为什么是架构问题
什么样本有资格进入 golden set需要业务、风险、数据、模型和运营共同定义,不是工程师随手挑样本
哪些场景必须作为 hard gate牵涉客户影响、监管义务、人工复核、回滚和管理层风险接受
synthetic case 与真实案例如何组合涉及隐私、稀有风险、覆盖率、代表性和可复现性
标签口径变化如何影响历史分数需要 lineage、版本、impact analysis 和比较口径
覆盖漂移如何触发补样生产分布变化会让旧 golden set 逐步失效
审计如何复核一次上线判断必须能追溯 dataset version、case source、label authority、run result、approval 和 exception

一句话:

AI Eval Dataset Lifecycle Architecture = 把测试数据从一次性样本升级为可版本化、可治理、可度量、可审计、可退役的质量资产供应链。


Concept diagram

flowchart TB
  A[Business use case and risk tier] --> B[Dataset intake]
  B --> C{Source type}
  C --> D[Real production or historical cases]
  C --> E[Synthetic and mutated cases]
  C --> F[Policy and scenario authored cases]

  D --> G[Privacy, consent, retention and de-identification gate]
  E --> H[Realism and oracle gate]
  F --> I[Business policy and SME gate]

  G --> J[Candidate case registry]
  H --> J
  I --> J

  J --> K[Label and expected-behavior governance]
  K --> L[Coverage model]
  L --> M{Promotion decision}

  M --> N[Golden set]
  M --> O[Challenge set]
  M --> P[Adversarial set]
  M --> Q[Regression set]
  M --> R[Monitor-only sample]

  N --> S[Eval runner and release gates]
  O --> S
  P --> S
  Q --> S
  S --> T[Evidence packet and approval ledger]
  T --> U[Production monitoring and failure mining]
  U --> B

  V[Dataset lineage graph] --- J
  V --- K
  V --- S
  W[Retention and legal hold controls] --- D
  W --- T

Core architecture model

1. Dataset portfolio, not one dataset

不同数据集服务不同决策,不能混成一个“测试集”。

Dataset type用途进入门槛典型金融零售案例
Golden set稳定比较核心能力,保护关键业务行为标签已裁决,覆盖关键流程,预期行为清晰,版本冻结KYC 文档审核、信用 memo 摘要、客户服务政策回答
Challenge set压测边界、复杂例外、低频高风险情境场景难度明确,失败原因可解释,不追求生产占比AML 复杂 typology、投诉多产品责任归因、支付争议例外
Adversarial set测试绕过、注入、越权、数据泄露、过度代理安全目标明确,严重度定义清楚,禁止被平均分稀释RAG prompt injection、agent 越权更新 CRM、泄露客户 PII
Regression set每次变更必须重复执行,保护已修复缺陷和关键路径来自历史事故、缺陷、投诉、模型退步或风险发现曾经错误解释贷款拒绝原因、曾经漏升级高风险 AML case
Synthetic set扩展覆盖、构造稀有场景、保护隐私和测试边界组合生成逻辑可追溯,事实一致,预期输出可验证罕见 KYC 文件组合、极端支付异常、双语投诉
Production sample set监控线上分布、发现新失败、衡量覆盖漂移抽样策略、隐私控制、保留期限、用途边界已批准客服真实问答抽样、analyst copilot trace、投诉回复质量抽检

架构原则:

Golden protects stable core behavior.
Challenge exposes business edge cases.
Adversarial protects trust and security boundaries.
Regression prevents known failures from returning.
Synthetic expands controlled coverage.
Production sample detects distribution change.

2. Case object as the atomic unit

每个 eval case 应被设计成可追溯对象,而不是一行 prompt。

Field说明
case_id全局唯一编号,例如 KYC-DOC-GOLD-2026Q3-00142
use_case_id绑定业务用例和风险等级
source_typereal / synthetic / policy-authored / mutated / incident-derived
source_lineage原始系统、事件、采样窗口、生成规则、脱敏版本、legal hold 状态
customer_context渠道、产品、地区、语言、客户类型、脆弱客户标记等可用切片
workflow_stateAI 介入时的流程节点、人工角色、可用证据、禁止动作
input_payload用户问题、文档摘要、交易事实、case notes、retrieved evidence 或 tool observation
expected_behavior期望 AI 做什么:回答、拒答、升级、引用、生成草稿、调用工具前请求批准
unacceptable_behavior明确的失败行为:无证据断言、错误拒绝、越权承诺、泄露数据、漏升级
labels业务标签、风险标签、policy 标签、failure mode、severity、segment
label_authoritySME / risk / compliance / model validation / product owner 的裁决记录
dataset_membership属于 golden、challenge、adversarial、regression、monitor-only 的哪个版本
retention_policy保留期限、删除条件、匿名化状态、审计保留和访问控制
evidence_links评审、审批、运行结果、缺陷、发布 gate、变更影响记录

3. Test data factory

Test Data Factory 不是“生成更多测试样本”的脚本,而是受控的案例供应链。

Factory capability架构职责示例
Case mining从生产 trace、缺陷、投诉、人工 override、事故和监控告警中发现候选案例客服 copilot 真实问答中出现错误政策解释
Case mutation在保持业务事实一致的情况下改变语言、渠道、客户类型、金额、日期、证据完整性把同一个支付争议案例变成 mobile / branch / Spanish / missing receipt 版本
Synthetic authoring基于政策、流程和风险情景构造真实世界低频案例构造高风险 AML typology 与无害相似行为的对照组
Privacy transformation脱敏、tokenization、数据最小化、PII mask、合成替代、访问隔离KYC 地址证明保留字段结构但移除真实姓名和地址
Oracle construction维护 reference answer、expected action、expected refusal、tool argument、citation target信贷 copilot 必须引用正确 reason code,不得生成新拒绝原因
Coverage balancing按产品、渠道、语言、风险、失败模式、客户影响调整组合投诉数据不能只覆盖高频信用卡投诉,也要覆盖贷款、欺诈、脆弱客户
Version packaging生成可复现 dataset release,绑定 manifest、hash、lineage 和 approvalAML-COPILOT-REGRESSION-v2026.06.30

4. Lineage graph

Dataset lineage 要回答:

这个 case 从哪里来?
经过哪些隐私和质量处理?
谁给了什么标签和裁决?
何时进入哪个 dataset version?
哪些 release 使用过它?
哪些结果和审批依赖它?
何时需要退役或重标?

推荐把 lineage 看成图,而不是文件夹:

source event
  -> candidate case
  -> de-identified case
  -> label decision
  -> dataset version
  -> eval run
  -> release gate
  -> evidence packet
  -> production outcome

Lifecycle states and gates

Lifecycle states

State含义允许动作不允许动作
Discovered从生产、事故、政策、SME workshop 或 synthetic factory 发现候选记录来源、初步分类、风险标记直接进入 release gate
Quarantined含敏感数据、法律保留、质量疑问或用途不明做隐私评估、访问限制、源系统确认扩散到开发环境或供应商
Candidate已通过基础用途和隐私检查,等待标签和覆盖评审标注、裁决、切片归类、事实检查作为 golden score 宣传
Reviewed标签、预期行为、风险严重度已由授权角色确认进入 coverage review 和 promotion gate未经版本化地反复修改历史结果
Promoted被批准进入 golden / challenge / adversarial / regression 等集合参与 release gate、证据包、趋势分析静默修改 case 内容或标签
Active当前版本用于 release、monitoring 或 periodic review执行、比较、抽样复核、影响分析被模型团队为通过测试而定向删除
Watchlisted发现覆盖漂移、标签争议、政策变化或质量风险限制使用、触发重评、保留历史结果用于高风险上线的唯一证据
Deprecated被新政策、新流程、新产品或更高质量案例替代保留历史 lineage、停止新增依赖从历史证据中物理删除
Archived超过 active 用途但仍需保留用于审计或复盘按保留策略归档、访问受控用于新 release gate
Purged保留期限届满且无法律/审计保留记录删除证明恢复使用或重新分发

Promotion gates

Gate关键判断必要证据典型阻断条件
Intake gate候选 case 是否属于批准用途use case mapping、source lineage、business rationale来源不明、用途不在批准范围
Privacy and retention gate是否允许用于 eval,保留多久,谁能访问data classification、PII handling、retention rule、access group生产 PII 未脱敏、客户同意边界不清、legal hold 冲突
Label authority gate标签和 expected behavior 是否被授权角色裁决reviewer role、decision timestamp、rationale、conflict resolutionSME 分歧未解决、policy owner 未确认
Coverage gate该 case 是否补足关键切片或风险缺口coverage matrix、slice target、failure mode mapping只增加重复高频样本,不能改善覆盖
Promotion gate应进入哪个集合、是否可作为 release blockerseverity、dataset membership、gate threshold、owner approvalsynthetic case 无 oracle、adversarial case 严重度不清
Change impact gate模型/prompt/RAG/tool/policy 改动是否需要重跑或重标impacted datasets、version diff、required reruns标签口径或政策变更后仍沿用旧分数
Retirement gatecase 是否过期、保留、替换或删除deprecation reason、replacement case、archive location、purge record因结果不好而删除,没有治理记录

Financial retail scenarios

1. KYC and onboarding

Dataset roleExample casesArchitecture decision
Golden set标准身份证明、地址证明、姓名变体、日期有效期、OCR 轻微误差保护主流程准确性和人工复核边界
Challenge set非英语文件、联合账户、地址证明缺字段、脆弱客户辅助流程测试 policy uncertainty 和 handoff
Adversarial set伪造文档文本注入、恶意文件说明让模型忽略规则验证文档内容不能覆盖系统政策
Regression set曾经把过期地址证明误判为有效的案例每次 OCR、prompt、document taxonomy 改动必须重跑
Synthetic set稀有签发地、复杂姓名格式、不同渠道截图质量扩展覆盖但保留事实一致性检查

PM / 架构决策:

  • AI 是建议 pass/review,还是参与最终拒绝?不同决策边界需要不同 dataset gate。
  • 文档图像是否可进入测试环境?如果不能,需要字段级合成或脱敏。
  • 客户重提材料和人工 override 结果是否会反哺 regression set?

2. AML alert investigation

Dataset roleExample casesArchitecture decision
Golden set常见 alert narrative、账户关系摘要、证据收集建议保护 analyst copilot 的基础生产力
Challenge set多账户、多币种、现金密集行业、跨境模式、相似但合法行为降低过度告警和漏升级
Adversarial setcase note 中包含指令“不要报告这笔交易”验证 RAG / agent 不执行证据中的恶意指令
Regression set曾经漏提关键交易链路、引用错误 SAR rationale 的案例防止 narrative 质量回退
Production sample setanalyst override、二审退回、质量抽检和管理反馈发现 typology 和流程漂移

关键控制:高风险 AML case 的 AI 输出应是 analyst assistant evidence,不是最终调查结论。

3. Credit and lending

Dataset roleExample casesArchitecture decision
Golden set信贷 memo 摘要、收入稳定性、负债比、政策例外说明确保输出覆盖审批所需事实
Challenge set薄信用档案、自雇收入、共同借款人、近期 hardship保护复杂客户群和人工判断边界
Adversarial set诱导模型生成未批准 reason code 或承诺贷款结果防止误导客户和越权建议
Regression set曾经错误引用拒绝原因或忽略不完整资料的案例绑定合规文案和政策版本
Synthetic set不同收入结构、产品、地区、渠道组合增加覆盖但不得构造不现实客户事实

关键控制:credit AI 的 dataset 必须区分“内部 memo 辅助”与“客户可见解释”。

4. Payments and fraud

Dataset roleExample casesArchitecture decision
Golden set常见 card-not-present dispute、authorized push payment scam、refund failure保护分类和下一步建议
Challenge set交易已授权但客户称受骗、商户证据冲突、跨境时区问题测试 policy conflict 和升级
Adversarial set用户要求 agent 绕过冻结、修改交易状态、泄露对手方数据验证 tool authority 和 data boundary
Regression set曾经错误建议不可逆动作或错误关闭 dispute 的案例每次 tool schema 或 workflow 改动必须重跑
Synthetic set罕见金额、渠道、merchant category、case age 组合支撑低频高影响风险覆盖

5. Contact center and complaints

Dataset roleExample casesArchitecture decision
Golden set高频政策问答、账户服务、费用解释、投诉摘要保护客户可见答案质量
Challenge set情绪激烈客户、多产品投诉、监管敏感表达、脆弱客户测试语气、升级、完整性
Adversarial setprompt injection、越权退款、要求披露他人账户保护客户数据和操作权限
Regression set曾经导致投诉升级的错误话术和错误政策解释将 complaint RCA 接回 dataset lifecycle
Production sample setQA 抽检、投诉根因、客户回访、人工改写发现覆盖漂移和新政策问题

Metrics/control/evidence model

Metrics

Metric解释决策用途
Coverage by slice产品、渠道、语言、地区、客户类型、风险等级、流程节点的覆盖判断 dataset 是否能支撑上线范围
Critical scenario coverage高影响失败模式是否有 case设定 hard gate
Synthetic-real mixsynthetic 与真实/历史样本占比判断代表性、隐私和稀有场景覆盖的平衡
Label dispute rate标签和 expected behavior 被争议的比例发现政策口径不稳
Label aging标签依据的政策、流程或产品版本是否过期触发重评或退役
Lineage completenesscase 来源、处理、标签、版本、运行、审批是否完整支撑审计可追溯
Coverage drift生产分布与 active dataset 切片差距触发 case mining 和补样
Escaped failure capture生产缺陷进入 regression set 的比例和时效衡量闭环质量
Gate sensitivitydataset 是否能发现高风险退步防止平均分掩盖关键失败
Retention complianceactive / archived / purged 状态是否符合策略控制隐私、审计和法律风险

Controls

ControlControl intentEvidence
Dataset owner and steward assignment每个 dataset version 有业务和技术责任人owner registry、approval ledger
Source lineage capture防止来源不明、无法复现、不可审计source manifest、hash、sampling window
Privacy gate控制生产数据进入 eval 的用途、访问和保留privacy review、de-identification report、access log
Label authority防止标签口径被单一工程团队决定reviewer role、decision rationale、conflict record
Promotion gate确认 case 是否有资格影响 releasegate memo、coverage diff、severity mapping
Dataset immutability for release runs防止上线后篡改证据dataset version hash、run artifact、approval timestamp
Contamination prevention防止模型训练或 prompt tuning 使用门禁集并污染比较access boundary、training exclusion record
Retirement workflow防止过期案例继续阻断或支持错误决策deprecation log、replacement mapping、archive record

Evidence packet

一次高风险 AI release 至少应能归档:

dataset inventory
coverage matrix
dataset version manifests and hashes
case lineage report
label authority and conflict log
privacy and retention decision
eval run report
slice failure report
promotion gate memo
open exceptions and accepted residual risk
release approval ledger
monitoring plan and failure-mining trigger

Anti-patterns and failure modes

Anti-pattern表现后果修正方式
One big golden file所有样本混在一个表里,既做开发又做上线门禁分数不可解释,容易污染,无法追溯建立 portfolio:golden / challenge / adversarial / regression / monitor-only
Average-score release平均分提升就放行关键高风险场景退步被掩盖为 critical failure、unauthorized action、PII leakage 设置 hard stop
Production dump testing把生产日志直接丢进测试环境隐私、访问、保留和客户信任风险使用 privacy gate、脱敏、最小化、用途批准和访问隔离
Synthetic-only optimism大量合成样本看似覆盖广不代表真实运营噪声和客户行为用真实样本抽检、生产 trace mining 和 realism review 平衡
Frozen forever golden setgolden set 多季度不变产品、政策、渠道和客户分布漂移后失效建立 coverage drift 和 retirement cadence
Label drift without versioning政策口径改变但历史标签未记录版本新旧分数不可比较标签版本、policy version、impact analysis 同步管理
Test set contamination开发团队反复调 prompt 以通过门禁集线上泛化能力下降限制访问、保留 holdout、记录训练排除
Deleting hard cases因难以通过而删除边界样本风险被美化,审计不可解释通过 exception / risk acceptance 处理,不静默删除
No retirement path过期政策案例继续阻断 release团队绕过 eval 或失去信任建立 deprecation、replacement、archive、purge 流程
Evidence after the fact上线后补截图、补表格证据弱,无法证明控制真实运行在 pipeline 中自动生成 versioned evidence

Architecture mapping to RAG / Agent / Copilot / Eval / Governance

Architecture domainDataset lifecycle implicationFinancial retail example
RAGCase 需要包含 query、retrieved sources、expected citation、policy conflict、answerability 和 source freshness客服 RAG 必须识别费用政策的例外条款,不得引用过期条款
AgentCase 需要包含 tool authority、approval requirement、tool argument oracle、side-effect boundary 和 rollback expectation支付 agent 可以草拟 dispute note,但不能未经批准关闭 case
CopilotCase 需要包含 human role、editable draft、override reason、handoff trigger 和 analyst feedbackAML copilot 生成 narrative,analyst 保留裁决权
Eval platformDataset version 必须绑定 evaluator、run config、component version、threshold 和 result artifactprompt / model / retriever / tool 组合的 release gate 可复现
GovernanceDataset lifecycle 必须连接 risk tier、privacy、retention、access、approval、exception、evidence binder高风险 credit use case 的每个 release 能回答“用什么数据证明没有引入关键退步”
ObservabilityProduction trace mining 将线上失败、投诉、override、latency 和 fallback 反馈到 candidate backlog客服投诉 RCA 发现新失败模式,进入 regression set
Change management模型、prompt、RAG、tool、policy、workflow 变更都能触发 dataset impact analysisRAG index rebuild 后重跑 citation 和 policy conflict datasets

ADR draft

ADR: Establish Governed Eval Dataset Lifecycle and Test Data Factory

Status: proposed
Date: 2026-06-30
Decision owners: AI Product Owner, AI Platform Architect, Data Governance, Model Risk, Privacy, Business Operations

Context

AI systems in KYC, AML, credit, payments, contact center and complaints depend on eval datasets for release decisions, regression protection, monitoring and audit evidence. Current practice often treats eval samples as ad hoc spreadsheets. This creates weak lineage, unclear label authority, poor coverage visibility, privacy risk and unreliable release comparisons.

Decision

We will establish a governed eval dataset lifecycle and test data factory with the following capabilities:

  1. Dataset portfolio: golden, challenge, adversarial, regression, synthetic and production sample sets.
  2. Case registry: each eval case has source lineage, privacy state, labels, expected behavior, severity, dataset membership and retention policy.
  3. Promotion gates: intake, privacy, label authority, coverage, promotion, change impact and retirement.
  4. Test data factory: controlled real-case mining, synthetic authoring, mutation, privacy transformation, oracle construction and version packaging.
  5. Evidence automation: every release run stores dataset manifest, version hash, run result, slice failures, approval, exception and retention decision.

Decision drivers

  • Release decisions must be defensible by slice, not only by aggregate score.
  • High-impact financial retail use cases require traceable evidence for customers, operations, risk, compliance and audit.
  • Production failures and complaints must become regression assets.
  • Synthetic data is useful but must be governed for realism, oracle quality and lineage.
  • Privacy and retention decisions must be made before data enters AI evaluation workflows.

Consequences

Positive:

  • Stronger release gates and regression control.
  • Better portfolio visibility across KYC, AML, credit, payments and customer operations.
  • Faster response to incidents through reusable regression cases.
  • Clearer evidence for audit, model risk and management review.

Trade-offs:

  • Upfront effort to define case schema, owners, gates and access boundaries.
  • Slower promotion of new cases until label authority and privacy checks are complete.
  • Need for tooling to avoid manual evidence assembly.

Alternatives considered

AlternativeWhy not selected
Keep eval cases in team spreadsheetsLow overhead but weak lineage, access control, evidence quality and version comparability
Use only vendor benchmark resultsUseful for model selection, insufficient for use-case-specific financial retail risk
Use only production samplesHigh realism but privacy-heavy, biased toward observed traffic and weak for rare/adversarial scenarios
Use only synthetic samplesGood for coverage expansion, insufficient for production distribution and operational noise

Implementation notes

  • Start with one high-value use case such as contact center RAG or AML copilot.
  • Define the case object schema and dataset inventory before building automation.
  • Make promotion gate output part of release evidence, not a separate governance attachment.
  • Treat dataset retirement as a first-class workflow to prevent stale gates.

Interview answer: 30秒, 2分钟, CTO版本

30秒版本

AI eval dataset 不能只是一张测试表。我会把它设计成生命周期资产:golden set 保护核心能力,challenge set 覆盖复杂业务边界,adversarial set 测试安全和越权,regression set 防止已知缺陷回归,synthetic set 扩展稀有场景,production sample set 发现分布漂移。每个 case 都要有来源、标签、预期行为、覆盖切片、隐私保留、版本和证据链。这样上线决策才能回答“我们用什么数据证明这个版本可放行,以及哪些风险还没覆盖”。

2分钟版本

在金融零售 AI 里,我不会把 eval dataset 当成 QA 附件,而会当成一个质量控制产品。架构上先定义 dataset portfolio:golden、challenge、adversarial、regression、synthetic、production sample。然后定义 case object:source lineage、workflow context、expected behavior、unacceptable behavior、severity、label authority、dataset membership、retention policy 和 evidence links。

核心治理是 lifecycle gate。候选 case 先过 intake gate,确认属于批准用途;再过 privacy and retention gate,确认生产数据是否能用于 eval;然后过 label authority gate,确保 SME、risk 或 policy owner 对预期行为有裁决;最后过 coverage 和 promotion gate,决定它进入 golden、challenge、adversarial 还是 regression set。上线时,eval run 必须绑定 dataset version、component version、threshold、slice result 和 release approval。

这套架构的产品价值是把 release discussion 从“分数升了没有”变成“关键客户、关键流程、关键失败模式有没有覆盖;高风险场景有没有 hard stop;生产投诉和事故有没有进入 regression set;数据和标签变更有没有 impact analysis”。对 KYC、AML、credit、payments、contact center 和 complaints,这比单纯调模型更接近真实的产品和风险管理。

CTO版本

我会把 eval dataset lifecycle 放进 AI platform control plane。底层是 case registry 和 lineage graph,上面是 test data factory,负责真实案例挖掘、合成案例生成、变体构造、脱敏、oracle 管理和 version packaging。再上面是 dataset promotion engine,把 case 推入 golden、challenge、adversarial、regression 或 monitor-only 集合,并把每次 promotion 的 privacy、label、coverage、severity 和 owner 作为可审计事件。

对 release pipeline,我会要求任何 model / prompt / RAG / tool / workflow / policy 变更都触发 dataset impact analysis,并自动选择必须重跑的数据集。高风险 slice 失败、越权 tool action、PII 泄露、错误拒绝原因这类问题设置 hard stop;普通质量变化用趋势和置信区间辅助决策。上线证据包不靠人工整理,而由平台生成:dataset manifest、hash、run result、slice diff、exception、approval、retention decision 和 monitoring trigger。

战略上,这让组织避免两个极端:一边是 ad hoc eval 无法审计,另一边是治理流程拖慢交付。平台化后,团队可以更快做模型迁移、RAG 语料刷新和 agent 权限调整,同时保留管理层、内审和监管能理解的证据链。


7-day practice plan

DayPractice focusConcrete output
Day 1选一个金融零售 AI use case,例如 contact center RAG 或 AML copilot写一页 use case boundary:用户、流程节点、AI 角色、禁止用途、风险等级
Day 2设计 dataset portfolio列出 golden / challenge / adversarial / regression / synthetic / production sample 的用途和样本来源
Day 3定义 case object schema写 20 个字段和每个字段的治理意义
Day 4设计 promotion gates画出 intake、privacy、label authority、coverage、promotion、retirement gate 的责任和证据
Day 5构建 coverage matrix用产品、渠道、语言、客户类型、失败模式、风险等级填一张覆盖矩阵
Day 6写 release evidence packet模拟一次 prompt change,生成 dataset impact、run result、slice failure、exception 和 approval 说明
Day 7准备面试讲述用 KYC、AML、payments、complaints 各讲一个 dataset lifecycle 决策案例

这些来源用于校准风险管理、管理体系、需求与架构描述、安全威胁、可观测性和证据语言。本文把它们转成 AI eval dataset lifecycle 的产品架构,不构成法律、合规、审计或认证结论。

SourceLink使用方式
NIST AI Risk Management Frameworkhttps://www.nist.gov/itl/ai-risk-management-framework用 Govern / Map / Measure / Manage 思路组织 dataset risk、coverage、measurement 和 lifecycle management
NIST AI RMF Generative AI Profilehttps://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence将生成式 AI 风险转成 adversarial、privacy、content provenance、human oversight 和 monitoring cases
ISO/IEC 42001 AI management systemhttps://www.iso.org/standard/81230.html用 AI management system 视角组织 owner、operational control、performance evaluation、improvement 和 management evidence
ISO/IEC/IEEE 29148 Requirements Engineeringhttps://www.iso.org/standard/72089.html将业务需求、约束、质量属性和 acceptance condition 转成 eval case 与 expected behavior
ISO/IEC/IEEE 42010 Architecture Descriptionhttps://www.iso.org/standard/74393.html用 architecture viewpoint / concern / stakeholder 思路描述 dataset lifecycle 的不同视角
OWASP LLM Top 10https://owasp.org/www-project-top-10-for-large-language-model-applications/将 prompt injection、sensitive information disclosure、excessive agency、supply chain 等风险转成 adversarial set
OpenTelemetry docshttps://opentelemetry.io/docs/用 trace、metric、log 和 observability 连接 production monitoring、failure mining 和 evidence capture