AI Eval Dataset Lifecycle / Golden Set / Test Data Factory Playbook
这份 playbook 用于建立一套可执行的 AI eval dataset operating model:
AI Eval Dataset Lifecycle / Golden Set / Test Data Factory Playbook
定位:面向资深 CBAP / 金融零售 PM / AI Product Architect / Solution Architect / AI Governance / Model Risk,把 eval dataset 从测试附件升级为可治理、可复用、可审计的 AI 质量资产供应链。
使用边界:本文适用于 KYC、AML、credit、payments、contact center、complaints、RAG、agent、copilot、workflow automation 等 AI 产品。它不替代 Legal、Compliance、Privacy、Model Risk、Information Security、Internal Audit 或业务管理层的正式判断。
Purpose and when to use
Purpose
这份 playbook 用于建立一套可执行的 AI eval dataset operating model:
dataset inventory
-> case intake
-> privacy and retention decision
-> label governance
-> coverage matrix
-> golden / challenge / adversarial / regression promotion
-> test data factory
-> release gate
-> evidence packet
-> production feedback loop
目标不是“多准备一些测试样本”,而是让团队能回答:
- 哪些 dataset 支撑本次 release decision?
- 哪些高风险场景是 hard gate?
- synthetic case 和真实案例如何组合?
- 标签由谁裁决,依据哪个政策版本?
- 覆盖缺口、覆盖漂移和历史失败如何进入下一版 dataset?
- 一次上线后,审计如何追溯 dataset version、run result、approval 和 exception?
When to use
| Trigger | 使用方式 |
|---|---|
| 新 AI 用例立项 | 先定义 dataset portfolio 和 case schema,再定义模型或 prompt 评估 |
| 模型 / prompt / RAG / tool / workflow 变更 | 做 dataset impact analysis,选择必须重跑的数据集 |
| 高风险金融零售 release | 形成 promotion gate 和 evidence packet |
| 生产投诉、事故、人工 override 激增 | 把失败 trace 转成 candidate regression cases |
| 覆盖漂移或政策变化 | 重评标签、退役旧 case、补充 challenge set |
| 监管、内审、模型风险复核 | 用 evidence packet 证明数据、标签、运行和审批链路 |
Operating model
1. Roles and decision rights
| Role | Decision right | Key responsibilities |
|---|---|---|
| Business owner | 批准业务用途、上线范围、剩余风险 | 定义客户影响、业务边界、release appetite |
| Product manager | 维护 dataset roadmap 和 release gate narrative | 将业务结果、失败模式、指标、门禁和证据连接起来 |
| Senior BA / CBAP | 管理需求、流程、政策、标签口径和 coverage matrix | 把业务规则、例外、客户旅程和控制点转成 eval cases |
| AI architect | 设计 case registry、lineage、factory、eval runner 和 evidence integration | 确保版本、依赖、可观测性和安全边界可实现 |
| Data governance / privacy | 批准数据来源、脱敏、用途、访问、保留和删除 | 控制真实案例进入 eval 的边界 |
| SME / operations lead | 裁决 expected behavior、标签和失败严重度 | 维护业务真实度和操作可行性 |
| Model risk / independent review | 挑战覆盖、标签稳定性、门禁阈值和证据充分性 | 对高影响 use case 提出有效挑战 |
| Security | 审查 adversarial set、prompt injection、tool authority、PII 泄露 | 将安全威胁转成测试和 hard gate |
| Release manager | 绑定 dataset version、run result、approval 和 rollback | 确保上线流程不绕过 dataset gate |
2. Cadence
| Cadence | 活动 | Output |
|---|---|---|
| 每个 release | dataset impact analysis、required reruns、evidence packet | release gate memo |
| 每周 | candidate case triage、label conflict review、failure mining | updated candidate backlog |
| 每月 | coverage drift review、production sample review、retirement candidates | coverage and drift report |
| 每季度 | dataset portfolio review、access review、retention review、management summary | dataset governance report |
| 重大事故后 | incident-to-regression conversion、root cause cases、gate strengthening | regression set update |
3. Dataset lifecycle workflow
1. Discover
从生产 trace、SME workshop、政策变化、事故、投诉、合成场景发现 case。
2. Classify
标记 use case、source type、risk tier、customer impact、privacy state 和 failure mode。
3. Govern
通过 privacy、retention、label authority、coverage 和 promotion gates。
4. Promote
分配到 golden、challenge、adversarial、regression、synthetic 或 monitor-only 集合。
5. Execute
绑定 dataset version、component version、evaluator、threshold 和 release decision。
6. Evidence
归档 manifest、hash、lineage、run result、slice result、exception、approval 和 retention decision。
7. Improve
从生产失败、漂移、投诉、override 和政策变化中补样、重标、退役或升级 gate。
Template: dataset inventory
| Field | Required content | Example |
|---|---|---|
| dataset_id | 唯一编号,包含 use case、集合类型和版本 | CONTACT-RAG-GOLDEN-v2026.06.30 |
| dataset_type | golden / challenge / adversarial / regression / synthetic / production_sample | regression |
| business_use_case | 绑定业务能力和流程节点 | 客服 RAG 回答零售账户费用政策 |
| risk_tier | low / medium / high / prohibited-boundary | high |
| owner | business owner、product owner、technical owner | Retail Servicing PO + AI Platform Architect |
| source_mix | real、synthetic、policy-authored、incident-derived 比例 | 45% real, 35% synthetic, 20% incident-derived |
| active_scope | 产品、渠道、语言、地区、客户类型、流程节点 | mobile + call center, English / Spanish, checking and cards |
| excluded_scope | 明确不覆盖的范围 | wealth advisory、commercial banking、legal complaints |
| label_authority | 谁有权确认 expected behavior 和 severity | servicing SME + compliance reviewer |
| privacy_classification | PII / confidential / de-identified / synthetic | de-identified with restricted access |
| retention_rule | active、archive、purge 规则 | active 18 months, archived 5 years for release evidence |
| release_usage | 哪些 release gate 使用 | prompt, RAG corpus, retriever, model route, tool changes |
| evidence_location | manifest、run、approval、exception 的归档位置 | evidence binder path or GRC object id |
| last_reviewed | 最近一次 portfolio review 日期 | 2026-06-30 |
| next_review_trigger | 日期或事件触发 | quarterly review or complaint spike |
Template: coverage matrix
| Coverage dimension | Required slices | Current coverage signal | Gap decision |
|---|---|---|---|
| Product | checking, credit card, loan, mortgage, prepaid | card cases overrepresented | add checking fee and loan servicing cases |
| Channel | mobile, web, branch, call center, back office | mobile and call center covered | create branch handoff cases |
| Language | English, Spanish, bilingual, low-literacy phrasing | English strong, Spanish thin | promote Spanish synthetic + SME-authored cases |
| Customer segment | new customer, long-tenured, vulnerable customer, thin-file, small business where applicable | vulnerable customer not explicit | add escalation and accommodation cases |
| Workflow point | read, summarize, recommend, draft, act | read/summarize covered | add draft and tool-approval cases before agent release |
| Failure mode | unsupported claim, wrong citation, under-escalation, unauthorized action, PII leakage, over-refusal | wrong citation and unsupported claim covered | adversarial set needs tool and PII cases |
| Risk severity | critical, high, medium, low | high and medium sufficient | critical cases need hard gate definition |
| Data source | policy document, transaction fact, case note, customer message, tool observation | policy documents strong | add tool observation and stale-policy cases |
| Time sensitivity | current policy, outdated policy, effective-date conflict | current policy only | add effective-date challenge cases |
| Evidence quality | complete, missing field, conflicting source, ambiguous source | complete evidence dominant | add missing and conflicting source cases |
How to use:
1. Define release scope first.
2. Mark each required slice as covered, thin, missing or not in scope.
3. Convert missing high-risk slices into candidate cases.
4. Do not approve high-impact releases when critical slices are missing without explicit risk acceptance.
Template: label quality and authority
| Field | Required content | Example |
|---|---|---|
| case_id | Case identifier | AML-CHALLENGE-2026Q3-0041 |
| expected_behavior | What AI should do | Summarize suspicious pattern, cite transactions, recommend analyst review, avoid final SAR conclusion |
| unacceptable_behavior | What must not happen | State that SAR filing is required without analyst decision |
| primary_label | Business or risk label | possible structuring pattern |
| severity | critical / high / medium / low | high |
| policy_reference | Policy, procedure or control version | AML Investigation Procedure v2026.05 |
| reviewer_1 | Role and decision | AML SME: approve expected behavior |
| reviewer_2 | Role and decision | Financial crime compliance: approve escalation boundary |
| conflict_state | none / open / resolved | resolved |
| conflict_resolution | Final rationale | Wording changed from “must file” to “requires analyst determination” |
| label_version | Version of label decision | label-v2 |
| review_expiry | Date or policy-change trigger | expires when AML procedure changes or after 12 months |
Label governance rules:
- High and critical cases require at least one business SME and one risk/compliance reviewer.
- Label changes require versioning and impact analysis on historical runs.
- A case with open label conflict can be used for exploration but not as a release blocker.
- Synthetic cases need both realism review and expected-behavior review.
- A label can expire when policy, product, workflow or customer communication rules change.
Template: promotion gate
| Gate item | Pass evidence | Fail example | Decision |
|---|---|---|---|
| Approved use mapping | Case maps to registered AI use case and workflow point | Case belongs to wealth advice while release is retail servicing | reject or move to another use case |
| Source lineage | Source system, sampling window, generation rule or incident id recorded | Screenshot copied from unknown location | hold |
| Privacy and retention | PII handling, access group, retention rule documented | Raw customer data stored in open dev folder | reject |
| Expected behavior | Clear desired action, refusal, citation or escalation | “Answer correctly” without oracle | hold |
| Unacceptable behavior | Failure modes and severity defined | No definition of critical failure | hold |
| Label authority | Required reviewers approved label and severity | Only model developer approved high-risk label | hold |
| Coverage contribution | Case fills a known gap or protects a known behavior | Duplicate of 40 existing easy cases | do not promote |
| Dataset membership | Golden/challenge/adversarial/regression/synthetic role selected | Same case used for all purposes with no role | split or classify |
| Release gate impact | Threshold and blocker status defined | High-risk case has no decision effect | hold |
| Evidence readiness | Manifest, hash, version, approval and retention captured | Case exists only in local spreadsheet | hold |
Promotion decisions:
| Decision | Meaning |
|---|---|
| promote_to_golden | Stable core case, strong label authority, expected to remain comparable |
| promote_to_challenge | Complex boundary case, useful for risk discussion and slice analysis |
| promote_to_adversarial | Security, abuse, leakage, prompt injection or excessive agency case |
| promote_to_regression | Historical failure, incident, complaint, defect or previously fixed issue |
| monitor_only | Useful production sample but not stable enough for release gate |
| reject | Out of scope, unsafe to use, poor lineage or not decision-relevant |
Template: evidence packet
| Evidence object | What it proves | Owner |
|---|---|---|
| Dataset inventory snapshot | Which datasets were in scope and who owns them | Product manager |
| Dataset manifest and hash | Exact version used for release run | AI platform |
| Source lineage report | Where cases came from and how they were processed | Data governance / AI platform |
| Privacy and retention decision | Data use, access and retention are approved | Privacy / data governance |
| Coverage matrix | Release scope has adequate dataset coverage or known accepted gaps | BA / PM |
| Label authority log | Expected behavior and severity were approved by the right roles | SME / risk |
| Eval run report | Component version results against dataset versions | AI engineering |
| Slice failure report | Which products, segments, languages, channels or failure modes failed | AI platform / PM |
| Exception and risk acceptance | Open gaps, compensating controls, expiry and owner | Business owner / risk |
| Release decision memo | Go / no-go / limited release / rollback decision | Release manager |
| Monitoring and failure-mining plan | How production signals will refresh datasets | Operations / observability |
| Archive and access record | Evidence is retained and accessible to authorized reviewers | GRC / records management |
Evidence packet rule:
If a dataset result influences a release decision, its manifest, lineage, labels, run result and approval must be retained together.
PM/BA/architecture questions
PM questions
| Question | Strong answer should include |
|---|---|
| What business decision does this dataset support? | release, scale, rollback, model migration, policy update or monitoring |
| Which customer harm is this dataset designed to prevent? | wrong denial, missed AML escalation, bad payment advice, privacy breach, complaint mishandling |
| What is the minimum viable golden set for pilot? | core journeys, critical failures, business-owner-approved labels, privacy-cleared cases |
| Which gaps are acceptable for pilot but not for scale? | low-volume channels, secondary products, lower-risk languages, manual fallback available |
| How will complaints and production failures become regression cases? | intake trigger, owner, review SLA, promotion gate and next release usage |
BA / CBAP questions
| Question | Strong answer should include |
|---|---|
| What workflow state does each case represent? | actor, system state, available evidence, decision boundary, handoff point |
| What is the expected behavior, not just expected output? | answer, cite, refuse, escalate, draft, ask for missing data, request approval |
| Which policies and procedures define the label? | policy version, owner, effective date, exception handling |
| What makes a case critical, high, medium or low severity? | customer impact, regulatory impact, financial impact, operational reversibility |
| How do we handle conflicting SME labels? | conflict log, escalation path, final authority, label version and impact analysis |
Architecture questions
| Question | Strong answer should include |
|---|---|
| Where is the case registry and how is it versioned? | immutable dataset release, hash, API, access control, lineage graph |
| How does the eval runner select datasets for a change? | model/prompt/RAG/tool/policy/workflow impact graph |
| How do we prevent test set contamination? | role-based access, training exclusion, holdout protection, run audit |
| How are production traces converted into eval cases? | sampling, privacy gate, de-identification, SME review, promotion gate |
| How does observability connect to dataset lifecycle? | traces, metrics, logs, complaint signals, override signals, incident RCA |
| How is retention enforced? | retention metadata, legal hold flag, archive state, purge record |
Release checklist
Use this checklist when an AI release depends on eval dataset evidence.
| Check | Pass condition |
|---|---|
| Release scope defined | Use case, users, workflow point, products, channels, language, customer impact and risk tier are explicit |
| Dataset versions frozen | Golden, challenge, adversarial and regression versions are immutable for the release run |
| Required datasets selected | Change impact analysis selected required datasets for model, prompt, RAG, tool, policy and workflow changes |
| Critical slices covered | High-risk product, channel, language, customer and failure-mode slices are present or formally accepted as gap |
| Privacy cleared | Real or production-derived cases have approved use, access and retention |
| Labels approved | Expected behavior, unacceptable behavior, severity and policy reference have authorized review |
| Synthetic cases reviewed | Synthetic cases have realism, factual consistency and oracle checks |
| Hard gates configured | Critical violations, unauthorized actions, PII leakage, wrong customer-facing decisions and under-escalation have blocker thresholds |
| Eval run reproducible | Dataset manifest, component versions, evaluator versions and run config are captured |
| Slice results reviewed | Aggregate score, critical failures, slice failures and regression failures are all reviewed |
| Exceptions documented | Open issues have owner, expiry, compensating control and approval |
| Monitoring connected | Production trace mining, complaint review, override review and drift checks are mapped to dataset update workflow |
| Evidence archived | Manifest, hash, reports, approvals, exceptions and retention decisions are stored together |
| Rollback criteria defined | Dataset-related failures that trigger rollback, ramp pause or manual fallback are explicit |
Release decision language:
| Decision | Use when |
|---|---|
| Go | Required datasets passed, no critical failures, evidence complete |
| Limited go | Known gaps accepted with limited population, monitoring and expiry |
| No-go | Critical failures, insufficient coverage, unresolved label conflicts or privacy blockers |
| Rollback | Production signals breach gate criteria or escaped failure repeats |
| Rework dataset | Dataset cannot support the decision because coverage, labels, lineage or evidence are weak |
Executive narrative
One-page storyline
We are not approving this AI release only because model quality improved.
We are approving it because the approved use case was tested against governed datasets:
1. Golden set protected core customer and employee journeys.
2. Challenge set tested complex financial retail edge cases.
3. Adversarial set tested injection, leakage and excessive agency risks.
4. Regression set verified that prior incidents and defects did not return.
5. Synthetic and production-derived cases expanded coverage while respecting privacy and retention.
Each case has lineage, label authority, expected behavior, severity and dataset membership.
The release run is tied to immutable dataset versions and component versions.
Open gaps are documented with owner, expiry and monitoring.
Production complaints, overrides and incidents will feed the next dataset cycle.
Management message
For executives:
The dataset lifecycle is the control system behind AI release quality. It gives us a defensible way to scale AI without relying on anecdotal demos or average scores. We can show which customer journeys and risk scenarios were covered, which failures block release, where gaps remain, and how production evidence improves the next version.
For risk and audit:
The evidence packet links dataset inventory, lineage, labels, privacy decisions, eval runs, exceptions and approvals. This allows reviewers to reconstruct why a release was approved and whether controls operated as designed.
For product and architecture:
The test data factory makes AI delivery faster over time. Once cases, labels and gates are reusable, each model, prompt, RAG, tool or workflow change can be assessed with a consistent decision framework.
Interview drills
Drill 1: Explain golden set vs regression set
Strong answer:
Golden set protects stable core behavior and allows comparison across versions. Regression set protects against known failures returning, often sourced from incidents, defects, complaints or production overrides. A case can be important to both, but the governance question is different: golden asks “does the system still perform approved core tasks,” while regression asks “did we reintroduce a failure we already learned from.”
Drill 2: Synthetic vs real cases
Strong answer:
Real cases provide operational realism, but they carry privacy, retention, access and sampling-bias issues. Synthetic cases help cover rare, adversarial and combinatorial scenarios, but they need realism checks and a reliable oracle. I would use both: real and production-derived samples for distribution and noise, synthetic and mutated samples for low-frequency and high-risk coverage.
Drill 3: What would block release?
Strong answer:
I would block release for critical failures such as unauthorized tool action, PII leakage, wrong customer-facing credit or payment advice, under-escalation of high-risk AML or complaint cases, unresolved label conflict in a release blocker, missing privacy approval for real data, or insufficient coverage of a high-impact slice in the approved rollout scope.
Drill 4: How do you make it audit-ready?
Strong answer:
I would retain dataset manifest, version hash, source lineage, privacy decision, label authority log, coverage matrix, eval run report, slice failures, exceptions, approvals and monitoring trigger in one evidence packet. The key is not having documents; it is being able to reconstruct which dataset version supported which release decision and who accepted which residual risk.
Drill 5: How does this apply to an agent?
Strong answer:
For an agent, dataset cases must include tool authority, approval requirement, tool arguments, expected side effects and rollback expectation. The adversarial set should test prompt injection, excessive agency and data leakage. The regression set should include any previous unauthorized action or incorrect workflow update. Agent release should not rely on natural-language answer quality alone.
Drill 6: What would you ask in a CTO interview?
Strong answer:
I would ask whether eval datasets are treated as platform assets: versioned, access-controlled, tied to release gates and connected to production observability. Then I would ask how dataset impact analysis works when model, prompt, RAG, tool or workflow changes. If the answer is “teams keep their own spreadsheets,” the roadmap should include a case registry, lineage graph, promotion gates and automated evidence packet.
Source anchors
| Source | Link | Playbook use |
|---|---|---|
| NIST AI Risk Management Framework | https://www.nist.gov/itl/ai-risk-management-framework | Risk lifecycle, measurement, management and governance language |
| NIST AI RMF Generative AI Profile | https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence | GenAI-specific risk cases for adversarial and governance sets |
| ISO/IEC 42001 AI management system | https://www.iso.org/standard/81230.html | Management-system operating model, ownership, performance evaluation and improvement |
| ISO/IEC/IEEE 29148 Requirements Engineering | https://www.iso.org/standard/72089.html | Requirements-to-eval traceability and acceptance behavior |
| ISO/IEC/IEEE 42010 Architecture Description | https://www.iso.org/standard/74393.html | Architecture viewpoints and stakeholder concerns |
| OWASP LLM Top 10 | https://owasp.org/www-project-top-10-for-large-language-model-applications/ | Prompt injection, sensitive information disclosure and excessive agency test design |
| OpenTelemetry docs | https://opentelemetry.io/docs/ | Production traces, metrics and logs feeding failure mining and evidence |