返回 Papers
AI 扩展计划 / Playbooks

AI Eval Dataset Lifecycle / Golden Set / Test Data Factory Playbook

这份 playbook 用于建立一套可执行的 AI eval dataset operating model:

386AI_EVAL_DATASET_LIFECYCLE_GOLDEN_SET_TEST_DATA_FACTORY_PLAYBOOK.md

AI Eval Dataset Lifecycle / Golden Set / Test Data Factory Playbook

定位:面向资深 CBAP / 金融零售 PM / AI Product Architect / Solution Architect / AI Governance / Model Risk,把 eval dataset 从测试附件升级为可治理、可复用、可审计的 AI 质量资产供应链。

使用边界:本文适用于 KYC、AML、credit、payments、contact center、complaints、RAG、agent、copilot、workflow automation 等 AI 产品。它不替代 Legal、Compliance、Privacy、Model Risk、Information Security、Internal Audit 或业务管理层的正式判断。


Purpose and when to use

Purpose

这份 playbook 用于建立一套可执行的 AI eval dataset operating model:

dataset inventory
  -> case intake
  -> privacy and retention decision
  -> label governance
  -> coverage matrix
  -> golden / challenge / adversarial / regression promotion
  -> test data factory
  -> release gate
  -> evidence packet
  -> production feedback loop

目标不是“多准备一些测试样本”,而是让团队能回答:

  • 哪些 dataset 支撑本次 release decision?
  • 哪些高风险场景是 hard gate?
  • synthetic case 和真实案例如何组合?
  • 标签由谁裁决,依据哪个政策版本?
  • 覆盖缺口、覆盖漂移和历史失败如何进入下一版 dataset?
  • 一次上线后,审计如何追溯 dataset version、run result、approval 和 exception?

When to use

Trigger使用方式
新 AI 用例立项先定义 dataset portfolio 和 case schema,再定义模型或 prompt 评估
模型 / prompt / RAG / tool / workflow 变更做 dataset impact analysis,选择必须重跑的数据集
高风险金融零售 release形成 promotion gate 和 evidence packet
生产投诉、事故、人工 override 激增把失败 trace 转成 candidate regression cases
覆盖漂移或政策变化重评标签、退役旧 case、补充 challenge set
监管、内审、模型风险复核用 evidence packet 证明数据、标签、运行和审批链路

Operating model

1. Roles and decision rights

RoleDecision rightKey responsibilities
Business owner批准业务用途、上线范围、剩余风险定义客户影响、业务边界、release appetite
Product manager维护 dataset roadmap 和 release gate narrative将业务结果、失败模式、指标、门禁和证据连接起来
Senior BA / CBAP管理需求、流程、政策、标签口径和 coverage matrix把业务规则、例外、客户旅程和控制点转成 eval cases
AI architect设计 case registry、lineage、factory、eval runner 和 evidence integration确保版本、依赖、可观测性和安全边界可实现
Data governance / privacy批准数据来源、脱敏、用途、访问、保留和删除控制真实案例进入 eval 的边界
SME / operations lead裁决 expected behavior、标签和失败严重度维护业务真实度和操作可行性
Model risk / independent review挑战覆盖、标签稳定性、门禁阈值和证据充分性对高影响 use case 提出有效挑战
Security审查 adversarial set、prompt injection、tool authority、PII 泄露将安全威胁转成测试和 hard gate
Release manager绑定 dataset version、run result、approval 和 rollback确保上线流程不绕过 dataset gate

2. Cadence

Cadence活动Output
每个 releasedataset impact analysis、required reruns、evidence packetrelease gate memo
每周candidate case triage、label conflict review、failure miningupdated candidate backlog
每月coverage drift review、production sample review、retirement candidatescoverage and drift report
每季度dataset portfolio review、access review、retention review、management summarydataset governance report
重大事故后incident-to-regression conversion、root cause cases、gate strengtheningregression set update

3. Dataset lifecycle workflow

1. Discover
   从生产 trace、SME workshop、政策变化、事故、投诉、合成场景发现 case。

2. Classify
   标记 use case、source type、risk tier、customer impact、privacy state 和 failure mode。

3. Govern
   通过 privacy、retention、label authority、coverage 和 promotion gates。

4. Promote
   分配到 golden、challenge、adversarial、regression、synthetic 或 monitor-only 集合。

5. Execute
   绑定 dataset version、component version、evaluator、threshold 和 release decision。

6. Evidence
   归档 manifest、hash、lineage、run result、slice result、exception、approval 和 retention decision。

7. Improve
   从生产失败、漂移、投诉、override 和政策变化中补样、重标、退役或升级 gate。

Template: dataset inventory

FieldRequired contentExample
dataset_id唯一编号,包含 use case、集合类型和版本CONTACT-RAG-GOLDEN-v2026.06.30
dataset_typegolden / challenge / adversarial / regression / synthetic / production_sampleregression
business_use_case绑定业务能力和流程节点客服 RAG 回答零售账户费用政策
risk_tierlow / medium / high / prohibited-boundaryhigh
ownerbusiness owner、product owner、technical ownerRetail Servicing PO + AI Platform Architect
source_mixreal、synthetic、policy-authored、incident-derived 比例45% real, 35% synthetic, 20% incident-derived
active_scope产品、渠道、语言、地区、客户类型、流程节点mobile + call center, English / Spanish, checking and cards
excluded_scope明确不覆盖的范围wealth advisory、commercial banking、legal complaints
label_authority谁有权确认 expected behavior 和 severityservicing SME + compliance reviewer
privacy_classificationPII / confidential / de-identified / syntheticde-identified with restricted access
retention_ruleactive、archive、purge 规则active 18 months, archived 5 years for release evidence
release_usage哪些 release gate 使用prompt, RAG corpus, retriever, model route, tool changes
evidence_locationmanifest、run、approval、exception 的归档位置evidence binder path or GRC object id
last_reviewed最近一次 portfolio review 日期2026-06-30
next_review_trigger日期或事件触发quarterly review or complaint spike

Template: coverage matrix

Coverage dimensionRequired slicesCurrent coverage signalGap decision
Productchecking, credit card, loan, mortgage, prepaidcard cases overrepresentedadd checking fee and loan servicing cases
Channelmobile, web, branch, call center, back officemobile and call center coveredcreate branch handoff cases
LanguageEnglish, Spanish, bilingual, low-literacy phrasingEnglish strong, Spanish thinpromote Spanish synthetic + SME-authored cases
Customer segmentnew customer, long-tenured, vulnerable customer, thin-file, small business where applicablevulnerable customer not explicitadd escalation and accommodation cases
Workflow pointread, summarize, recommend, draft, actread/summarize coveredadd draft and tool-approval cases before agent release
Failure modeunsupported claim, wrong citation, under-escalation, unauthorized action, PII leakage, over-refusalwrong citation and unsupported claim coveredadversarial set needs tool and PII cases
Risk severitycritical, high, medium, lowhigh and medium sufficientcritical cases need hard gate definition
Data sourcepolicy document, transaction fact, case note, customer message, tool observationpolicy documents strongadd tool observation and stale-policy cases
Time sensitivitycurrent policy, outdated policy, effective-date conflictcurrent policy onlyadd effective-date challenge cases
Evidence qualitycomplete, missing field, conflicting source, ambiguous sourcecomplete evidence dominantadd missing and conflicting source cases

How to use:

1. Define release scope first.
2. Mark each required slice as covered, thin, missing or not in scope.
3. Convert missing high-risk slices into candidate cases.
4. Do not approve high-impact releases when critical slices are missing without explicit risk acceptance.

Template: label quality and authority

FieldRequired contentExample
case_idCase identifierAML-CHALLENGE-2026Q3-0041
expected_behaviorWhat AI should doSummarize suspicious pattern, cite transactions, recommend analyst review, avoid final SAR conclusion
unacceptable_behaviorWhat must not happenState that SAR filing is required without analyst decision
primary_labelBusiness or risk labelpossible structuring pattern
severitycritical / high / medium / lowhigh
policy_referencePolicy, procedure or control versionAML Investigation Procedure v2026.05
reviewer_1Role and decisionAML SME: approve expected behavior
reviewer_2Role and decisionFinancial crime compliance: approve escalation boundary
conflict_statenone / open / resolvedresolved
conflict_resolutionFinal rationaleWording changed from “must file” to “requires analyst determination”
label_versionVersion of label decisionlabel-v2
review_expiryDate or policy-change triggerexpires when AML procedure changes or after 12 months

Label governance rules:

  • High and critical cases require at least one business SME and one risk/compliance reviewer.
  • Label changes require versioning and impact analysis on historical runs.
  • A case with open label conflict can be used for exploration but not as a release blocker.
  • Synthetic cases need both realism review and expected-behavior review.
  • A label can expire when policy, product, workflow or customer communication rules change.

Template: promotion gate

Gate itemPass evidenceFail exampleDecision
Approved use mappingCase maps to registered AI use case and workflow pointCase belongs to wealth advice while release is retail servicingreject or move to another use case
Source lineageSource system, sampling window, generation rule or incident id recordedScreenshot copied from unknown locationhold
Privacy and retentionPII handling, access group, retention rule documentedRaw customer data stored in open dev folderreject
Expected behaviorClear desired action, refusal, citation or escalation“Answer correctly” without oraclehold
Unacceptable behaviorFailure modes and severity definedNo definition of critical failurehold
Label authorityRequired reviewers approved label and severityOnly model developer approved high-risk labelhold
Coverage contributionCase fills a known gap or protects a known behaviorDuplicate of 40 existing easy casesdo not promote
Dataset membershipGolden/challenge/adversarial/regression/synthetic role selectedSame case used for all purposes with no rolesplit or classify
Release gate impactThreshold and blocker status definedHigh-risk case has no decision effecthold
Evidence readinessManifest, hash, version, approval and retention capturedCase exists only in local spreadsheethold

Promotion decisions:

DecisionMeaning
promote_to_goldenStable core case, strong label authority, expected to remain comparable
promote_to_challengeComplex boundary case, useful for risk discussion and slice analysis
promote_to_adversarialSecurity, abuse, leakage, prompt injection or excessive agency case
promote_to_regressionHistorical failure, incident, complaint, defect or previously fixed issue
monitor_onlyUseful production sample but not stable enough for release gate
rejectOut of scope, unsafe to use, poor lineage or not decision-relevant

Template: evidence packet

Evidence objectWhat it provesOwner
Dataset inventory snapshotWhich datasets were in scope and who owns themProduct manager
Dataset manifest and hashExact version used for release runAI platform
Source lineage reportWhere cases came from and how they were processedData governance / AI platform
Privacy and retention decisionData use, access and retention are approvedPrivacy / data governance
Coverage matrixRelease scope has adequate dataset coverage or known accepted gapsBA / PM
Label authority logExpected behavior and severity were approved by the right rolesSME / risk
Eval run reportComponent version results against dataset versionsAI engineering
Slice failure reportWhich products, segments, languages, channels or failure modes failedAI platform / PM
Exception and risk acceptanceOpen gaps, compensating controls, expiry and ownerBusiness owner / risk
Release decision memoGo / no-go / limited release / rollback decisionRelease manager
Monitoring and failure-mining planHow production signals will refresh datasetsOperations / observability
Archive and access recordEvidence is retained and accessible to authorized reviewersGRC / records management

Evidence packet rule:

If a dataset result influences a release decision, its manifest, lineage, labels, run result and approval must be retained together.

PM/BA/architecture questions

PM questions

QuestionStrong answer should include
What business decision does this dataset support?release, scale, rollback, model migration, policy update or monitoring
Which customer harm is this dataset designed to prevent?wrong denial, missed AML escalation, bad payment advice, privacy breach, complaint mishandling
What is the minimum viable golden set for pilot?core journeys, critical failures, business-owner-approved labels, privacy-cleared cases
Which gaps are acceptable for pilot but not for scale?low-volume channels, secondary products, lower-risk languages, manual fallback available
How will complaints and production failures become regression cases?intake trigger, owner, review SLA, promotion gate and next release usage

BA / CBAP questions

QuestionStrong answer should include
What workflow state does each case represent?actor, system state, available evidence, decision boundary, handoff point
What is the expected behavior, not just expected output?answer, cite, refuse, escalate, draft, ask for missing data, request approval
Which policies and procedures define the label?policy version, owner, effective date, exception handling
What makes a case critical, high, medium or low severity?customer impact, regulatory impact, financial impact, operational reversibility
How do we handle conflicting SME labels?conflict log, escalation path, final authority, label version and impact analysis

Architecture questions

QuestionStrong answer should include
Where is the case registry and how is it versioned?immutable dataset release, hash, API, access control, lineage graph
How does the eval runner select datasets for a change?model/prompt/RAG/tool/policy/workflow impact graph
How do we prevent test set contamination?role-based access, training exclusion, holdout protection, run audit
How are production traces converted into eval cases?sampling, privacy gate, de-identification, SME review, promotion gate
How does observability connect to dataset lifecycle?traces, metrics, logs, complaint signals, override signals, incident RCA
How is retention enforced?retention metadata, legal hold flag, archive state, purge record

Release checklist

Use this checklist when an AI release depends on eval dataset evidence.

CheckPass condition
Release scope definedUse case, users, workflow point, products, channels, language, customer impact and risk tier are explicit
Dataset versions frozenGolden, challenge, adversarial and regression versions are immutable for the release run
Required datasets selectedChange impact analysis selected required datasets for model, prompt, RAG, tool, policy and workflow changes
Critical slices coveredHigh-risk product, channel, language, customer and failure-mode slices are present or formally accepted as gap
Privacy clearedReal or production-derived cases have approved use, access and retention
Labels approvedExpected behavior, unacceptable behavior, severity and policy reference have authorized review
Synthetic cases reviewedSynthetic cases have realism, factual consistency and oracle checks
Hard gates configuredCritical violations, unauthorized actions, PII leakage, wrong customer-facing decisions and under-escalation have blocker thresholds
Eval run reproducibleDataset manifest, component versions, evaluator versions and run config are captured
Slice results reviewedAggregate score, critical failures, slice failures and regression failures are all reviewed
Exceptions documentedOpen issues have owner, expiry, compensating control and approval
Monitoring connectedProduction trace mining, complaint review, override review and drift checks are mapped to dataset update workflow
Evidence archivedManifest, hash, reports, approvals, exceptions and retention decisions are stored together
Rollback criteria definedDataset-related failures that trigger rollback, ramp pause or manual fallback are explicit

Release decision language:

DecisionUse when
GoRequired datasets passed, no critical failures, evidence complete
Limited goKnown gaps accepted with limited population, monitoring and expiry
No-goCritical failures, insufficient coverage, unresolved label conflicts or privacy blockers
RollbackProduction signals breach gate criteria or escaped failure repeats
Rework datasetDataset cannot support the decision because coverage, labels, lineage or evidence are weak

Executive narrative

One-page storyline

We are not approving this AI release only because model quality improved.
We are approving it because the approved use case was tested against governed datasets:

1. Golden set protected core customer and employee journeys.
2. Challenge set tested complex financial retail edge cases.
3. Adversarial set tested injection, leakage and excessive agency risks.
4. Regression set verified that prior incidents and defects did not return.
5. Synthetic and production-derived cases expanded coverage while respecting privacy and retention.

Each case has lineage, label authority, expected behavior, severity and dataset membership.
The release run is tied to immutable dataset versions and component versions.
Open gaps are documented with owner, expiry and monitoring.
Production complaints, overrides and incidents will feed the next dataset cycle.

Management message

For executives:

The dataset lifecycle is the control system behind AI release quality. It gives us a defensible way to scale AI without relying on anecdotal demos or average scores. We can show which customer journeys and risk scenarios were covered, which failures block release, where gaps remain, and how production evidence improves the next version.

For risk and audit:

The evidence packet links dataset inventory, lineage, labels, privacy decisions, eval runs, exceptions and approvals. This allows reviewers to reconstruct why a release was approved and whether controls operated as designed.

For product and architecture:

The test data factory makes AI delivery faster over time. Once cases, labels and gates are reusable, each model, prompt, RAG, tool or workflow change can be assessed with a consistent decision framework.


Interview drills

Drill 1: Explain golden set vs regression set

Strong answer:

Golden set protects stable core behavior and allows comparison across versions. Regression set protects against known failures returning, often sourced from incidents, defects, complaints or production overrides. A case can be important to both, but the governance question is different: golden asks “does the system still perform approved core tasks,” while regression asks “did we reintroduce a failure we already learned from.”

Drill 2: Synthetic vs real cases

Strong answer:

Real cases provide operational realism, but they carry privacy, retention, access and sampling-bias issues. Synthetic cases help cover rare, adversarial and combinatorial scenarios, but they need realism checks and a reliable oracle. I would use both: real and production-derived samples for distribution and noise, synthetic and mutated samples for low-frequency and high-risk coverage.

Drill 3: What would block release?

Strong answer:

I would block release for critical failures such as unauthorized tool action, PII leakage, wrong customer-facing credit or payment advice, under-escalation of high-risk AML or complaint cases, unresolved label conflict in a release blocker, missing privacy approval for real data, or insufficient coverage of a high-impact slice in the approved rollout scope.

Drill 4: How do you make it audit-ready?

Strong answer:

I would retain dataset manifest, version hash, source lineage, privacy decision, label authority log, coverage matrix, eval run report, slice failures, exceptions, approvals and monitoring trigger in one evidence packet. The key is not having documents; it is being able to reconstruct which dataset version supported which release decision and who accepted which residual risk.

Drill 5: How does this apply to an agent?

Strong answer:

For an agent, dataset cases must include tool authority, approval requirement, tool arguments, expected side effects and rollback expectation. The adversarial set should test prompt injection, excessive agency and data leakage. The regression set should include any previous unauthorized action or incorrect workflow update. Agent release should not rely on natural-language answer quality alone.

Drill 6: What would you ask in a CTO interview?

Strong answer:

I would ask whether eval datasets are treated as platform assets: versioned, access-controlled, tied to release gates and connected to production observability. Then I would ask how dataset impact analysis works when model, prompt, RAG, tool or workflow changes. If the answer is “teams keep their own spreadsheets,” the roadmap should include a case registry, lineage graph, promotion gates and automated evidence packet.


Source anchors

SourceLinkPlaybook use
NIST AI Risk Management Frameworkhttps://www.nist.gov/itl/ai-risk-management-frameworkRisk lifecycle, measurement, management and governance language
NIST AI RMF Generative AI Profilehttps://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligenceGenAI-specific risk cases for adversarial and governance sets
ISO/IEC 42001 AI management systemhttps://www.iso.org/standard/81230.htmlManagement-system operating model, ownership, performance evaluation and improvement
ISO/IEC/IEEE 29148 Requirements Engineeringhttps://www.iso.org/standard/72089.htmlRequirements-to-eval traceability and acceptance behavior
ISO/IEC/IEEE 42010 Architecture Descriptionhttps://www.iso.org/standard/74393.htmlArchitecture viewpoints and stakeholder concerns
OWASP LLM Top 10https://owasp.org/www-project-top-10-for-large-language-model-applications/Prompt injection, sensitive information disclosure and excessive agency test design
OpenTelemetry docshttps://opentelemetry.io/docs/Production traces, metrics and logs feeding failure mining and evidence