AI 扩展计划 / Playbooks

AI Eval Dataset Lifecycle / Golden Set / Test Data Factory Playbook

这份 playbook 用于建立一套可执行的 AI eval dataset operating model：

386 行AI_EVAL_DATASET_LIFECYCLE_GOLDEN_SET_TEST_DATA_FACTORY_PLAYBOOK.md

AI Eval Dataset Lifecycle / Golden Set / Test Data Factory Playbook

定位：面向资深 CBAP / 金融零售 PM / AI Product Architect / Solution Architect / AI Governance / Model Risk，把 eval dataset 从测试附件升级为可治理、可复用、可审计的 AI 质量资产供应链。

使用边界：本文适用于 KYC、AML、credit、payments、contact center、complaints、RAG、agent、copilot、workflow automation 等 AI 产品。它不替代 Legal、Compliance、Privacy、Model Risk、Information Security、Internal Audit 或业务管理层的正式判断。

Purpose and when to use

Purpose

这份 playbook 用于建立一套可执行的 AI eval dataset operating model：

dataset inventory
  -> case intake
  -> privacy and retention decision
  -> label governance
  -> coverage matrix
  -> golden / challenge / adversarial / regression promotion
  -> test data factory
  -> release gate
  -> evidence packet
  -> production feedback loop

目标不是“多准备一些测试样本”，而是让团队能回答：

哪些 dataset 支撑本次 release decision？
哪些高风险场景是 hard gate？
synthetic case 和真实案例如何组合？
标签由谁裁决，依据哪个政策版本？
覆盖缺口、覆盖漂移和历史失败如何进入下一版 dataset？
一次上线后，审计如何追溯 dataset version、run result、approval 和 exception？

When to use

Trigger	使用方式
新 AI 用例立项	先定义 dataset portfolio 和 case schema，再定义模型或 prompt 评估
模型 / prompt / RAG / tool / workflow 变更	做 dataset impact analysis，选择必须重跑的数据集
高风险金融零售 release	形成 promotion gate 和 evidence packet
生产投诉、事故、人工 override 激增	把失败 trace 转成 candidate regression cases
覆盖漂移或政策变化	重评标签、退役旧 case、补充 challenge set
监管、内审、模型风险复核	用 evidence packet 证明数据、标签、运行和审批链路

Operating model

1. Roles and decision rights

Role	Decision right	Key responsibilities
Business owner	批准业务用途、上线范围、剩余风险	定义客户影响、业务边界、release appetite
Product manager	维护 dataset roadmap 和 release gate narrative	将业务结果、失败模式、指标、门禁和证据连接起来
Senior BA / CBAP	管理需求、流程、政策、标签口径和 coverage matrix	把业务规则、例外、客户旅程和控制点转成 eval cases
AI architect	设计 case registry、lineage、factory、eval runner 和 evidence integration	确保版本、依赖、可观测性和安全边界可实现
Data governance / privacy	批准数据来源、脱敏、用途、访问、保留和删除	控制真实案例进入 eval 的边界
SME / operations lead	裁决 expected behavior、标签和失败严重度	维护业务真实度和操作可行性
Model risk / independent review	挑战覆盖、标签稳定性、门禁阈值和证据充分性	对高影响 use case 提出有效挑战
Security	审查 adversarial set、prompt injection、tool authority、PII 泄露	将安全威胁转成测试和 hard gate
Release manager	绑定 dataset version、run result、approval 和 rollback	确保上线流程不绕过 dataset gate

2. Cadence

Cadence	活动	Output
每个 release	dataset impact analysis、required reruns、evidence packet	release gate memo
每周	candidate case triage、label conflict review、failure mining	updated candidate backlog
每月	coverage drift review、production sample review、retirement candidates	coverage and drift report
每季度	dataset portfolio review、access review、retention review、management summary	dataset governance report
重大事故后	incident-to-regression conversion、root cause cases、gate strengthening	regression set update

3. Dataset lifecycle workflow

1. Discover
   从生产 trace、SME workshop、政策变化、事故、投诉、合成场景发现 case。

2. Classify
   标记 use case、source type、risk tier、customer impact、privacy state 和 failure mode。

3. Govern
   通过 privacy、retention、label authority、coverage 和 promotion gates。

4. Promote
   分配到 golden、challenge、adversarial、regression、synthetic 或 monitor-only 集合。

5. Execute
   绑定 dataset version、component version、evaluator、threshold 和 release decision。

6. Evidence
   归档 manifest、hash、lineage、run result、slice result、exception、approval 和 retention decision。

7. Improve
   从生产失败、漂移、投诉、override 和政策变化中补样、重标、退役或升级 gate。

Template: dataset inventory

Field	Required content	Example
dataset_id	唯一编号，包含 use case、集合类型和版本	`CONTACT-RAG-GOLDEN-v2026.06.30`
dataset_type	golden / challenge / adversarial / regression / synthetic / production_sample	`regression`
business_use_case	绑定业务能力和流程节点	客服 RAG 回答零售账户费用政策
risk_tier	low / medium / high / prohibited-boundary	high
owner	business owner、product owner、technical owner	Retail Servicing PO + AI Platform Architect
source_mix	real、synthetic、policy-authored、incident-derived 比例	45% real, 35% synthetic, 20% incident-derived
active_scope	产品、渠道、语言、地区、客户类型、流程节点	mobile + call center, English / Spanish, checking and cards
excluded_scope	明确不覆盖的范围	wealth advisory、commercial banking、legal complaints
label_authority	谁有权确认 expected behavior 和 severity	servicing SME + compliance reviewer
privacy_classification	PII / confidential / de-identified / synthetic	de-identified with restricted access
retention_rule	active、archive、purge 规则	active 18 months, archived 5 years for release evidence
release_usage	哪些 release gate 使用	prompt, RAG corpus, retriever, model route, tool changes
evidence_location	manifest、run、approval、exception 的归档位置	evidence binder path or GRC object id
last_reviewed	最近一次 portfolio review 日期	2026-06-30
next_review_trigger	日期或事件触发	quarterly review or complaint spike

Template: coverage matrix

Coverage dimension	Required slices	Current coverage signal	Gap decision
Product	checking, credit card, loan, mortgage, prepaid	card cases overrepresented	add checking fee and loan servicing cases
Channel	mobile, web, branch, call center, back office	mobile and call center covered	create branch handoff cases
Language	English, Spanish, bilingual, low-literacy phrasing	English strong, Spanish thin	promote Spanish synthetic + SME-authored cases
Customer segment	new customer, long-tenured, vulnerable customer, thin-file, small business where applicable	vulnerable customer not explicit	add escalation and accommodation cases
Workflow point	read, summarize, recommend, draft, act	read/summarize covered	add draft and tool-approval cases before agent release
Failure mode	unsupported claim, wrong citation, under-escalation, unauthorized action, PII leakage, over-refusal	wrong citation and unsupported claim covered	adversarial set needs tool and PII cases
Risk severity	critical, high, medium, low	high and medium sufficient	critical cases need hard gate definition
Data source	policy document, transaction fact, case note, customer message, tool observation	policy documents strong	add tool observation and stale-policy cases
Time sensitivity	current policy, outdated policy, effective-date conflict	current policy only	add effective-date challenge cases
Evidence quality	complete, missing field, conflicting source, ambiguous source	complete evidence dominant	add missing and conflicting source cases

How to use:

1. Define release scope first.
2. Mark each required slice as covered, thin, missing or not in scope.
3. Convert missing high-risk slices into candidate cases.
4. Do not approve high-impact releases when critical slices are missing without explicit risk acceptance.

Template: label quality and authority

Field	Required content	Example
case_id	Case identifier	`AML-CHALLENGE-2026Q3-0041`
expected_behavior	What AI should do	Summarize suspicious pattern, cite transactions, recommend analyst review, avoid final SAR conclusion
unacceptable_behavior	What must not happen	State that SAR filing is required without analyst decision
primary_label	Business or risk label	possible structuring pattern
severity	critical / high / medium / low	high
policy_reference	Policy, procedure or control version	AML Investigation Procedure v2026.05
reviewer_1	Role and decision	AML SME: approve expected behavior
reviewer_2	Role and decision	Financial crime compliance: approve escalation boundary
conflict_state	none / open / resolved	resolved
conflict_resolution	Final rationale	Wording changed from “must file” to “requires analyst determination”
label_version	Version of label decision	`label-v2`
review_expiry	Date or policy-change trigger	expires when AML procedure changes or after 12 months

Label governance rules:

High and critical cases require at least one business SME and one risk/compliance reviewer.
Label changes require versioning and impact analysis on historical runs.
A case with open label conflict can be used for exploration but not as a release blocker.
Synthetic cases need both realism review and expected-behavior review.
A label can expire when policy, product, workflow or customer communication rules change.

Template: promotion gate

Gate item	Pass evidence	Fail example	Decision
Approved use mapping	Case maps to registered AI use case and workflow point	Case belongs to wealth advice while release is retail servicing	reject or move to another use case
Source lineage	Source system, sampling window, generation rule or incident id recorded	Screenshot copied from unknown location	hold
Privacy and retention	PII handling, access group, retention rule documented	Raw customer data stored in open dev folder	reject
Expected behavior	Clear desired action, refusal, citation or escalation	“Answer correctly” without oracle	hold
Unacceptable behavior	Failure modes and severity defined	No definition of critical failure	hold
Label authority	Required reviewers approved label and severity	Only model developer approved high-risk label	hold
Coverage contribution	Case fills a known gap or protects a known behavior	Duplicate of 40 existing easy cases	do not promote
Dataset membership	Golden/challenge/adversarial/regression/synthetic role selected	Same case used for all purposes with no role	split or classify
Release gate impact	Threshold and blocker status defined	High-risk case has no decision effect	hold
Evidence readiness	Manifest, hash, version, approval and retention captured	Case exists only in local spreadsheet	hold

Promotion decisions:

Decision	Meaning
promote_to_golden	Stable core case, strong label authority, expected to remain comparable
promote_to_challenge	Complex boundary case, useful for risk discussion and slice analysis
promote_to_adversarial	Security, abuse, leakage, prompt injection or excessive agency case
promote_to_regression	Historical failure, incident, complaint, defect or previously fixed issue
monitor_only	Useful production sample but not stable enough for release gate
reject	Out of scope, unsafe to use, poor lineage or not decision-relevant

Template: evidence packet

Evidence object	What it proves	Owner
Dataset inventory snapshot	Which datasets were in scope and who owns them	Product manager
Dataset manifest and hash	Exact version used for release run	AI platform
Source lineage report	Where cases came from and how they were processed	Data governance / AI platform
Privacy and retention decision	Data use, access and retention are approved	Privacy / data governance
Coverage matrix	Release scope has adequate dataset coverage or known accepted gaps	BA / PM
Label authority log	Expected behavior and severity were approved by the right roles	SME / risk
Eval run report	Component version results against dataset versions	AI engineering
Slice failure report	Which products, segments, languages, channels or failure modes failed	AI platform / PM
Exception and risk acceptance	Open gaps, compensating controls, expiry and owner	Business owner / risk
Release decision memo	Go / no-go / limited release / rollback decision	Release manager
Monitoring and failure-mining plan	How production signals will refresh datasets	Operations / observability
Archive and access record	Evidence is retained and accessible to authorized reviewers	GRC / records management

Evidence packet rule:

If a dataset result influences a release decision, its manifest, lineage, labels, run result and approval must be retained together.

PM/BA/architecture questions

PM questions

Question	Strong answer should include
What business decision does this dataset support?	release, scale, rollback, model migration, policy update or monitoring
Which customer harm is this dataset designed to prevent?	wrong denial, missed AML escalation, bad payment advice, privacy breach, complaint mishandling
What is the minimum viable golden set for pilot?	core journeys, critical failures, business-owner-approved labels, privacy-cleared cases
Which gaps are acceptable for pilot but not for scale?	low-volume channels, secondary products, lower-risk languages, manual fallback available
How will complaints and production failures become regression cases?	intake trigger, owner, review SLA, promotion gate and next release usage

BA / CBAP questions

Question	Strong answer should include
What workflow state does each case represent?	actor, system state, available evidence, decision boundary, handoff point
What is the expected behavior, not just expected output?	answer, cite, refuse, escalate, draft, ask for missing data, request approval
Which policies and procedures define the label?	policy version, owner, effective date, exception handling
What makes a case critical, high, medium or low severity?	customer impact, regulatory impact, financial impact, operational reversibility
How do we handle conflicting SME labels?	conflict log, escalation path, final authority, label version and impact analysis

Architecture questions

Question	Strong answer should include
Where is the case registry and how is it versioned?	immutable dataset release, hash, API, access control, lineage graph
How does the eval runner select datasets for a change?	model/prompt/RAG/tool/policy/workflow impact graph
How do we prevent test set contamination?	role-based access, training exclusion, holdout protection, run audit
How are production traces converted into eval cases?	sampling, privacy gate, de-identification, SME review, promotion gate
How does observability connect to dataset lifecycle?	traces, metrics, logs, complaint signals, override signals, incident RCA
How is retention enforced?	retention metadata, legal hold flag, archive state, purge record

Release checklist

Use this checklist when an AI release depends on eval dataset evidence.

Check	Pass condition
Release scope defined	Use case, users, workflow point, products, channels, language, customer impact and risk tier are explicit
Dataset versions frozen	Golden, challenge, adversarial and regression versions are immutable for the release run
Required datasets selected	Change impact analysis selected required datasets for model, prompt, RAG, tool, policy and workflow changes
Critical slices covered	High-risk product, channel, language, customer and failure-mode slices are present or formally accepted as gap
Privacy cleared	Real or production-derived cases have approved use, access and retention
Labels approved	Expected behavior, unacceptable behavior, severity and policy reference have authorized review
Synthetic cases reviewed	Synthetic cases have realism, factual consistency and oracle checks
Hard gates configured	Critical violations, unauthorized actions, PII leakage, wrong customer-facing decisions and under-escalation have blocker thresholds
Eval run reproducible	Dataset manifest, component versions, evaluator versions and run config are captured
Slice results reviewed	Aggregate score, critical failures, slice failures and regression failures are all reviewed
Exceptions documented	Open issues have owner, expiry, compensating control and approval
Monitoring connected	Production trace mining, complaint review, override review and drift checks are mapped to dataset update workflow
Evidence archived	Manifest, hash, reports, approvals, exceptions and retention decisions are stored together
Rollback criteria defined	Dataset-related failures that trigger rollback, ramp pause or manual fallback are explicit

Release decision language:

Decision	Use when
Go	Required datasets passed, no critical failures, evidence complete
Limited go	Known gaps accepted with limited population, monitoring and expiry
No-go	Critical failures, insufficient coverage, unresolved label conflicts or privacy blockers
Rollback	Production signals breach gate criteria or escaped failure repeats
Rework dataset	Dataset cannot support the decision because coverage, labels, lineage or evidence are weak

Executive narrative

One-page storyline

We are not approving this AI release only because model quality improved.
We are approving it because the approved use case was tested against governed datasets:

1. Golden set protected core customer and employee journeys.
2. Challenge set tested complex financial retail edge cases.
3. Adversarial set tested injection, leakage and excessive agency risks.
4. Regression set verified that prior incidents and defects did not return.
5. Synthetic and production-derived cases expanded coverage while respecting privacy and retention.

Each case has lineage, label authority, expected behavior, severity and dataset membership.
The release run is tied to immutable dataset versions and component versions.
Open gaps are documented with owner, expiry and monitoring.
Production complaints, overrides and incidents will feed the next dataset cycle.

Management message

For executives:

The dataset lifecycle is the control system behind AI release quality. It gives us a defensible way to scale AI without relying on anecdotal demos or average scores. We can show which customer journeys and risk scenarios were covered, which failures block release, where gaps remain, and how production evidence improves the next version.

For risk and audit:

The evidence packet links dataset inventory, lineage, labels, privacy decisions, eval runs, exceptions and approvals. This allows reviewers to reconstruct why a release was approved and whether controls operated as designed.

For product and architecture:

The test data factory makes AI delivery faster over time. Once cases, labels and gates are reusable, each model, prompt, RAG, tool or workflow change can be assessed with a consistent decision framework.

Interview drills

Drill 1: Explain golden set vs regression set

Strong answer:

Golden set protects stable core behavior and allows comparison across versions. Regression set protects against known failures returning, often sourced from incidents, defects, complaints or production overrides. A case can be important to both, but the governance question is different: golden asks “does the system still perform approved core tasks,” while regression asks “did we reintroduce a failure we already learned from.”

Drill 2: Synthetic vs real cases

Strong answer:

Real cases provide operational realism, but they carry privacy, retention, access and sampling-bias issues. Synthetic cases help cover rare, adversarial and combinatorial scenarios, but they need realism checks and a reliable oracle. I would use both: real and production-derived samples for distribution and noise, synthetic and mutated samples for low-frequency and high-risk coverage.

Drill 3: What would block release?

Strong answer:

I would block release for critical failures such as unauthorized tool action, PII leakage, wrong customer-facing credit or payment advice, under-escalation of high-risk AML or complaint cases, unresolved label conflict in a release blocker, missing privacy approval for real data, or insufficient coverage of a high-impact slice in the approved rollout scope.

Drill 4: How do you make it audit-ready?

Strong answer:

I would retain dataset manifest, version hash, source lineage, privacy decision, label authority log, coverage matrix, eval run report, slice failures, exceptions, approvals and monitoring trigger in one evidence packet. The key is not having documents; it is being able to reconstruct which dataset version supported which release decision and who accepted which residual risk.

Drill 5: How does this apply to an agent?

Strong answer:

For an agent, dataset cases must include tool authority, approval requirement, tool arguments, expected side effects and rollback expectation. The adversarial set should test prompt injection, excessive agency and data leakage. The regression set should include any previous unauthorized action or incorrect workflow update. Agent release should not rely on natural-language answer quality alone.

Drill 6: What would you ask in a CTO interview?

Strong answer:

I would ask whether eval datasets are treated as platform assets: versioned, access-controlled, tied to release gates and connected to production observability. Then I would ask how dataset impact analysis works when model, prompt, RAG, tool or workflow changes. If the answer is “teams keep their own spreadsheets,” the roadmap should include a case registry, lineage graph, promotion gates and automated evidence packet.

Source anchors

Source	Link	Playbook use
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	Risk lifecycle, measurement, management and governance language
NIST AI RMF Generative AI Profile	https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence	GenAI-specific risk cases for adversarial and governance sets
ISO/IEC 42001 AI management system	https://www.iso.org/standard/81230.html	Management-system operating model, ownership, performance evaluation and improvement
ISO/IEC/IEEE 29148 Requirements Engineering	https://www.iso.org/standard/72089.html	Requirements-to-eval traceability and acceptance behavior
ISO/IEC/IEEE 42010 Architecture Description	https://www.iso.org/standard/74393.html	Architecture viewpoints and stakeholder concerns
OWASP LLM Top 10	https://owasp.org/www-project-top-10-for-large-language-model-applications/	Prompt injection, sensitive information disclosure and excessive agency test design
OpenTelemetry docs	https://opentelemetry.io/docs/	Production traces, metrics and logs feeding failure mining and evidence