AI 底层逻辑 / 经典论文

AI Synthetic Data Governance：隐私-效用-保真度架构

Date: 2026-06-30

502 行ai-foundations/papers/173-ai-synthetic-data-governance-privacy-utility-fidelity-architecture.md

AI 合成数据治理架构：Synthetic Data Governance / Privacy-Utility-Fidelity Architecture

Date: 2026-06-30
Status: evergreen
Audience: experienced CBAP / financial retail PM / data product architect / AI architect / privacy-risk partner
Output: advanced architecture note, governance model, release gate design, ADR draft, interview-ready narrative

Why synthetic data governance matters

Synthetic data 不是“假数据”, 也不是把真实客户数据丢给生成器后自动得到安全资产。对金融零售 AI 来说, synthetic data 是一种受治理的数据产品:

Synthetic data product = approved purpose + source permission + generation method + privacy attack evidence + utility/fidelity score + allowed-use license + release gate.

它的价值在于让团队在不无限复制敏感数据的情况下, 支持 AI 开发、流程测试、边界场景扩展、员工训练和供应商 PoC。但它也会引入新的风险:

风险	为什么高级团队必须治理
Residual privacy leakage	生成器可能记住罕见客户、交易、投诉、文件图像或通话片段, 导致 membership inference、model inversion 或 record linkage 风险
False safety belief	团队误以为“合成”天然匿名, 然后把数据带入 vendor sandbox、training corpus 或低控制环境
Utility theater	数据看起来真实, 但无法支撑业务任务、模型训练、流程测试或控制验证
Fidelity overfit	过度追求像真实数据, 反而复制原始分布中的隐私、偏见、历史错误和异常个案
Bias amplification	生成器放大少数群体误分布、投诉语言模式、欺诈标签偏差或信贷历史偏差
Purpose creep	为 KYC 测试生成的数据被拿去训练营销模型, 或 AML typology 数据被复用于员工绩效监控
Provenance collapse	下游不知道数据由谁生成、基于哪些 source、允许做什么、保留多久、是否可再分发

高级 PM / BA / Architect 的关键判断不是“能不能生成一批数据”, 而是:

Which business decision can this synthetic dataset support,
what real data permissions authorize its creation,
what privacy attacks has it survived,
what utility and fidelity evidence proves it is fit for use,
and what license prevents misuse after release?

本文刻意不重复四个相邻主题:

Synthetic user simulation 关注 persona / scenario / journey lab; 本文关注 synthetic dataset 作为数据产品的治理和发布。
Privacy clean room 关注多方数据协作和受控计算; 本文关注企业内部或受控供应商场景中的合成数据资产。
Differential privacy 关注数学隐私预算; 本文把 DP 视为可选控制之一, 不把它当成全部方案。
Eval dataset lifecycle 关注 eval set 的版本、标签和回归治理; 本文关注 synthetic data 是否可被释放、用于何种目的、证据是否充分。

Concept diagram

flowchart TB
  A[Use-case intake<br/>purpose, users, risk tier,<br/>allowed decision] --> B[Source permission gate]
  B --> C[Source data registry<br/>classification, consent, contract,<br/>lineage, retention, owner]
  C --> D[Generation plan<br/>method, prompts, rules, model,<br/>sampling, constraints]
  D --> E[Generator risk review<br/>memorization, rare cases,<br/>prompt leakage, vendor boundary]
  E --> F[Synthetic dataset build]
  F --> G[Privacy attack workbench<br/>membership inference,<br/>model inversion, linkage,<br/>nearest-neighbor, canary tests]
  F --> H[Utility and fidelity lab<br/>task performance, distribution,<br/>business rules, SME review]
  F --> I[Representativeness and bias review<br/>coverage, subgroup error,<br/>amplification, missingness]
  G --> J[Data card and evidence packet]
  H --> J
  I --> J
  J --> K{Release gate}
  K -->|Reject| D
  K -->|Limited release| L[Allowed-use license<br/>scope, users, prohibited uses,<br/>expiry, retention, watermark]
  K -->|Promote| M[Dataset catalog<br/>provenance, version,<br/>access control, monitoring]
  M --> N[Downstream use<br/>RAG, Agent, Copilot,<br/>Eval, training, QA, sandbox]
  N --> O[Usage telemetry and incident feedback]
  O --> A

核心闭环:

business purpose
  -> source permission
  -> governed generation
  -> privacy / utility / fidelity / bias evidence
  -> allowed-use release
  -> monitored downstream use
  -> renewal, restriction or retirement

Core architecture model

1. Use-case intake as the first control

Synthetic data 治理从 intake 开始, 不是从生成器开始。一个成熟 intake 要把“为什么需要合成数据”写清楚。

Field	设计要求	金融零售例子
use_case_id	稳定编号, 绑定 PRD、ADR、data card、release gate	`syn-aml-typology-training-v1`
business purpose	明确要支持的决策或测试, 禁止泛化成“AI 研发”	培训 AML analyst 识别 mule network typology
target users	谁可以访问, 是否含供应商、承包商、离岸团队	Financial crime training team, model validation
risk tier	基于数据敏感性、客户影响、外发范围和自动化程度分级	KYC 文档测试为 high, UI demo mock data 为 low
requested output	tabular, transcript, image, document, transaction graph, mixed dataset	contact center transcript + intent label
downstream use	testing, training, eval, demonstration, sandbox, analytics, model training	payment fraud simulation for rule stress test
prohibited use	明确不能做什么	不得用于真实客户决策、营销定位、员工考核
evidence needed	需要哪些 privacy、utility、fidelity、bias 证据	membership inference test, SME utility review

Intake 的产品原则:

不能用一个 synthetic dataset 服务无限多目的。
高风险用途必须把 allowed use 写成 license, 不是写在邮件里。
如果真实源数据没有权限支持某用途, 合成后也不自动获得该用途权限。

2. Source permission and lineage layer

Synthetic data 的合法性、合规性和可解释性取决于 source data permission。架构上应建立 source permission matrix。

Source dimension	关键问题	控制证据
data owner	谁批准该 source 被用于生成合成数据	data owner approval, system of record
data class	是否含 PII、SPI、financial crime data、credit data、complaint data、call recording	data classification record
purpose basis	原始收集目的是否允许用于 synthetic data generation	privacy / legal / compliance review
consent / notice	客户声明、内部政策或合同是否支持该处理	notice mapping, consent basis
contractual boundary	vendor 或 partner 数据是否允许衍生数据、合成数据、再分发	contract clause summary
retention	原始数据、intermediate artifacts、synthetic output 保留多久	retention schedule
jurisdiction	数据处理和存储区域	regional control record
exclusion rules	哪些字段、群体、样本或事件禁止进入生成流程	source filter log

Source permission 的关键认知:

Synthetic output inherits constraints from source data unless a reviewed governance decision narrows, transforms and re-licenses the output.

3. Generation architecture patterns

不同生成方式对应不同风险和证据需求。

Pattern	适用场景	主要风险	证据重点
Rule-based synthesis	KYC 文件字段组合、账户状态、流程路径测试	规则过窄, 缺少真实分布	business rule coverage, SME approval
Statistical / tabular generator	信贷 scenario augmentation、客户分群样本	分布失真, minority subgroup leakage	distribution fidelity, nearest-neighbor privacy
LLM text generation	contact center transcript、complaint narrative、case notes	原文记忆、敏感片段复现、语气偏差	prompt/source control, plagiarism scan, PII leakage test
Document/image synthesis	KYC document testing、statement extraction	复制真实图像、身份信息泄露、水印缺失	visual similarity, OCR PII scan, watermark/provenance
Graph / sequence synthesis	payment fraud simulation、AML mule network	罕见模式泄露, 图结构可重识别	graph similarity, rare motif suppression
Hybrid mutation	从真实 case 改写成边界场景	原始 case 可被反推	edit distance, source separation, canary cases

生成器本身也要治理:

记录 generator model、version、prompt、seed、rules、parameters、source sample window。
对 LLM 生成场景, 禁止把敏感 source 原文直接放入不受控 vendor prompt。
对高风险数据, 使用 isolated generation environment、short-lived workspace、access logging 和 artifact cleanup。
对重复生成任务, 建立 prompt/rule registry, 避免手工脚本成为隐形数据处理系统。

4. Data card and license layer

每个可释放 synthetic dataset 都应有 data card, 并附带 allowed-use license。

Data card field	内容
dataset_id / version	稳定 ID, 版本, 生成日期
purpose	被批准的业务目的
source summary	source systems、时间窗口、字段类别, 不暴露敏感细节
generation method	rule/statistical/LLM/document/graph/hybrid, generator version
privacy tests	attack type、result、threshold、exceptions
utility tests	task performance、SME review、business rule pass rate
fidelity tests	distribution similarity、sequence realism、document realism
representativeness	覆盖范围、缺口、禁止推断的人群
known limitations	不能代表哪些人群、渠道、产品、事件
allowed uses	testing/training/eval/demo/model training 等明确范围
prohibited uses	真实客户决策、外发、再训练、重新识别、linkage 等
retention / expiry	过期日、复审日、删除要求
provenance	W3C PROV 风格 entity/activity/agent 记录
watermark / marker	visible/invisible watermark, metadata tag, synthetic flag
owner / approvers	PM、Data Owner、Privacy、Risk、Architecture

Privacy-utility-fidelity measurement model

Synthetic data release 必须同时测 privacy、utility、fidelity。只测一个维度会导致错误决策。

1. Three-axis model

Privacy: can an attacker infer a real person, record, document, call or transaction?
Utility: does the dataset support the approved task?
Fidelity: does it preserve the relevant structure of the target domain without copying protected facts?

Axis	目标	常用测试	失败信号
Privacy	限制对真实个体或真实记录的推断	membership inference, model inversion, nearest-neighbor distance, record linkage, PII scan, canary extraction	合成样本与真实罕见样本过近, 可还原客户字段, 可判断某人是否在源数据中
Utility	支撑批准用途	downstream task score, rule coverage, SME rating, workflow pass rate, model validation comparison	模型训练后不提升, 测试无法发现缺陷, SME 认为不可信
Fidelity	保留必要业务结构	distribution similarity, correlation preservation, temporal pattern check, graph motif check, document layout realism	交易金额、投诉语气、KYC 文件组合或欺诈链路不符合业务经验

2. Privacy attack testing

高风险 synthetic data 至少应覆盖以下测试。

Attack / test	测什么	金融零售例子	通过标准示例
Membership inference	攻击者能否判断某客户或 case 是否在源数据中	AML 罕见 typology case 是否被生成器记住	attack AUC 不高于批准阈值, 高风险样本人工复核通过
Model inversion	能否从合成数据或生成器输出反推敏感属性	从投诉叙述反推客户身份、地址、特殊困境	敏感属性复原率低于阈值, 无可识别 narrative
Nearest-neighbor similarity	合成记录是否过度接近真实记录	KYC 文档图像、交易序列、call transcript 句子近似复制	高相似样本被删除或降权
Linkage attack	能否与外部或内部数据拼接重识别	交易时间、商户、金额组合识别客户	小群体/罕见组合被泛化、扰动或禁止释放
Canary extraction	人为插入罕见 token/record, 测生成器是否复现	在训练样本中插入测试 case marker	生成输出不得复现 canary
PII / SPI scan	是否出现姓名、地址、账号、电话、身份证件、录音片段	contact center transcript generation	自动扫描和人工抽检均无未授权标识

3. Utility/fidelity scoring

建议用多维 scorecard, 不把分数压成单一平均值。

Dimension	指标	说明
Task utility	model lift, test defect discovery, workflow success	是否真正帮助批准用途
Business rule validity	rule pass rate, impossible-combination rate	例如 KYC 文件国家/证件类型/有效期组合不能违反业务规则
Distribution fidelity	KS distance, PSI, KL/JS divergence, correlation delta	适合 tabular/transaction 数据, 需要业务解释
Temporal fidelity	seasonality, burst pattern, event order validity	支付欺诈、AML 行为链、投诉升级特别重要
Text/document fidelity	SME realism score, policy terminology accuracy, layout validity	contact center、complaint、KYC 文档测试
Coverage	segment coverage, scenario coverage, edge-case coverage	不只看总体, 要看关键群体和流程路径
Bias / fairness	subgroup utility, label skew, error amplification	信贷、投诉、KYC、客服场景必须单独看
Stability	re-generation variance, version-to-version drift	避免每次生成导致测试结果不可比较

4. Representativeness without copying reality

Representativeness 不是把真实分布完整复制。金融零售 synthetic data 要处理三种 tension:

Tension	错误做法	更成熟做法
罕见风险 vs 隐私	直接复制少数真实 fraud / AML case	抽象 typology, 重组行为模式, 删除可识别组合
真实分布 vs 测试覆盖	完全按生产占比生成, 长尾不足	core distribution + challenge slice 分层发布
群体公平 vs 敏感属性	用敏感属性机械生成 persona	使用被批准的 fairness review 字段和代理风险控制, 明确禁止推断用途

Allowed-use and release gate design

1. Allowed-use license

Synthetic dataset release 应像 API scope 一样有 license。

License element	示例
Allowed purpose	用于 KYC document ingestion pipeline 的非生产功能测试和 OCR regression
Allowed users	KYC platform QA team, data quality team, named vendor test engineers
Allowed environments	enterprise test environment, approved vendor sandbox with no retention after 30 days
Allowed operations	read, query, transform for test cases, run automated tests
Prohibited operations	train foundation model, enrich customer profiles, external sharing, re-identification, linkage to production customer table
Expiry	90 天后必须删除或重新审批
Derivative rules	派生数据继承原 license, 不得去除 synthetic marker
Evidence requirements	使用记录、测试结果、删除证明、exception log

2. Release gate levels

Gate	适用数据	审批	必备证据
G0 Draft	仅本地设计, 无敏感 source	Team lead	use-case intake, no sensitive source statement
G1 Internal low-risk	UI demo, mock operational data	PM + Data Owner	data card, basic PII scan, allowed-use license
G2 Controlled test	KYC / complaints / contact center synthetic cases	PM + Architect + Privacy	source permission, privacy tests, utility/fidelity scorecard, retention
G3 Model-impacting	训练、fine-tune、model validation、credit/fraud augmentation	Data Owner + Model Risk + Privacy + Risk	attack report, bias review, benchmark comparison, risk acceptance
G4 External / vendor release	vendor PoC, offshore test, partner review	Legal + Procurement + Security + Privacy + Business Owner	contract boundary, export controls, watermark/provenance, deletion evidence

Gate 的关键纪律:

低风险 release 不能自动升级为高风险用途。
G3/G4 必须有 residual risk owner, expiry 和 reopen trigger。
如果 utility 高但 privacy attack 失败, 不能通过“业务价值大”绕过 gate; 只能缩小用途、重生成、加控制或拒绝。
如果 privacy 强但 utility 不足, 不能当作模型训练或 validation 数据, 只能作为 demo/mock 或流程测试数据。

Financial retail scenarios

1. AML typology training

目标是训练 analyst 识别 typology, 不是复制真实 SAR case。

设计点	要求
Source permission	financial crime case data 高敏, 需要明确 training/simulation basis
Generation method	typology abstraction + graph/sequence synthesis
Privacy tests	rare case nearest-neighbor, canary extraction, linkage on amount/time/counterparty pattern
Utility tests	analyst training score, typology coverage, false narrative review
Allowed use	staff training, tabletop exercise, model challenge set
Prohibited use	客户风险评分、真实 alert disposition、外部共享

2. KYC document testing

目标是测试 document ingestion、OCR、field extraction、exception routing。

设计点	要求
Source permission	禁止复用真实证件图像; 使用模板、字段规则、合成图像
Generation method	rule-based document layout + synthetic image + OCR noise injection
Privacy tests	OCR PII scan, visual similarity to real docs, metadata scrub
Utility tests	extraction accuracy regression, exception path coverage
Fidelity tests	文件版式、语言、过期日期、证件类型组合符合政策
Release gate	vendor sandbox 需 G4, 内部 QA 通常 G2

3. Credit model scenario augmentation

目标是扩展边界场景, 不替代真实 model validation。

设计点	要求
Source permission	credit data、bureau-like attributes、protected-class proxy 风险需严格审查
Generation method	constrained tabular synthesis + policy scenario authoring
Privacy tests	nearest-neighbor, linkage, membership inference
Utility tests	model sensitivity, challenger model behavior, adverse action reason sanity
Bias review	subgroup coverage、proxy amplification、拒绝原因稳定性
Prohibited use	未经 model risk approval 不得直接作为生产训练主数据

4. Payment fraud simulation

目标是压测 fraud rules、agent alert triage、payment warning UX。

设计点	要求
Source permission	fraud cases 和 payment events 需要最小字段和 typology abstraction
Generation method	transaction sequence / graph synthesis + adversarial pattern mutation
Privacy tests	unique sequence leakage, counterparty linkage, rare merchant pattern
Utility tests	rule trigger coverage, investigator realism score, alert fatigue estimate
Fidelity tests	timing、amount、merchant category、device signal 合理性
Allowed use	sandbox rule stress test, analyst training, warning copy testing

5. Contact center transcript generation

目标是生成受控 transcript, 用于 copilot QA、intent classification、agent training。

设计点	要求
Source permission	call recordings/transcripts 通常含高敏语音、身份验证、弱势客户信息
Generation method	intent-conditioned LLM generation, redaction-first summaries, no raw transcript prompt for uncontrolled LLM
Privacy tests	PII/SPI scan, phrase similarity, membership inference sample test
Utility tests	intent label quality, policy answer coverage, escalation trigger coverage
Bias review	language clarity、accent proxy、vulnerability phrase handling
Prohibited use	复原真实客户对话、员工绩效推断、未经批准的 sentiment profiling

6. Complaint analytics

目标是扩展 complaint themes 和 root-cause analytics coverage, 不是制造虚假投诉指标。

设计点	要求
Source permission	complaints may include legal privilege, regulator-sensitive narratives, vulnerable customer details
Generation method	theme-level synthesis + policy taxonomy + controlled narrative variation
Privacy tests	narrative uniqueness, linkage to public complaint, PII scan
Utility tests	taxonomy classifier performance, root-cause label quality, QA reviewer score
Fidelity tests	product, channel, issue, harm, resolution structure
Allowed use	taxonomy test, complaint copilot challenge set, QA calibration
Prohibited use	管理层真实投诉率报表、监管报送、客户画像

Metrics/control/evidence model

1. Metrics

Category	Metric	Evidence
Privacy	membership inference AUC, nearest-neighbor distance, PII leakage rate, linkage success rate	attack report, scan logs, reviewer sign-off
Utility	downstream task lift, workflow pass rate, defect discovery rate, SME usefulness score	benchmark run, QA test result, SME review
Fidelity	distribution delta, rule validity rate, temporal pattern score, transcript realism	statistical report, rule engine output, review rubric
Coverage	scenario coverage, subgroup coverage, edge-case count, missing segment list	coverage matrix, gap register
Bias	subgroup utility delta, label skew, stereotype phrase rate, adverse reason stability	fairness review, bias checklist
Governance	data card completeness, license coverage, expired dataset count, unauthorized use attempts	catalog report, access logs, exception log
Operations	regeneration reproducibility, version drift, deletion proof timeliness	pipeline log, provenance graph, retention evidence

2. Control model

Control	Purpose	Owner
Intake approval	Prevent vague or unauthorized synthetic data requests	PM / BA / Business Owner
Source permission gate	Ensure source data can support the proposed transformation	Data Owner / Privacy / Legal
Generation environment control	Prevent uncontrolled exposure during generation	Security / Platform
Privacy attack suite	Test leakage and re-identification risk	Privacy Engineering / Data Science
Utility/fidelity scorecard	Confirm fit for approved use	Product / SME / Model Risk
Bias and coverage review	Detect representational harm and amplified historical bias	Risk / Fairness / Business SME
Data card	Make limitations and provenance inspectable	Data Product Owner
Allowed-use license	Prevent downstream misuse and purpose creep	Data Governance / PM
Watermark/provenance	Preserve synthetic identity across copies and derivatives	Data Platform
Release gate	Make go/no-go decision auditable	Architecture / Risk / Data Owner
Retention and retirement	Avoid long-lived uncontrolled synthetic assets	Records / Data Governance

3. Evidence packet

上线或释放时至少保留:

use-case intake and risk tier
source permission matrix and data classification
generation plan, generator version, prompts/rules, source window
privacy attack report and exceptions
utility/fidelity scorecard
bias/coverage review
data card and allowed-use license
watermark/provenance evidence
approver list, residual risk owner, expiry date
retention/deletion requirements
downstream access logs and monitoring plan

Anti-patterns and failure modes

Anti-pattern	表现	后果	修正方式
“Synthetic means safe”	不做 source permission 和 privacy attack test	数据泄露、合规失败、供应商滥用	建立 G2+ release gate
One dataset for everything	同一数据被用于 demo、eval、training、vendor PoC	purpose creep, 证据不匹配	按用途拆分 license 和版本
Fidelity worship	追求与真实数据几乎一致	复制隐私风险和历史偏差	设置 privacy floor 和 bias review
Privacy-only release	数据很安全但没有业务效用	测试误导、模型质量下降	同时要求 utility/fidelity score
Demo-to-production drift	demo mock data 被留在产品训练或测试流水线	假分布污染真实系统	catalog + expiry + pipeline marker
Unmarked derivatives	下游加工后失去 synthetic 标记	无法追踪、误作真实数据	watermark/provenance inheritance
Vendor sandbox sprawl	合成数据被发给多个供应商且无删除证明	第三方风险扩大	G4 gate + contract + deletion evidence
Rare case copying	把高价值少数案例直接改名重用	membership inference 和客户识别	typology abstraction + nearest-neighbor suppression
Bias laundering	以“合成”为名复制历史歧视标签	公平性风险扩大	subgroup review + label governance
Metric averaging	privacy、utility、fidelity 合成总分	掩盖致命失败	三轴门槛, 任一硬门失败即拒绝

Architecture mapping to RAG / Agent / Copilot / Eval / Governance

AI pattern	Synthetic data value	关键治理点
RAG	生成政策问答、投诉叙事、KYC exception cases, 测 retrieval grounding	不得把 synthetic documents 混入真实知识库而不标记; citation 必须说明 synthetic source
Agent	模拟 payment fraud workflow、case triage、tool-call boundary	合成交易/客户不能进入生产 tool; agent action logs 标记 synthetic run
Copilot	contact center draft、AML narrative、complaint response QA	输出质量用真实 SME rubric 校准; 禁止把合成 transcript 当真实客户证据
Eval	扩展边界场景、隐私保护测试集、regression edge cases	需要 oracle/expected behavior, 不重复 eval dataset lifecycle, 重点看 synthetic source license
Governance	作为 AI data product 纳入 catalog、risk register、release gate	data card、allowed use、retention、provenance、attack evidence 可审计

关键架构原则:

Synthetic data may enter AI workflows only through typed, versioned, licensed datasets.
It should never be an invisible folder of files copied into prompts, notebooks or vendor sandboxes.

ADR draft

Field	Content
ADR ID	ADR-AI-DATA-173
Title	Adopt governed synthetic data release gates for financial retail AI development and testing
Status	Proposed
Context	AI teams need synthetic AML, KYC, credit, fraud, contact center and complaint datasets for testing, training, eval, sandbox and vendor PoC. Current ad hoc generation creates privacy, utility, bias, provenance and purpose-creep risk.
Decision	Treat synthetic data as a governed data product. Every G2+ synthetic dataset must pass use-case intake, source permission review, generator risk review, privacy attack testing, utility/fidelity scorecard, bias/coverage review, data card, allowed-use license, watermark/provenance tagging, retention rule and release gate approval.
Scope	Applies to synthetic datasets derived from, calibrated by or intended to resemble financial retail customer, transaction, case, document, complaint, call, credit, fraud or AML data.
Options considered	1. Team-level ad hoc generation. 2. Central synthetic data platform without release gates. 3. Governed synthetic data product lifecycle with risk-tiered gates.
Decision rationale	Option 3 balances speed and control. It allows low-risk mock data to move quickly while requiring evidence for high-risk model-impacting, customer-like or externally released datasets.
Consequences	Teams must plan evidence work earlier. Synthetic datasets become cataloged assets with owners, expiry and allowed use. Some datasets will be rejected or limited despite high utility if privacy or bias evidence fails.
Controls	Source permission matrix, privacy attack suite, utility/fidelity scorecard, bias review, data card, license, watermark/provenance, access logging, retention/deletion evidence.
Residual risk	Privacy attacks are not exhaustive; utility evidence may not transfer to production; downstream teams may misuse derivatives without strong catalog and access controls.
Reopen triggers	New source data class, external release, use for model training, privacy incident, bias finding, attack threshold breach, regulatory/legal interpretation change, dataset expiry, generator model change.

Interview answer

30秒版本

我会把 synthetic data 当成受治理的数据产品, 而不是“假数据”。核心架构是 use-case intake、source permission、generation risk review、privacy attack testing、utility/fidelity measurement、bias/coverage review、data card、allowed-use license、watermark/provenance、retention 和 release gate。这样既能支持 AML、KYC、信贷、欺诈、客服和投诉 AI 场景, 又不会把合成数据误用成无限制训练或外发资产。

2分钟版本

在金融零售里, synthetic data 的价值是降低真实客户数据在开发、测试、训练和供应商 PoC 中的暴露, 同时扩展长尾场景。但风险是团队容易误以为“合成就安全”。我会从 approved purpose 开始, 明确这个数据是用于 KYC OCR 测试、AML typology training、payment fraud simulation, 还是 contact center copilot QA。然后检查 source data permission, 因为合成数据不能自动摆脱原始数据的目的限制和合同边界。

架构上我会建立三轴评估: privacy、utility、fidelity。Privacy 覆盖 membership inference、model inversion、nearest-neighbor、linkage、canary 和 PII scan; utility 看它是否支撑批准任务; fidelity 看业务结构是否真实, 但不能过度复制真实个体。再加上 representativeness 和 bias amplification review, 避免信贷、投诉、KYC 和客服场景复制历史偏差。

最后每个数据集要有 data card 和 allowed-use license, 说明可用于什么、禁止什么、谁能访问、多久过期、是否可外发、派生数据如何处理。release gate 根据风险分层, 高风险训练、模型验证或供应商释放必须有 Privacy、Model Risk、Legal、Security 和 Business Owner 的 evidence-based sign-off。

CTO版本

我不会把 synthetic data 平台定位成单纯生成工具, 而会定位成 AI data product control plane。CTO 关心的是复用速度、平台一致性、审计压力和事故半径。我的建议是建设一个轻量但强约束的 lifecycle: catalog + generator registry + attack workbench + scorecard + license + provenance + retention。

这样工程团队可以通过标准接口申请和消费合成数据, 不需要每个项目重新发明脚本; 风险团队可以看到 source permission、attack evidence 和 residual risk; 平台可以用 watermark、metadata、access policy 和 expiry 防止合成数据在 notebook、RAG index、agent sandbox 和 vendor environment 中失控。商业价值不是“我们有更多假数据”, 而是“我们能更快测试高风险 AI 场景, 同时把隐私、偏见、用途漂移和审计不可追溯的风险前移控制”。

7-day practice plan

Day	Focus	Practice output
1	Synthetic data intake	为 AML typology training 写 use-case intake、risk tier、allowed/prohibited use
2	Source permission	为 KYC document testing 建 source permission matrix 和 exclusion rules
3	Generation design	为 payment fraud simulation 设计 graph/sequence synthesis plan 和 generator risk review
4	Privacy attacks	为 contact center transcript generation 设计 membership inference、PII scan、phrase similarity、canary test
5	Utility/fidelity	为 credit model scenario augmentation 建 scorecard, 包括 task utility、distribution fidelity、bias review
6	Data card/license	为 complaint analytics synthetic dataset 写 data card 和 allowed-use license
7	Release gate and ADR	完成 G2/G3 release evidence packet, 写一份 ADR 并准备 2 分钟面试叙述

Source anchors with links

Source	Link	用途
UK ICO synthetic data guidance	https://ico.org.uk/about-the-ico/research-reports-impact-and-evaluation/research-and-reports/technology-and-innovation/synthetic-data/	作为 synthetic data privacy risk、utility、governance 和误用风险的锚点
NIST Privacy Framework	https://www.nist.gov/privacy-framework	用 privacy risk management 思维组织 source permission、privacy attack testing、控制和证据
NIST AI RMF	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 组织 AI synthetic data 风险治理和 release gate
ISO/IEC 42001	https://www.iso.org/standard/81230.html	用 AI management system 思维设计 roles、operation、performance evaluation、internal audit 和持续改进
ISO/IEC 23894	https://www.iso.org/standard/77304.html	用 AI risk management lifecycle 思维组织 risk identification、analysis、treatment 和 monitoring
W3C PROV	https://www.w3.org/TR/prov-overview/	用 entity / activity / agent 模型描述 synthetic data provenance、生成活动和责任主体