返回 Papers
AI 底层逻辑 / 经典论文

AI Synthetic Data Governance:隐私-效用-保真度架构

Date: 2026-06-30

502ai-foundations/papers/173-ai-synthetic-data-governance-privacy-utility-fidelity-architecture.md

AI 合成数据治理架构:Synthetic Data Governance / Privacy-Utility-Fidelity Architecture

Date: 2026-06-30
Status: evergreen
Audience: experienced CBAP / financial retail PM / data product architect / AI architect / privacy-risk partner
Output: advanced architecture note, governance model, release gate design, ADR draft, interview-ready narrative


Why synthetic data governance matters

Synthetic data 不是“假数据”, 也不是把真实客户数据丢给生成器后自动得到安全资产。对金融零售 AI 来说, synthetic data 是一种受治理的数据产品:

Synthetic data product = approved purpose + source permission + generation method + privacy attack evidence + utility/fidelity score + allowed-use license + release gate.

它的价值在于让团队在不无限复制敏感数据的情况下, 支持 AI 开发、流程测试、边界场景扩展、员工训练和供应商 PoC。但它也会引入新的风险:

风险为什么高级团队必须治理
Residual privacy leakage生成器可能记住罕见客户、交易、投诉、文件图像或通话片段, 导致 membership inference、model inversion 或 record linkage 风险
False safety belief团队误以为“合成”天然匿名, 然后把数据带入 vendor sandbox、training corpus 或低控制环境
Utility theater数据看起来真实, 但无法支撑业务任务、模型训练、流程测试或控制验证
Fidelity overfit过度追求像真实数据, 反而复制原始分布中的隐私、偏见、历史错误和异常个案
Bias amplification生成器放大少数群体误分布、投诉语言模式、欺诈标签偏差或信贷历史偏差
Purpose creep为 KYC 测试生成的数据被拿去训练营销模型, 或 AML typology 数据被复用于员工绩效监控
Provenance collapse下游不知道数据由谁生成、基于哪些 source、允许做什么、保留多久、是否可再分发

高级 PM / BA / Architect 的关键判断不是“能不能生成一批数据”, 而是:

Which business decision can this synthetic dataset support,
what real data permissions authorize its creation,
what privacy attacks has it survived,
what utility and fidelity evidence proves it is fit for use,
and what license prevents misuse after release?

本文刻意不重复四个相邻主题:

  • Synthetic user simulation 关注 persona / scenario / journey lab; 本文关注 synthetic dataset 作为数据产品的治理和发布。
  • Privacy clean room 关注多方数据协作和受控计算; 本文关注企业内部或受控供应商场景中的合成数据资产。
  • Differential privacy 关注数学隐私预算; 本文把 DP 视为可选控制之一, 不把它当成全部方案。
  • Eval dataset lifecycle 关注 eval set 的版本、标签和回归治理; 本文关注 synthetic data 是否可被释放、用于何种目的、证据是否充分。

Concept diagram

flowchart TB
  A[Use-case intake<br/>purpose, users, risk tier,<br/>allowed decision] --> B[Source permission gate]
  B --> C[Source data registry<br/>classification, consent, contract,<br/>lineage, retention, owner]
  C --> D[Generation plan<br/>method, prompts, rules, model,<br/>sampling, constraints]
  D --> E[Generator risk review<br/>memorization, rare cases,<br/>prompt leakage, vendor boundary]
  E --> F[Synthetic dataset build]
  F --> G[Privacy attack workbench<br/>membership inference,<br/>model inversion, linkage,<br/>nearest-neighbor, canary tests]
  F --> H[Utility and fidelity lab<br/>task performance, distribution,<br/>business rules, SME review]
  F --> I[Representativeness and bias review<br/>coverage, subgroup error,<br/>amplification, missingness]
  G --> J[Data card and evidence packet]
  H --> J
  I --> J
  J --> K{Release gate}
  K -->|Reject| D
  K -->|Limited release| L[Allowed-use license<br/>scope, users, prohibited uses,<br/>expiry, retention, watermark]
  K -->|Promote| M[Dataset catalog<br/>provenance, version,<br/>access control, monitoring]
  M --> N[Downstream use<br/>RAG, Agent, Copilot,<br/>Eval, training, QA, sandbox]
  N --> O[Usage telemetry and incident feedback]
  O --> A

核心闭环:

business purpose
  -> source permission
  -> governed generation
  -> privacy / utility / fidelity / bias evidence
  -> allowed-use release
  -> monitored downstream use
  -> renewal, restriction or retirement

Core architecture model

1. Use-case intake as the first control

Synthetic data 治理从 intake 开始, 不是从生成器开始。一个成熟 intake 要把“为什么需要合成数据”写清楚。

Field设计要求金融零售例子
use_case_id稳定编号, 绑定 PRD、ADR、data card、release gatesyn-aml-typology-training-v1
business purpose明确要支持的决策或测试, 禁止泛化成“AI 研发”培训 AML analyst 识别 mule network typology
target users谁可以访问, 是否含供应商、承包商、离岸团队Financial crime training team, model validation
risk tier基于数据敏感性、客户影响、外发范围和自动化程度分级KYC 文档测试为 high, UI demo mock data 为 low
requested outputtabular, transcript, image, document, transaction graph, mixed datasetcontact center transcript + intent label
downstream usetesting, training, eval, demonstration, sandbox, analytics, model trainingpayment fraud simulation for rule stress test
prohibited use明确不能做什么不得用于真实客户决策、营销定位、员工考核
evidence needed需要哪些 privacy、utility、fidelity、bias 证据membership inference test, SME utility review

Intake 的产品原则:

  • 不能用一个 synthetic dataset 服务无限多目的。
  • 高风险用途必须把 allowed use 写成 license, 不是写在邮件里。
  • 如果真实源数据没有权限支持某用途, 合成后也不自动获得该用途权限。

2. Source permission and lineage layer

Synthetic data 的合法性、合规性和可解释性取决于 source data permission。架构上应建立 source permission matrix。

Source dimension关键问题控制证据
data owner谁批准该 source 被用于生成合成数据data owner approval, system of record
data class是否含 PII、SPI、financial crime data、credit data、complaint data、call recordingdata classification record
purpose basis原始收集目的是否允许用于 synthetic data generationprivacy / legal / compliance review
consent / notice客户声明、内部政策或合同是否支持该处理notice mapping, consent basis
contractual boundaryvendor 或 partner 数据是否允许衍生数据、合成数据、再分发contract clause summary
retention原始数据、intermediate artifacts、synthetic output 保留多久retention schedule
jurisdiction数据处理和存储区域regional control record
exclusion rules哪些字段、群体、样本或事件禁止进入生成流程source filter log

Source permission 的关键认知:

Synthetic output inherits constraints from source data unless a reviewed governance decision narrows, transforms and re-licenses the output.

3. Generation architecture patterns

不同生成方式对应不同风险和证据需求。

Pattern适用场景主要风险证据重点
Rule-based synthesisKYC 文件字段组合、账户状态、流程路径测试规则过窄, 缺少真实分布business rule coverage, SME approval
Statistical / tabular generator信贷 scenario augmentation、客户分群样本分布失真, minority subgroup leakagedistribution fidelity, nearest-neighbor privacy
LLM text generationcontact center transcript、complaint narrative、case notes原文记忆、敏感片段复现、语气偏差prompt/source control, plagiarism scan, PII leakage test
Document/image synthesisKYC document testing、statement extraction复制真实图像、身份信息泄露、水印缺失visual similarity, OCR PII scan, watermark/provenance
Graph / sequence synthesispayment fraud simulation、AML mule network罕见模式泄露, 图结构可重识别graph similarity, rare motif suppression
Hybrid mutation从真实 case 改写成边界场景原始 case 可被反推edit distance, source separation, canary cases

生成器本身也要治理:

  • 记录 generator model、version、prompt、seed、rules、parameters、source sample window。
  • 对 LLM 生成场景, 禁止把敏感 source 原文直接放入不受控 vendor prompt。
  • 对高风险数据, 使用 isolated generation environment、short-lived workspace、access logging 和 artifact cleanup。
  • 对重复生成任务, 建立 prompt/rule registry, 避免手工脚本成为隐形数据处理系统。

4. Data card and license layer

每个可释放 synthetic dataset 都应有 data card, 并附带 allowed-use license。

Data card field内容
dataset_id / version稳定 ID, 版本, 生成日期
purpose被批准的业务目的
source summarysource systems、时间窗口、字段类别, 不暴露敏感细节
generation methodrule/statistical/LLM/document/graph/hybrid, generator version
privacy testsattack type、result、threshold、exceptions
utility teststask performance、SME review、business rule pass rate
fidelity testsdistribution similarity、sequence realism、document realism
representativeness覆盖范围、缺口、禁止推断的人群
known limitations不能代表哪些人群、渠道、产品、事件
allowed usestesting/training/eval/demo/model training 等明确范围
prohibited uses真实客户决策、外发、再训练、重新识别、linkage 等
retention / expiry过期日、复审日、删除要求
provenanceW3C PROV 风格 entity/activity/agent 记录
watermark / markervisible/invisible watermark, metadata tag, synthetic flag
owner / approversPM、Data Owner、Privacy、Risk、Architecture

Privacy-utility-fidelity measurement model

Synthetic data release 必须同时测 privacy、utility、fidelity。只测一个维度会导致错误决策。

1. Three-axis model

Privacy: can an attacker infer a real person, record, document, call or transaction?
Utility: does the dataset support the approved task?
Fidelity: does it preserve the relevant structure of the target domain without copying protected facts?
Axis目标常用测试失败信号
Privacy限制对真实个体或真实记录的推断membership inference, model inversion, nearest-neighbor distance, record linkage, PII scan, canary extraction合成样本与真实罕见样本过近, 可还原客户字段, 可判断某人是否在源数据中
Utility支撑批准用途downstream task score, rule coverage, SME rating, workflow pass rate, model validation comparison模型训练后不提升, 测试无法发现缺陷, SME 认为不可信
Fidelity保留必要业务结构distribution similarity, correlation preservation, temporal pattern check, graph motif check, document layout realism交易金额、投诉语气、KYC 文件组合或欺诈链路不符合业务经验

2. Privacy attack testing

高风险 synthetic data 至少应覆盖以下测试。

Attack / test测什么金融零售例子通过标准示例
Membership inference攻击者能否判断某客户或 case 是否在源数据中AML 罕见 typology case 是否被生成器记住attack AUC 不高于批准阈值, 高风险样本人工复核通过
Model inversion能否从合成数据或生成器输出反推敏感属性从投诉叙述反推客户身份、地址、特殊困境敏感属性复原率低于阈值, 无可识别 narrative
Nearest-neighbor similarity合成记录是否过度接近真实记录KYC 文档图像、交易序列、call transcript 句子近似复制高相似样本被删除或降权
Linkage attack能否与外部或内部数据拼接重识别交易时间、商户、金额组合识别客户小群体/罕见组合被泛化、扰动或禁止释放
Canary extraction人为插入罕见 token/record, 测生成器是否复现在训练样本中插入测试 case marker生成输出不得复现 canary
PII / SPI scan是否出现姓名、地址、账号、电话、身份证件、录音片段contact center transcript generation自动扫描和人工抽检均无未授权标识

3. Utility/fidelity scoring

建议用多维 scorecard, 不把分数压成单一平均值。

Dimension指标说明
Task utilitymodel lift, test defect discovery, workflow success是否真正帮助批准用途
Business rule validityrule pass rate, impossible-combination rate例如 KYC 文件国家/证件类型/有效期组合不能违反业务规则
Distribution fidelityKS distance, PSI, KL/JS divergence, correlation delta适合 tabular/transaction 数据, 需要业务解释
Temporal fidelityseasonality, burst pattern, event order validity支付欺诈、AML 行为链、投诉升级特别重要
Text/document fidelitySME realism score, policy terminology accuracy, layout validitycontact center、complaint、KYC 文档测试
Coveragesegment coverage, scenario coverage, edge-case coverage不只看总体, 要看关键群体和流程路径
Bias / fairnesssubgroup utility, label skew, error amplification信贷、投诉、KYC、客服场景必须单独看
Stabilityre-generation variance, version-to-version drift避免每次生成导致测试结果不可比较

4. Representativeness without copying reality

Representativeness 不是把真实分布完整复制。金融零售 synthetic data 要处理三种 tension:

Tension错误做法更成熟做法
罕见风险 vs 隐私直接复制少数真实 fraud / AML case抽象 typology, 重组行为模式, 删除可识别组合
真实分布 vs 测试覆盖完全按生产占比生成, 长尾不足core distribution + challenge slice 分层发布
群体公平 vs 敏感属性用敏感属性机械生成 persona使用被批准的 fairness review 字段和代理风险控制, 明确禁止推断用途

Allowed-use and release gate design

1. Allowed-use license

Synthetic dataset release 应像 API scope 一样有 license。

License element示例
Allowed purpose用于 KYC document ingestion pipeline 的非生产功能测试和 OCR regression
Allowed usersKYC platform QA team, data quality team, named vendor test engineers
Allowed environmentsenterprise test environment, approved vendor sandbox with no retention after 30 days
Allowed operationsread, query, transform for test cases, run automated tests
Prohibited operationstrain foundation model, enrich customer profiles, external sharing, re-identification, linkage to production customer table
Expiry90 天后必须删除或重新审批
Derivative rules派生数据继承原 license, 不得去除 synthetic marker
Evidence requirements使用记录、测试结果、删除证明、exception log

2. Release gate levels

Gate适用数据审批必备证据
G0 Draft仅本地设计, 无敏感 sourceTeam leaduse-case intake, no sensitive source statement
G1 Internal low-riskUI demo, mock operational dataPM + Data Ownerdata card, basic PII scan, allowed-use license
G2 Controlled testKYC / complaints / contact center synthetic casesPM + Architect + Privacysource permission, privacy tests, utility/fidelity scorecard, retention
G3 Model-impacting训练、fine-tune、model validation、credit/fraud augmentationData Owner + Model Risk + Privacy + Riskattack report, bias review, benchmark comparison, risk acceptance
G4 External / vendor releasevendor PoC, offshore test, partner reviewLegal + Procurement + Security + Privacy + Business Ownercontract boundary, export controls, watermark/provenance, deletion evidence

Gate 的关键纪律:

  • 低风险 release 不能自动升级为高风险用途。
  • G3/G4 必须有 residual risk owner, expiry 和 reopen trigger。
  • 如果 utility 高但 privacy attack 失败, 不能通过“业务价值大”绕过 gate; 只能缩小用途、重生成、加控制或拒绝。
  • 如果 privacy 强但 utility 不足, 不能当作模型训练或 validation 数据, 只能作为 demo/mock 或流程测试数据。

Financial retail scenarios

1. AML typology training

目标是训练 analyst 识别 typology, 不是复制真实 SAR case。

设计点要求
Source permissionfinancial crime case data 高敏, 需要明确 training/simulation basis
Generation methodtypology abstraction + graph/sequence synthesis
Privacy testsrare case nearest-neighbor, canary extraction, linkage on amount/time/counterparty pattern
Utility testsanalyst training score, typology coverage, false narrative review
Allowed usestaff training, tabletop exercise, model challenge set
Prohibited use客户风险评分、真实 alert disposition、外部共享

2. KYC document testing

目标是测试 document ingestion、OCR、field extraction、exception routing。

设计点要求
Source permission禁止复用真实证件图像; 使用模板、字段规则、合成图像
Generation methodrule-based document layout + synthetic image + OCR noise injection
Privacy testsOCR PII scan, visual similarity to real docs, metadata scrub
Utility testsextraction accuracy regression, exception path coverage
Fidelity tests文件版式、语言、过期日期、证件类型组合符合政策
Release gatevendor sandbox 需 G4, 内部 QA 通常 G2

3. Credit model scenario augmentation

目标是扩展边界场景, 不替代真实 model validation。

设计点要求
Source permissioncredit data、bureau-like attributes、protected-class proxy 风险需严格审查
Generation methodconstrained tabular synthesis + policy scenario authoring
Privacy testsnearest-neighbor, linkage, membership inference
Utility testsmodel sensitivity, challenger model behavior, adverse action reason sanity
Bias reviewsubgroup coverage、proxy amplification、拒绝原因稳定性
Prohibited use未经 model risk approval 不得直接作为生产训练主数据

4. Payment fraud simulation

目标是压测 fraud rules、agent alert triage、payment warning UX。

设计点要求
Source permissionfraud cases 和 payment events 需要最小字段和 typology abstraction
Generation methodtransaction sequence / graph synthesis + adversarial pattern mutation
Privacy testsunique sequence leakage, counterparty linkage, rare merchant pattern
Utility testsrule trigger coverage, investigator realism score, alert fatigue estimate
Fidelity teststiming、amount、merchant category、device signal 合理性
Allowed usesandbox rule stress test, analyst training, warning copy testing

5. Contact center transcript generation

目标是生成受控 transcript, 用于 copilot QA、intent classification、agent training。

设计点要求
Source permissioncall recordings/transcripts 通常含高敏语音、身份验证、弱势客户信息
Generation methodintent-conditioned LLM generation, redaction-first summaries, no raw transcript prompt for uncontrolled LLM
Privacy testsPII/SPI scan, phrase similarity, membership inference sample test
Utility testsintent label quality, policy answer coverage, escalation trigger coverage
Bias reviewlanguage clarity、accent proxy、vulnerability phrase handling
Prohibited use复原真实客户对话、员工绩效推断、未经批准的 sentiment profiling

6. Complaint analytics

目标是扩展 complaint themes 和 root-cause analytics coverage, 不是制造虚假投诉指标。

设计点要求
Source permissioncomplaints may include legal privilege, regulator-sensitive narratives, vulnerable customer details
Generation methodtheme-level synthesis + policy taxonomy + controlled narrative variation
Privacy testsnarrative uniqueness, linkage to public complaint, PII scan
Utility teststaxonomy classifier performance, root-cause label quality, QA reviewer score
Fidelity testsproduct, channel, issue, harm, resolution structure
Allowed usetaxonomy test, complaint copilot challenge set, QA calibration
Prohibited use管理层真实投诉率报表、监管报送、客户画像

Metrics/control/evidence model

1. Metrics

CategoryMetricEvidence
Privacymembership inference AUC, nearest-neighbor distance, PII leakage rate, linkage success rateattack report, scan logs, reviewer sign-off
Utilitydownstream task lift, workflow pass rate, defect discovery rate, SME usefulness scorebenchmark run, QA test result, SME review
Fidelitydistribution delta, rule validity rate, temporal pattern score, transcript realismstatistical report, rule engine output, review rubric
Coveragescenario coverage, subgroup coverage, edge-case count, missing segment listcoverage matrix, gap register
Biassubgroup utility delta, label skew, stereotype phrase rate, adverse reason stabilityfairness review, bias checklist
Governancedata card completeness, license coverage, expired dataset count, unauthorized use attemptscatalog report, access logs, exception log
Operationsregeneration reproducibility, version drift, deletion proof timelinesspipeline log, provenance graph, retention evidence

2. Control model

ControlPurposeOwner
Intake approvalPrevent vague or unauthorized synthetic data requestsPM / BA / Business Owner
Source permission gateEnsure source data can support the proposed transformationData Owner / Privacy / Legal
Generation environment controlPrevent uncontrolled exposure during generationSecurity / Platform
Privacy attack suiteTest leakage and re-identification riskPrivacy Engineering / Data Science
Utility/fidelity scorecardConfirm fit for approved useProduct / SME / Model Risk
Bias and coverage reviewDetect representational harm and amplified historical biasRisk / Fairness / Business SME
Data cardMake limitations and provenance inspectableData Product Owner
Allowed-use licensePrevent downstream misuse and purpose creepData Governance / PM
Watermark/provenancePreserve synthetic identity across copies and derivativesData Platform
Release gateMake go/no-go decision auditableArchitecture / Risk / Data Owner
Retention and retirementAvoid long-lived uncontrolled synthetic assetsRecords / Data Governance

3. Evidence packet

上线或释放时至少保留:

  • use-case intake and risk tier
  • source permission matrix and data classification
  • generation plan, generator version, prompts/rules, source window
  • privacy attack report and exceptions
  • utility/fidelity scorecard
  • bias/coverage review
  • data card and allowed-use license
  • watermark/provenance evidence
  • approver list, residual risk owner, expiry date
  • retention/deletion requirements
  • downstream access logs and monitoring plan

Anti-patterns and failure modes

Anti-pattern表现后果修正方式
“Synthetic means safe”不做 source permission 和 privacy attack test数据泄露、合规失败、供应商滥用建立 G2+ release gate
One dataset for everything同一数据被用于 demo、eval、training、vendor PoCpurpose creep, 证据不匹配按用途拆分 license 和版本
Fidelity worship追求与真实数据几乎一致复制隐私风险和历史偏差设置 privacy floor 和 bias review
Privacy-only release数据很安全但没有业务效用测试误导、模型质量下降同时要求 utility/fidelity score
Demo-to-production driftdemo mock data 被留在产品训练或测试流水线假分布污染真实系统catalog + expiry + pipeline marker
Unmarked derivatives下游加工后失去 synthetic 标记无法追踪、误作真实数据watermark/provenance inheritance
Vendor sandbox sprawl合成数据被发给多个供应商且无删除证明第三方风险扩大G4 gate + contract + deletion evidence
Rare case copying把高价值少数案例直接改名重用membership inference 和客户识别typology abstraction + nearest-neighbor suppression
Bias laundering以“合成”为名复制历史歧视标签公平性风险扩大subgroup review + label governance
Metric averagingprivacy、utility、fidelity 合成总分掩盖致命失败三轴门槛, 任一硬门失败即拒绝

Architecture mapping to RAG / Agent / Copilot / Eval / Governance

AI patternSynthetic data value关键治理点
RAG生成政策问答、投诉叙事、KYC exception cases, 测 retrieval grounding不得把 synthetic documents 混入真实知识库而不标记; citation 必须说明 synthetic source
Agent模拟 payment fraud workflow、case triage、tool-call boundary合成交易/客户不能进入生产 tool; agent action logs 标记 synthetic run
Copilotcontact center draft、AML narrative、complaint response QA输出质量用真实 SME rubric 校准; 禁止把合成 transcript 当真实客户证据
Eval扩展边界场景、隐私保护测试集、regression edge cases需要 oracle/expected behavior, 不重复 eval dataset lifecycle, 重点看 synthetic source license
Governance作为 AI data product 纳入 catalog、risk register、release gatedata card、allowed use、retention、provenance、attack evidence 可审计

关键架构原则:

Synthetic data may enter AI workflows only through typed, versioned, licensed datasets.
It should never be an invisible folder of files copied into prompts, notebooks or vendor sandboxes.

ADR draft

FieldContent
ADR IDADR-AI-DATA-173
TitleAdopt governed synthetic data release gates for financial retail AI development and testing
StatusProposed
ContextAI teams need synthetic AML, KYC, credit, fraud, contact center and complaint datasets for testing, training, eval, sandbox and vendor PoC. Current ad hoc generation creates privacy, utility, bias, provenance and purpose-creep risk.
DecisionTreat synthetic data as a governed data product. Every G2+ synthetic dataset must pass use-case intake, source permission review, generator risk review, privacy attack testing, utility/fidelity scorecard, bias/coverage review, data card, allowed-use license, watermark/provenance tagging, retention rule and release gate approval.
ScopeApplies to synthetic datasets derived from, calibrated by or intended to resemble financial retail customer, transaction, case, document, complaint, call, credit, fraud or AML data.
Options considered1. Team-level ad hoc generation. 2. Central synthetic data platform without release gates. 3. Governed synthetic data product lifecycle with risk-tiered gates.
Decision rationaleOption 3 balances speed and control. It allows low-risk mock data to move quickly while requiring evidence for high-risk model-impacting, customer-like or externally released datasets.
ConsequencesTeams must plan evidence work earlier. Synthetic datasets become cataloged assets with owners, expiry and allowed use. Some datasets will be rejected or limited despite high utility if privacy or bias evidence fails.
ControlsSource permission matrix, privacy attack suite, utility/fidelity scorecard, bias review, data card, license, watermark/provenance, access logging, retention/deletion evidence.
Residual riskPrivacy attacks are not exhaustive; utility evidence may not transfer to production; downstream teams may misuse derivatives without strong catalog and access controls.
Reopen triggersNew source data class, external release, use for model training, privacy incident, bias finding, attack threshold breach, regulatory/legal interpretation change, dataset expiry, generator model change.

Interview answer

30秒版本

我会把 synthetic data 当成受治理的数据产品, 而不是“假数据”。核心架构是 use-case intake、source permission、generation risk review、privacy attack testing、utility/fidelity measurement、bias/coverage review、data card、allowed-use license、watermark/provenance、retention 和 release gate。这样既能支持 AML、KYC、信贷、欺诈、客服和投诉 AI 场景, 又不会把合成数据误用成无限制训练或外发资产。

2分钟版本

在金融零售里, synthetic data 的价值是降低真实客户数据在开发、测试、训练和供应商 PoC 中的暴露, 同时扩展长尾场景。但风险是团队容易误以为“合成就安全”。我会从 approved purpose 开始, 明确这个数据是用于 KYC OCR 测试、AML typology training、payment fraud simulation, 还是 contact center copilot QA。然后检查 source data permission, 因为合成数据不能自动摆脱原始数据的目的限制和合同边界。

架构上我会建立三轴评估: privacy、utility、fidelity。Privacy 覆盖 membership inference、model inversion、nearest-neighbor、linkage、canary 和 PII scan; utility 看它是否支撑批准任务; fidelity 看业务结构是否真实, 但不能过度复制真实个体。再加上 representativeness 和 bias amplification review, 避免信贷、投诉、KYC 和客服场景复制历史偏差。

最后每个数据集要有 data card 和 allowed-use license, 说明可用于什么、禁止什么、谁能访问、多久过期、是否可外发、派生数据如何处理。release gate 根据风险分层, 高风险训练、模型验证或供应商释放必须有 Privacy、Model Risk、Legal、Security 和 Business Owner 的 evidence-based sign-off。

CTO版本

我不会把 synthetic data 平台定位成单纯生成工具, 而会定位成 AI data product control plane。CTO 关心的是复用速度、平台一致性、审计压力和事故半径。我的建议是建设一个轻量但强约束的 lifecycle: catalog + generator registry + attack workbench + scorecard + license + provenance + retention。

这样工程团队可以通过标准接口申请和消费合成数据, 不需要每个项目重新发明脚本; 风险团队可以看到 source permission、attack evidence 和 residual risk; 平台可以用 watermark、metadata、access policy 和 expiry 防止合成数据在 notebook、RAG index、agent sandbox 和 vendor environment 中失控。商业价值不是“我们有更多假数据”, 而是“我们能更快测试高风险 AI 场景, 同时把隐私、偏见、用途漂移和审计不可追溯的风险前移控制”。


7-day practice plan

DayFocusPractice output
1Synthetic data intake为 AML typology training 写 use-case intake、risk tier、allowed/prohibited use
2Source permission为 KYC document testing 建 source permission matrix 和 exclusion rules
3Generation design为 payment fraud simulation 设计 graph/sequence synthesis plan 和 generator risk review
4Privacy attacks为 contact center transcript generation 设计 membership inference、PII scan、phrase similarity、canary test
5Utility/fidelity为 credit model scenario augmentation 建 scorecard, 包括 task utility、distribution fidelity、bias review
6Data card/license为 complaint analytics synthetic dataset 写 data card 和 allowed-use license
7Release gate and ADR完成 G2/G3 release evidence packet, 写一份 ADR 并准备 2 分钟面试叙述

SourceLink用途
UK ICO synthetic data guidancehttps://ico.org.uk/about-the-ico/research-reports-impact-and-evaluation/research-and-reports/technology-and-innovation/synthetic-data/作为 synthetic data privacy risk、utility、governance 和误用风险的锚点
NIST Privacy Frameworkhttps://www.nist.gov/privacy-framework用 privacy risk management 思维组织 source permission、privacy attack testing、控制和证据
NIST AI RMFhttps://www.nist.gov/itl/ai-risk-management-framework用 Govern / Map / Measure / Manage 组织 AI synthetic data 风险治理和 release gate
ISO/IEC 42001https://www.iso.org/standard/81230.html用 AI management system 思维设计 roles、operation、performance evaluation、internal audit 和持续改进
ISO/IEC 23894https://www.iso.org/standard/77304.html用 AI risk management lifecycle 思维组织 risk identification、analysis、treatment 和 monitoring
W3C PROVhttps://www.w3.org/TR/prov-overview/用 entity / activity / agent 模型描述 synthetic data provenance、生成活动和责任主体