AI Synthetic Data Governance:隐私-效用-保真度架构
Date: 2026-06-30
AI 合成数据治理架构:Synthetic Data Governance / Privacy-Utility-Fidelity Architecture
Date: 2026-06-30
Status: evergreen
Audience: experienced CBAP / financial retail PM / data product architect / AI architect / privacy-risk partner
Output: advanced architecture note, governance model, release gate design, ADR draft, interview-ready narrative
Why synthetic data governance matters
Synthetic data 不是“假数据”, 也不是把真实客户数据丢给生成器后自动得到安全资产。对金融零售 AI 来说, synthetic data 是一种受治理的数据产品:
Synthetic data product = approved purpose + source permission + generation method + privacy attack evidence + utility/fidelity score + allowed-use license + release gate.
它的价值在于让团队在不无限复制敏感数据的情况下, 支持 AI 开发、流程测试、边界场景扩展、员工训练和供应商 PoC。但它也会引入新的风险:
| 风险 | 为什么高级团队必须治理 |
|---|---|
| Residual privacy leakage | 生成器可能记住罕见客户、交易、投诉、文件图像或通话片段, 导致 membership inference、model inversion 或 record linkage 风险 |
| False safety belief | 团队误以为“合成”天然匿名, 然后把数据带入 vendor sandbox、training corpus 或低控制环境 |
| Utility theater | 数据看起来真实, 但无法支撑业务任务、模型训练、流程测试或控制验证 |
| Fidelity overfit | 过度追求像真实数据, 反而复制原始分布中的隐私、偏见、历史错误和异常个案 |
| Bias amplification | 生成器放大少数群体误分布、投诉语言模式、欺诈标签偏差或信贷历史偏差 |
| Purpose creep | 为 KYC 测试生成的数据被拿去训练营销模型, 或 AML typology 数据被复用于员工绩效监控 |
| Provenance collapse | 下游不知道数据由谁生成、基于哪些 source、允许做什么、保留多久、是否可再分发 |
高级 PM / BA / Architect 的关键判断不是“能不能生成一批数据”, 而是:
Which business decision can this synthetic dataset support,
what real data permissions authorize its creation,
what privacy attacks has it survived,
what utility and fidelity evidence proves it is fit for use,
and what license prevents misuse after release?
本文刻意不重复四个相邻主题:
- Synthetic user simulation 关注 persona / scenario / journey lab; 本文关注 synthetic dataset 作为数据产品的治理和发布。
- Privacy clean room 关注多方数据协作和受控计算; 本文关注企业内部或受控供应商场景中的合成数据资产。
- Differential privacy 关注数学隐私预算; 本文把 DP 视为可选控制之一, 不把它当成全部方案。
- Eval dataset lifecycle 关注 eval set 的版本、标签和回归治理; 本文关注 synthetic data 是否可被释放、用于何种目的、证据是否充分。
Concept diagram
flowchart TB
A[Use-case intake<br/>purpose, users, risk tier,<br/>allowed decision] --> B[Source permission gate]
B --> C[Source data registry<br/>classification, consent, contract,<br/>lineage, retention, owner]
C --> D[Generation plan<br/>method, prompts, rules, model,<br/>sampling, constraints]
D --> E[Generator risk review<br/>memorization, rare cases,<br/>prompt leakage, vendor boundary]
E --> F[Synthetic dataset build]
F --> G[Privacy attack workbench<br/>membership inference,<br/>model inversion, linkage,<br/>nearest-neighbor, canary tests]
F --> H[Utility and fidelity lab<br/>task performance, distribution,<br/>business rules, SME review]
F --> I[Representativeness and bias review<br/>coverage, subgroup error,<br/>amplification, missingness]
G --> J[Data card and evidence packet]
H --> J
I --> J
J --> K{Release gate}
K -->|Reject| D
K -->|Limited release| L[Allowed-use license<br/>scope, users, prohibited uses,<br/>expiry, retention, watermark]
K -->|Promote| M[Dataset catalog<br/>provenance, version,<br/>access control, monitoring]
M --> N[Downstream use<br/>RAG, Agent, Copilot,<br/>Eval, training, QA, sandbox]
N --> O[Usage telemetry and incident feedback]
O --> A
核心闭环:
business purpose
-> source permission
-> governed generation
-> privacy / utility / fidelity / bias evidence
-> allowed-use release
-> monitored downstream use
-> renewal, restriction or retirement
Core architecture model
1. Use-case intake as the first control
Synthetic data 治理从 intake 开始, 不是从生成器开始。一个成熟 intake 要把“为什么需要合成数据”写清楚。
| Field | 设计要求 | 金融零售例子 |
|---|---|---|
| use_case_id | 稳定编号, 绑定 PRD、ADR、data card、release gate | syn-aml-typology-training-v1 |
| business purpose | 明确要支持的决策或测试, 禁止泛化成“AI 研发” | 培训 AML analyst 识别 mule network typology |
| target users | 谁可以访问, 是否含供应商、承包商、离岸团队 | Financial crime training team, model validation |
| risk tier | 基于数据敏感性、客户影响、外发范围和自动化程度分级 | KYC 文档测试为 high, UI demo mock data 为 low |
| requested output | tabular, transcript, image, document, transaction graph, mixed dataset | contact center transcript + intent label |
| downstream use | testing, training, eval, demonstration, sandbox, analytics, model training | payment fraud simulation for rule stress test |
| prohibited use | 明确不能做什么 | 不得用于真实客户决策、营销定位、员工考核 |
| evidence needed | 需要哪些 privacy、utility、fidelity、bias 证据 | membership inference test, SME utility review |
Intake 的产品原则:
- 不能用一个 synthetic dataset 服务无限多目的。
- 高风险用途必须把 allowed use 写成 license, 不是写在邮件里。
- 如果真实源数据没有权限支持某用途, 合成后也不自动获得该用途权限。
2. Source permission and lineage layer
Synthetic data 的合法性、合规性和可解释性取决于 source data permission。架构上应建立 source permission matrix。
| Source dimension | 关键问题 | 控制证据 |
|---|---|---|
| data owner | 谁批准该 source 被用于生成合成数据 | data owner approval, system of record |
| data class | 是否含 PII、SPI、financial crime data、credit data、complaint data、call recording | data classification record |
| purpose basis | 原始收集目的是否允许用于 synthetic data generation | privacy / legal / compliance review |
| consent / notice | 客户声明、内部政策或合同是否支持该处理 | notice mapping, consent basis |
| contractual boundary | vendor 或 partner 数据是否允许衍生数据、合成数据、再分发 | contract clause summary |
| retention | 原始数据、intermediate artifacts、synthetic output 保留多久 | retention schedule |
| jurisdiction | 数据处理和存储区域 | regional control record |
| exclusion rules | 哪些字段、群体、样本或事件禁止进入生成流程 | source filter log |
Source permission 的关键认知:
Synthetic output inherits constraints from source data unless a reviewed governance decision narrows, transforms and re-licenses the output.
3. Generation architecture patterns
不同生成方式对应不同风险和证据需求。
| Pattern | 适用场景 | 主要风险 | 证据重点 |
|---|---|---|---|
| Rule-based synthesis | KYC 文件字段组合、账户状态、流程路径测试 | 规则过窄, 缺少真实分布 | business rule coverage, SME approval |
| Statistical / tabular generator | 信贷 scenario augmentation、客户分群样本 | 分布失真, minority subgroup leakage | distribution fidelity, nearest-neighbor privacy |
| LLM text generation | contact center transcript、complaint narrative、case notes | 原文记忆、敏感片段复现、语气偏差 | prompt/source control, plagiarism scan, PII leakage test |
| Document/image synthesis | KYC document testing、statement extraction | 复制真实图像、身份信息泄露、水印缺失 | visual similarity, OCR PII scan, watermark/provenance |
| Graph / sequence synthesis | payment fraud simulation、AML mule network | 罕见模式泄露, 图结构可重识别 | graph similarity, rare motif suppression |
| Hybrid mutation | 从真实 case 改写成边界场景 | 原始 case 可被反推 | edit distance, source separation, canary cases |
生成器本身也要治理:
- 记录 generator model、version、prompt、seed、rules、parameters、source sample window。
- 对 LLM 生成场景, 禁止把敏感 source 原文直接放入不受控 vendor prompt。
- 对高风险数据, 使用 isolated generation environment、short-lived workspace、access logging 和 artifact cleanup。
- 对重复生成任务, 建立 prompt/rule registry, 避免手工脚本成为隐形数据处理系统。
4. Data card and license layer
每个可释放 synthetic dataset 都应有 data card, 并附带 allowed-use license。
| Data card field | 内容 |
|---|---|
| dataset_id / version | 稳定 ID, 版本, 生成日期 |
| purpose | 被批准的业务目的 |
| source summary | source systems、时间窗口、字段类别, 不暴露敏感细节 |
| generation method | rule/statistical/LLM/document/graph/hybrid, generator version |
| privacy tests | attack type、result、threshold、exceptions |
| utility tests | task performance、SME review、business rule pass rate |
| fidelity tests | distribution similarity、sequence realism、document realism |
| representativeness | 覆盖范围、缺口、禁止推断的人群 |
| known limitations | 不能代表哪些人群、渠道、产品、事件 |
| allowed uses | testing/training/eval/demo/model training 等明确范围 |
| prohibited uses | 真实客户决策、外发、再训练、重新识别、linkage 等 |
| retention / expiry | 过期日、复审日、删除要求 |
| provenance | W3C PROV 风格 entity/activity/agent 记录 |
| watermark / marker | visible/invisible watermark, metadata tag, synthetic flag |
| owner / approvers | PM、Data Owner、Privacy、Risk、Architecture |
Privacy-utility-fidelity measurement model
Synthetic data release 必须同时测 privacy、utility、fidelity。只测一个维度会导致错误决策。
1. Three-axis model
Privacy: can an attacker infer a real person, record, document, call or transaction?
Utility: does the dataset support the approved task?
Fidelity: does it preserve the relevant structure of the target domain without copying protected facts?
| Axis | 目标 | 常用测试 | 失败信号 |
|---|---|---|---|
| Privacy | 限制对真实个体或真实记录的推断 | membership inference, model inversion, nearest-neighbor distance, record linkage, PII scan, canary extraction | 合成样本与真实罕见样本过近, 可还原客户字段, 可判断某人是否在源数据中 |
| Utility | 支撑批准用途 | downstream task score, rule coverage, SME rating, workflow pass rate, model validation comparison | 模型训练后不提升, 测试无法发现缺陷, SME 认为不可信 |
| Fidelity | 保留必要业务结构 | distribution similarity, correlation preservation, temporal pattern check, graph motif check, document layout realism | 交易金额、投诉语气、KYC 文件组合或欺诈链路不符合业务经验 |
2. Privacy attack testing
高风险 synthetic data 至少应覆盖以下测试。
| Attack / test | 测什么 | 金融零售例子 | 通过标准示例 |
|---|---|---|---|
| Membership inference | 攻击者能否判断某客户或 case 是否在源数据中 | AML 罕见 typology case 是否被生成器记住 | attack AUC 不高于批准阈值, 高风险样本人工复核通过 |
| Model inversion | 能否从合成数据或生成器输出反推敏感属性 | 从投诉叙述反推客户身份、地址、特殊困境 | 敏感属性复原率低于阈值, 无可识别 narrative |
| Nearest-neighbor similarity | 合成记录是否过度接近真实记录 | KYC 文档图像、交易序列、call transcript 句子近似复制 | 高相似样本被删除或降权 |
| Linkage attack | 能否与外部或内部数据拼接重识别 | 交易时间、商户、金额组合识别客户 | 小群体/罕见组合被泛化、扰动或禁止释放 |
| Canary extraction | 人为插入罕见 token/record, 测生成器是否复现 | 在训练样本中插入测试 case marker | 生成输出不得复现 canary |
| PII / SPI scan | 是否出现姓名、地址、账号、电话、身份证件、录音片段 | contact center transcript generation | 自动扫描和人工抽检均无未授权标识 |
3. Utility/fidelity scoring
建议用多维 scorecard, 不把分数压成单一平均值。
| Dimension | 指标 | 说明 |
|---|---|---|
| Task utility | model lift, test defect discovery, workflow success | 是否真正帮助批准用途 |
| Business rule validity | rule pass rate, impossible-combination rate | 例如 KYC 文件国家/证件类型/有效期组合不能违反业务规则 |
| Distribution fidelity | KS distance, PSI, KL/JS divergence, correlation delta | 适合 tabular/transaction 数据, 需要业务解释 |
| Temporal fidelity | seasonality, burst pattern, event order validity | 支付欺诈、AML 行为链、投诉升级特别重要 |
| Text/document fidelity | SME realism score, policy terminology accuracy, layout validity | contact center、complaint、KYC 文档测试 |
| Coverage | segment coverage, scenario coverage, edge-case coverage | 不只看总体, 要看关键群体和流程路径 |
| Bias / fairness | subgroup utility, label skew, error amplification | 信贷、投诉、KYC、客服场景必须单独看 |
| Stability | re-generation variance, version-to-version drift | 避免每次生成导致测试结果不可比较 |
4. Representativeness without copying reality
Representativeness 不是把真实分布完整复制。金融零售 synthetic data 要处理三种 tension:
| Tension | 错误做法 | 更成熟做法 |
|---|---|---|
| 罕见风险 vs 隐私 | 直接复制少数真实 fraud / AML case | 抽象 typology, 重组行为模式, 删除可识别组合 |
| 真实分布 vs 测试覆盖 | 完全按生产占比生成, 长尾不足 | core distribution + challenge slice 分层发布 |
| 群体公平 vs 敏感属性 | 用敏感属性机械生成 persona | 使用被批准的 fairness review 字段和代理风险控制, 明确禁止推断用途 |
Allowed-use and release gate design
1. Allowed-use license
Synthetic dataset release 应像 API scope 一样有 license。
| License element | 示例 |
|---|---|
| Allowed purpose | 用于 KYC document ingestion pipeline 的非生产功能测试和 OCR regression |
| Allowed users | KYC platform QA team, data quality team, named vendor test engineers |
| Allowed environments | enterprise test environment, approved vendor sandbox with no retention after 30 days |
| Allowed operations | read, query, transform for test cases, run automated tests |
| Prohibited operations | train foundation model, enrich customer profiles, external sharing, re-identification, linkage to production customer table |
| Expiry | 90 天后必须删除或重新审批 |
| Derivative rules | 派生数据继承原 license, 不得去除 synthetic marker |
| Evidence requirements | 使用记录、测试结果、删除证明、exception log |
2. Release gate levels
| Gate | 适用数据 | 审批 | 必备证据 |
|---|---|---|---|
| G0 Draft | 仅本地设计, 无敏感 source | Team lead | use-case intake, no sensitive source statement |
| G1 Internal low-risk | UI demo, mock operational data | PM + Data Owner | data card, basic PII scan, allowed-use license |
| G2 Controlled test | KYC / complaints / contact center synthetic cases | PM + Architect + Privacy | source permission, privacy tests, utility/fidelity scorecard, retention |
| G3 Model-impacting | 训练、fine-tune、model validation、credit/fraud augmentation | Data Owner + Model Risk + Privacy + Risk | attack report, bias review, benchmark comparison, risk acceptance |
| G4 External / vendor release | vendor PoC, offshore test, partner review | Legal + Procurement + Security + Privacy + Business Owner | contract boundary, export controls, watermark/provenance, deletion evidence |
Gate 的关键纪律:
- 低风险 release 不能自动升级为高风险用途。
- G3/G4 必须有 residual risk owner, expiry 和 reopen trigger。
- 如果 utility 高但 privacy attack 失败, 不能通过“业务价值大”绕过 gate; 只能缩小用途、重生成、加控制或拒绝。
- 如果 privacy 强但 utility 不足, 不能当作模型训练或 validation 数据, 只能作为 demo/mock 或流程测试数据。
Financial retail scenarios
1. AML typology training
目标是训练 analyst 识别 typology, 不是复制真实 SAR case。
| 设计点 | 要求 |
|---|---|
| Source permission | financial crime case data 高敏, 需要明确 training/simulation basis |
| Generation method | typology abstraction + graph/sequence synthesis |
| Privacy tests | rare case nearest-neighbor, canary extraction, linkage on amount/time/counterparty pattern |
| Utility tests | analyst training score, typology coverage, false narrative review |
| Allowed use | staff training, tabletop exercise, model challenge set |
| Prohibited use | 客户风险评分、真实 alert disposition、外部共享 |
2. KYC document testing
目标是测试 document ingestion、OCR、field extraction、exception routing。
| 设计点 | 要求 |
|---|---|
| Source permission | 禁止复用真实证件图像; 使用模板、字段规则、合成图像 |
| Generation method | rule-based document layout + synthetic image + OCR noise injection |
| Privacy tests | OCR PII scan, visual similarity to real docs, metadata scrub |
| Utility tests | extraction accuracy regression, exception path coverage |
| Fidelity tests | 文件版式、语言、过期日期、证件类型组合符合政策 |
| Release gate | vendor sandbox 需 G4, 内部 QA 通常 G2 |
3. Credit model scenario augmentation
目标是扩展边界场景, 不替代真实 model validation。
| 设计点 | 要求 |
|---|---|
| Source permission | credit data、bureau-like attributes、protected-class proxy 风险需严格审查 |
| Generation method | constrained tabular synthesis + policy scenario authoring |
| Privacy tests | nearest-neighbor, linkage, membership inference |
| Utility tests | model sensitivity, challenger model behavior, adverse action reason sanity |
| Bias review | subgroup coverage、proxy amplification、拒绝原因稳定性 |
| Prohibited use | 未经 model risk approval 不得直接作为生产训练主数据 |
4. Payment fraud simulation
目标是压测 fraud rules、agent alert triage、payment warning UX。
| 设计点 | 要求 |
|---|---|
| Source permission | fraud cases 和 payment events 需要最小字段和 typology abstraction |
| Generation method | transaction sequence / graph synthesis + adversarial pattern mutation |
| Privacy tests | unique sequence leakage, counterparty linkage, rare merchant pattern |
| Utility tests | rule trigger coverage, investigator realism score, alert fatigue estimate |
| Fidelity tests | timing、amount、merchant category、device signal 合理性 |
| Allowed use | sandbox rule stress test, analyst training, warning copy testing |
5. Contact center transcript generation
目标是生成受控 transcript, 用于 copilot QA、intent classification、agent training。
| 设计点 | 要求 |
|---|---|
| Source permission | call recordings/transcripts 通常含高敏语音、身份验证、弱势客户信息 |
| Generation method | intent-conditioned LLM generation, redaction-first summaries, no raw transcript prompt for uncontrolled LLM |
| Privacy tests | PII/SPI scan, phrase similarity, membership inference sample test |
| Utility tests | intent label quality, policy answer coverage, escalation trigger coverage |
| Bias review | language clarity、accent proxy、vulnerability phrase handling |
| Prohibited use | 复原真实客户对话、员工绩效推断、未经批准的 sentiment profiling |
6. Complaint analytics
目标是扩展 complaint themes 和 root-cause analytics coverage, 不是制造虚假投诉指标。
| 设计点 | 要求 |
|---|---|
| Source permission | complaints may include legal privilege, regulator-sensitive narratives, vulnerable customer details |
| Generation method | theme-level synthesis + policy taxonomy + controlled narrative variation |
| Privacy tests | narrative uniqueness, linkage to public complaint, PII scan |
| Utility tests | taxonomy classifier performance, root-cause label quality, QA reviewer score |
| Fidelity tests | product, channel, issue, harm, resolution structure |
| Allowed use | taxonomy test, complaint copilot challenge set, QA calibration |
| Prohibited use | 管理层真实投诉率报表、监管报送、客户画像 |
Metrics/control/evidence model
1. Metrics
| Category | Metric | Evidence |
|---|---|---|
| Privacy | membership inference AUC, nearest-neighbor distance, PII leakage rate, linkage success rate | attack report, scan logs, reviewer sign-off |
| Utility | downstream task lift, workflow pass rate, defect discovery rate, SME usefulness score | benchmark run, QA test result, SME review |
| Fidelity | distribution delta, rule validity rate, temporal pattern score, transcript realism | statistical report, rule engine output, review rubric |
| Coverage | scenario coverage, subgroup coverage, edge-case count, missing segment list | coverage matrix, gap register |
| Bias | subgroup utility delta, label skew, stereotype phrase rate, adverse reason stability | fairness review, bias checklist |
| Governance | data card completeness, license coverage, expired dataset count, unauthorized use attempts | catalog report, access logs, exception log |
| Operations | regeneration reproducibility, version drift, deletion proof timeliness | pipeline log, provenance graph, retention evidence |
2. Control model
| Control | Purpose | Owner |
|---|---|---|
| Intake approval | Prevent vague or unauthorized synthetic data requests | PM / BA / Business Owner |
| Source permission gate | Ensure source data can support the proposed transformation | Data Owner / Privacy / Legal |
| Generation environment control | Prevent uncontrolled exposure during generation | Security / Platform |
| Privacy attack suite | Test leakage and re-identification risk | Privacy Engineering / Data Science |
| Utility/fidelity scorecard | Confirm fit for approved use | Product / SME / Model Risk |
| Bias and coverage review | Detect representational harm and amplified historical bias | Risk / Fairness / Business SME |
| Data card | Make limitations and provenance inspectable | Data Product Owner |
| Allowed-use license | Prevent downstream misuse and purpose creep | Data Governance / PM |
| Watermark/provenance | Preserve synthetic identity across copies and derivatives | Data Platform |
| Release gate | Make go/no-go decision auditable | Architecture / Risk / Data Owner |
| Retention and retirement | Avoid long-lived uncontrolled synthetic assets | Records / Data Governance |
3. Evidence packet
上线或释放时至少保留:
- use-case intake and risk tier
- source permission matrix and data classification
- generation plan, generator version, prompts/rules, source window
- privacy attack report and exceptions
- utility/fidelity scorecard
- bias/coverage review
- data card and allowed-use license
- watermark/provenance evidence
- approver list, residual risk owner, expiry date
- retention/deletion requirements
- downstream access logs and monitoring plan
Anti-patterns and failure modes
| Anti-pattern | 表现 | 后果 | 修正方式 |
|---|---|---|---|
| “Synthetic means safe” | 不做 source permission 和 privacy attack test | 数据泄露、合规失败、供应商滥用 | 建立 G2+ release gate |
| One dataset for everything | 同一数据被用于 demo、eval、training、vendor PoC | purpose creep, 证据不匹配 | 按用途拆分 license 和版本 |
| Fidelity worship | 追求与真实数据几乎一致 | 复制隐私风险和历史偏差 | 设置 privacy floor 和 bias review |
| Privacy-only release | 数据很安全但没有业务效用 | 测试误导、模型质量下降 | 同时要求 utility/fidelity score |
| Demo-to-production drift | demo mock data 被留在产品训练或测试流水线 | 假分布污染真实系统 | catalog + expiry + pipeline marker |
| Unmarked derivatives | 下游加工后失去 synthetic 标记 | 无法追踪、误作真实数据 | watermark/provenance inheritance |
| Vendor sandbox sprawl | 合成数据被发给多个供应商且无删除证明 | 第三方风险扩大 | G4 gate + contract + deletion evidence |
| Rare case copying | 把高价值少数案例直接改名重用 | membership inference 和客户识别 | typology abstraction + nearest-neighbor suppression |
| Bias laundering | 以“合成”为名复制历史歧视标签 | 公平性风险扩大 | subgroup review + label governance |
| Metric averaging | privacy、utility、fidelity 合成总分 | 掩盖致命失败 | 三轴门槛, 任一硬门失败即拒绝 |
Architecture mapping to RAG / Agent / Copilot / Eval / Governance
| AI pattern | Synthetic data value | 关键治理点 |
|---|---|---|
| RAG | 生成政策问答、投诉叙事、KYC exception cases, 测 retrieval grounding | 不得把 synthetic documents 混入真实知识库而不标记; citation 必须说明 synthetic source |
| Agent | 模拟 payment fraud workflow、case triage、tool-call boundary | 合成交易/客户不能进入生产 tool; agent action logs 标记 synthetic run |
| Copilot | contact center draft、AML narrative、complaint response QA | 输出质量用真实 SME rubric 校准; 禁止把合成 transcript 当真实客户证据 |
| Eval | 扩展边界场景、隐私保护测试集、regression edge cases | 需要 oracle/expected behavior, 不重复 eval dataset lifecycle, 重点看 synthetic source license |
| Governance | 作为 AI data product 纳入 catalog、risk register、release gate | data card、allowed use、retention、provenance、attack evidence 可审计 |
关键架构原则:
Synthetic data may enter AI workflows only through typed, versioned, licensed datasets.
It should never be an invisible folder of files copied into prompts, notebooks or vendor sandboxes.
ADR draft
| Field | Content |
|---|---|
| ADR ID | ADR-AI-DATA-173 |
| Title | Adopt governed synthetic data release gates for financial retail AI development and testing |
| Status | Proposed |
| Context | AI teams need synthetic AML, KYC, credit, fraud, contact center and complaint datasets for testing, training, eval, sandbox and vendor PoC. Current ad hoc generation creates privacy, utility, bias, provenance and purpose-creep risk. |
| Decision | Treat synthetic data as a governed data product. Every G2+ synthetic dataset must pass use-case intake, source permission review, generator risk review, privacy attack testing, utility/fidelity scorecard, bias/coverage review, data card, allowed-use license, watermark/provenance tagging, retention rule and release gate approval. |
| Scope | Applies to synthetic datasets derived from, calibrated by or intended to resemble financial retail customer, transaction, case, document, complaint, call, credit, fraud or AML data. |
| Options considered | 1. Team-level ad hoc generation. 2. Central synthetic data platform without release gates. 3. Governed synthetic data product lifecycle with risk-tiered gates. |
| Decision rationale | Option 3 balances speed and control. It allows low-risk mock data to move quickly while requiring evidence for high-risk model-impacting, customer-like or externally released datasets. |
| Consequences | Teams must plan evidence work earlier. Synthetic datasets become cataloged assets with owners, expiry and allowed use. Some datasets will be rejected or limited despite high utility if privacy or bias evidence fails. |
| Controls | Source permission matrix, privacy attack suite, utility/fidelity scorecard, bias review, data card, license, watermark/provenance, access logging, retention/deletion evidence. |
| Residual risk | Privacy attacks are not exhaustive; utility evidence may not transfer to production; downstream teams may misuse derivatives without strong catalog and access controls. |
| Reopen triggers | New source data class, external release, use for model training, privacy incident, bias finding, attack threshold breach, regulatory/legal interpretation change, dataset expiry, generator model change. |
Interview answer
30秒版本
我会把 synthetic data 当成受治理的数据产品, 而不是“假数据”。核心架构是 use-case intake、source permission、generation risk review、privacy attack testing、utility/fidelity measurement、bias/coverage review、data card、allowed-use license、watermark/provenance、retention 和 release gate。这样既能支持 AML、KYC、信贷、欺诈、客服和投诉 AI 场景, 又不会把合成数据误用成无限制训练或外发资产。
2分钟版本
在金融零售里, synthetic data 的价值是降低真实客户数据在开发、测试、训练和供应商 PoC 中的暴露, 同时扩展长尾场景。但风险是团队容易误以为“合成就安全”。我会从 approved purpose 开始, 明确这个数据是用于 KYC OCR 测试、AML typology training、payment fraud simulation, 还是 contact center copilot QA。然后检查 source data permission, 因为合成数据不能自动摆脱原始数据的目的限制和合同边界。
架构上我会建立三轴评估: privacy、utility、fidelity。Privacy 覆盖 membership inference、model inversion、nearest-neighbor、linkage、canary 和 PII scan; utility 看它是否支撑批准任务; fidelity 看业务结构是否真实, 但不能过度复制真实个体。再加上 representativeness 和 bias amplification review, 避免信贷、投诉、KYC 和客服场景复制历史偏差。
最后每个数据集要有 data card 和 allowed-use license, 说明可用于什么、禁止什么、谁能访问、多久过期、是否可外发、派生数据如何处理。release gate 根据风险分层, 高风险训练、模型验证或供应商释放必须有 Privacy、Model Risk、Legal、Security 和 Business Owner 的 evidence-based sign-off。
CTO版本
我不会把 synthetic data 平台定位成单纯生成工具, 而会定位成 AI data product control plane。CTO 关心的是复用速度、平台一致性、审计压力和事故半径。我的建议是建设一个轻量但强约束的 lifecycle: catalog + generator registry + attack workbench + scorecard + license + provenance + retention。
这样工程团队可以通过标准接口申请和消费合成数据, 不需要每个项目重新发明脚本; 风险团队可以看到 source permission、attack evidence 和 residual risk; 平台可以用 watermark、metadata、access policy 和 expiry 防止合成数据在 notebook、RAG index、agent sandbox 和 vendor environment 中失控。商业价值不是“我们有更多假数据”, 而是“我们能更快测试高风险 AI 场景, 同时把隐私、偏见、用途漂移和审计不可追溯的风险前移控制”。
7-day practice plan
| Day | Focus | Practice output |
|---|---|---|
| 1 | Synthetic data intake | 为 AML typology training 写 use-case intake、risk tier、allowed/prohibited use |
| 2 | Source permission | 为 KYC document testing 建 source permission matrix 和 exclusion rules |
| 3 | Generation design | 为 payment fraud simulation 设计 graph/sequence synthesis plan 和 generator risk review |
| 4 | Privacy attacks | 为 contact center transcript generation 设计 membership inference、PII scan、phrase similarity、canary test |
| 5 | Utility/fidelity | 为 credit model scenario augmentation 建 scorecard, 包括 task utility、distribution fidelity、bias review |
| 6 | Data card/license | 为 complaint analytics synthetic dataset 写 data card 和 allowed-use license |
| 7 | Release gate and ADR | 完成 G2/G3 release evidence packet, 写一份 ADR 并准备 2 分钟面试叙述 |
Source anchors with links
| Source | Link | 用途 |
|---|---|---|
| UK ICO synthetic data guidance | https://ico.org.uk/about-the-ico/research-reports-impact-and-evaluation/research-and-reports/technology-and-innovation/synthetic-data/ | 作为 synthetic data privacy risk、utility、governance 和误用风险的锚点 |
| NIST Privacy Framework | https://www.nist.gov/privacy-framework | 用 privacy risk management 思维组织 source permission、privacy attack testing、控制和证据 |
| NIST AI RMF | https://www.nist.gov/itl/ai-risk-management-framework | 用 Govern / Map / Measure / Manage 组织 AI synthetic data 风险治理和 release gate |
| ISO/IEC 42001 | https://www.iso.org/standard/81230.html | 用 AI management system 思维设计 roles、operation、performance evaluation、internal audit 和持续改进 |
| ISO/IEC 23894 | https://www.iso.org/standard/77304.html | 用 AI risk management lifecycle 思维组织 risk identification、analysis、treatment 和 monitoring |
| W3C PROV | https://www.w3.org/TR/prov-overview/ | 用 entity / activity / agent 模型描述 synthetic data provenance、生成活动和责任主体 |