AI Synthetic Data Governance / Privacy-Utility-Fidelity Playbook
版本: v1.0
AI Synthetic Data Governance Privacy-Utility-Fidelity Playbook
版本: v1.0
日期: 2026-06-30
适用对象: AI 产品经理、CBAP / Senior BA、数据产品架构师、AI 架构师、隐私风险伙伴、模型风险管理、金融零售业务负责人、数据治理、信息安全、内审
Purpose and when to use
本手册用于把 synthetic data 从“项目临时生成文件”升级为可治理、可发布、可审计的数据产品。它适合以下场景:
| 场景 | 什么时候使用本手册 |
|---|---|
| AML typology training | 需要构造 mule network、layering、structuring、trade-based pattern, 但不能复制真实 case |
| KYC document testing | 需要测试 OCR、document ingestion、exception routing、vendor sandbox, 但禁止使用真实证件图像 |
| Credit model scenario augmentation | 需要扩展边界样本、压力测试政策阈值、验证 adverse action reason 稳定性 |
| Payment fraud simulation | 需要模拟 APP scam、account takeover、merchant fraud、device anomaly 和 rule trigger |
| Contact center transcript generation | 需要训练 copilot、intent classifier、QA rubric, 但不应暴露真实通话原文 |
| Complaint analytics | 需要扩展 complaint taxonomy、root-cause classifier、regulatory response drill, 但不能制造真实投诉指标 |
| Vendor PoC / sandbox | 需要给供应商提供近似业务数据, 但要限制用途、保留、再训练和再分发 |
使用边界:
- 本手册不是法律意见、隐私影响评估结论、模型验证报告或监管解释。
- Synthetic data 不自动等于 anonymized / de-identified / safe for sharing。
- 如果 source data 不允许某用途, 不能靠“合成”绕过目的限制、合同边界或客户期望。
- 高风险 synthetic data 只有在 privacy、utility、fidelity、bias、license 和 release evidence 都满足门槛后, 才能释放。
Operating model
1. Lifecycle
Intake
-> source permission review
-> generation design
-> controlled build
-> privacy attack testing
-> utility/fidelity scoring
-> bias/coverage review
-> data card and allowed-use license
-> release gate
-> catalog, access, monitoring
-> renewal, restriction or retirement
2. RACI
| Activity | PM | BA | Data Owner | Architect | Data Science | Privacy | Security | Model Risk | Legal/Compliance | Business Owner |
|---|---|---|---|---|---|---|---|---|---|---|
| Define use case and allowed use | A/R | R | C | C | C | C | C | C | C | A |
| Classify source data and permissions | C | C | A/R | C | C | A/R | C | C | A/R | C |
| Design generation approach | C | C | C | A/R | A/R | C | C | C | C | I |
| Control generation environment | I | I | C | A/R | R | C | A/R | I | C | I |
| Run privacy attack tests | I | I | C | C | R | A/R | C | C | C | I |
| Score utility/fidelity | A/R | R | C | C | R | C | I | C | C | C |
| Review bias/coverage | C | C | C | C | R | C | C | A/R | C | A |
| Approve release gate | A | C | A | A/R | C | A/R | A/R | A/R | A/R | A |
| Monitor usage and expiry | R | C | A/R | C | C | C | C | C | C | A |
RACI discipline:
- Business Owner owns why the synthetic data is needed and accepts business residual risk.
- Data Owner owns source permission and release scope.
- Privacy and Security own leakage, re-identification, environment and access controls.
- Model Risk owns model-impacting uses, especially credit, fraud, AML and decision-support augmentation.
- PM / BA / Architect ensure the license is embedded in PRD, architecture review, release checklist and vendor handoff.
3. Gate levels
| Gate | Use type | Required reviewers | Release posture |
|---|---|---|---|
| G0 Draft mock | Design-only mock data, no sensitive source | PM / Tech Lead | Local use, no catalog release |
| G1 Internal low risk | Demo data, generic process testing | PM, Data Owner | Internal release with data card and expiry |
| G2 Controlled operational test | KYC, contact center, complaint, payment test data | PM, Data Owner, Architect, Privacy, Security | Cataloged release with license and attack evidence |
| G3 Model-impacting | Training, fine-tune, validation, scenario augmentation | Data Owner, Model Risk, Privacy, Risk, Architect | Limited release with residual risk owner and monitoring |
| G4 External / vendor | Vendor PoC, offshore test, partner review | Legal, Procurement, Privacy, Security, Business Owner | Contract-bound release with deletion evidence |
Template: synthetic data intake
| Field | Required content | Example |
|---|---|---|
| Request ID | Stable ID tied to PRD / ADR / ticket | SYN-KYC-DOC-2026-001 |
| Business objective | The concrete decision, workflow or test supported | Test KYC OCR and exception routing for document quality edge cases |
| Data product type | Tabular, text, transcript, image, document, graph, sequence, mixed | Synthetic document images + field labels |
| Target users | Named teams, roles, vendors and locations | KYC QA, onboarding platform team, approved OCR vendor |
| Downstream system | RAG, Agent, Copilot, model training, analytics, QA, sandbox | OCR regression pipeline and vendor sandbox |
| Approved use | Precise allowed actions | Run extraction tests and compare field-level accuracy |
| Prohibited use | Explicitly blocked actions | No production decisioning, no model pretraining, no customer profiling |
| Source data summary | Systems and field categories, not raw examples | KYC policy rules, document templates, historical exception taxonomy |
| Sensitive data class | PII, financial crime, credit, complaint, call recording, employee data | Identity document-like data, no real document images |
| Risk tier | G0-G4 recommendation with rationale | G4 because vendor sandbox release is requested |
| Utility target | What “useful enough” means | OCR extracts required fields with 95% expected-label agreement in test |
| Fidelity target | What real-world structure must be preserved | Layouts, expiry formats, country-specific field constraints, image noise |
| Privacy target | What leakage must be prevented | No real names, IDs, faces, document numbers, metadata or source image similarity |
| Expiry | Date or event requiring renewal/deletion | 90 days after vendor test completion |
| Business owner | Accountable owner | Head of Digital Onboarding |
Intake decision rule:
No approved use, no source permission, no release.
Template: source permission matrix
| Source | Data class | Owner | Proposed use in synthesis | Permission basis | Exclusions | Retention | Evidence |
|---|---|---|---|---|---|---|---|
| AML closed case taxonomy | Financial crime sensitive | Financial Crime Ops | Abstract typology labels and event sequences | Training simulation approved by FC governance | Real names, account numbers, counterparties, exact dates | 1 year synthetic training set, source extracts deleted after build | FC owner approval, privacy review |
| KYC document policy rules | Internal policy | Onboarding Policy | Generate valid/invalid document combinations | Internal testing allowed | Real customer documents excluded | Until next policy version | Policy owner approval |
| Contact center intent labels | Customer interaction metadata | Contact Center Data Owner | Condition transcript generation by intent | QA and copilot testing approved | Raw transcripts, authentication phrases, vulnerability notes unless reviewed | 180 days synthetic set | Data classification, transcript exclusion log |
| Payment fraud typologies | Fraud sensitive | Fraud Risk | Build transaction sequence simulation | Fraud rule stress testing approved | Unique merchant/customer/device combinations | 90 days pilot | Fraud risk sign-off |
| Complaint taxonomy | Complaint / conduct sensitive | Complaints Ops | Generate complaint narratives by theme | QA taxonomy testing approved | Legal privilege, regulator-specific case text, identifiable narratives | 180 days | Compliance review |
Permission checks:
- Does the original collection purpose support synthetic generation for this use?
- Does any contract or policy restrict derivatives, vendor sharing, retention or model training?
- Are there fields that must be excluded before generation rather than redacted after generation?
- Does source data include rare events whose structure could identify a customer even after field removal?
- Will intermediate artifacts be deleted or retained as evidence?
Template: privacy attack checklist
| Test | Required for | Method | Pass evidence | Escalation trigger |
|---|---|---|---|---|
| PII / SPI scan | All G1+ text, document, transcript, tabular data | Automated scan + sampled human review | No unauthorized names, IDs, addresses, accounts, phone, email, auth phrases | Any direct identifier found |
| Nearest-neighbor similarity | G2+ datasets derived from real cases | Compare synthetic samples with approved source embeddings/features | Similarity below threshold or high-similarity samples removed | Rare case or near-copy detected |
| Membership inference | G3/G4 model-impacting or externally released data | Attack model or holdout comparison where feasible | Attack performance not materially above baseline | Attacker can infer source membership |
| Model inversion | G3/G4, especially text/credit/fraud | Attempt reconstruction of sensitive attributes from output or generator | Sensitive reconstruction below threshold, no usable identity clues | Sensitive field reconstructed or inferred |
| Linkage attack | G2+ transaction, graph, complaint, AML, credit | Join-like test using quasi-identifiers or external/internal reference fields | Small cells suppressed, combinations generalized | Unique customer-like path exposed |
| Canary extraction | LLM or neural generator trained/fitted on sensitive source | Insert controlled marker during test build and probe outputs | Canary not reproduced | Canary reproduced or paraphrased closely |
| Metadata / provenance leak | Document/image/transcript files | Inspect EXIF, file metadata, hidden comments, prompt logs | Metadata scrubbed, synthetic marker retained | Real source path, author, customer or vendor metadata present |
| Prompt/source leakage | LLM generation | Review prompts, logs and outputs | No raw sensitive source in uncontrolled prompts/logs | Raw transcript/case/doc copied into vendor prompt |
Privacy result classification:
| Result | Meaning | Action |
|---|---|---|
| Pass | Meets gate threshold and no material exceptions | Continue to utility/fidelity review |
| Conditional pass | Minor exceptions removed, limited use, stronger controls | Release only with restricted license and expiry |
| Fail | Material leakage or attack success | Regenerate, reduce fidelity, narrow source, change method or reject |
Template: utility/fidelity scorecard
| Dimension | Metric | Target | Evidence | Example |
|---|---|---|---|---|
| Task utility | Downstream task score / workflow pass rate | Meets approved use threshold | Test report | KYC OCR extracts required fields at target accuracy |
| Defect discovery | Number and severity of meaningful failures found | Finds known and plausible edge failures | QA report | Payment fraud simulation triggers rule gaps |
| Business rule validity | Impossible combination rate | Below threshold set by SME | Rule engine output | Credit attributes obey policy constraints |
| Distribution fidelity | PSI / KS / correlation delta / domain comparison | Within approved range for relevant fields | Statistical report | Transaction amount bands and timing resemble target segment |
| Temporal fidelity | Sequence order, seasonality, burst behavior | Matches domain pattern needed for test | Sequence analysis | Fraud bursts and device changes occur in plausible order |
| Text realism | SME realism rating, taxonomy alignment | Meets rubric threshold | SME panel review | Complaint narrative includes product, harm, resolution request |
| Document fidelity | Layout validity, OCR readability, image artifact realism | Meets test target | Visual/OCR inspection | Synthetic ID statement has valid layout and controlled noise |
| Graph fidelity | Motif, degree, path length, community pattern | Preserves typology without copying real graph | Graph report | AML mule ring has plausible layering structure |
| Stability | Regeneration variance | Within expected band | Re-run comparison | Same generator settings produce comparable metrics |
| Limitation clarity | Known gaps documented | All material gaps captured | Data card | Dataset does not represent rural branch onboarding |
Decision rule:
Utility must be measured against the approved use, not against generic realism.
Fidelity must preserve business structure without copying customer-identifying facts.
Template: bias/coverage review
| Review area | Question | Evidence | Example control |
|---|---|---|---|
| Segment coverage | Which customer/product/channel segments are included or absent? | Coverage matrix | Separate online, branch, call center and mobile-wallet paths |
| Edge-case coverage | Which low-frequency high-impact cases are represented? | Scenario list | KYC expired document + address mismatch + manual review |
| Vulnerable customer risk | Does generated language stereotype or mishandle hardship, disability, language barriers or scams? | SME/risk review | Approved vulnerability phrase library and escalation rules |
| Protected/proxy attributes | Are sensitive attributes used directly or inferred through proxies? | Feature review | Remove or restrict proxy-like fields unless approved for fairness testing |
| Label amplification | Does generator reproduce historical biased labels or investigator decisions? | Label distribution and error review | Separate historical label from adjudicated target label |
| Complaint harm framing | Does narrative understate customer harm or overfit company-friendly resolution? | Complaints QA review | Include harm, impact, expectation and remediation fields |
| Credit fairness | Does scenario augmentation distort adverse action reasons or subgroup performance? | Model risk review | Run subgroup utility and reason-code stability checks |
| Fraud false positives | Does simulation over-associate specific behaviors with fraud risk? | Fraud risk review | Balance legitimate lookalike behaviors and require reason codes |
| Language and channel | Does generated data overrepresent polished English/digital confidence? | Linguistic/channel review | Include approved multilingual or low-digital-confidence cases |
Bias decision categories:
| Category | Action |
|---|---|
| Acceptable for limited purpose | Release with documented limitations |
| Needs rebalance | Regenerate or add missing segments before release |
| Needs stronger warning | Release only for narrow test, not model training |
| Reject | Bias amplification creates unacceptable customer, conduct or model risk |
Template: allowed-use license
Dataset ID: SYN-PAY-FRAUD-SEQUENCE-2026-002
Version: v1.0
Status: Limited release
Approved purpose: Payment fraud rule and warning UX stress testing.
Allowed users: Fraud Risk Analytics, Payment Platform QA, named AI copilot test engineers.
Allowed environments: Enterprise non-production fraud sandbox only.
Allowed operations: Query, aggregate, run fraud rule simulations, run agent workflow tests, create test reports.
Prohibited operations: Production decisioning, customer scoring, external sharing, model pretraining, customer re-identification, linkage to production customer tables, employee performance analytics.
Derivative rule: All derivative samples, embeddings, reports and test fixtures inherit this license and must retain synthetic marker.
Watermark/provenance: Dataset metadata must retain synthetic=true, source_permission_id, generation_run_id and license_id.
Retention: Delete working copies after 90 days. Evidence packet retained under governance retention schedule.
Review trigger: New fraud typology source, vendor release request, model training request, privacy attack finding, incident, expiry.
Approvers: Business Owner, Data Owner, Privacy, Security, Model Risk, Architect.
License table version:
| License field | Required wording standard |
|---|---|
| Approved purpose | One or two concrete uses, not broad “AI development” |
| Allowed users | Roles and teams; vendors named separately |
| Allowed environments | Named non-production / sandbox / approved external environment |
| Allowed operations | Read, query, transform, test, train, validate, export, embed, index, summarize |
| Prohibited operations | Re-identification, production decisioning, unrelated training, marketing, resale, external sharing |
| Derivative rule | Defines whether embeddings, prompts, labels, reports and transformed files inherit restrictions |
| Watermark/provenance | Required markers and metadata |
| Retention | Expiry, deletion proof, evidence retention |
| Review trigger | Events that force reapproval |
Template: release evidence packet
| Evidence item | G1 | G2 | G3 | G4 |
|---|---|---|---|---|
| Use-case intake | Required | Required | Required | Required |
| Risk tier rationale | Required | Required | Required | Required |
| Source permission matrix | Summary | Required | Required | Required |
| Data classification | Required | Required | Required | Required |
| Generation plan and method | Summary | Required | Required | Required |
| Generator version / prompts / rules | Optional | Required | Required | Required |
| Privacy attack checklist | Basic scan | Required | Required with report | Required with report |
| Utility/fidelity scorecard | Basic SME review | Required | Required | Required |
| Bias/coverage review | Optional | Required for customer-like data | Required | Required |
| Data card | Required | Required | Required | Required |
| Allowed-use license | Required | Required | Required | Required |
| Watermark/provenance evidence | Recommended | Required | Required | Required |
| Security environment evidence | Optional | Required | Required | Required |
| Vendor/contract evidence | Not needed | If vendor involved | If vendor involved | Required |
| Residual risk owner | Optional | Required for exceptions | Required | Required |
| Retention/deletion plan | Required | Required | Required | Required |
| Monitoring plan | Optional | Required | Required | Required |
| Approver record | Required | Required | Required | Required |
Release decision log:
| Decision | When to use |
|---|---|
| Approved | All hard gates pass, limitations documented |
| Approved with restrictions | Minor gaps controlled by narrower use, shorter retention, smaller audience or stronger monitoring |
| Rework required | Utility/fidelity/bias/privacy evidence incomplete or fixable |
| Rejected | Privacy leakage, source permission failure, unacceptable bias, misleading utility or prohibited downstream use |
PM/BA/architecture questions
Product questions
| Question | What a strong answer shows |
|---|---|
| What business decision does this synthetic dataset support? | Clear purpose instead of generic AI experimentation |
| Who will consume it and what action will they take? | User, workflow and environment clarity |
| What would be harmed if the synthetic data is wrong? | Customer, operational, model, conduct and compliance impact |
| What is explicitly out of scope? | Prevention of purpose creep |
| What would make this dataset no longer fit for use? | Expiry, drift, policy change, source change |
BA questions
| Question | What a strong answer shows |
|---|---|
| Which process steps, exceptions and controls must be represented? | Real workflow grounding |
| Which business rules must never be violated? | Test oracle and rule validity |
| Which labels or expected outputs are authoritative? | Prevents synthetic label drift |
| What limitations must users see before consuming the dataset? | Data card discipline |
| Which stakeholders must approve changes to scope? | Governance routing |
Architecture questions
| Question | What a strong answer shows |
|---|---|
| Where is synthetic generation executed and logged? | Environment and evidence design |
| How are generator prompts, rules, seeds and versions managed? | Reproducibility and lineage |
| How does watermark/provenance survive transformations and embeddings? | Downstream control |
| Can synthetic data enter RAG indexes, agent tools or model training pipelines? | Typed access and license enforcement |
| How will unauthorized use be detected? | Access logs, catalog policy, pipeline checks |
| What is the deletion and renewal mechanism? | Retention operationalization |
Privacy/risk questions
| Question | What a strong answer shows |
|---|---|
| Can an attacker infer source membership, identity or sensitive attributes? | Privacy attack thinking |
| What quasi-identifiers or rare patterns remain? | Linkage risk awareness |
| Does the dataset amplify historical bias or false-positive patterns? | Fairness and conduct risk |
| Who accepts residual risk and when does it expire? | Accountability |
| What evidence would satisfy internal audit? | Replayable evidence packet |
Release checklist
Use this checklist before any G2+ release.
| # | Check | Pass condition |
|---|---|---|
| 1 | Use case approved | Business purpose, target users, allowed/prohibited uses are documented |
| 2 | Risk tier assigned | G-level is justified by data class, use, audience and environment |
| 3 | Source permission complete | Data Owner, Privacy/Legal/Compliance as needed approve source use |
| 4 | Generation plan approved | Method, generator, prompts/rules, source windows and exclusions are recorded |
| 5 | Environment controlled | Generation and storage environment meets security requirements |
| 6 | Privacy attacks executed | Required tests pass or exceptions are remediated/restricted |
| 7 | Utility measured | Dataset supports the approved task with evidence |
| 8 | Fidelity measured | Relevant domain structure is preserved without unsafe copying |
| 9 | Bias/coverage reviewed | Coverage gaps and amplification risks are accepted or corrected |
| 10 | Data card complete | Purpose, source summary, methods, tests, limitations and owner are visible |
| 11 | License attached | Allowed uses, prohibited uses, derivative rules, retention and expiry are explicit |
| 12 | Watermark/provenance applied | Metadata markers and generation lineage are retained |
| 13 | Access controls configured | Only approved users/environments can access |
| 14 | Retention/deletion scheduled | Working copies, derivatives and evidence retention are defined |
| 15 | Release evidence packet stored | All approval and evidence artifacts are linked |
| 16 | Monitoring enabled | Access, use, exceptions and expiry are tracked |
| 17 | Reopen triggers defined | New use, vendor release, training use, attack finding, incident or expiry triggers review |
Hard-stop conditions:
- Source permission is unclear or denied.
- Direct identifiers or near-copies remain.
- Membership inference, linkage or inversion risk exceeds threshold.
- Utility evidence does not support requested use.
- Bias review identifies unacceptable customer or conduct risk.
- Downstream use includes production decisioning without explicit high-risk approval.
- Vendor release lacks contract, retention and deletion evidence.
Executive narrative
1-minute version
Synthetic data can accelerate AI delivery, but only if it is governed as a data product. The main risk is not that the data is fake; the risk is that teams treat it as automatically safe, then use it for training, testing, vendor sharing or analytics beyond its permission boundary. Our architecture introduces risk-tiered release gates: source permission, generation controls, privacy attack testing, utility/fidelity scoring, bias review, data card, allowed-use license, provenance and retention. This lets us use synthetic AML, KYC, credit, fraud, contact center and complaint datasets safely enough for approved purposes while preserving audit evidence.
CTO / CDO version
The platform decision is to build a synthetic data control plane, not a loose generation toolkit. The control plane standardizes intake, generator registry, source permission, privacy attack workbench, scorecards, catalog, license enforcement, watermark/provenance and expiry. Engineering gets reusable datasets and faster test coverage. Data governance gets a record of what was generated, from which source, for which use, and under which restrictions. Risk teams get evidence that privacy, utility, fidelity and bias were measured before release. This reduces duplicated scripts, vendor sandbox sprawl and hidden training-data contamination.
CRO / Privacy version
Synthetic data does not remove privacy or conduct risk by definition. The release gate must prove that the dataset cannot reasonably expose real customers, rare cases, documents, calls or transaction paths, and that it will not be used outside the approved purpose. The evidence package includes source permission, membership inference and linkage testing, model inversion review where relevant, bias/coverage review, residual risk owner, retention and deletion plan. For high-risk uses, approval is conditional and time-bound.
Interview drills
Drill 1: “Is synthetic data automatically safe to use with vendors?”
30 秒答案:
Synthetic data is not automatically safe. I would first check source permission, whether the vendor contract allows derivative data, and whether the dataset passed privacy attack testing. For vendor release I would require a G4 gate: data card, allowed-use license, watermark/provenance, access restrictions, retention period, deletion proof and explicit prohibition on model training or onward sharing unless approved.
2 分钟展开:
The key mistake is assuming that removing direct identifiers or generating records from a model eliminates privacy risk. Synthetic data can preserve rare transaction patterns, complaint narratives, KYC document artifacts or fraud graph structures that support linkage or membership inference. For vendor use, I would narrow the purpose, reduce fidelity where needed, test for nearest-neighbor similarity and PII leakage, confirm contractual restrictions, and require evidence that the vendor cannot reuse the data for unrelated model training. If the vendor needs higher fidelity, that is a risk trade-off requiring stronger controls, not an informal data share.
Drill 2: “How do you balance privacy, utility and fidelity?”
30 秒答案:
I treat privacy, utility and fidelity as separate gate dimensions, not one averaged score. Privacy asks whether real people or cases can be inferred. Utility asks whether the dataset supports the approved task. Fidelity asks whether it preserves the relevant business structure without copying protected facts. Any hard failure can block or narrow release.
2 分钟展开:
For example, payment fraud simulation needs realistic transaction timing, device changes and merchant patterns, but copying exact rare fraud sequences can expose source cases. I would abstract typologies, generate plausible variants, suppress unique combinations, and run linkage and nearest-neighbor tests. Then I would measure utility through rule trigger coverage and investigator realism scores, and fidelity through sequence validity. If privacy is strong but utility is weak, the dataset may be suitable for demos but not rule validation. If utility is strong but privacy fails, it cannot be released until regenerated or restricted.
Drill 3: “What makes a synthetic dataset fit for credit model augmentation?”
30 秒答案:
It needs source permission, model-risk approval, privacy attack evidence, policy-valid features, subgroup coverage, reason-code stability and clear limits. I would not let synthetic credit data become production training input unless Model Risk, Privacy, Data Owner and Business Owner approve the use and evidence shows it does not amplify bias or distort adverse action explanations.
2 分钟展开:
Credit is high impact because synthetic data can alter model boundaries and fairness behavior. I would define whether the dataset is for sensitivity analysis, challenger testing, rejected-inference exploration or actual training. The scorecard would include distribution fidelity, business rule validity, subgroup performance, proxy-feature review, label provenance, adverse action reason stability and membership/linkage testing. The license would prohibit production scoring and unrelated use unless explicitly approved. This protects the organization from treating synthetic scenarios as ground truth.
Drill 4: “How would you govern contact center transcript generation?”
30 秒答案:
I would avoid sending raw transcripts to an uncontrolled LLM. I would use approved intent labels, redacted summaries and policy constraints to generate synthetic conversations, then run PII scans, phrase similarity checks, vulnerability-language review and utility testing against copilot QA or intent classification tasks. The data card must say it is synthetic and cannot be used as real customer evidence.
2 分钟展开:
Contact center transcripts contain authentication phrases, account details, hardship signals, health or vulnerability disclosures and employee behavior. The generator should operate in a controlled environment using minimized source features. Utility should be measured by intent classification accuracy, escalation coverage and SME realism. Bias review should check that the synthetic language does not stereotype customers with low digital confidence, accents, vulnerability markers or complaint intensity. The license should prohibit employee performance analytics, sentiment profiling beyond approved QA, and any attempt to reconstruct real calls.
Drill 5: “What evidence would you show internal audit?”
30 秒答案:
I would show the intake, source permission matrix, data classification, generation plan, privacy attack report, utility/fidelity scorecard, bias review, data card, allowed-use license, watermark/provenance evidence, approval record, residual risk owner, retention plan and access logs. The goal is to replay why the dataset was created, why it was safe enough for that use, who approved it and how misuse is controlled.
2 分钟展开:
Audit does not need a story that “synthetic means safe”; it needs evidence. I would connect the dataset ID to PRD, ADR, release gate and downstream systems. The evidence packet should show source owners approved the purpose, privacy tests were run, utility was measured against the actual task, limitations were disclosed, and access was limited to approved environments. If the dataset was shared externally, I would add contract terms, deletion proof and vendor access logs. If it was used for model training or validation, I would include model-risk sign-off and monitoring thresholds.
Source anchors
| Source | Link | How to use this anchor |
|---|---|---|
| UK ICO synthetic data guidance | https://ico.org.uk/about-the-ico/research-reports-impact-and-evaluation/research-and-reports/technology-and-innovation/synthetic-data/ | Ground synthetic data privacy, utility and governance discussion |
| NIST Privacy Framework | https://www.nist.gov/privacy-framework | Structure privacy risk management, controls and evidence |
| NIST AI RMF | https://www.nist.gov/itl/ai-risk-management-framework | Map synthetic data risks to AI governance, measurement and management |
| ISO/IEC 42001 | https://www.iso.org/standard/81230.html | Align with AI management system operations, roles and continual improvement |
| ISO/IEC 23894 | https://www.iso.org/standard/77304.html | Align with AI risk management lifecycle |
| W3C PROV | https://www.w3.org/TR/prov-overview/ | Represent synthetic data provenance through entities, activities and agents |