AI Data Lifecycle Governance:来源-保留-删除
一句话:
AI Data Lifecycle Governance / Provenance / Retention 解读
面向对象: AI Data Product Manager / Data Architect / AI Governance Lead / Privacy Architect / Senior BA。 核心问题: AI 数据不只存在于训练集。Prompt、retrieved context、tool result、eval sample、judge output、human feedback、trace、incident record 和 memory 都是 AI data lifecycle 的一部分。没有生命周期治理, AI 系统会积累隐私、偏见、漂移、审计和删除风险。 学习目标: 建立 source-to-prompt-to-output-to-feedback-to-evidence 的数据生命周期治理, 覆盖 provenance、retention、deletion、minimization、quality 和 audit queries。
Source Anchors
| Source | Link | 用途 |
|---|---|---|
| NIST AI RMF | https://www.nist.gov/itl/ai-risk-management-framework | 将数据治理纳入 AI 风险管理 |
| ISO/IEC 42001 | https://www.iso.org/standard/81230.html | 将数据、责任和持续改进纳入 AI management system |
| W3C PROV | https://www.w3.org/TR/prov-overview/ | 参考 provenance 的 entity、activity、agent 思维 |
| NIST Privacy Framework | https://www.nist.gov/privacy-framework | 参考隐私风险管理、数据处理和组织控制 |
| JSON Schema | https://json-schema.org/ | 为数据对象、日志、evidence object 建立结构约束 |
一句话:
AI data lifecycle governance 是知道每个 AI 数据对象从哪里来、为什么用、被谁处理、保留多久、何时删除、如何证明。
1. AI Data Is More Than Training Data
| Data object | Risk |
|---|---|
| Source documents | stale, unauthorized, low quality |
| Chunks / embeddings | permission leakage, deletion complexity |
| Prompts | PII, confidential data, prompt injection |
| Retrieved context | over-sharing, wrong source |
| Tool results | sensitive operational data |
| Model outputs | hallucinated or regulated content |
| Eval samples | production data reuse, leakage |
| Judge outputs | biased ratings |
| Human feedback | personal opinions, labor data |
| Trace/logs | retention and audit exposure |
| Memory | consent, deletion, purpose creep |
| Incident records | sensitive evidence handling |
2. Lifecycle Stages
source creation
-> ingestion
-> transformation/chunking/embedding
-> retrieval/prompt assembly
-> model/tool processing
-> output
-> feedback/eval
-> monitoring/evidence
-> retention/deletion
Each stage needs:
- owner。
- purpose。
- data classification。
- legal/regulatory basis。
- quality control。
- access control。
- retention rule。
- deletion mechanism。
- evidence.
3. Provenance Graph
W3C PROV thinking:
| Concept | AI example |
|---|---|
| Entity | source doc, chunk, prompt, output, eval report |
| Activity | ingestion, retrieval, generation, judging, review |
| Agent | user, model, reviewer, service, vendor |
Example:
answer:001 wasGeneratedBy generation:run-777
generation:run-777 used prompt:abc
prompt:abc used context:chunk-55
chunk-55 wasDerivedFrom source:policy-v12
source:policy-v12 wasAttributedTo knowledgeOwner:retail-policy
answer:001 wasReviewedBy reviewer:agent-supervisor
Provenance lets you answer:
- 哪个 source 影响了这个回答。
- 哪个 prompt/model/tool 版本产生了输出。
- 哪些 eval 使用了生产数据。
- 哪个数据对象需要删除时影响哪些 index/log/evidence。
4. Retention and Deletion Architecture
| Data object | Retention question |
|---|---|
| prompt | 是否含 PII, 是否需要审计保留 |
| retrieved context | 是否保存全文还是 source ids |
| output | 是否 customer-facing, 是否 regulated communication |
| trace | 保存多久, 是否脱敏 |
| feedback | 是否可用于模型改进 |
| memory | 用户是否同意, 如何删除 |
| eval dataset | 是否可长期保留, 是否合成化 |
| evidence bundle | 法规/审计保留多久 |
Architecture patterns:
- Store source ids instead of raw sensitive context when possible。
- Separate audit evidence from product analytics。
- Use retention tags by data classification and risk tier。
- Build deletion job for index, cache, memory, logs where feasible。
- Record deletion evidence。
- Define exceptions for legal hold and audit retention。
5. Data Minimization for AI
AI data minimization is architectural:
| Layer | Minimization tactic |
|---|---|
| Intake | collect only needed user state |
| Retrieval | metadata filter and top-k control |
| Prompt | redact or summarize sensitive fields |
| Tool | least-privilege tool scope |
| Logging | mask sensitive fields |
| Eval | synthetic or de-identified data |
| Feedback | separate rating from personal identity |
| Memory | opt-in and purpose-bound |
6. Financial Retail Case: Customer Service RAG
| Lifecycle object | Governance |
|---|---|
| policy document | owner, effective date, source authority |
| chunk | source id, permission, version |
| embedding | rebuild when source changes |
| prompt | PII redaction, context budget |
| model output | citation required, stored if customer-facing |
| feedback | taxonomy, reviewer calibration |
| trace | retention by risk tier |
| incident | evidence binder and legal hold rules |
Deletion example:
source doc retired
-> mark source inactive
-> remove chunks from retrieval
-> rebuild index
-> preserve release evidence if required
-> log deletion/rebuild evidence
7. Templates
Data Lifecycle Inventory
| Data object | Owner | Purpose | Classification | Retention | Deletion | Evidence |
|---|---|---|---|---|---|---|
| prompt | app owner | model input | confidential | 30 days / audit exception | log deletion job | trace policy |
| chunk | knowledge owner | retrieval | internal/confidential | until source retired | index rebuild | source lineage |
| eval sample | EvalOps | regression | de-identified | controlled | dataset version | eval report |
Provenance Table
| Output | Used prompt | Used source | Model | Tool | Reviewer | Evidence |
|---|---|---|---|---|---|---|
| answer id | prompt version | source ids | model route | tool calls | reviewer id | trace id |
Retention/Deletion Matrix
| Object | Default retention | Trigger | Action |
|---|---|---|---|
| memory | purpose-bound | user deletion | delete + proof |
| embeddings | source lifecycle | source retired | rebuild index |
| trace | risk-tiered | retention expiry | mask/delete |
| evidence | audit/legal | retention expiry | archive/delete |
8. Common Failure Modes
| Failure mode | Fix |
|---|---|
| Only training data governed | include prompt/context/output/feedback/logs |
| Embeddings forgotten | include index rebuild and deletion strategy |
| Feedback reused without purpose | feedback governance and consent |
| Logs over-collect | data minimization and masking |
| Evidence conflicts with deletion | retention exception and legal hold policy |
| No provenance | source-to-output trace |
9. 面试表达
30 秒版本:
AI 数据生命周期不只是训练数据治理。我会把 source docs、chunks、embeddings、prompts、retrieved context、tool results、outputs、eval samples、feedback、logs、memory 和 evidence 都放进 lifecycle inventory, 为每类数据定义 owner、purpose、classification、retention、deletion 和 provenance。
2 分钟版本:
以 customer service RAG 为例, 政策文档有 owner、effective date 和 source authority; chunks 带 source id、permission 和 version; prompt 经过 PII redaction; output 必须带 citation; trace 记录 prompt/model/source/tool versions; feedback 有用途限制; source retired 后要 inactive、remove chunks、rebuild index 并保存 deletion evidence。这样既能支持审计, 也能处理隐私、删除和知识更新。