返回 Papers
AI 底层逻辑 / 经典论文

AI Data Lifecycle Governance:来源-保留-删除

一句话:

218ai-foundations/papers/98-ai-data-lifecycle-governance-provenance-retention.md

AI Data Lifecycle Governance / Provenance / Retention 解读

面向对象: AI Data Product Manager / Data Architect / AI Governance Lead / Privacy Architect / Senior BA。 核心问题: AI 数据不只存在于训练集。Prompt、retrieved context、tool result、eval sample、judge output、human feedback、trace、incident record 和 memory 都是 AI data lifecycle 的一部分。没有生命周期治理, AI 系统会积累隐私、偏见、漂移、审计和删除风险。 学习目标: 建立 source-to-prompt-to-output-to-feedback-to-evidence 的数据生命周期治理, 覆盖 provenance、retention、deletion、minimization、quality 和 audit queries。


Source Anchors

SourceLink用途
NIST AI RMFhttps://www.nist.gov/itl/ai-risk-management-framework将数据治理纳入 AI 风险管理
ISO/IEC 42001https://www.iso.org/standard/81230.html将数据、责任和持续改进纳入 AI management system
W3C PROVhttps://www.w3.org/TR/prov-overview/参考 provenance 的 entity、activity、agent 思维
NIST Privacy Frameworkhttps://www.nist.gov/privacy-framework参考隐私风险管理、数据处理和组织控制
JSON Schemahttps://json-schema.org/为数据对象、日志、evidence object 建立结构约束

一句话:

AI data lifecycle governance 是知道每个 AI 数据对象从哪里来、为什么用、被谁处理、保留多久、何时删除、如何证明。


1. AI Data Is More Than Training Data

Data objectRisk
Source documentsstale, unauthorized, low quality
Chunks / embeddingspermission leakage, deletion complexity
PromptsPII, confidential data, prompt injection
Retrieved contextover-sharing, wrong source
Tool resultssensitive operational data
Model outputshallucinated or regulated content
Eval samplesproduction data reuse, leakage
Judge outputsbiased ratings
Human feedbackpersonal opinions, labor data
Trace/logsretention and audit exposure
Memoryconsent, deletion, purpose creep
Incident recordssensitive evidence handling

2. Lifecycle Stages

source creation
  -> ingestion
  -> transformation/chunking/embedding
  -> retrieval/prompt assembly
  -> model/tool processing
  -> output
  -> feedback/eval
  -> monitoring/evidence
  -> retention/deletion

Each stage needs:

  • owner。
  • purpose。
  • data classification。
  • legal/regulatory basis。
  • quality control。
  • access control。
  • retention rule。
  • deletion mechanism。
  • evidence.

3. Provenance Graph

W3C PROV thinking:

ConceptAI example
Entitysource doc, chunk, prompt, output, eval report
Activityingestion, retrieval, generation, judging, review
Agentuser, model, reviewer, service, vendor

Example:

answer:001 wasGeneratedBy generation:run-777
generation:run-777 used prompt:abc
prompt:abc used context:chunk-55
chunk-55 wasDerivedFrom source:policy-v12
source:policy-v12 wasAttributedTo knowledgeOwner:retail-policy
answer:001 wasReviewedBy reviewer:agent-supervisor

Provenance lets you answer:

  • 哪个 source 影响了这个回答。
  • 哪个 prompt/model/tool 版本产生了输出。
  • 哪些 eval 使用了生产数据。
  • 哪个数据对象需要删除时影响哪些 index/log/evidence。

4. Retention and Deletion Architecture

Data objectRetention question
prompt是否含 PII, 是否需要审计保留
retrieved context是否保存全文还是 source ids
output是否 customer-facing, 是否 regulated communication
trace保存多久, 是否脱敏
feedback是否可用于模型改进
memory用户是否同意, 如何删除
eval dataset是否可长期保留, 是否合成化
evidence bundle法规/审计保留多久

Architecture patterns:

  • Store source ids instead of raw sensitive context when possible。
  • Separate audit evidence from product analytics。
  • Use retention tags by data classification and risk tier。
  • Build deletion job for index, cache, memory, logs where feasible。
  • Record deletion evidence。
  • Define exceptions for legal hold and audit retention。

5. Data Minimization for AI

AI data minimization is architectural:

LayerMinimization tactic
Intakecollect only needed user state
Retrievalmetadata filter and top-k control
Promptredact or summarize sensitive fields
Toolleast-privilege tool scope
Loggingmask sensitive fields
Evalsynthetic or de-identified data
Feedbackseparate rating from personal identity
Memoryopt-in and purpose-bound

6. Financial Retail Case: Customer Service RAG

Lifecycle objectGovernance
policy documentowner, effective date, source authority
chunksource id, permission, version
embeddingrebuild when source changes
promptPII redaction, context budget
model outputcitation required, stored if customer-facing
feedbacktaxonomy, reviewer calibration
traceretention by risk tier
incidentevidence binder and legal hold rules

Deletion example:

source doc retired
  -> mark source inactive
  -> remove chunks from retrieval
  -> rebuild index
  -> preserve release evidence if required
  -> log deletion/rebuild evidence

7. Templates

Data Lifecycle Inventory

Data objectOwnerPurposeClassificationRetentionDeletionEvidence
promptapp ownermodel inputconfidential30 days / audit exceptionlog deletion jobtrace policy
chunkknowledge ownerretrievalinternal/confidentialuntil source retiredindex rebuildsource lineage
eval sampleEvalOpsregressionde-identifiedcontrolleddataset versioneval report

Provenance Table

OutputUsed promptUsed sourceModelToolReviewerEvidence
answer idprompt versionsource idsmodel routetool callsreviewer idtrace id

Retention/Deletion Matrix

ObjectDefault retentionTriggerAction
memorypurpose-bounduser deletiondelete + proof
embeddingssource lifecyclesource retiredrebuild index
tracerisk-tieredretention expirymask/delete
evidenceaudit/legalretention expiryarchive/delete

8. Common Failure Modes

Failure modeFix
Only training data governedinclude prompt/context/output/feedback/logs
Embeddings forgotteninclude index rebuild and deletion strategy
Feedback reused without purposefeedback governance and consent
Logs over-collectdata minimization and masking
Evidence conflicts with deletionretention exception and legal hold policy
No provenancesource-to-output trace

9. 面试表达

30 秒版本:

AI 数据生命周期不只是训练数据治理。我会把 source docs、chunks、embeddings、prompts、retrieved context、tool results、outputs、eval samples、feedback、logs、memory 和 evidence 都放进 lifecycle inventory, 为每类数据定义 owner、purpose、classification、retention、deletion 和 provenance。

2 分钟版本:

以 customer service RAG 为例, 政策文档有 owner、effective date 和 source authority; chunks 带 source id、permission 和 version; prompt 经过 PII redaction; output 必须带 citation; trace 记录 prompt/model/source/tool versions; feedback 有用途限制; source retired 后要 inactive、remove chunks、rebuild index 并保存 deletion evidence。这样既能支持审计, 也能处理隐私、删除和知识更新。