AI 底层逻辑 / 经典论文

AI Document Intelligence：非结构化数据与证据质量架构

本文是学习、架构训练和作品集材料, 不构成法律意见、监管意见、记录保留结论、e-discovery 建议、KYC/KYB 充分性判断、贷款或保险承保结论、消费者争议处置意见、欺诈处置指令、模型验证报告或供应商推荐。

695 行ai-foundations/papers/137-ai-document-intelligence-unstructured-data-evidence-quality-architecture.md

AI Document Intelligence / Unstructured Data / Evidence Quality Architecture 解读

面向对象: CBAP+ Senior BA / Advanced AI PM / Product Architect / Enterprise Architect / Operations Architect / Model Risk / Records Management / Fraud Risk / KYC-KYB Operations / Claims and Disputes Lead / Loan and Insurance Servicing Product Owner。核心问题: 金融零售 AI 系统如何把 bank statement、paystub、claim package、dispute evidence、KYC/KYB 文件、insurance / loan servicing 文档和运营来信, 从 unstructured documents 转成 evidence-grade, auditable, reviewable, workflow-ready 的事实, 同时控制 OCR/layout/multimodal extraction、classification、entity extraction、summarization、confidence scoring、human review、document provenance、records retention、legal hold、fraud/tamper checks 和 model risk? 学习目标: 建立 document intelligence reference architecture、evidence quality model、document provenance and chain-of-custody、confidence and review design、records/legal hold integration、workflow automation controls、fraud/tamper detection、model risk governance 和 senior PM/architect decision framework。

0. Disclaimer

正式项目必须由 Legal、Compliance、Privacy、Records Management、Information Governance、Model Risk、Fraud Risk、Financial Crime、Operations、Product、Architecture、Information Security、Data Governance、Vendor Management、Internal Audit 和相关业务 owner 共同判断。记录、证据、法律保留、客户通知、KYC/KYB、信贷、保险、投诉、争议、索赔、跨境数据和 e-discovery 的具体适用性, 取决于 product、record type、jurisdiction、retention schedule、legal hold status、customer segment、channel、policy、contract 和 Legal / Compliance / Records interpretation。

本文不把 document intelligence 简化成 OCR 教程。OCR 只是把图像转成文本的一个能力。金融零售场景真正需要的是 evidence-grade extraction architecture: 能说明文档从哪里来、是否完整、字段从哪一页哪一区域抽取、置信度如何校准、何时需要人工复核、如何进入工作流、如何保留记录、如何处理 legal hold、如何检测篡改和欺诈, 以及事后如何重放决策证据。

Source Anchors

Source	Link	用途
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 组织 document AI 的风险治理、eval、monitoring、human oversight、incident and evidence controls
NIST Privacy Framework	https://www.nist.gov/privacy-framework	用 privacy risk management、data minimization、purpose、processing、access and monitoring 设计文档数据采集、抽取、使用和保留边界
NARA Records Management	https://www.archives.gov/records-mgmt	用 records lifecycle、disposition、records program 和 accountability 作为 records retention / evidence management 的官方锚点
NARA Electronic Records Management	https://www.archives.gov/records-mgmt/policy/transfer-guidance-tables.html	用电子记录格式、metadata、transfer/readiness guidance 作为 electronic records architecture and preservation discussion 的锚点
CFPB Consumer Complaint Database	https://www.consumerfinance.gov/data-research/consumer-complaints/	用消费者投诉和 complaint operations 视角校验 document evidence trace、dispute handling、case explanation 和 operational learning loop
FFIEC Authentication and Access to Financial Institution Services and Systems	https://www.ffiec.gov/press/pr081121.htm	用金融机构认证、访问控制、风险评估和 layered security 思路设计 document intake、reviewer access、workflow action 和 privileged operation controls
ISO/IEC 42001 overview	https://www.iso.org/standard/42001	用 AI management system、roles、operation、performance evaluation、internal audit 和 continual improvement 建立 document AI operating model

一句话:

Document intelligence is not "OCR + LLM summary". In financial operations, it is an evidence system that converts unstructured documents into policy-bound, source-linked, confidence-calibrated, human-reviewable and records-aware decision inputs.

1. Thesis

金融零售文档智能的核心目标不是“更快读文档”, 而是把文档变成可依赖的 operational evidence。成熟架构应实现下面的转换:

from: uploaded PDF / scanned image / email attachment / photo / fax
to: evidence envelope
    + document provenance
    + page/layout map
    + extracted entities with source coordinates
    + normalized facts
    + confidence and validation results
    + fraud/tamper signals
    + human review decisions
    + workflow action
    + records retention / legal hold metadata
    + replayable audit trail

核心判断:

Text recognized does not mean fact established.
A model summary does not mean evidence accepted.
Confidence score does not mean business risk resolved.
Human review does not mean control effectiveness unless review is designed, sampled and evidenced.
Document storage does not mean records compliance.

高级 PM / Architect 要把 document AI 设计成四层系统:

Layer	目标	关键问题
Capture and provenance	证明文档从哪里来、何时进入、是否完整、是否被处理过	source channel、hash、version、customer/session/case binding、chain of custody
Extraction and understanding	把 layout、text、tables、images、signatures、entities、relationships 转成 structured evidence	OCR/layout/multimodal model、field mapping、normalization、source coordinates
Evidence quality and controls	判断字段是否可用于业务, 何时人工复核, 如何处理冲突	confidence calibration、validation rules、human review、fraud/tamper checks、policy gates
Workflow and records	把证据进入 claims/disputes/KYC/servicing 工作流, 并保留可重放记录	case management、decision logs、records retention、legal hold、complaint linkage

最重要的架构边界是:

extraction result
  != verified fact
  != policy-accepted evidence
  != legal sufficiency
  != final business decision

2. Why It Matters

金融零售运营高度依赖非结构化文档:

Journey	Document examples	Business decision risk
Loan origination and servicing	bank statements、paystubs、tax forms、hardship letters、income proof、servicing correspondence	affordability, income verification, repayment plan, adverse action, servicing treatment
Insurance claims	claim forms、photos、repair invoices、medical bills、police reports、adjuster notes	claim eligibility, payout amount, fraud triage, escalation, customer communication
Payment disputes and chargebacks	receipts、merchant correspondence、tracking proof、screenshots、cardholder statements	dispute reason code, evidence package, representment, regulatory timelines
KYC/KYB onboarding and refresh	ID images、business registration、ownership docs、licenses、utility bills、board resolutions	identity/entity evidence, authority, beneficial ownership, sanctions/financial crime review
Operations and complaints	complaint letters、email threads、call transcripts、agent notes、documents attached to cases	issue classification, remediation, response evidence, root cause analysis
Account maintenance	name/address change proof、death certificate、power of attorney、court order、consent forms	entitlement, authority, privacy, account access, legal/ops escalation

AI 放大三类风险:

规模风险: 一个 extraction defect 会批量影响成千上万个 case。
语义风险: 模型把“看起来像工资单”误写成“收入已验证”。
证据风险: 业务行动发生后无法证明字段来自哪份文档、哪个模型、哪个版本、哪个 reviewer。

Senior PM / Architect 的目标不是“自动化率最高”, 而是:

Use automation where evidence quality is sufficient,
route ambiguity to the right human queue,
preserve proof of what was seen and decided,
and keep records, privacy, fraud and model risk controls in the same workflow.

3. Evidence Object Taxonomy

document AI 中至少要区分七类对象。混用这些对象会导致不可审计的自动化。

Object	Definition	Example	Control implication
Raw document	客户、商户、员工、第三方或系统提交的原始文件	PDF statement、photo paystub、email attachment	保留原始 hash、source channel、received time、case binding
Rendered page	系统渲染出的 page image / normalized PDF page	page 3 of bank statement	记录 renderer version、page count、image quality
Layout element	表格、段落、checkbox、signature block、stamp、logo、field region	paystub earnings table	需要坐标、reading order、table structure
Extracted field	从文档中抽取的字段和值	gross pay = 4,850.00	需要 source coordinates、confidence、parser/model version
Normalized entity	经过标准化和业务字典映射的实体	employer_name、account_holder、claimant、policy_number	需要 normalization rule、entity resolution evidence
Derived fact	由多个字段或规则计算出的事实	average monthly deposit, income variance, coverage period	需要 formula、input fields、calculation version
Decision evidence	被业务 policy 接受或人工确认的证据	accepted income evidence for case X	需要 policy decision、reason code、reviewer/action trace
Summary	面向 reviewer 或客户的 source-linked 摘要	claim package summary	必须引用来源, 不能替代原证据

字段级 evidence metadata 建议:

field_name
document_id
document_version
page_number
bounding_box_or_anchor
raw_text
normalized_value
extraction_method
model_or_rule_version
confidence_score
calibration_bucket
validation_results
cross_document_match
fraud_or_tamper_signals
human_review_status
policy_acceptance_status
evidence_retention_rule
legal_hold_flag

4. Reference Architecture Model

参考架构:

intake channels
  -> document capture and provenance service
  -> file normalization / rendering / virus and content safety scan
  -> document classification and package splitting
  -> OCR + layout understanding + table extraction
  -> multimodal extraction / entity extraction / relationship mapping
  -> normalization and business validation
  -> confidence calibration and quality scoring
  -> fraud / tamper / duplicate / synthetic-document checks
  -> evidence policy engine
  -> human review and exception queues
  -> workflow integration: KYC, claims, disputes, servicing, complaints
  -> records retention / legal hold / disposition integration
  -> evidence ledger, monitoring, QA and model governance

关键组件:

Component	Responsibility	Senior design question
Intake gateway	接收 upload、email、fax、branch scan、mobile capture、API、vendor feed	是否绑定 customer/case/session, 是否记录 channel risk and consent context?
Provenance service	生成 document id、hash、timestamp、source、custody events、version	事后能否证明文档没有被替换或静默改写?
Classification service	判断文档类型、子类型、issuer/source、语言、质量、package structure	错分是否会进入错误 workflow 或错误 retention rule?
Layout and OCR service	识别 text、reading order、tables、checkboxes、signature/stamp blocks	表格和多栏阅读顺序是否被验证, 是否保留坐标?
Multimodal extraction	结合 text、layout、image、tables 抽取字段和关系	模型输出是否被 schema 约束, 是否能解释来源?
Entity normalization	统一姓名、地址、金额、日期、账号后四位、企业名、政策号	normalization 是否可重放, 是否保留原文?
Validation and reconciliation	跨页、跨文档、系统记录、第三方数据做一致性检查	冲突如何进入 review, 不能被 summary 掩盖?
Confidence engine	字段、文档、case 层级的置信度和校准	threshold 是否按 field criticality and journey risk 定义?
Fraud/tamper service	检查编辑痕迹、metadata anomaly、duplicate、template abuse、image manipulation	是否只作为信号, 不直接替代 fraud investigation?
Evidence policy engine	判断 extraction 是否可被当前业务流程接受	是否把 extraction、validation、review、policy acceptance 分开?
Human review workbench	reviewer 查看 source-linked fields、conflicts、model rationale、history	reviewer 能否快速定位证据并留下结构化 decision?
Records and hold connector	赋予 record class、retention schedule、legal hold flag、disposition controls	retention and hold 是否从 case/workflow 状态继承并可审计?
Evidence ledger	保存 document、model、rule、review、workflow action 和 final communication trace	complaint/audit 时能否完整 replay?

5. Document Classes and Evidence Risk

不是所有文档都应该用同样的 automation threshold。按 business impact、field criticality、fraud exposure 和 records sensitivity 分层。

Document class	High-value fields	Special risks	Architecture control
Bank statements	account holder、institution、statement period、balances、deposits、NSF、account number mask	altered PDF、missing pages、fake bank template、income misclassification	page completeness, transaction table validation, institution/logo metadata checks, human review for high-impact use
Paystubs	employer、employee、pay period、gross/net pay、YTD income、deductions	generated fake paystub, mismatched employer, inconsistent YTD	arithmetic checks, pay period consistency, employer/entity validation, duplicate template detection
Claims documents	claim number、loss date、policy number、coverage, invoices, photos, police/medical records	inflated invoices, reused photos, inconsistent event timeline	package timeline, image metadata, duplicate media search, adjuster review
Dispute packages	transaction details、merchant evidence、shipping proof、customer assertion、reason codes	weak evidence, missing required proof, model over-summary	reason-code-specific evidence checklist, source-linked package summary, SLA controls
KYC/KYB documents	identity data、business registration、ownership、license、authorized signer	stale documents, entity mismatch, authority ambiguity	freshness policy, entity resolution, legal/compliance review boundary, beneficial ownership evidence routing
Insurance/loan servicing docs	hardship reason、income/expense、death/POA/court order、address/name change	authority and entitlement errors, sensitive data leakage	privileged workflow controls, dual review for authority, records/hold metadata
Operational correspondence	complaint letters、emails、agent notes、attachments	missed complaint, wrong product classification, incomplete response evidence	complaint taxonomy, case linkage, final response capture, CFPB-style complaint learning loop

高级设计原则:

用 document class 决定 extraction schema、review threshold、retention class、fraud checks 和 workflow route。
对 high-impact fields 使用 field-level controls, 不只看 document-level confidence。
对 summaries 使用 source-linked citations inside the case tool, 不能让摘要成为唯一证据。
对 customer-provided documents 和 institution-generated records 分开治理。

6. Evidence-Grade Extraction Pipeline

6.1 Intake and Capture

Control question	Strong pattern
文档从哪里来?	source channel、user/session、case id、upload event、IP/device/risk context where permitted
原始文件是否保留?	raw artifact immutable storage + hash + version pointer
是否完整?	page count, file size, render success, missing/blank page check
是否安全?	malware scan, file type validation, macro/script blocking, content safety routing
是否可访问?	mobile capture quality feedback, supported formats, assisted channel

6.2 Classification and Package Splitting

分类不是 UI 标签, 而是 workflow/risk/records decision 的入口。

Classification output	Why it matters
document_type and subtype	决定 extraction schema and workflow queue
issuer/source class	决定 trust and fraud checks
language and locale	决定 OCR/model and reviewer routing
package boundaries	多文档 PDF 中切分 statement、paystub、invoice、letter
confidence and ambiguity	低置信分类进入 intake review, 防止走错流程
record category candidate	后续 records retention / legal hold integration 的输入

6.3 Layout, OCR and Multimodal Understanding

架构关注点不是 OCR 算法细节, 而是 evidence recoverability:

Capability	Evidence requirement
Text recognition	raw OCR text, confidence, language, page, text anchors
Layout detection	paragraphs, tables, cells, checkboxes, signature areas, stamps, reading order
Table extraction	row/column coordinates, header mapping, merged cell handling, totals validation
Image understanding	photo/document boundary, logo/stamp/signature presence, damage or blur
Multimodal extraction	field value must link back to visual/text source, not only generated answer
Summarization	source-linked, scoped to reviewer need, not used as record replacement

6.4 Entity Extraction and Normalization

Entity extraction 必须按 field criticality 分级:

Field type	Examples	Control
Identity/entity	name, DOB, business name, beneficial owner, authorized signer	source coordinate + normalization + conflict check + high-risk review
Monetary	gross pay, net pay, deposit amount, invoice amount, claim amount	arithmetic validation, currency, period, outlier checks
Temporal	statement period, pay period, loss date, coverage dates, received date	date normalization, timeline consistency
Authority/relationship	POA, signer role, officer, policyholder, claimant	human/legal/ops review triggers based on product policy
Operational	reason code, claim type, complaint category, servicing request	workflow route and SLA impact, QA sampling

7. Confidence Architecture

置信度不是一个漂亮分数。它是 routing, review, policy acceptance and monitoring 的控制输入。

Confidence level	Definition	Example
Character/text confidence	OCR 对具体字符或 token 的识别可信度	`8` vs `B` in account mask
Field confidence	模型认为某字段值正确的概率或 score	pay period end date
Layout confidence	表格结构、reading order、checkbox 状态是否可信	deductions table
Document classification confidence	文档类型和子类型是否正确	paystub vs payroll summary
Cross-validation confidence	字段与内部系统、其他文档、规则是否一致	YTD income vs pay period
Fraud/tamper confidence	文档篡改或伪造信号强弱	PDF metadata anomaly
Case evidence confidence	整个 case package 是否足以推进下一步	income evidence accepted for review

弱模式:

if model_confidence > 0.85 then auto-approve

强模式:

if field is low impact and confidence calibrated and validations pass
  then auto-populate with audit trace
if field is high impact or conflicts with another source
  then route to human review
if document class has fraud/tamper signal or legal/authority implication
  then require specialized queue

置信度设计要点:

threshold 按 field criticality、journey risk、customer harm、fraud exposure 定义。
不能用 document-level average 掩盖关键字段错误。
置信度必须校准, 并用 reviewer outcomes 监控 drift。
对 high-impact decisions 使用 validation + review + policy, 不只用 score。
reviewer override 进入 feedback loop, 但不自动训练模型, 除非数据治理和模型治理已批准。

8. Human Review Design

人工复核不是“自动化失败后的人工兜底”, 而是 evidence architecture 的组成部分。

Review pattern	Use case	Control requirement
Intake review	分类不确定、文件质量差、缺页、格式异常	确认 document type, package completeness, reroute
Field review	高影响字段低置信或冲突	reviewer 看到原文、坐标、候选值、规则失败原因
Specialist review	authority, legal document, KYB ownership, fraud signal	分配给有资质/权限队列, 记录 rationale
QA sampling	自动化通过的 case 抽样	估算 false accept / false extraction, 触发 model/control tuning
Dual control	高金额 claim, sensitive servicing action, authority change	second reviewer or approver, separation of duties
Complaint review	客户质疑文档处理或 AI 结论	连接 evidence ledger、AI run、reviewer action、final response

Reviewer UI 应具备:

左侧原文/页面/图像, 右侧结构化字段, 字段点击可定位 source。
显示 model output、confidence、validation failures、previous reviewer actions。
不显示会诱导 rubber-stamping 的“AI says approve”。
对 fraud/risk signal 做最小必要解释, 避免泄露敏感规则。
reviewer 必须选择 structured reason code and free-text rationale where needed。
每次 override 记录 old value、new value、reason、reviewer、timestamp、policy version。

9. Document Provenance and Chain of Custody

Evidence-grade document AI 必须能回答:

Who or what submitted the document?
When was it received?
Which exact file and page were processed?
Which model/rule/version extracted the field?
Was the document changed, re-rendered, split, redacted or reprocessed?
Who reviewed or overrode the result?
Which workflow decision used the evidence?
Which final customer or counterparty communication referenced it?

Provenance controls:

Control	Evidence
Immutable raw artifact	file hash, storage location, write-once policy where applicable
Versioned derived artifacts	rendered pages, OCR JSON, layout graph, extracted field set
Processing lineage	model id/version, prompt/template id, parser version, ruleset version
Source coordinate binding	page number, bounding box, table cell id, paragraph anchor
Custody event log	uploaded, scanned, normalized, classified, reviewed, redacted, exported
Access trace	reviewer, system, service account, vendor access, download/export events
Workflow linkage	case id, task id, decision id, communication id
Records metadata	record class, retention rule, hold flag, disposition state

LLM output 必须 grounded:

summary_claim:
  text: "The claimant submitted two repair invoices totaling 3,420.00."
  sources:
    - document_id: D123
      page: 2
      field: invoice_total
      value: 1,850.00
    - document_id: D124
      page: 1
      field: invoice_total
      value: 1,570.00
  model_run_id: MR789
  reviewer_status: reviewed

10. Records Retention and Legal Hold Architecture

Document intelligence 经常会生成多种 derived artifacts: OCR text、layout JSON、extracted fields、summary、review notes、decision logs、redacted copies、exports。它们是否构成 records、保留多久、是否进入 legal hold, 不能由 AI 团队自行判断。

架构应提供可配置机制:

Question	Architecture response
哪些 artifact 是 records?	record classification service + Records/Legal interpretation
原文和抽取结果是否同等保留?	retention rule can differ by artifact type and case type
legal hold 如何传播?	hold flag propagates to raw doc, derived artifacts, case decisions, exports
disposition 如何执行?	scheduled disposition workflow with approval, audit and exception handling
reprocessing 后旧结果如何处理?	version lineage retained per policy; no silent overwrite
vendor 是否持有副本?	vendor data inventory, deletion/return evidence, contract controls
records search 如何定位文档和 AI artifacts?	metadata index with access controls and preservation status

边界原则:

不在产品文案或架构文档中断言某类文档的法定保留期限。
retention schedule、legal hold status、e-discovery、regulatory response 由 Legal / Compliance / Records 确认。
AI summaries 不能替代原始记录, 除非 Records/Legal 明确认可该 artifact 的用途和保留方式。
legal hold 下, 自动删除、模型训练清理、vendor purge、data minimization job 都必须检查 hold state。

11. Fraud, Tamper and Authenticity Checks

document AI 必须假设输入可能被操纵。特别是 bank statements、paystubs、invoices、receipts、screenshots、photos 和 identity/KYB documents。

Threat	Pattern	Controls
Altered PDF	字段被编辑, metadata 异常, 字体/层不一致	PDF object analysis, metadata checks, visual inconsistency, reviewer alert
Fake template	使用伪造银行/雇主/商户模板	institution/template registry, logo/layout similarity, issuer validation where available
Missing pages	statement 缺少关键页或 terms/context	page count completeness, period continuity, expected section checks
Reused document	同一 paystub / invoice / photo 多个客户重复使用	perceptual hash, duplicate detection, cross-case risk signal
Synthetic paystub	YTD/pay period/tax/deduction 不一致	arithmetic and chronology validation
Screenshot manipulation	裁剪、拼接、覆盖、低质量绕过	image metadata, edge artifacts, quality gate, source channel policy
Deepfake / generated image	AI 生成的事故照片、票据、签名	media provenance, duplicate search, anomaly model, human/fraud review
Insider manipulation	员工替换、导出、改写证据	access control, separation of duties, immutable logs, FFIEC-aligned layered controls
Prompt injection in documents	文档中写入指令诱导 LLM 忽略规则	tool isolation, deterministic extraction schema, prompt injection filters, output validation

重要边界:

fraud/tamper model output 是 risk signal, 不是最终欺诈结论。
客户沟通应解释需要补充或复核的业务原因, 不暴露内部检测规则。
对高风险文档, 控制组合通常比单一模型更重要: metadata + layout + arithmetic + external validation + human review + monitoring。

12. Workflow Integration

Document intelligence 的价值来自进入业务流程, 不是停留在 extraction dashboard。

Workflow	Integration pattern	Evidence control
KYC/KYB onboarding	prefill application, identify missing evidence, route ownership/authority ambiguity	source-linked fields, policy reason codes, compliance review boundary
Loan underwriting/servicing	income/expense extraction, hardship package completeness, servicing task creation	field confidence, calculation trace, reviewer rationale
Insurance claims	claim package classification, invoice/photo extraction, timeline, fraud triage	media provenance, duplicate checks, adjuster summary
Payment disputes	reason-code evidence checklist, merchant/customer package summary, SLA management	required evidence flags, final package trace
Complaints	identify complaint theme, product, customer harm, attached evidence, response deadlines	complaint-to-document linkage, final response capture
Back office operations	mailroom automation, form processing, correspondence routing	queue routing evidence, SLA, error sampling

Workflow contract should include:

input artifact type
required extracted fields
confidence thresholds by field
validation rules
human review triggers
fraud/tamper triggers
policy decision states
records metadata
case update payload
customer communication constraints
fallback and exception path
monitoring metrics

13. Model Risk and AI Governance

Document AI 可以同时使用 OCR engine、layout model、classification model、multimodal LLM、entity extraction model、rules engine、fraud model、summarizer。每个模型/规则的风险不同。

Capability	Model risk focus
Classification	wrong workflow route, wrong retention category, SLA miss
OCR/layout	field distortion, table errors, missed signatures or checkboxes
Entity extraction	wrong identity, amount, date, account, authority
Summarization	unsupported conclusion, omission of conflicting evidence, tone risk
Fraud/tamper detection	false positives, false negatives, sensitive rule leakage
Confidence scoring	poor calibration, threshold gaming, automation beyond evidence
Human review recommendation	automation bias, rubber-stamping, unequal treatment

Governance design:

Map use cases and harms using NIST AI RMF categories: validity, reliability, safety, security, accountability, transparency, privacy, fairness。
Maintain model inventory with purpose, owner, vendor, version, data classes, decision impact and allowed uses。
Define eval sets by document class, language, channel, quality, customer segment, fraud pattern and workflow outcome。
Monitor extraction accuracy by field criticality, not only aggregate F1。
Track reviewer overturns, complaint defects, downstream rework and customer harm indicators。
Use ISO/IEC 42001-style AI management system controls: role accountability, operational procedures, performance evaluation, internal audit and continual improvement。
Treat prompt/template/ruleset changes as governed artifacts when they affect extraction or workflow decisions。

14. Product / Architecture Decisions

Decision	Weak answer	Strong architecture answer
What are we automating?	“Read all PDFs with AI”	Define document classes, field schemas, decision impact, review thresholds and workflow contracts
How to use OCR?	“OCR everything and send to LLM”	Preserve page/layout/source coordinates; use OCR/layout only as one stage of evidence pipeline
How to use multimodal models?	“Ask model what the document says”	Use schema-constrained extraction, grounded outputs, validation, confidence and review
What counts as evidence?	“Model output in JSON”	Raw document + source-linked extracted fields + validations + policy acceptance + review trace
When to auto-process?	“High confidence”	Field criticality + calibrated confidence + validations + fraud signals + policy threshold
How to handle summaries?	“Summarize the file for ops”	Source-linked, scoped summary that cannot override field evidence or policy rules
How to handle records?	“Store documents in S3”	Record class, retention schedule mapping, legal hold propagation, disposition audit
How to handle legal hold?	“Pause deletion manually”	Hold-aware storage, derived artifact propagation, vendor and downstream system checks
How to measure quality?	“OCR accuracy”	Field-level accuracy, calibration, review overturn, evidence completeness, downstream defect
How to govern vendors?	“Use best API”	Data use, retention, access, audit logs, model versioning, outage, exit and evidence obligations

15. Control Matrix

Control objective	Control activity	Evidence
Preserve original evidence	Immutable raw artifact, hash, received timestamp, source channel	document hash, intake event, storage policy
Classify correctly	Document type/subtype model with ambiguity routing	classification result, confidence, review decision
Bind fields to source	Every extracted field includes document/page/coordinate or anchor	extraction JSON, UI source link
Validate critical fields	Rule and cross-document checks for amounts, dates, identity, authority	validation log, failed rule reason
Calibrate confidence	Compare confidence with reviewer outcomes and sampling	calibration report, threshold change record
Prevent unsupported automation	Field criticality thresholds and policy gates before workflow action	policy decision id, reason code
Control summaries	Source-linked summaries with prohibited conclusion rules	model run id, citations, eval result
Route human review	Queue by ambiguity, high impact, authority, fraud, legal/records sensitivity	task id, reviewer, rationale
Detect tamper/fraud	Metadata, visual, duplicate, arithmetic and behavioral checks	risk signals, fraud case link
Protect privacy	Data minimization, access controls, redaction, purpose-bound use	privacy review, access logs
Manage records	Record class and retention metadata assigned to raw and derived artifacts	record metadata, retention rule
Honor legal hold	Hold flag propagates to raw docs, derived artifacts, exports and vendor purge flows	hold event, propagation log
Govern models	Model inventory, evals, drift monitoring, change control	model card, eval report, approval
Support complaints/audit	Link documents, AI runs, review actions, workflow decisions and final messages	evidence bundle, complaint id

16. Metrics

Metric family	Examples
Extraction quality	field-level precision/recall, table extraction accuracy, date/amount/entity error rate
Confidence quality	calibration error, high-confidence wrong field rate, threshold breach rate
Workflow outcome	straight-through processing rate by document class, review queue SLA, downstream rework
Human review	reviewer overturn rate, agreement rate, average handle time, QA defect rate
Evidence completeness	% fields with source coordinates, % decisions with policy reason, replay success rate
Records and hold	retention metadata completeness, hold propagation success, disposition exception count
Fraud/tamper	duplicate document rate, altered document detection, false positive review rate, confirmed fraud yield
Privacy/security	over-collection defects, unauthorized access attempts, redaction defects, vendor retention exceptions
Model governance	eval pass rate, drift alerts, prompt/ruleset changes, incident count
Customer impact	complaint rate related to document handling, dispute re-open rate, request-for-more-info rate, accessibility defects

Balanced executive dashboard:

Speed: cycle time and review productivity improve.
Quality: critical fields are accurate and calibrated.
Risk: fraud, tamper, legal hold and records controls work.
Fairness: errors are monitored across document quality, language and channel.
Trust: every automated or reviewed decision is replayable.

17. Failure Modes

Failure mode	Why dangerous	Better control
OCR text treated as truth	OCR may misread critical amounts, dates, names	source-linked fields, validation, human review
Document-level confidence used for all fields	High average hides one critical field error	field criticality thresholds
LLM summary becomes decision record	Summary may omit conflicts or invent conclusions	source-linked summary plus structured evidence
Wrong document classification	Routes to wrong workflow, SLA, retention class	ambiguity queue and QA sampling
Silent reprocessing overwrites evidence	Audit cannot explain historical decision	versioned artifacts and lineage
No legal hold propagation	Derived OCR/extractions may be deleted while raw doc held	hold-aware artifact graph
Reviewer rubber-stamping	Automation bias turns human review into weak control	source-first UI, reason codes, QA
Fraud model blocks customers without review	False positives can cause harm and complaints	risk signal routing and human/fraud review
Vendor retains documents unexpectedly	Privacy, records and legal hold exposure	contract controls, data inventory, deletion evidence
Model trained on records under hold or restricted use	Governance and discovery risk	data use controls and hold-aware training exclusion
Prompt injection from document text	Model follows malicious embedded instructions	tool isolation and output validation
Complaint cannot link to evidence	Root cause and remediation become speculative	complaint-to-evidence trace

18. Interview-Ready Takeaways

Q1: 为什么 document intelligence 不是 OCR 项目?

OCR 只解决“看见文字”。金融零售真正需要的是 evidence-grade extraction: 文档 provenance、layout/source coordinates、字段置信度、业务验证、fraud/tamper checks、人工复核、workflow action、records retention、legal hold 和 audit replay。否则只是把人工读错文档变成机器批量读错文档。

Q2: 如何判断某个抽取字段可以自动进入业务流程?

不能只看模型 confidence。要看 field criticality、document class、confidence calibration、source linkage、validation result、cross-document consistency、fraud/tamper signals、policy gate 和 human review threshold。高影响字段例如 income、authority、claim amount、beneficial owner 通常需要更强验证或复核。

Q3: AI summary 在 claims/disputes/KYC 中如何安全使用?

Summary 应该作为 reviewer productivity tool, 不是 evidence replacement。每个关键陈述要 source-linked, 不能给出 unsupported eligibility、KYC/KYB、fraud 或 legal conclusion。最终业务决定应引用 structured evidence、policy reason 和 reviewer action。

Q4: records retention 和 legal hold 为什么要进入 document AI 架构?

因为 document AI 会产生 raw docs、OCR text、layout JSON、extracted fields、summaries、review notes、exports 等 derived artifacts。哪些是 records、保留多久、是否受 legal hold 影响, 取决于 product、record type、jurisdiction、retention schedule、hold status 和 Legal/Compliance/Records interpretation。架构必须能传播 metadata and hold state, 而不是事后人工查找。

Q5: 高级 PM 如何衡量 document AI 成功?

不只看自动化率或处理时间。要看 field-level accuracy、confidence calibration、review overturn、evidence completeness、records/hold metadata completeness、fraud/tamper yield、complaint defects、downstream rework、customer harm and audit replay success。速度必须和证据质量一起看。

19. Practical Templates

19.1 Document Evidence Envelope

Document ID:
Case ID:
Customer / business reference:
Source channel:
Received timestamp:
Submitter / system reference:
Raw file hash:
File type and size:
Page count:
Document class / subtype:
Classification confidence:
Language / locale:
Quality score:
Processing lineage:
  renderer version:
  OCR/layout version:
  extraction model version:
  ruleset version:
Fraud/tamper signals:
Record class:
Retention rule:
Legal hold flag:
Access restrictions:
Derived artifacts:
Workflow decisions:
Complaint / audit links:

19.2 Field Extraction Spec

Field	Definition
field_name	`gross_pay_amount`
document_classes	paystub, payroll statement
source_requirement	page + bounding box + raw text
normalization	currency amount with locale and period
validations	gross >= net, pay period exists, YTD consistency
confidence_threshold	higher threshold for auto-populate, lower threshold for review suggestion
review_trigger	low confidence, arithmetic mismatch, employer mismatch, tamper signal
allowed_workflow_use	income package preparation, not final credit decision by itself
prohibited_use	unsupported affordability conclusion
retention_metadata	derived field linked to paystub record class

19.3 Confidence and Review Policy

Document class:
Workflow:
Field criticality:
Customer impact:
Fraud exposure:
Auto-populate allowed when:
  field confidence:
  classification confidence:
  validations:
  tamper signals:
  cross-document consistency:
Human review required when:
  low confidence:
  conflict:
  authority/legal implication:
  high amount:
  vulnerable customer / complaint sensitivity:
Sampling rule:
Reviewer queue:
QA metric:

19.4 Human Review Record

Review task ID:
Reviewer role:
Document ID:
Field(s) reviewed:
Model suggestion:
Source location:
Validation failures:
Fraud/tamper signals:
Reviewer decision:
Corrected value:
Reason code:
Free-text rationale:
Second approval:
Workflow action:
Customer communication reference:
Timestamp:

19.5 Records / Legal Hold Integration Card

Artifact type:
Raw document:
Rendered page:
OCR text:
Layout JSON:
Extracted fields:
AI summary:
Review notes:
Workflow decision:
Exported package:
Record class owner:
Retention schedule reference:
Legal hold propagation rule:
Disposition approval:
Vendor copy:
Search / retrieval metadata:
Access restrictions:
Audit evidence:

20. Final Operating Principle

成熟的 AI document intelligence architecture 可以用一个问题检验:

Can the institution prove that every automated or human-assisted document decision
was based on the right document,
the right source-linked fields,
the right confidence and validation controls,
the right human review boundary,
the right fraud and records treatment,
and the right workflow policy at that point in time?

如果答案不清楚, 团队缺的不是更强 OCR。缺的是 document provenance、evidence quality、confidence calibration、human review design、records/legal hold integration、fraud controls、workflow contracts 和 AI governance 组成的一套 evidence operating architecture。