AI 扩展计划 / Playbooks

AI Human Review Operations / Capacity Playbook

Human review 不是 AI 项目的装饰性安全带。它是生产系统的一部分, 直接决定 AI 能否在真实业务压力下安全运行。

748 行AI_HUMAN_REVIEW_OPERATIONS_CAPACITY_PLAYBOOK.md

AI Human Review Operations / Capacity Architecture Playbook

定位: 面向高级 AI PM / AI BA / AI Architect / Enterprise Architect / Operations Lead / Workforce Planning / Model Risk / Compliance / Internal Audit, 把 AI human review 从“有人审核”升级为可运营、可扩容、可校准、可审计、可恢复的生产控制体系。适用范围: AML Copilot、KYC Review Assistant、Credit Underwriting Copilot、Payment Dispute Assistant、Complaint Response Agent、Customer Service RAG、Fraud Alert Triage、Collections Copilot、金融零售内部知识助手和 agentic workflow。重要说明: 本文是学习、作品集和内部治理训练材料, 不是法律意见、合规结论、审计意见、模型验证报告或监管解释。正式项目必须由 Legal、Compliance、Risk、Model Risk、Internal Audit、Security、Privacy、Business Owner、Operations、Workforce Planning 和管理层结合机构类型、司法辖区、业务用途、客户影响和内部政策确认。

1. Executive Framing

Human review 不是 AI 项目的装饰性安全带。它是生产系统的一部分, 直接决定 AI 能否在真实业务压力下安全运行。很多组织已经知道要做 HITL, 但失败发生在更运营化的层面:

该审的 case 进了同一个大队列。
reviewer 没有足够技能或权限。
审核量超过容量, 人只能跳读。
训练只覆盖系统操作, 不覆盖边界判断。
override 没有 reason code 或证据引用。
escalation 没有接收人、时限和暂停规则。
管理层只看通过量, 不看质量和疲劳。
审计无法重放 AI 输出、证据、人工判断和下游动作。本 playbook 的核心判断:

Human review is production capacity.
It must be routed, staffed, calibrated, measured, governed and evidenced.
Otherwise it becomes a bottleneck or control theater.

1.1 区别于 Generic HITL

Generic HITL	Human Review Operations / Capacity Architecture
问是否有人参与	问谁在什么条件下以什么证据做什么判断
强调审批点	强调队列、SLA、技能、容量、校准和升级
通常是流程图节点	是 operating model、workflow engine 和 evidence system
假设人可以吸收风险	明确人类判断力是稀缺资源
只保留 accept / reject	保留 edit、override、escalate、stop、adjudicate 和 feedback
指标看 review count	指标看 review quality、agreement、miss rate、fatigue 和 audit replay

1.2 当前 Nuance

human review 可能降低 AI 风险, 也可能制造新风险:

当容量不足时, 它成为瓶颈。
当证据不足时, 它成为橡皮图章。
当 reviewer 不独立时, 它放大 automation bias。
当 escalation authority 不清时, 它无法阻止错误动作。
当 audit evidence 不完整时, 它无法证明控制有效。
当校准缺失时, 它把个人主观判断包装成治理。高级设计要把 human review 当作受控生产能力, 而不是“人会兜底”的乐观假设。

2. Source Anchors

以下来源用于建立治理、human-centered AI、管理系统和业务连续性语言。本文把它们转成产品、流程、架构、运营和证据设计要求。

Anchor	Official link	本文使用方式
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 组织 human review 的治理、场景、度量、处置和持续改进
NIST Human-Centered AI	https://www.nist.gov/programs-projects/human-centered-ai	用 human-centered AI、AI user trust、workplace GenAI 和人类任务视角设计 reviewer 工作系统
NIST AI Use Taxonomy PDF	https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.200-1.pdf	用 AI contribution to human task 的 taxonomy 拆分 review unit、human activity 和 measurement need
ISO/IEC 42001	https://www.iso.org/standard/42001	用 AI management system 语言连接责任、能力、运营控制、绩效评价、管理评审和改进
FFIEC Business Continuity Management booklet	https://ithandbook.ffiec.gov/it-booklets/business-continuity-management.aspx	用 critical operations、dependencies、training、testing、exercise、board / senior management oversight 设计 review surge、manual fallback 和 queue resilience

2.1 Governance Mapping

Framework	Human review operations 问题	Evidence
NIST Govern	谁拥有 review policy、capacity、quality、training、exception 和 risk acceptance	RACI、review standard、management report
NIST Map	哪些 AI use case、任务、客户影响和人工判断点需要 review	use case map、risk tier、queue taxonomy
NIST Measure	如何度量 reviewer 是否发现错误、减少伤害和保持一致	calibration report、QA sample、agreement metric
NIST Manage	review 失效、积压、质量下降或发现高风险问题时如何处置	escalation runbook、surge mode、route stop
ISO/IEC 42001	如何把 review 纳入 AI management system	policy、competence、operational control、performance review
FFIEC BCM	review queue 在压力状态下是否能支撑 critical operations	BIA、dependency map、exercise report、management oversight

3. Operating Model Overview

3.1 End-To-End Flow

AI output or action candidate
  -> risk and impact classification
  -> review policy decision
  -> queue creation
  -> skill / risk / independence routing
  -> reviewer workspace
  -> human action
  -> second review or adjudication if needed
  -> downstream workflow
  -> evidence ledger
  -> QA, calibration and dashboard
  -> model / policy / process improvement

3.2 Review Policy Inputs

Review policy engine 应基于:

risk tier。
customer impact。
financial impact。
regulatory impact。
reversibility。
model confidence and evidence quality。
source freshness。
tool side effect。
reviewer capacity。
incident state。
customer vulnerability or complaint signal。

3.3 Operating Principles

Principle	解释
Risk first	高风险、高影响、不可逆和监管敏感优先
Capacity aware	队列策略必须知道可用 reviewer hours
Skill routed	不同业务和风险需要不同 reviewer pool
Evidence based	人必须看到足够证据, 不能只看 AI 结论
Independently challengeable	关键审核不能被效率指标和 AI 默认值绑架
Calibrated	reviewer 对边界 case 的判断必须可训练和可度量
Auditable	出事后能重放输入、输出、证据、人工动作和下游影响
Recoverable	审核队列失效时有 surge、manual-first、safe-stop 和 recovery gate

4. Queue Taxonomy

4.1 按审核目的分类

Queue type	目的	示例
Pre-decision review	AI 建议进入业务决策前复核	信贷 memo、KYC risk recommendation
Pre-action review	AI 调用工具或改变状态前复核	临时入账、退款、账户冻结
Customer-visible draft review	客户沟通发出前复核	投诉回复、费用解释、dispute update
Exception review	低置信、证据冲突、政策例外进入复核	RAG unsupported claim、policy conflict
Post-decision QA	事后抽样, 验证自动或人工辅助结果	客服回答、文档分类
Appeal / complaint review	客户异议或投诉触发复核	wrong denial、misleading answer
Calibration review	gold case 或 challenge case 复核	reviewer training and drift
Incident review	风险事件期间扩大复核	model drift、RAG stale、policy outage

4.2 按覆盖策略分类

Coverage	适用场景	风险
100% review	高影响、不可逆、监管敏感、客户权益重大	成本高、疲劳、积压
Risk-based review	有可靠 risk score、confidence、evidence signal	risk signal 漏检会低估风险
Stratified sampling	需要监控 segment、渠道、产品和语言质量	样本设计差会掩盖少数群体问题
Exception-only review	低置信、证据缺失、政策冲突、客户投诉	正常样本中的系统性偏差可能漏掉
Shadow review	pilot 或 model comparison	不能替代正式生产控制
Random sentinel	低比例随机哨兵样本	成本可控, 用于发现 blind spot

4.3 Queue Priority Rules

Priority	Criteria	Action
P0	客户资金、账户状态、法律威胁、监管 deadline、隐私暴露、不可逆动作	stop downstream action until reviewed
P1	高风险客户、投诉、信贷、AML/KYC、欺诈、支付争议	priority SLA and skilled reviewer
P2	客户可见但可纠正的回答或草稿	standard review or risk-based sample
P3	内部 productivity output, low customer impact	periodic QA and feedback
P4	training, taxonomy and improvement samples	scheduled calibration cycle

4.4 Time Sensitivity

Time band	例子	Queue rule
Immediate	工具动作、客户正在会话中、欺诈阻断	同步或 near-real-time review
Same day	投诉确认、支付争议下一步、KYC 补件	当日 SLA 和 supervisor watch
Regulatory deadline	投诉、AML / SAR、dispute rules	deadline priority and breach escalation
Batch QA	抽样复核、训练数据、calibration	周期性处理
Governance review	趋势、审计、管理层报告	monthly / quarterly cadence

5. Skill And Risk Routing

5.1 Reviewer Skill Dimensions

Skill dimension	示例
Domain	AML、KYC、credit、fraud、complaints、payments、wealth、servicing
Product	credit card、mortgage、deposit、small business、investment、BNPL
Risk	high-risk customer、vulnerable customer、PEP、sanctions、fair lending
Channel	branch、contact center、mobile app、chat、email、back office
Language	English、Spanish、Chinese、bilingual customer communication
Authority	frontline review、senior approval、compliance escalation、route stop
AI literacy	hallucination、RAG evidence、automation bias、tool misuse、prompt injection
Evidence handling	PII、restricted documents、legal hold、audit packet

5.2 Routing Matrix

Trigger	Reviewer pool	Review mode
Payment dispute amount above threshold	Payment dispute specialist	pre-action review
Fraud signal and customer complaint	Fraud senior + complaint specialist	dual review
Credit adverse action draft	Underwriter + compliance-trained reviewer	pre-decision review
AML high-risk typology	Senior AML investigator	mandatory review
KYC PEP or sanctions near match	EDD specialist	100% review
RAG policy conflict	Policy SME	exception review
Customer vulnerability	Vulnerability-trained servicing lead	escalation review
Tool action request with side effect	Authorized approver	approval with trace

5.3 Conflict-Of-Interest Controls

Conflict	Control
Same employee created AI prompt and approves output	separate reviewer role
Reviewer measured only on throughput	balanced quality and risk metrics
Sales owner reviews credit exception	independent credit approval
Agent reviews own customer communication QA	second-line QA sample
Model team adjudicates labels used to prove model quality	independent model risk or QA challenge
Vendor reviews its own production failures	internal acceptance and audit sample

5.4 Independence Modes

Mode	用法
Standard review	reviewer sees AI output and evidence
Blind initial review	reviewer first judges evidence without seeing AI recommendation
Delayed reveal	reviewer submits initial label, then compares AI suggestion
Double review	two reviewers independently review high-risk or ambiguous cases
Adjudication	senior SME resolves disagreement
Second-line QA	independent QA samples first-line review
Audit replay	internal audit or model risk reconstructs case from evidence

6. Capacity Model

6.1 Basic Formula

reviewed_cases = total_cases * review_rate
adjusted_minutes =
  reviewed_cases
  * average_handling_minutes
  * complexity_multiplier
  * (1 + double_review_rate)
  * (1 + rework_rate)
required_fte =
  adjusted_minutes / 60 / productive_hours_per_reviewer

6.2 Required Inputs

Input	Why it matters
total case volume	case arrival baseline
review rate	percentage routed to human review
average handling time	core workload driver
risk mix	high-risk case takes longer
double review rate	senior capacity requirement
rework rate	quality issue and unclear rubric cost
escalation rate	downstream specialist load
productive hours	meetings, breaks, training and fatigue reduce usable time
SLA	determines concurrency and queue size tolerance
arrival pattern	peaks require staffing more than daily average
seasonality	fraud, tax, shopping, disaster and policy events
incident reserve	capacity for model drift or outage surge

6.3 Worked Example

Daily AI-assisted cases: 8,000
Risk-based review rate: 16%
Average handling time: 5.5 minutes
Complexity multiplier: 1.20
Double review rate: 10%
Rework rate: 7%
Productive reviewer hours: 5.75

reviewed_cases = 8000 * 0.16 = 1280
adjusted_minutes = 1280 * 5.5 * 1.20 * 1.10 * 1.07 = 9945.94
required_fte = 9945.94 / 60 / 5.75 = 28.83

Operational reading:

A team of 20 reviewers cannot sustain this control without backlog or quality decay。
If incident mode raises review rate to 25%, required FTE becomes roughly 45。
PM ROI must include reviewers, senior adjudicators, QA, training and workforce planning。

6.4 SLA / OLA Model

Queue	SLA	OLA dependency
P0 tool action	15 minutes	approver on duty, evidence complete
P1 complaint / dispute	4 business hours	specialist availability
P1 credit / AML	same day or regulated deadline	senior reviewer and policy SME
P2 customer-visible draft	1 business day	frontline QA pool
P3 batch QA	5 business days	QA analyst capacity
Calibration	monthly cycle	gold set and trainer availability
SLA is customer or business-facing. OLA is the internal promise between queue owner, reviewer pool, policy SME, platform and escalation team.

6.5 Fatigue And Quality Decay

Fatigue signal	Detection	Control
review time drops sharply	duration analytics	micro-break and supervisor check
override rate collapses	trend dashboard	calibration and blind QA
reason text becomes generic	reason quality score	structured reasons and coaching
disagreement increases	inter-rater metric	rubric refresh
after-hours backlog work rises	workforce report	staffing adjustment
gold case failures rise	gold injection	pause high-risk assignment

6.6 Capacity Thresholds

Signal	Yellow action	Red action
P1 backlog > 80% of daily capacity	activate reserve reviewers	stop low-value review inflow
SLA breach forecast within 4 hours	supervisor triage	senior management risk acceptance
reviewer utilization > 85% for 3 days	reduce discretionary QA	surge staffing or route throttling
gold accuracy below threshold	coaching	remove reviewer from high-risk queue
escalation queue aged beyond OLA	backup owner	incident bridge

7. Reviewer Workspace Requirements

7.1 Evidence Panel

Reviewer must not judge only from AI text. The workspace should show:

AI output or action request。
source documents and cited spans。
source owner、version、effective date。
model / prompt / retriever version。
customer / case context allowed for the role。
missing evidence and contradictory evidence。
policy and rubric excerpt。
risk flags and downstream impact。
previous human actions。
allowed decisions and authority limits。

7.2 UI Actions

Action	Meaning	Required evidence
Accept	AI output meets rubric and evidence standard	reason optional for low risk, required for high risk
Edit	reviewer corrects content before downstream use	edit diff and reason code
Reject	AI output should not be used	reason code and evidence gap
Override	reviewer changes AI recommendation or route	authority and rationale
Escalate	specialist or senior authority required	escalation trigger and target queue
Request evidence	case cannot be reviewed with current packet	missing evidence type
Stop route	AI route or tool should pause	severity and incident link
Adjudicate	resolve disagreement	final rationale and rubric interpretation

7.3 Workspace Anti-Patterns

Anti-pattern	Risk
Approve button is dominant and reject hidden	default acceptance
AI confidence shown without evidence	false precision
No policy effective date	stale policy risk
Free-text only reasons	analytics and audit failure
Reviewer cannot see downstream action	weak control over impact
Escalation creates email, not workflow case	lost handoff
Missing source access due entitlement	reviewer guesses

8. Calibration Program

8.1 Calibration Lifecycle

define rubric
  -> create gold and boundary cases
  -> train reviewers
  -> run blind calibration
  -> score agreement and rationale
  -> adjudicate disagreements
  -> update rubric
  -> certify reviewer skill
  -> monitor production drift

8.2 Calibration Case Types

Case type	Purpose
Clear pass	reviewer recognizes valid AI output
Clear fail	reviewer catches unsupported or prohibited output
Boundary case	tests policy nuance
Missing evidence	tests refusal and request-evidence behavior
Conflict evidence	tests escalation judgment
Tool side effect	tests pre-action control
Customer vulnerability	tests harm-sensitive routing
Prompt injection	tests security-aware review
Stale policy	tests source freshness
Automation bias challenge	AI sounds confident but is wrong

8.3 Calibration Metrics

Metric	Meaning
Gold accuracy	reviewer correctness on known cases
Inter-rater agreement	consistency across reviewers
Adjudication overturn rate	how often senior SME reverses initial review
Rationale completeness	reason references evidence and rubric
Escalation correctness	reviewer escalates cases that need specialist judgment
False accept rate	unsafe AI output accepted
False reject rate	valid AI output rejected
Boundary performance	performance on ambiguous but common cases
Drift by reviewer	quality trend over time

8.4 Certification Levels

Level	Scope	Requirements
L1 reviewer	low / medium risk drafts and QA	training, gold accuracy, supervisor signoff
L2 specialist	domain-specific high-risk cases	domain certification, calibration pass, case experience
L3 senior adjudicator	disagreement and policy exception	SME authority, rubric ownership, governance reporting
Stop authority	route or tool pause	incident training, management delegation

9. Reviewer Quality Metrics

9.1 Quality Dashboard

Metric	Good use	Bad interpretation
Review throughput	capacity and planning	equating speed with quality
Average handling time	complexity and fatigue signal	forcing shorter review for high-risk work
Override rate	trust calibration signal	assuming lower is always better
Meaningful edit rate	AI quality and reviewer value	punishing reviewers for edits
Reason quality score	audit readiness	treating generic reason as enough
Escalation precision	routing quality	ignoring escalation miss
Escalation miss rate	high-risk control effectiveness	measuring only escalated cases
Inter-rater agreement	rubric clarity	demanding perfect agreement on true ambiguity
Appeal upheld rate	customer harm signal	blaming reviewer without root cause
Audit replay success	evidence completeness	checking only sample presence

9.2 Override Quality

Override type	Quality question
Evidence override	Did reviewer identify missing or conflicting evidence?
Policy override	Did reviewer apply current policy correctly?
Risk override	Did reviewer detect high-risk condition AI missed?
Customer context override	Did reviewer account for vulnerability, hardship, complaint or exception?
Tool boundary override	Did reviewer prevent unauthorized or unsafe action?
Language override	Did reviewer reduce misleading, unfair or non-compliant wording?

9.3 Reason Codes

Code	Meaning
EVIDENCE_MISSING	Required evidence absent
EVIDENCE_CONFLICT	Sources conflict or do not support conclusion
STALE_SOURCE	Policy or source appears outdated
POLICY_MISMATCH	AI output conflicts with policy or procedure
RISK_ESCALATION	Case requires higher authority
CUSTOMER_CONTEXT	AI missed relevant customer context
TOOL_BOUNDARY	Requested action exceeds allowed tool boundary
LANGUAGE_RISK	Customer-visible wording creates compliance or experience risk
DATA_QUALITY	Input data is incomplete, noisy or inconsistent
SECURITY_SIGNAL	Prompt injection, access issue or suspicious content
RUBRIC_AMBIGUITY	Review standard needs clarification
CONTROLLED_ACCEPT	AI accepted with documented high-risk rationale

10. Escalation And Override Governance

10.1 Authority Matrix

Decision	Frontline reviewer	Specialist	Senior adjudicator	Risk / Compliance	AI platform owner
Edit customer draft	yes	yes	yes	consult	no
Reject AI draft	yes	yes	yes	consult	no
Approve money movement	within limit	within higher limit	exception	consult	no
Approve credit exception	no	underwriter authority	senior credit	consult	no
Close AML alert	no	analyst within policy	L2 / compliance	oversight	no
Stop AI route	raise	recommend	approve if delegated	approve	execute
Stop tool action	raise	recommend	approve if delegated	approve	execute
Change rubric	feedback	propose	approve draft	approve high-risk	implement workflow

10.2 Escalation Rules

Trigger	Escalation target	Downstream state
Evidence missing for high-risk decision	specialist queue	decision paused
Policy conflict	policy owner	output blocked
Customer complaint / legal threat	complaint specialist	AI response stopped
High amount or irreversible action	senior approver	tool action blocked
Reviewer disagreement	adjudicator	case held
Prompt injection or data leakage	security incident path	route paused
Backlog threatens SLA	operations lead	capacity surge or route throttle
Quality KRI red	governance owner	release freeze or safe-stop

10.3 Override Governance Controls

High-risk override requires reason code, evidence reference and authority check。
Override can change workflow outcome only within reviewer authority。
Certain overrides require second approval or adjudication。
Override trends feed model, prompt, RAG, policy and training backlog。
Override cannot silently retrain model without usage tag and governance review。
Long-term override pattern may indicate AI quality issue, policy ambiguity or role mismatch。

11. Sampling Design

11.1 Sampling Purposes

Purpose	Sampling design
Production quality	random sentinel and risk-weighted sample
High-risk protection	exception and threshold-based sample
Model improvement	uncertainty, disagreement and drift sample
Fairness / segment coverage	stratified sample across protected or proxy-sensitive segments
Reviewer calibration	gold and boundary sample
Audit evidence	reproducible sample with trace retention
Incident response	expanded sample during exposure window

11.2 Sample Policy Example

Bucket	Share	Rationale
P0 / P1 mandatory review	as needed	customer harm and regulatory exposure
Uncertainty and evidence weakness	25%	catch model and RAG boundary failures
Random sentinel	20%	detect blind spots
Segment / channel coverage	20%	prevent quality blind spots
Complaint / appeal signals	15%	customer harm feedback
Reviewer calibration	10%	quality and drift
Drift / new cluster	10%	taxonomy and policy learning

11.3 Sampling Evidence

每个 sampled case 应记录 sampling policy version、sampling reason、sampling probability、risk tier、segment tags、model / prompt / source version、review mode 和 final usage tag: QA, train, eval, audit, incident or exclude。

12. Surge Mode And Degraded Operations

12.1 Surge Triggers

Trigger	Surge action
model drift increases review rate	activate reserve reviewers
RAG stale policy incident	stop free-form answers and review exposure window
fraud spike	route P0 / P1 to fraud surge pool
complaint volume surge	prioritize regulatory deadline queue
reviewer outage	reduce low-risk QA and preserve high-risk queue
evidence export failure	local evidence ledger and draft-only for Tier 1
policy change	temporary double review for affected intents

12.2 Surge Mode Playbook

1. Declare surge condition and scope.
2. Freeze low-value or discretionary review inflow.
3. Reclassify queues by customer harm and deadline.
4. Activate certified reserve reviewers.
5. Move senior reviewers to adjudication and P0/P1 only.
6. Switch low-risk customer-visible outputs to template or delayed response.
7. Increase random sentinel sample for affected workflow.
8. Track backlog, SLA, fatigue and quality every 2-4 hours.
9. Record residual risk and management decisions.
10. Exit surge only after backlog, quality and evidence gates pass.

12.3 Manual-First Mode

Manual-first is appropriate when AI can still retrieve or summarize but cannot safely recommend, decide or act. Allowed: evidence retrieval、extractive summary、queue prioritization with human confirmation、approved templates。 Blocked: final recommendation framing、auto-send、state-changing tool calls、case closure、customer-specific promises。

12.4 Recovery Gate

Normal review operations resume when dependency health is stable, backlog is within tolerance, P0/P1 SLA breach risk is controlled, reviewer quality metrics are above threshold, local evidence is reconciled, sampled surge-period decisions pass QA, and business owner plus risk owner approve exit。

13. Dashboards And KRIs

13.1 Operations Dashboard

Metric	Purpose
queue depth by priority	workload and backlog
queue age by SLA	breach forecast
arrival rate vs completion rate	capacity balance
review rate by workflow	policy and model behavior
reviewer utilization	fatigue and staffing
escalation volume	risk pressure
adjudication backlog	senior bottleneck
surge reserve availability	resilience readiness

13.2 Quality Dashboard

Metric	Purpose
gold accuracy	reviewer correctness
inter-rater agreement	rubric consistency
false accept rate	unsafe AI accepted
false reject rate	valid AI blocked
reason quality	audit readiness
meaningful edit rate	reviewer value and AI quality
appeal upheld rate	customer harm
audit replay success	evidence completeness

13.3 Risk KRIs

KRI	Yellow	Red
P1 SLA breach forecast	breach likely within day	breach active or unavoidable
Override rate collapse	below historical band	near zero with short review time
Escalation miss	increasing in QA	confirmed high-risk miss
Gold accuracy	below target	high-risk reviewer fails
Evidence completeness	minor field gaps	trace cannot be replayed
Reviewer utilization	>85% for 3 days	sustained >95% or after-hours spike
Appeal upheld	trend up	severe customer harm case
Adjudication overturn	trend up	rubric or training failure

13.4 Executive View

Executive reporting should answer whether high-risk queues are within tolerance, review quality is stable, reviewers are overloaded, AI outputs require more override, escalation and stop authorities work, audit evidence is replayable, surge mode is tested, and residual risk has been accepted。

14. RACI

Activity	Accountable	Responsible	Consulted	Informed
Review strategy	Business Owner	AI PM	Risk, Compliance, Operations	Senior management
Queue taxonomy	Operations Owner	AI BA + Workflow Lead	Product, Architecture, QA	Reviewer teams
Review policy engine	AI Product Owner	Architect + Platform	Risk, Legal, Security	Operations
Capacity model	Operations Executive	Workforce Planning	PM, Finance, Risk	Queue owners
Skill routing	Operations Owner	Training Lead + Queue Manager	HR, Compliance	Reviewers
Calibration program	QA Lead	SME + Training Lead	Model Risk, Product	Governance forum
Evidence schema	Governance Evidence Owner	Platform Engineering	Audit, Privacy, Security	Business owner
Override governance	Risk Owner	Product + Operations	Compliance, Legal	Internal Audit
Surge mode	Operations Executive	Queue Managers	BCP, Platform, Risk	Senior management
Route stop	Business + Risk Owner	AI Platform Owner	Security, Compliance	All affected teams
Dashboard reporting	AI Governance Owner	Analytics + AI Ops	Operations, QA	Executive committee
Audit replay	Internal Audit	Evidence Owner	Product, Platform, Risk	Senior management

15. Financial Retail Examples

15.1 AML Copilot

Dimension	Design
AI role	summarize alert, transactions, KYC profile, typology and narrative draft
Review queue	high-risk typology, SAR draft, close-alert recommendation
Skill routing	AML analyst, senior investigator, compliance reviewer
Evidence	transaction pattern, source systems, customer profile, typology reference
Calibration	red flag gold cases and suspicious narrative boundary cases
Override	analyst edits narrative; senior reviewer blocks closure
KRI	missed red flag, L2 disagreement, SAR narrative edit rate
Red line	AI cannot be sole basis for closing high-risk alert

15.2 Credit Underwriting Copilot

Dimension	Design
AI role	prepare credit memo draft, policy citations, missing information list
Review queue	adverse action reason, policy exception, borderline case
Skill routing	underwriter, senior credit approver, fair lending reviewer
Evidence	application data, income docs, credit report references, policy version
Calibration	adverse reason specificity and exception boundary cases
Override	underwriter can reject AI memo and require source correction
KRI	policy citation accuracy, adverse reason correction, appeal upheld
Red line	AI cannot independently approve, decline or generate final adverse action reason

15.3 Payment Dispute Assistant

Dimension	Design
AI role	summarize dispute, recommend next step, draft customer update
Review queue	provisional credit, denial draft, high amount, fraud signal
Skill routing	dispute specialist, fraud senior, complaint lead
Evidence	transaction, merchant, customer claim, rule deadline, history
Calibration	wrong denial, deadline pressure and evidence sufficiency cases
Override	supervisor changes amount, blocks denial, escalates complaint
KRI	provisional credit error, wrong denial, SLA breach, complaint escalation
Red line	AI tool call cannot bypass amount limit or dual approval

15.4 Customer Service RAG

Dimension	Design
AI role	answer policy questions and draft agent response
Review queue	high-risk intent, unsupported claim, stale source, customer complaint
Skill routing	servicing SME, policy owner, complaint specialist
Evidence	source article, effective date, citation span, customer entitlement
Calibration	claim-level support and refusal boundary cases
Override	agent edits answer; policy owner updates source issue
KRI	unsupported claim, agent edit distance, escalation miss, source freshness
Red line	AI cannot make customer-specific fee, credit or legal commitments without approved process

16. Templates

16.1 Human Review Design Brief

Field	Filled example
Use case	Payment dispute assistant
AI behavior	Summarizes evidence and recommends next action
Review unit	Provisional credit recommendation
Risk tier	High customer and financial impact
Queue	Pre-action payment review
Reviewer skill	Payment dispute specialist
SLA	4 business hours or sooner when deadline risk exists
Evidence	transaction, customer claim, rules deadline, AI rationale, policy version
Allowed actions	approve, reject, edit amount, request evidence, escalate
Override authority	supervisor for amount above frontline threshold
Audit fields	trace ID, reviewer, reason code, evidence references, timestamp
KRI	wrong credit, wrong denial, SLA breach, complaint escalation

16.2 Review Requirement Pattern

For payment dispute provisional credit,
when AI recommends an amount above the frontline threshold
or detects fraud, complaint or deadline risk,
the system must route the case to a payment dispute specialist
before any payment action is submitted,
showing transaction evidence, customer claim, rule deadline, policy version and AI rationale,
allowing approve, reject, edit amount, request evidence and escalate,
capturing reason code, reviewer identity, evidence references, timestamp and tool trace,
and enforcing a four-business-hour SLA with supervisor escalation on breach forecast.

16.3 Queue Configuration Card

Field	Example
queue_id	`payment_dispute_pre_action_p1`
owner	Payment Dispute Operations Lead
priority	P1
review policy	100% for amount above threshold, risk-based below threshold
skill requirement	dispute specialist certification
independence	reviewer cannot be original case handler for high amount
SLA	4 business hours
OLA	evidence builder under 5 minutes, approver assignment under 15 minutes
escalation	supervisor queue after 75% SLA consumption
surge rule	activate reserve pool when backlog exceeds 80% daily capacity
evidence retention	governed case record and immutable trace

16.4 Evidence Ledger Schema

Field group	Fields
Identity	trace_id, case_id, customer_hash, workflow, risk_tier
AI config	model_id, model_version, prompt_version, retriever_version
Source	document_ids, source_owner, effective_date, citation_spans
Review	queue_id, reviewer_id, reviewer_role, review_mode, action_type
Decision	reason_code, rationale, edit_diff, override_flag, escalation_id
Timing	created_at, assigned_at, due_at, completed_at, SLA_status
Downstream	tool_action, approval_id, customer_visible_message_id
Governance	rubric_version, sampling_policy, incident_id, retention_class

16.5 Surge Decision Card

Field	Filled example
Trigger	Customer service RAG stale policy incident
Scope	Fee, dispute and complaint intents from 08:00 CT onward
Mode	Template-only for customer-facing answers, manual-first for exceptions
Queues prioritized	P0 complaint, P1 dispute, P1 fee correction
Capacity action	Activate servicing SME reserve pool
Evidence action	Preserve source IDs and customer-visible message versions
Executive update	Every 4 hours until backlog and evidence gate pass
Exit condition	index validated, sample QA passed, backlog below threshold

17. 30-Day Lab

目标: 30 天内完成一套可展示的 AI Human Review Operations / Capacity Architecture portfolio pack。推荐选择 Payment Dispute Assistant、Customer Service RAG、AML Copilot 或 Credit Underwriting Copilot。

Day	Task	Artifact
1	选择一个金融零售 AI workflow	use case boundary card
2	定义 AI behavior: retrieve、summarize、recommend、draft、act	AI behavior map
3	拆 review unit: claim、draft、recommendation、tool action、sample	review unit taxonomy
4	做 risk tiering 和 customer impact map	risk and impact matrix
5	设计 queue taxonomy	queue catalog
6	定义 priority rules: P0-P4	priority rulebook
7	设计 skill routing	reviewer skill matrix
8	设计 conflict-of-interest controls	independence control table
9	写 reviewer workspace requirements	evidence and action spec
10	建 capacity model baseline	capacity worksheet
11	做 peak and incident sensitivity	capacity scenario memo
12	定义 SLA / OLA	SLA and OLA table
13	设计 fatigue controls	fatigue monitoring plan
14	设计 calibration cases	12-case calibration pack
15	定义 certification levels	reviewer certification model
16	设计 reason codes	override reason taxonomy
17	设计 escalation rules	escalation matrix
18	设计 override governance	authority matrix
19	设计 sampling policy	sampling strategy memo
20	设计 dashboard and KRIs	operations and quality dashboard spec
21	设计 evidence ledger	trace schema
22	设计 surge mode	surge playbook
23	设计 manual-first fallback	degraded review runbook
24	写 RACI	responsibility matrix
25	选一个业务案例走通端到端	case walkthrough
26	做 tabletop: reviewer capacity shortage	exercise script and decision log
27	做 tabletop: model drift increases review rate	exercise script and decision log
28	写 executive memo	capacity risk and ROI memo
29	写 architecture review pack	workflow, queue, trace and dashboard diagrams
30	准备 interview story	STAR-T story and 8 Q&A

18. Interview Answers

Q1: Human review 和 HITL 有什么区别?

30 秒:

HITL 只说明人参与了流程。Human review operations 要证明这个人有技能、时间、证据、独立性、权限和升级路径, 并且组织能度量 review 是否真的降低风险。 2 分钟: 我会把 human review 设计成生产控制系统。第一步定义 review unit, 例如 claim、draft、recommendation 或 tool action。第二步按风险和客户影响设计 queue taxonomy。第三步做 skill routing、capacity model、SLA/OLA 和 reviewer workspace。第四步建立 calibration、reason code、override quality、adjudication 和 audit evidence。这样 human review 不只是审批按钮, 而是可运营、可审计、可扩容的控制能力。

Q2: 如何判断 human review 是否变成控制剧场?

30 秒:

看五个信号: review time 极短、override rate 异常接近零、reason code 空泛、reviewer 没有证据或权限、队列容量长期不足。 2 分钟: 控制剧场的本质是形式存在但无法改变结果。我会检查 reviewer 是否能看到证据、是否能拒绝和升级、是否有停机路径、是否有足够容量、是否完成校准。指标上看 meaningful edit、override quality、escalation miss、gold accuracy、inter-rater agreement、audit replay success 和 fatigue signal。长期 0 override 不是自动好结果, 可能是 automation bias 或管理压力。

Q3: 如何做 capacity planning?

30 秒:

用 case volume、review rate、平均处理时间、复杂度、双审比例、返工率、productive hours、SLA 和 surge reserve 建模, 再做 peak 和 incident sensitivity。 2 分钟: 我不会只看平均 daily volume。AI 上线后 review rate 会随模型漂移、政策变化和事件波动。模型要区分 P0/P1/P2 队列, 估算平均处理时间和双审需求。然后把 productive hours 从排班时间中扣除培训、会议、休息和疲劳因素。最后模拟高峰和 incident: 如果 review rate 从 15% 到 30%, 是否仍能守住 SLA; 如果 senior adjudicator 不够, 哪些 case 被阻塞。

Q4: 如何设计 skill routing?

30 秒:

routing 不能只按队列空闲分配, 要按业务域、产品、风险、语言、权限、AI literacy 和证据访问资格分配。 2 分钟: 例如支付争议 case 涉及高额临时入账和欺诈信号, 不能给普通客服 reviewer。它需要 payment dispute specialist, 且 fraud signal 需要 fraud senior 参与。信贷 adverse action draft 需要 underwriter 和 fair lending aware reviewer。routing engine 应结合 case risk、product、customer context、language、authority 和 conflict-of-interest rule, 并在日志中记录为什么分配给该 reviewer。

Q5: Reviewer calibration 如何做?

30 秒:

用 gold cases、boundary cases、blind review、double review 和 adjudication 训练 reviewer 对相同证据形成一致判断。 2 分钟: 我会先定义 rubric, 包括证据充分性、政策解释、风险升级、客户影响和允许动作。然后设计 clear pass、clear fail、boundary、missing evidence、stale policy、tool side effect 和 automation bias challenge cases。reviewer 先 blind review, 再对比 adjudicator 标准。指标包括 gold accuracy、inter-rater agreement、adjudication overturn、reason completeness 和 escalation correctness。校准结果决定 certification 和可处理队列。

Q6: Override rate 应该越低越好吗?

30 秒:

不一定。过低可能说明 AI 很好, 也可能说明 reviewer 默认采纳。要结合 review time、reason quality、gold case performance、appeal upheld 和 audit sample 判断。 2 分钟: Override 是 trust calibration 指标, 不是单独的成功指标。AI 质量改善后 override 应该在合理范围下降, 但如果突然接近零, 同时 review time 变短、reason 变空泛、gold cases 失败, 那是控制失效信号。我更关注 override quality: 人工 override 是否修正证据缺失、政策错误、客户语境、工具越权或语言风险。

Q7: Human review 失败时如何降级?

30 秒:

当队列积压或 reviewer quality 下降时, 要进入 surge 或 manual-first mode: 停低价值 review, 优先 P0/P1, 启用 reserve pool, 降级客户可见 AI, 并记录管理层风险接受。 2 分钟: 首先识别 scope: 哪些 workflow、intent、客户群和动作受影响。然后冻结低风险 discretionary QA, 把容量给 P0/P1。对客户可见高风险 AI, 切 template-only 或 draft-only。对 tool action, 停止自动动作, 等待授权 review。激活经过认证的 reserve reviewers, 并每 2-4 小时看 backlog、SLA、fatigue 和 quality。恢复 normal 前要通过 backlog、quality、evidence 和 business signoff gate。

Q8: 如何向高管解释 human review 成本?

30 秒:

review cost 是 AI 控制成本, 不是额外行政负担。没有它, 高风险 AI 的 ROI 是虚假的, 因为风险被转移给员工、客户和审计。 2 分钟: 我会用 capacity model 展示三件事: 正常量需要多少 reviewer, 高峰和 incident 需要多少 surge reserve, review quality 如何降低客户伤害和监管风险。还会对比不同策略: 100% review、risk-based review、sampling、exception review 的成本和残余风险。高管需要看到的是 controlled automation, 不是把自动化收益建立在不可持续的人力压力上。

Q9: Audit evidence 要保留什么?

30 秒:

至少保留 trace ID、AI 版本、prompt / retriever / source version、证据、reviewer、动作、reason code、时间、下游动作和 escalation。 2 分钟: 审计要能重放当时发生了什么: AI 看到了什么输入, 使用哪个模型和 prompt, 检索到哪些来源, 证据是否有效, review policy 为什么路由到该队列, reviewer 看到什么, 做了什么动作, 理由是什么, 是否 override, 是否升级, 下游工具是否执行。没有这些字段, human review 很难证明有效, 也无法从事故中学习。

19. Final Operating Principle

一个 AI human review system 最少要回答:

Question	Mature answer
What is reviewed?	claim、draft、recommendation、tool action、sampled outcome
Who reviews?	certified reviewer with domain skill, authority and independence
How is it routed?	risk, skill, SLA, capacity and conflict-of-interest rules
What evidence is shown?	sources, versions, missing evidence, AI trace, policy, downstream impact
What can the human do?	accept, edit, reject, override, escalate, request evidence, stop
How is quality known?	calibration, QA, agreement, gold cases, appeal outcomes
How is capacity controlled?	model, dashboard, surge reserve, fatigue controls
How is it audited?	immutable trace, reason code, evidence reference and replay
Final principle:

Human review is effective only when the organization can prove
that the right human reviewed the right case,
with the right evidence and authority,
within the right time,
and that the review measurably improved control quality.