AI Human Review Operations / Capacity Architecture Playbook
定位: 面向高级 AI PM / AI BA / AI Architect / Enterprise Architect / Operations Lead / Workforce Planning / Model Risk / Compliance / Internal Audit, 把 AI human review 从“有人审核”升级为可运营、可扩容、可校准、可审计、可恢复的生产控制体系。
适用范围: AML Copilot、KYC Review Assistant、Credit Underwriting Copilot、Payment Dispute Assistant、Complaint Response Agent、Customer Service RAG、Fraud Alert Triage、Collections Copilot、金融零售内部知识助手和 agentic workflow。
重要说明: 本文是学习、作品集和内部治理训练材料, 不是法律意见、合规结论、审计意见、模型验证报告或监管解释。正式项目必须由 Legal、Compliance、Risk、Model Risk、Internal Audit、Security、Privacy、Business Owner、Operations、Workforce Planning 和管理层结合机构类型、司法辖区、业务用途、客户影响和内部政策确认。
1. Executive Framing
Human review 不是 AI 项目的装饰性安全带。它是生产系统的一部分, 直接决定 AI 能否在真实业务压力下安全运行。
很多组织已经知道要做 HITL, 但失败发生在更运营化的层面:
该审的 case 进了同一个大队列。
reviewer 没有足够技能或权限。
审核量超过容量, 人只能跳读。
训练只覆盖系统操作, 不覆盖边界判断。
override 没有 reason code 或证据引用。
escalation 没有接收人、时限和暂停规则。
管理层只看通过量, 不看质量和疲劳。
审计无法重放 AI 输出、证据、人工判断和下游动作。
本 playbook 的核心判断:
Human review is production capacity.
It must be routed, staffed, calibrated, measured, governed and evidenced.
Otherwise it becomes a bottleneck or control theater.
1.1 区别于 Generic HITL
Generic HITL
Human Review Operations / Capacity Architecture
问是否有人参与
问谁在什么条件下以什么证据做什么判断
强调审批点
强调队列、SLA、技能、容量、校准和升级
通常是流程图节点
是 operating model、workflow engine 和 evidence system
A team of 20 reviewers cannot sustain this control without backlog or quality decay。
If incident mode raises review rate to 25%, required FTE becomes roughly 45。
PM ROI must include reviewers, senior adjudicators, QA, training and workforce planning。
6.4 SLA / OLA Model
Queue
SLA
OLA dependency
P0 tool action
15 minutes
approver on duty, evidence complete
P1 complaint / dispute
4 business hours
specialist availability
P1 credit / AML
same day or regulated deadline
senior reviewer and policy SME
P2 customer-visible draft
1 business day
frontline QA pool
P3 batch QA
5 business days
QA analyst capacity
Calibration
monthly cycle
gold set and trainer availability
SLA is customer or business-facing. OLA is the internal promise between queue owner, reviewer pool, policy SME, platform and escalation team.
6.5 Fatigue And Quality Decay
Fatigue signal
Detection
Control
review time drops sharply
duration analytics
micro-break and supervisor check
override rate collapses
trend dashboard
calibration and blind QA
reason text becomes generic
reason quality score
structured reasons and coaching
disagreement increases
inter-rater metric
rubric refresh
after-hours backlog work rises
workforce report
staffing adjustment
gold case failures rise
gold injection
pause high-risk assignment
6.6 Capacity Thresholds
Signal
Yellow action
Red action
P1 backlog > 80% of daily capacity
activate reserve reviewers
stop low-value review inflow
SLA breach forecast within 4 hours
supervisor triage
senior management risk acceptance
reviewer utilization > 85% for 3 days
reduce discretionary QA
surge staffing or route throttling
gold accuracy below threshold
coaching
remove reviewer from high-risk queue
escalation queue aged beyond OLA
backup owner
incident bridge
7. Reviewer Workspace Requirements
7.1 Evidence Panel
Reviewer must not judge only from AI text. The workspace should show:
AI output or action request。
source documents and cited spans。
source owner、version、effective date。
model / prompt / retriever version。
customer / case context allowed for the role。
missing evidence and contradictory evidence。
policy and rubric excerpt。
risk flags and downstream impact。
previous human actions。
allowed decisions and authority limits。
7.2 UI Actions
Action
Meaning
Required evidence
Accept
AI output meets rubric and evidence standard
reason optional for low risk, required for high risk
Edit
reviewer corrects content before downstream use
edit diff and reason code
Reject
AI output should not be used
reason code and evidence gap
Override
reviewer changes AI recommendation or route
authority and rationale
Escalate
specialist or senior authority required
escalation trigger and target queue
Request evidence
case cannot be reviewed with current packet
missing evidence type
Stop route
AI route or tool should pause
severity and incident link
Adjudicate
resolve disagreement
final rationale and rubric interpretation
7.3 Workspace Anti-Patterns
Anti-pattern
Risk
Approve button is dominant and reject hidden
default acceptance
AI confidence shown without evidence
false precision
No policy effective date
stale policy risk
Free-text only reasons
analytics and audit failure
Reviewer cannot see downstream action
weak control over impact
Escalation creates email, not workflow case
lost handoff
Missing source access due entitlement
reviewer guesses
8. Calibration Program
8.1 Calibration Lifecycle
define rubric
-> create gold and boundary cases
-> train reviewers
-> run blind calibration
-> score agreement and rationale
-> adjudicate disagreements
-> update rubric
-> certify reviewer skill
-> monitor production drift
8.2 Calibration Case Types
Case type
Purpose
Clear pass
reviewer recognizes valid AI output
Clear fail
reviewer catches unsupported or prohibited output
Boundary case
tests policy nuance
Missing evidence
tests refusal and request-evidence behavior
Conflict evidence
tests escalation judgment
Tool side effect
tests pre-action control
Customer vulnerability
tests harm-sensitive routing
Prompt injection
tests security-aware review
Stale policy
tests source freshness
Automation bias challenge
AI sounds confident but is wrong
8.3 Calibration Metrics
Metric
Meaning
Gold accuracy
reviewer correctness on known cases
Inter-rater agreement
consistency across reviewers
Adjudication overturn rate
how often senior SME reverses initial review
Rationale completeness
reason references evidence and rubric
Escalation correctness
reviewer escalates cases that need specialist judgment
False accept rate
unsafe AI output accepted
False reject rate
valid AI output rejected
Boundary performance
performance on ambiguous but common cases
Drift by reviewer
quality trend over time
8.4 Certification Levels
Level
Scope
Requirements
L1 reviewer
low / medium risk drafts and QA
training, gold accuracy, supervisor signoff
L2 specialist
domain-specific high-risk cases
domain certification, calibration pass, case experience
L3 senior adjudicator
disagreement and policy exception
SME authority, rubric ownership, governance reporting
Stop authority
route or tool pause
incident training, management delegation
9. Reviewer Quality Metrics
9.1 Quality Dashboard
Metric
Good use
Bad interpretation
Review throughput
capacity and planning
equating speed with quality
Average handling time
complexity and fatigue signal
forcing shorter review for high-risk work
Override rate
trust calibration signal
assuming lower is always better
Meaningful edit rate
AI quality and reviewer value
punishing reviewers for edits
Reason quality score
audit readiness
treating generic reason as enough
Escalation precision
routing quality
ignoring escalation miss
Escalation miss rate
high-risk control effectiveness
measuring only escalated cases
Inter-rater agreement
rubric clarity
demanding perfect agreement on true ambiguity
Appeal upheld rate
customer harm signal
blaming reviewer without root cause
Audit replay success
evidence completeness
checking only sample presence
9.2 Override Quality
Override type
Quality question
Evidence override
Did reviewer identify missing or conflicting evidence?
Policy override
Did reviewer apply current policy correctly?
Risk override
Did reviewer detect high-risk condition AI missed?
Customer context override
Did reviewer account for vulnerability, hardship, complaint or exception?
Tool boundary override
Did reviewer prevent unauthorized or unsafe action?
Language override
Did reviewer reduce misleading, unfair or non-compliant wording?
9.3 Reason Codes
Code
Meaning
EVIDENCE_MISSING
Required evidence absent
EVIDENCE_CONFLICT
Sources conflict or do not support conclusion
STALE_SOURCE
Policy or source appears outdated
POLICY_MISMATCH
AI output conflicts with policy or procedure
RISK_ESCALATION
Case requires higher authority
CUSTOMER_CONTEXT
AI missed relevant customer context
TOOL_BOUNDARY
Requested action exceeds allowed tool boundary
LANGUAGE_RISK
Customer-visible wording creates compliance or experience risk
DATA_QUALITY
Input data is incomplete, noisy or inconsistent
SECURITY_SIGNAL
Prompt injection, access issue or suspicious content
RUBRIC_AMBIGUITY
Review standard needs clarification
CONTROLLED_ACCEPT
AI accepted with documented high-risk rationale
10. Escalation And Override Governance
10.1 Authority Matrix
Decision
Frontline reviewer
Specialist
Senior adjudicator
Risk / Compliance
AI platform owner
Edit customer draft
yes
yes
yes
consult
no
Reject AI draft
yes
yes
yes
consult
no
Approve money movement
within limit
within higher limit
exception
consult
no
Approve credit exception
no
underwriter authority
senior credit
consult
no
Close AML alert
no
analyst within policy
L2 / compliance
oversight
no
Stop AI route
raise
recommend
approve if delegated
approve
execute
Stop tool action
raise
recommend
approve if delegated
approve
execute
Change rubric
feedback
propose
approve draft
approve high-risk
implement workflow
10.2 Escalation Rules
Trigger
Escalation target
Downstream state
Evidence missing for high-risk decision
specialist queue
decision paused
Policy conflict
policy owner
output blocked
Customer complaint / legal threat
complaint specialist
AI response stopped
High amount or irreversible action
senior approver
tool action blocked
Reviewer disagreement
adjudicator
case held
Prompt injection or data leakage
security incident path
route paused
Backlog threatens SLA
operations lead
capacity surge or route throttle
Quality KRI red
governance owner
release freeze or safe-stop
10.3 Override Governance Controls
High-risk override requires reason code, evidence reference and authority check。
Override can change workflow outcome only within reviewer authority。
Certain overrides require second approval or adjudication。
Override trends feed model, prompt, RAG, policy and training backlog。
Override cannot silently retrain model without usage tag and governance review。
Long-term override pattern may indicate AI quality issue, policy ambiguity or role mismatch。
11. Sampling Design
11.1 Sampling Purposes
Purpose
Sampling design
Production quality
random sentinel and risk-weighted sample
High-risk protection
exception and threshold-based sample
Model improvement
uncertainty, disagreement and drift sample
Fairness / segment coverage
stratified sample across protected or proxy-sensitive segments
Reviewer calibration
gold and boundary sample
Audit evidence
reproducible sample with trace retention
Incident response
expanded sample during exposure window
11.2 Sample Policy Example
Bucket
Share
Rationale
P0 / P1 mandatory review
as needed
customer harm and regulatory exposure
Uncertainty and evidence weakness
25%
catch model and RAG boundary failures
Random sentinel
20%
detect blind spots
Segment / channel coverage
20%
prevent quality blind spots
Complaint / appeal signals
15%
customer harm feedback
Reviewer calibration
10%
quality and drift
Drift / new cluster
10%
taxonomy and policy learning
11.3 Sampling Evidence
每个 sampled case 应记录 sampling policy version、sampling reason、sampling probability、risk tier、segment tags、model / prompt / source version、review mode 和 final usage tag: QA, train, eval, audit, incident or exclude。
12. Surge Mode And Degraded Operations
12.1 Surge Triggers
Trigger
Surge action
model drift increases review rate
activate reserve reviewers
RAG stale policy incident
stop free-form answers and review exposure window
fraud spike
route P0 / P1 to fraud surge pool
complaint volume surge
prioritize regulatory deadline queue
reviewer outage
reduce low-risk QA and preserve high-risk queue
evidence export failure
local evidence ledger and draft-only for Tier 1
policy change
temporary double review for affected intents
12.2 Surge Mode Playbook
1. Declare surge condition and scope.
2. Freeze low-value or discretionary review inflow.
3. Reclassify queues by customer harm and deadline.
4. Activate certified reserve reviewers.
5. Move senior reviewers to adjudication and P0/P1 only.
6. Switch low-risk customer-visible outputs to template or delayed response.
7. Increase random sentinel sample for affected workflow.
8. Track backlog, SLA, fatigue and quality every 2-4 hours.
9. Record residual risk and management decisions.
10. Exit surge only after backlog, quality and evidence gates pass.
12.3 Manual-First Mode
Manual-first is appropriate when AI can still retrieve or summarize but cannot safely recommend, decide or act.
Allowed: evidence retrieval、extractive summary、queue prioritization with human confirmation、approved templates。
Blocked: final recommendation framing、auto-send、state-changing tool calls、case closure、customer-specific promises。
12.4 Recovery Gate
Normal review operations resume when dependency health is stable, backlog is within tolerance, P0/P1 SLA breach risk is controlled, reviewer quality metrics are above threshold, local evidence is reconciled, sampled surge-period decisions pass QA, and business owner plus risk owner approve exit。
13. Dashboards And KRIs
13.1 Operations Dashboard
Metric
Purpose
queue depth by priority
workload and backlog
queue age by SLA
breach forecast
arrival rate vs completion rate
capacity balance
review rate by workflow
policy and model behavior
reviewer utilization
fatigue and staffing
escalation volume
risk pressure
adjudication backlog
senior bottleneck
surge reserve availability
resilience readiness
13.2 Quality Dashboard
Metric
Purpose
gold accuracy
reviewer correctness
inter-rater agreement
rubric consistency
false accept rate
unsafe AI accepted
false reject rate
valid AI blocked
reason quality
audit readiness
meaningful edit rate
reviewer value and AI quality
appeal upheld rate
customer harm
audit replay success
evidence completeness
13.3 Risk KRIs
KRI
Yellow
Red
P1 SLA breach forecast
breach likely within day
breach active or unavoidable
Override rate collapse
below historical band
near zero with short review time
Escalation miss
increasing in QA
confirmed high-risk miss
Gold accuracy
below target
high-risk reviewer fails
Evidence completeness
minor field gaps
trace cannot be replayed
Reviewer utilization
>85% for 3 days
sustained >95% or after-hours spike
Appeal upheld
trend up
severe customer harm case
Adjudication overturn
trend up
rubric or training failure
13.4 Executive View
Executive reporting should answer whether high-risk queues are within tolerance, review quality is stable, reviewers are overloaded, AI outputs require more override, escalation and stop authorities work, audit evidence is replayable, surge mode is tested, and residual risk has been accepted。
14. RACI
Activity
Accountable
Responsible
Consulted
Informed
Review strategy
Business Owner
AI PM
Risk, Compliance, Operations
Senior management
Queue taxonomy
Operations Owner
AI BA + Workflow Lead
Product, Architecture, QA
Reviewer teams
Review policy engine
AI Product Owner
Architect + Platform
Risk, Legal, Security
Operations
Capacity model
Operations Executive
Workforce Planning
PM, Finance, Risk
Queue owners
Skill routing
Operations Owner
Training Lead + Queue Manager
HR, Compliance
Reviewers
Calibration program
QA Lead
SME + Training Lead
Model Risk, Product
Governance forum
Evidence schema
Governance Evidence Owner
Platform Engineering
Audit, Privacy, Security
Business owner
Override governance
Risk Owner
Product + Operations
Compliance, Legal
Internal Audit
Surge mode
Operations Executive
Queue Managers
BCP, Platform, Risk
Senior management
Route stop
Business + Risk Owner
AI Platform Owner
Security, Compliance
All affected teams
Dashboard reporting
AI Governance Owner
Analytics + AI Ops
Operations, QA
Executive committee
Audit replay
Internal Audit
Evidence Owner
Product, Platform, Risk
Senior management
15. Financial Retail Examples
15.1 AML Copilot
Dimension
Design
AI role
summarize alert, transactions, KYC profile, typology and narrative draft
Review queue
high-risk typology, SAR draft, close-alert recommendation
Skill routing
AML analyst, senior investigator, compliance reviewer
For payment dispute provisional credit,
when AI recommends an amount above the frontline threshold
or detects fraud, complaint or deadline risk,
the system must route the case to a payment dispute specialist
before any payment action is submitted,
showing transaction evidence, customer claim, rule deadline, policy version and AI rationale,
allowing approve, reject, edit amount, request evidence and escalate,
capturing reason code, reviewer identity, evidence references, timestamp and tool trace,
and enforcing a four-business-hour SLA with supervisor escalation on breach forecast.
16.3 Queue Configuration Card
Field
Example
queue_id
payment_dispute_pre_action_p1
owner
Payment Dispute Operations Lead
priority
P1
review policy
100% for amount above threshold, risk-based below threshold
skill requirement
dispute specialist certification
independence
reviewer cannot be original case handler for high amount
SLA
4 business hours
OLA
evidence builder under 5 minutes, approver assignment under 15 minutes
escalation
supervisor queue after 75% SLA consumption
surge rule
activate reserve pool when backlog exceeds 80% daily capacity
immutable trace, reason code, evidence reference and replay
Final principle:
Human review is effective only when the organization can prove
that the right human reviewed the right case,
with the right evidence and authority,
within the right time,
and that the review measurably improved control quality.