返回 Papers
AI 扩展计划 / Playbooks

AI Human Review Operations / Capacity Playbook

Human review 不是 AI 项目的装饰性安全带。它是生产系统的一部分, 直接决定 AI 能否在真实业务压力下安全运行。

748AI_HUMAN_REVIEW_OPERATIONS_CAPACITY_PLAYBOOK.md

AI Human Review Operations / Capacity Architecture Playbook

定位: 面向高级 AI PM / AI BA / AI Architect / Enterprise Architect / Operations Lead / Workforce Planning / Model Risk / Compliance / Internal Audit, 把 AI human review 从“有人审核”升级为可运营、可扩容、可校准、可审计、可恢复的生产控制体系。 适用范围: AML Copilot、KYC Review Assistant、Credit Underwriting Copilot、Payment Dispute Assistant、Complaint Response Agent、Customer Service RAG、Fraud Alert Triage、Collections Copilot、金融零售内部知识助手和 agentic workflow。 重要说明: 本文是学习、作品集和内部治理训练材料, 不是法律意见、合规结论、审计意见、模型验证报告或监管解释。正式项目必须由 Legal、Compliance、Risk、Model Risk、Internal Audit、Security、Privacy、Business Owner、Operations、Workforce Planning 和管理层结合机构类型、司法辖区、业务用途、客户影响和内部政策确认。


1. Executive Framing

Human review 不是 AI 项目的装饰性安全带。它是生产系统的一部分, 直接决定 AI 能否在真实业务压力下安全运行。 很多组织已经知道要做 HITL, 但失败发生在更运营化的层面:

  • 该审的 case 进了同一个大队列。
  • reviewer 没有足够技能或权限。
  • 审核量超过容量, 人只能跳读。
  • 训练只覆盖系统操作, 不覆盖边界判断。
  • override 没有 reason code 或证据引用。
  • escalation 没有接收人、时限和暂停规则。
  • 管理层只看通过量, 不看质量和疲劳。
  • 审计无法重放 AI 输出、证据、人工判断和下游动作。 本 playbook 的核心判断:
Human review is production capacity.
It must be routed, staffed, calibrated, measured, governed and evidenced.
Otherwise it becomes a bottleneck or control theater.

1.1 区别于 Generic HITL

Generic HITLHuman Review Operations / Capacity Architecture
问是否有人参与问谁在什么条件下以什么证据做什么判断
强调审批点强调队列、SLA、技能、容量、校准和升级
通常是流程图节点是 operating model、workflow engine 和 evidence system
假设人可以吸收风险明确人类判断力是稀缺资源
只保留 accept / reject保留 edit、override、escalate、stop、adjudicate 和 feedback
指标看 review count指标看 review quality、agreement、miss rate、fatigue 和 audit replay

1.2 当前 Nuance

human review 可能降低 AI 风险, 也可能制造新风险:

  • 当容量不足时, 它成为瓶颈。
  • 当证据不足时, 它成为橡皮图章。
  • 当 reviewer 不独立时, 它放大 automation bias。
  • 当 escalation authority 不清时, 它无法阻止错误动作。
  • 当 audit evidence 不完整时, 它无法证明控制有效。
  • 当校准缺失时, 它把个人主观判断包装成治理。 高级设计要把 human review 当作受控生产能力, 而不是“人会兜底”的乐观假设。

2. Source Anchors

以下来源用于建立治理、human-centered AI、管理系统和业务连续性语言。本文把它们转成产品、流程、架构、运营和证据设计要求。

AnchorOfficial link本文使用方式
NIST AI Risk Management Frameworkhttps://www.nist.gov/itl/ai-risk-management-framework用 Govern / Map / Measure / Manage 组织 human review 的治理、场景、度量、处置和持续改进
NIST Human-Centered AIhttps://www.nist.gov/programs-projects/human-centered-ai用 human-centered AI、AI user trust、workplace GenAI 和人类任务视角设计 reviewer 工作系统
NIST AI Use Taxonomy PDFhttps://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.200-1.pdf用 AI contribution to human task 的 taxonomy 拆分 review unit、human activity 和 measurement need
ISO/IEC 42001https://www.iso.org/standard/42001用 AI management system 语言连接责任、能力、运营控制、绩效评价、管理评审和改进
FFIEC Business Continuity Management booklethttps://ithandbook.ffiec.gov/it-booklets/business-continuity-management.aspx用 critical operations、dependencies、training、testing、exercise、board / senior management oversight 设计 review surge、manual fallback 和 queue resilience

2.1 Governance Mapping

FrameworkHuman review operations 问题Evidence
NIST Govern谁拥有 review policy、capacity、quality、training、exception 和 risk acceptanceRACI、review standard、management report
NIST Map哪些 AI use case、任务、客户影响和人工判断点需要 reviewuse case map、risk tier、queue taxonomy
NIST Measure如何度量 reviewer 是否发现错误、减少伤害和保持一致calibration report、QA sample、agreement metric
NIST Managereview 失效、积压、质量下降或发现高风险问题时如何处置escalation runbook、surge mode、route stop
ISO/IEC 42001如何把 review 纳入 AI management systempolicy、competence、operational control、performance review
FFIEC BCMreview queue 在压力状态下是否能支撑 critical operationsBIA、dependency map、exercise report、management oversight

3. Operating Model Overview

3.1 End-To-End Flow

AI output or action candidate
  -> risk and impact classification
  -> review policy decision
  -> queue creation
  -> skill / risk / independence routing
  -> reviewer workspace
  -> human action
  -> second review or adjudication if needed
  -> downstream workflow
  -> evidence ledger
  -> QA, calibration and dashboard
  -> model / policy / process improvement

3.2 Review Policy Inputs

Review policy engine 应基于:

  • risk tier。
  • customer impact。
  • financial impact。
  • regulatory impact。
  • reversibility。
  • model confidence and evidence quality。
  • source freshness。
  • tool side effect。
  • reviewer capacity。
  • incident state。
  • customer vulnerability or complaint signal。

3.3 Operating Principles

Principle解释
Risk first高风险、高影响、不可逆和监管敏感优先
Capacity aware队列策略必须知道可用 reviewer hours
Skill routed不同业务和风险需要不同 reviewer pool
Evidence based人必须看到足够证据, 不能只看 AI 结论
Independently challengeable关键审核不能被效率指标和 AI 默认值绑架
Calibratedreviewer 对边界 case 的判断必须可训练和可度量
Auditable出事后能重放输入、输出、证据、人工动作和下游影响
Recoverable审核队列失效时有 surge、manual-first、safe-stop 和 recovery gate

4. Queue Taxonomy

4.1 按审核目的分类

Queue type目的示例
Pre-decision reviewAI 建议进入业务决策前复核信贷 memo、KYC risk recommendation
Pre-action reviewAI 调用工具或改变状态前复核临时入账、退款、账户冻结
Customer-visible draft review客户沟通发出前复核投诉回复、费用解释、dispute update
Exception review低置信、证据冲突、政策例外进入复核RAG unsupported claim、policy conflict
Post-decision QA事后抽样, 验证自动或人工辅助结果客服回答、文档分类
Appeal / complaint review客户异议或投诉触发复核wrong denial、misleading answer
Calibration reviewgold case 或 challenge case 复核reviewer training and drift
Incident review风险事件期间扩大复核model drift、RAG stale、policy outage

4.2 按覆盖策略分类

Coverage适用场景风险
100% review高影响、不可逆、监管敏感、客户权益重大成本高、疲劳、积压
Risk-based review有可靠 risk score、confidence、evidence signalrisk signal 漏检会低估风险
Stratified sampling需要监控 segment、渠道、产品和语言质量样本设计差会掩盖少数群体问题
Exception-only review低置信、证据缺失、政策冲突、客户投诉正常样本中的系统性偏差可能漏掉
Shadow reviewpilot 或 model comparison不能替代正式生产控制
Random sentinel低比例随机哨兵样本成本可控, 用于发现 blind spot

4.3 Queue Priority Rules

PriorityCriteriaAction
P0客户资金、账户状态、法律威胁、监管 deadline、隐私暴露、不可逆动作stop downstream action until reviewed
P1高风险客户、投诉、信贷、AML/KYC、欺诈、支付争议priority SLA and skilled reviewer
P2客户可见但可纠正的回答或草稿standard review or risk-based sample
P3内部 productivity output, low customer impactperiodic QA and feedback
P4training, taxonomy and improvement samplesscheduled calibration cycle

4.4 Time Sensitivity

Time band例子Queue rule
Immediate工具动作、客户正在会话中、欺诈阻断同步或 near-real-time review
Same day投诉确认、支付争议下一步、KYC 补件当日 SLA 和 supervisor watch
Regulatory deadline投诉、AML / SAR、dispute rulesdeadline priority and breach escalation
Batch QA抽样复核、训练数据、calibration周期性处理
Governance review趋势、审计、管理层报告monthly / quarterly cadence

5. Skill And Risk Routing

5.1 Reviewer Skill Dimensions

Skill dimension示例
DomainAML、KYC、credit、fraud、complaints、payments、wealth、servicing
Productcredit card、mortgage、deposit、small business、investment、BNPL
Riskhigh-risk customer、vulnerable customer、PEP、sanctions、fair lending
Channelbranch、contact center、mobile app、chat、email、back office
LanguageEnglish、Spanish、Chinese、bilingual customer communication
Authorityfrontline review、senior approval、compliance escalation、route stop
AI literacyhallucination、RAG evidence、automation bias、tool misuse、prompt injection
Evidence handlingPII、restricted documents、legal hold、audit packet

5.2 Routing Matrix

TriggerReviewer poolReview mode
Payment dispute amount above thresholdPayment dispute specialistpre-action review
Fraud signal and customer complaintFraud senior + complaint specialistdual review
Credit adverse action draftUnderwriter + compliance-trained reviewerpre-decision review
AML high-risk typologySenior AML investigatormandatory review
KYC PEP or sanctions near matchEDD specialist100% review
RAG policy conflictPolicy SMEexception review
Customer vulnerabilityVulnerability-trained servicing leadescalation review
Tool action request with side effectAuthorized approverapproval with trace

5.3 Conflict-Of-Interest Controls

ConflictControl
Same employee created AI prompt and approves outputseparate reviewer role
Reviewer measured only on throughputbalanced quality and risk metrics
Sales owner reviews credit exceptionindependent credit approval
Agent reviews own customer communication QAsecond-line QA sample
Model team adjudicates labels used to prove model qualityindependent model risk or QA challenge
Vendor reviews its own production failuresinternal acceptance and audit sample

5.4 Independence Modes

Mode用法
Standard reviewreviewer sees AI output and evidence
Blind initial reviewreviewer first judges evidence without seeing AI recommendation
Delayed revealreviewer submits initial label, then compares AI suggestion
Double reviewtwo reviewers independently review high-risk or ambiguous cases
Adjudicationsenior SME resolves disagreement
Second-line QAindependent QA samples first-line review
Audit replayinternal audit or model risk reconstructs case from evidence

6. Capacity Model

6.1 Basic Formula

reviewed_cases = total_cases * review_rate
adjusted_minutes =
  reviewed_cases
  * average_handling_minutes
  * complexity_multiplier
  * (1 + double_review_rate)
  * (1 + rework_rate)
required_fte =
  adjusted_minutes / 60 / productive_hours_per_reviewer

6.2 Required Inputs

InputWhy it matters
total case volumecase arrival baseline
review ratepercentage routed to human review
average handling timecore workload driver
risk mixhigh-risk case takes longer
double review ratesenior capacity requirement
rework ratequality issue and unclear rubric cost
escalation ratedownstream specialist load
productive hoursmeetings, breaks, training and fatigue reduce usable time
SLAdetermines concurrency and queue size tolerance
arrival patternpeaks require staffing more than daily average
seasonalityfraud, tax, shopping, disaster and policy events
incident reservecapacity for model drift or outage surge

6.3 Worked Example

Daily AI-assisted cases: 8,000
Risk-based review rate: 16%
Average handling time: 5.5 minutes
Complexity multiplier: 1.20
Double review rate: 10%
Rework rate: 7%
Productive reviewer hours: 5.75

reviewed_cases = 8000 * 0.16 = 1280
adjusted_minutes = 1280 * 5.5 * 1.20 * 1.10 * 1.07 = 9945.94
required_fte = 9945.94 / 60 / 5.75 = 28.83

Operational reading:

  • A team of 20 reviewers cannot sustain this control without backlog or quality decay。
  • If incident mode raises review rate to 25%, required FTE becomes roughly 45。
  • PM ROI must include reviewers, senior adjudicators, QA, training and workforce planning。

6.4 SLA / OLA Model

QueueSLAOLA dependency
P0 tool action15 minutesapprover on duty, evidence complete
P1 complaint / dispute4 business hoursspecialist availability
P1 credit / AMLsame day or regulated deadlinesenior reviewer and policy SME
P2 customer-visible draft1 business dayfrontline QA pool
P3 batch QA5 business daysQA analyst capacity
Calibrationmonthly cyclegold set and trainer availability
SLA is customer or business-facing. OLA is the internal promise between queue owner, reviewer pool, policy SME, platform and escalation team.

6.5 Fatigue And Quality Decay

Fatigue signalDetectionControl
review time drops sharplyduration analyticsmicro-break and supervisor check
override rate collapsestrend dashboardcalibration and blind QA
reason text becomes genericreason quality scorestructured reasons and coaching
disagreement increasesinter-rater metricrubric refresh
after-hours backlog work risesworkforce reportstaffing adjustment
gold case failures risegold injectionpause high-risk assignment

6.6 Capacity Thresholds

SignalYellow actionRed action
P1 backlog > 80% of daily capacityactivate reserve reviewersstop low-value review inflow
SLA breach forecast within 4 hourssupervisor triagesenior management risk acceptance
reviewer utilization > 85% for 3 daysreduce discretionary QAsurge staffing or route throttling
gold accuracy below thresholdcoachingremove reviewer from high-risk queue
escalation queue aged beyond OLAbackup ownerincident bridge

7. Reviewer Workspace Requirements

7.1 Evidence Panel

Reviewer must not judge only from AI text. The workspace should show:

  • AI output or action request。
  • source documents and cited spans。
  • source owner、version、effective date。
  • model / prompt / retriever version。
  • customer / case context allowed for the role。
  • missing evidence and contradictory evidence。
  • policy and rubric excerpt。
  • risk flags and downstream impact。
  • previous human actions。
  • allowed decisions and authority limits。

7.2 UI Actions

ActionMeaningRequired evidence
AcceptAI output meets rubric and evidence standardreason optional for low risk, required for high risk
Editreviewer corrects content before downstream useedit diff and reason code
RejectAI output should not be usedreason code and evidence gap
Overridereviewer changes AI recommendation or routeauthority and rationale
Escalatespecialist or senior authority requiredescalation trigger and target queue
Request evidencecase cannot be reviewed with current packetmissing evidence type
Stop routeAI route or tool should pauseseverity and incident link
Adjudicateresolve disagreementfinal rationale and rubric interpretation

7.3 Workspace Anti-Patterns

Anti-patternRisk
Approve button is dominant and reject hiddendefault acceptance
AI confidence shown without evidencefalse precision
No policy effective datestale policy risk
Free-text only reasonsanalytics and audit failure
Reviewer cannot see downstream actionweak control over impact
Escalation creates email, not workflow caselost handoff
Missing source access due entitlementreviewer guesses

8. Calibration Program

8.1 Calibration Lifecycle

define rubric
  -> create gold and boundary cases
  -> train reviewers
  -> run blind calibration
  -> score agreement and rationale
  -> adjudicate disagreements
  -> update rubric
  -> certify reviewer skill
  -> monitor production drift

8.2 Calibration Case Types

Case typePurpose
Clear passreviewer recognizes valid AI output
Clear failreviewer catches unsupported or prohibited output
Boundary casetests policy nuance
Missing evidencetests refusal and request-evidence behavior
Conflict evidencetests escalation judgment
Tool side effecttests pre-action control
Customer vulnerabilitytests harm-sensitive routing
Prompt injectiontests security-aware review
Stale policytests source freshness
Automation bias challengeAI sounds confident but is wrong

8.3 Calibration Metrics

MetricMeaning
Gold accuracyreviewer correctness on known cases
Inter-rater agreementconsistency across reviewers
Adjudication overturn ratehow often senior SME reverses initial review
Rationale completenessreason references evidence and rubric
Escalation correctnessreviewer escalates cases that need specialist judgment
False accept rateunsafe AI output accepted
False reject ratevalid AI output rejected
Boundary performanceperformance on ambiguous but common cases
Drift by reviewerquality trend over time

8.4 Certification Levels

LevelScopeRequirements
L1 reviewerlow / medium risk drafts and QAtraining, gold accuracy, supervisor signoff
L2 specialistdomain-specific high-risk casesdomain certification, calibration pass, case experience
L3 senior adjudicatordisagreement and policy exceptionSME authority, rubric ownership, governance reporting
Stop authorityroute or tool pauseincident training, management delegation

9. Reviewer Quality Metrics

9.1 Quality Dashboard

MetricGood useBad interpretation
Review throughputcapacity and planningequating speed with quality
Average handling timecomplexity and fatigue signalforcing shorter review for high-risk work
Override ratetrust calibration signalassuming lower is always better
Meaningful edit rateAI quality and reviewer valuepunishing reviewers for edits
Reason quality scoreaudit readinesstreating generic reason as enough
Escalation precisionrouting qualityignoring escalation miss
Escalation miss ratehigh-risk control effectivenessmeasuring only escalated cases
Inter-rater agreementrubric claritydemanding perfect agreement on true ambiguity
Appeal upheld ratecustomer harm signalblaming reviewer without root cause
Audit replay successevidence completenesschecking only sample presence

9.2 Override Quality

Override typeQuality question
Evidence overrideDid reviewer identify missing or conflicting evidence?
Policy overrideDid reviewer apply current policy correctly?
Risk overrideDid reviewer detect high-risk condition AI missed?
Customer context overrideDid reviewer account for vulnerability, hardship, complaint or exception?
Tool boundary overrideDid reviewer prevent unauthorized or unsafe action?
Language overrideDid reviewer reduce misleading, unfair or non-compliant wording?

9.3 Reason Codes

CodeMeaning
EVIDENCE_MISSINGRequired evidence absent
EVIDENCE_CONFLICTSources conflict or do not support conclusion
STALE_SOURCEPolicy or source appears outdated
POLICY_MISMATCHAI output conflicts with policy or procedure
RISK_ESCALATIONCase requires higher authority
CUSTOMER_CONTEXTAI missed relevant customer context
TOOL_BOUNDARYRequested action exceeds allowed tool boundary
LANGUAGE_RISKCustomer-visible wording creates compliance or experience risk
DATA_QUALITYInput data is incomplete, noisy or inconsistent
SECURITY_SIGNALPrompt injection, access issue or suspicious content
RUBRIC_AMBIGUITYReview standard needs clarification
CONTROLLED_ACCEPTAI accepted with documented high-risk rationale

10. Escalation And Override Governance

10.1 Authority Matrix

DecisionFrontline reviewerSpecialistSenior adjudicatorRisk / ComplianceAI platform owner
Edit customer draftyesyesyesconsultno
Reject AI draftyesyesyesconsultno
Approve money movementwithin limitwithin higher limitexceptionconsultno
Approve credit exceptionnounderwriter authoritysenior creditconsultno
Close AML alertnoanalyst within policyL2 / complianceoversightno
Stop AI routeraiserecommendapprove if delegatedapproveexecute
Stop tool actionraiserecommendapprove if delegatedapproveexecute
Change rubricfeedbackproposeapprove draftapprove high-riskimplement workflow

10.2 Escalation Rules

TriggerEscalation targetDownstream state
Evidence missing for high-risk decisionspecialist queuedecision paused
Policy conflictpolicy owneroutput blocked
Customer complaint / legal threatcomplaint specialistAI response stopped
High amount or irreversible actionsenior approvertool action blocked
Reviewer disagreementadjudicatorcase held
Prompt injection or data leakagesecurity incident pathroute paused
Backlog threatens SLAoperations leadcapacity surge or route throttle
Quality KRI redgovernance ownerrelease freeze or safe-stop

10.3 Override Governance Controls

  • High-risk override requires reason code, evidence reference and authority check。
  • Override can change workflow outcome only within reviewer authority。
  • Certain overrides require second approval or adjudication。
  • Override trends feed model, prompt, RAG, policy and training backlog。
  • Override cannot silently retrain model without usage tag and governance review。
  • Long-term override pattern may indicate AI quality issue, policy ambiguity or role mismatch。

11. Sampling Design

11.1 Sampling Purposes

PurposeSampling design
Production qualityrandom sentinel and risk-weighted sample
High-risk protectionexception and threshold-based sample
Model improvementuncertainty, disagreement and drift sample
Fairness / segment coveragestratified sample across protected or proxy-sensitive segments
Reviewer calibrationgold and boundary sample
Audit evidencereproducible sample with trace retention
Incident responseexpanded sample during exposure window

11.2 Sample Policy Example

BucketShareRationale
P0 / P1 mandatory reviewas neededcustomer harm and regulatory exposure
Uncertainty and evidence weakness25%catch model and RAG boundary failures
Random sentinel20%detect blind spots
Segment / channel coverage20%prevent quality blind spots
Complaint / appeal signals15%customer harm feedback
Reviewer calibration10%quality and drift
Drift / new cluster10%taxonomy and policy learning

11.3 Sampling Evidence

每个 sampled case 应记录 sampling policy version、sampling reason、sampling probability、risk tier、segment tags、model / prompt / source version、review mode 和 final usage tag: QA, train, eval, audit, incident or exclude。


12. Surge Mode And Degraded Operations

12.1 Surge Triggers

TriggerSurge action
model drift increases review rateactivate reserve reviewers
RAG stale policy incidentstop free-form answers and review exposure window
fraud spikeroute P0 / P1 to fraud surge pool
complaint volume surgeprioritize regulatory deadline queue
reviewer outagereduce low-risk QA and preserve high-risk queue
evidence export failurelocal evidence ledger and draft-only for Tier 1
policy changetemporary double review for affected intents

12.2 Surge Mode Playbook

1. Declare surge condition and scope.
2. Freeze low-value or discretionary review inflow.
3. Reclassify queues by customer harm and deadline.
4. Activate certified reserve reviewers.
5. Move senior reviewers to adjudication and P0/P1 only.
6. Switch low-risk customer-visible outputs to template or delayed response.
7. Increase random sentinel sample for affected workflow.
8. Track backlog, SLA, fatigue and quality every 2-4 hours.
9. Record residual risk and management decisions.
10. Exit surge only after backlog, quality and evidence gates pass.

12.3 Manual-First Mode

Manual-first is appropriate when AI can still retrieve or summarize but cannot safely recommend, decide or act. Allowed: evidence retrieval、extractive summary、queue prioritization with human confirmation、approved templates。 Blocked: final recommendation framing、auto-send、state-changing tool calls、case closure、customer-specific promises。

12.4 Recovery Gate

Normal review operations resume when dependency health is stable, backlog is within tolerance, P0/P1 SLA breach risk is controlled, reviewer quality metrics are above threshold, local evidence is reconciled, sampled surge-period decisions pass QA, and business owner plus risk owner approve exit。


13. Dashboards And KRIs

13.1 Operations Dashboard

MetricPurpose
queue depth by priorityworkload and backlog
queue age by SLAbreach forecast
arrival rate vs completion ratecapacity balance
review rate by workflowpolicy and model behavior
reviewer utilizationfatigue and staffing
escalation volumerisk pressure
adjudication backlogsenior bottleneck
surge reserve availabilityresilience readiness

13.2 Quality Dashboard

MetricPurpose
gold accuracyreviewer correctness
inter-rater agreementrubric consistency
false accept rateunsafe AI accepted
false reject ratevalid AI blocked
reason qualityaudit readiness
meaningful edit ratereviewer value and AI quality
appeal upheld ratecustomer harm
audit replay successevidence completeness

13.3 Risk KRIs

KRIYellowRed
P1 SLA breach forecastbreach likely within daybreach active or unavoidable
Override rate collapsebelow historical bandnear zero with short review time
Escalation missincreasing in QAconfirmed high-risk miss
Gold accuracybelow targethigh-risk reviewer fails
Evidence completenessminor field gapstrace cannot be replayed
Reviewer utilization>85% for 3 dayssustained >95% or after-hours spike
Appeal upheldtrend upsevere customer harm case
Adjudication overturntrend uprubric or training failure

13.4 Executive View

Executive reporting should answer whether high-risk queues are within tolerance, review quality is stable, reviewers are overloaded, AI outputs require more override, escalation and stop authorities work, audit evidence is replayable, surge mode is tested, and residual risk has been accepted。


14. RACI

ActivityAccountableResponsibleConsultedInformed
Review strategyBusiness OwnerAI PMRisk, Compliance, OperationsSenior management
Queue taxonomyOperations OwnerAI BA + Workflow LeadProduct, Architecture, QAReviewer teams
Review policy engineAI Product OwnerArchitect + PlatformRisk, Legal, SecurityOperations
Capacity modelOperations ExecutiveWorkforce PlanningPM, Finance, RiskQueue owners
Skill routingOperations OwnerTraining Lead + Queue ManagerHR, ComplianceReviewers
Calibration programQA LeadSME + Training LeadModel Risk, ProductGovernance forum
Evidence schemaGovernance Evidence OwnerPlatform EngineeringAudit, Privacy, SecurityBusiness owner
Override governanceRisk OwnerProduct + OperationsCompliance, LegalInternal Audit
Surge modeOperations ExecutiveQueue ManagersBCP, Platform, RiskSenior management
Route stopBusiness + Risk OwnerAI Platform OwnerSecurity, ComplianceAll affected teams
Dashboard reportingAI Governance OwnerAnalytics + AI OpsOperations, QAExecutive committee
Audit replayInternal AuditEvidence OwnerProduct, Platform, RiskSenior management

15. Financial Retail Examples

15.1 AML Copilot

DimensionDesign
AI rolesummarize alert, transactions, KYC profile, typology and narrative draft
Review queuehigh-risk typology, SAR draft, close-alert recommendation
Skill routingAML analyst, senior investigator, compliance reviewer
Evidencetransaction pattern, source systems, customer profile, typology reference
Calibrationred flag gold cases and suspicious narrative boundary cases
Overrideanalyst edits narrative; senior reviewer blocks closure
KRImissed red flag, L2 disagreement, SAR narrative edit rate
Red lineAI cannot be sole basis for closing high-risk alert

15.2 Credit Underwriting Copilot

DimensionDesign
AI roleprepare credit memo draft, policy citations, missing information list
Review queueadverse action reason, policy exception, borderline case
Skill routingunderwriter, senior credit approver, fair lending reviewer
Evidenceapplication data, income docs, credit report references, policy version
Calibrationadverse reason specificity and exception boundary cases
Overrideunderwriter can reject AI memo and require source correction
KRIpolicy citation accuracy, adverse reason correction, appeal upheld
Red lineAI cannot independently approve, decline or generate final adverse action reason

15.3 Payment Dispute Assistant

DimensionDesign
AI rolesummarize dispute, recommend next step, draft customer update
Review queueprovisional credit, denial draft, high amount, fraud signal
Skill routingdispute specialist, fraud senior, complaint lead
Evidencetransaction, merchant, customer claim, rule deadline, history
Calibrationwrong denial, deadline pressure and evidence sufficiency cases
Overridesupervisor changes amount, blocks denial, escalates complaint
KRIprovisional credit error, wrong denial, SLA breach, complaint escalation
Red lineAI tool call cannot bypass amount limit or dual approval

15.4 Customer Service RAG

DimensionDesign
AI roleanswer policy questions and draft agent response
Review queuehigh-risk intent, unsupported claim, stale source, customer complaint
Skill routingservicing SME, policy owner, complaint specialist
Evidencesource article, effective date, citation span, customer entitlement
Calibrationclaim-level support and refusal boundary cases
Overrideagent edits answer; policy owner updates source issue
KRIunsupported claim, agent edit distance, escalation miss, source freshness
Red lineAI cannot make customer-specific fee, credit or legal commitments without approved process

16. Templates

16.1 Human Review Design Brief

FieldFilled example
Use casePayment dispute assistant
AI behaviorSummarizes evidence and recommends next action
Review unitProvisional credit recommendation
Risk tierHigh customer and financial impact
QueuePre-action payment review
Reviewer skillPayment dispute specialist
SLA4 business hours or sooner when deadline risk exists
Evidencetransaction, customer claim, rules deadline, AI rationale, policy version
Allowed actionsapprove, reject, edit amount, request evidence, escalate
Override authoritysupervisor for amount above frontline threshold
Audit fieldstrace ID, reviewer, reason code, evidence references, timestamp
KRIwrong credit, wrong denial, SLA breach, complaint escalation

16.2 Review Requirement Pattern

For payment dispute provisional credit,
when AI recommends an amount above the frontline threshold
or detects fraud, complaint or deadline risk,
the system must route the case to a payment dispute specialist
before any payment action is submitted,
showing transaction evidence, customer claim, rule deadline, policy version and AI rationale,
allowing approve, reject, edit amount, request evidence and escalate,
capturing reason code, reviewer identity, evidence references, timestamp and tool trace,
and enforcing a four-business-hour SLA with supervisor escalation on breach forecast.

16.3 Queue Configuration Card

FieldExample
queue_idpayment_dispute_pre_action_p1
ownerPayment Dispute Operations Lead
priorityP1
review policy100% for amount above threshold, risk-based below threshold
skill requirementdispute specialist certification
independencereviewer cannot be original case handler for high amount
SLA4 business hours
OLAevidence builder under 5 minutes, approver assignment under 15 minutes
escalationsupervisor queue after 75% SLA consumption
surge ruleactivate reserve pool when backlog exceeds 80% daily capacity
evidence retentiongoverned case record and immutable trace

16.4 Evidence Ledger Schema

Field groupFields
Identitytrace_id, case_id, customer_hash, workflow, risk_tier
AI configmodel_id, model_version, prompt_version, retriever_version
Sourcedocument_ids, source_owner, effective_date, citation_spans
Reviewqueue_id, reviewer_id, reviewer_role, review_mode, action_type
Decisionreason_code, rationale, edit_diff, override_flag, escalation_id
Timingcreated_at, assigned_at, due_at, completed_at, SLA_status
Downstreamtool_action, approval_id, customer_visible_message_id
Governancerubric_version, sampling_policy, incident_id, retention_class

16.5 Surge Decision Card

FieldFilled example
TriggerCustomer service RAG stale policy incident
ScopeFee, dispute and complaint intents from 08:00 CT onward
ModeTemplate-only for customer-facing answers, manual-first for exceptions
Queues prioritizedP0 complaint, P1 dispute, P1 fee correction
Capacity actionActivate servicing SME reserve pool
Evidence actionPreserve source IDs and customer-visible message versions
Executive updateEvery 4 hours until backlog and evidence gate pass
Exit conditionindex validated, sample QA passed, backlog below threshold

17. 30-Day Lab

目标: 30 天内完成一套可展示的 AI Human Review Operations / Capacity Architecture portfolio pack。推荐选择 Payment Dispute Assistant、Customer Service RAG、AML Copilot 或 Credit Underwriting Copilot。

DayTaskArtifact
1选择一个金融零售 AI workflowuse case boundary card
2定义 AI behavior: retrieve、summarize、recommend、draft、actAI behavior map
3拆 review unit: claim、draft、recommendation、tool action、samplereview unit taxonomy
4做 risk tiering 和 customer impact maprisk and impact matrix
5设计 queue taxonomyqueue catalog
6定义 priority rules: P0-P4priority rulebook
7设计 skill routingreviewer skill matrix
8设计 conflict-of-interest controlsindependence control table
9写 reviewer workspace requirementsevidence and action spec
10建 capacity model baselinecapacity worksheet
11做 peak and incident sensitivitycapacity scenario memo
12定义 SLA / OLASLA and OLA table
13设计 fatigue controlsfatigue monitoring plan
14设计 calibration cases12-case calibration pack
15定义 certification levelsreviewer certification model
16设计 reason codesoverride reason taxonomy
17设计 escalation rulesescalation matrix
18设计 override governanceauthority matrix
19设计 sampling policysampling strategy memo
20设计 dashboard and KRIsoperations and quality dashboard spec
21设计 evidence ledgertrace schema
22设计 surge modesurge playbook
23设计 manual-first fallbackdegraded review runbook
24写 RACIresponsibility matrix
25选一个业务案例走通端到端case walkthrough
26做 tabletop: reviewer capacity shortageexercise script and decision log
27做 tabletop: model drift increases review rateexercise script and decision log
28写 executive memocapacity risk and ROI memo
29写 architecture review packworkflow, queue, trace and dashboard diagrams
30准备 interview storySTAR-T story and 8 Q&A

18. Interview Answers

Q1: Human review 和 HITL 有什么区别?

30 秒:

HITL 只说明人参与了流程。Human review operations 要证明这个人有技能、时间、证据、独立性、权限和升级路径, 并且组织能度量 review 是否真的降低风险。 2 分钟: 我会把 human review 设计成生产控制系统。第一步定义 review unit, 例如 claim、draft、recommendation 或 tool action。第二步按风险和客户影响设计 queue taxonomy。第三步做 skill routing、capacity model、SLA/OLA 和 reviewer workspace。第四步建立 calibration、reason code、override quality、adjudication 和 audit evidence。这样 human review 不只是审批按钮, 而是可运营、可审计、可扩容的控制能力。

Q2: 如何判断 human review 是否变成控制剧场?

30 秒:

看五个信号: review time 极短、override rate 异常接近零、reason code 空泛、reviewer 没有证据或权限、队列容量长期不足。 2 分钟: 控制剧场的本质是形式存在但无法改变结果。我会检查 reviewer 是否能看到证据、是否能拒绝和升级、是否有停机路径、是否有足够容量、是否完成校准。指标上看 meaningful edit、override quality、escalation miss、gold accuracy、inter-rater agreement、audit replay success 和 fatigue signal。长期 0 override 不是自动好结果, 可能是 automation bias 或管理压力。

Q3: 如何做 capacity planning?

30 秒:

用 case volume、review rate、平均处理时间、复杂度、双审比例、返工率、productive hours、SLA 和 surge reserve 建模, 再做 peak 和 incident sensitivity。 2 分钟: 我不会只看平均 daily volume。AI 上线后 review rate 会随模型漂移、政策变化和事件波动。模型要区分 P0/P1/P2 队列, 估算平均处理时间和双审需求。然后把 productive hours 从排班时间中扣除培训、会议、休息和疲劳因素。最后模拟高峰和 incident: 如果 review rate 从 15% 到 30%, 是否仍能守住 SLA; 如果 senior adjudicator 不够, 哪些 case 被阻塞。

Q4: 如何设计 skill routing?

30 秒:

routing 不能只按队列空闲分配, 要按业务域、产品、风险、语言、权限、AI literacy 和证据访问资格分配。 2 分钟: 例如支付争议 case 涉及高额临时入账和欺诈信号, 不能给普通客服 reviewer。它需要 payment dispute specialist, 且 fraud signal 需要 fraud senior 参与。信贷 adverse action draft 需要 underwriter 和 fair lending aware reviewer。routing engine 应结合 case risk、product、customer context、language、authority 和 conflict-of-interest rule, 并在日志中记录为什么分配给该 reviewer。

Q5: Reviewer calibration 如何做?

30 秒:

用 gold cases、boundary cases、blind review、double review 和 adjudication 训练 reviewer 对相同证据形成一致判断。 2 分钟: 我会先定义 rubric, 包括证据充分性、政策解释、风险升级、客户影响和允许动作。然后设计 clear pass、clear fail、boundary、missing evidence、stale policy、tool side effect 和 automation bias challenge cases。reviewer 先 blind review, 再对比 adjudicator 标准。指标包括 gold accuracy、inter-rater agreement、adjudication overturn、reason completeness 和 escalation correctness。校准结果决定 certification 和可处理队列。

Q6: Override rate 应该越低越好吗?

30 秒:

不一定。过低可能说明 AI 很好, 也可能说明 reviewer 默认采纳。要结合 review time、reason quality、gold case performance、appeal upheld 和 audit sample 判断。 2 分钟: Override 是 trust calibration 指标, 不是单独的成功指标。AI 质量改善后 override 应该在合理范围下降, 但如果突然接近零, 同时 review time 变短、reason 变空泛、gold cases 失败, 那是控制失效信号。我更关注 override quality: 人工 override 是否修正证据缺失、政策错误、客户语境、工具越权或语言风险。

Q7: Human review 失败时如何降级?

30 秒:

当队列积压或 reviewer quality 下降时, 要进入 surge 或 manual-first mode: 停低价值 review, 优先 P0/P1, 启用 reserve pool, 降级客户可见 AI, 并记录管理层风险接受。 2 分钟: 首先识别 scope: 哪些 workflow、intent、客户群和动作受影响。然后冻结低风险 discretionary QA, 把容量给 P0/P1。对客户可见高风险 AI, 切 template-only 或 draft-only。对 tool action, 停止自动动作, 等待授权 review。激活经过认证的 reserve reviewers, 并每 2-4 小时看 backlog、SLA、fatigue 和 quality。恢复 normal 前要通过 backlog、quality、evidence 和 business signoff gate。

Q8: 如何向高管解释 human review 成本?

30 秒:

review cost 是 AI 控制成本, 不是额外行政负担。没有它, 高风险 AI 的 ROI 是虚假的, 因为风险被转移给员工、客户和审计。 2 分钟: 我会用 capacity model 展示三件事: 正常量需要多少 reviewer, 高峰和 incident 需要多少 surge reserve, review quality 如何降低客户伤害和监管风险。还会对比不同策略: 100% review、risk-based review、sampling、exception review 的成本和残余风险。高管需要看到的是 controlled automation, 不是把自动化收益建立在不可持续的人力压力上。

Q9: Audit evidence 要保留什么?

30 秒:

至少保留 trace ID、AI 版本、prompt / retriever / source version、证据、reviewer、动作、reason code、时间、下游动作和 escalation。 2 分钟: 审计要能重放当时发生了什么: AI 看到了什么输入, 使用哪个模型和 prompt, 检索到哪些来源, 证据是否有效, review policy 为什么路由到该队列, reviewer 看到什么, 做了什么动作, 理由是什么, 是否 override, 是否升级, 下游工具是否执行。没有这些字段, human review 很难证明有效, 也无法从事故中学习。


19. Final Operating Principle

一个 AI human review system 最少要回答:

QuestionMature answer
What is reviewed?claim、draft、recommendation、tool action、sampled outcome
Who reviews?certified reviewer with domain skill, authority and independence
How is it routed?risk, skill, SLA, capacity and conflict-of-interest rules
What evidence is shown?sources, versions, missing evidence, AI trace, policy, downstream impact
What can the human do?accept, edit, reject, override, escalate, request evidence, stop
How is quality known?calibration, QA, agreement, gold cases, appeal outcomes
How is capacity controlled?model, dashboard, surge reserve, fatigue controls
How is it audited?immutable trace, reason code, evidence reference and replay
Final principle:
Human review is effective only when the organization can prove
that the right human reviewed the right case,
with the right evidence and authority,
within the right time,
and that the review measurably improved control quality.