返回 Papers
AI 底层逻辑 / 经典论文

AI Human Factors Operations:认知负载与自动化偏差架构

AI human factors often get reduced to "make the interface clearer" or "add a human review step." That framing is too shallow for financial retail operations. In AML, credit, fraud, complaints, collect

435ai-foundations/papers/165-ai-human-factors-operations-cognitive-load-automation-bias-architecture.md

AI 人因运营架构:Cognitive Load / Automation Bias / Calibrated Trust Architecture

Date: 2026-06-30 Status: evergreen Audience: experienced CBAP / financial retail PM / AI product architect / enterprise architect / operations lead / model risk partner Output: advanced architecture note, operating model, control design, ADR draft, interview-ready narrative


1. Why Human Factors Are Architecture, Not Just UX

AI human factors often get reduced to "make the interface clearer" or "add a human review step." That framing is too shallow for financial retail operations. In AML, credit, fraud, complaints, collections and contact centers, the human is not a decorative fallback. The human is a scarce decision resource inside a production control system.

Human factors become architecture because they shape:

Architecture concernHuman factorWhy it matters in production
Throughputcase volume, handling time, fatigue, task switchingA review queue that exceeds cognitive capacity becomes a rubber stamp or backlog.
Risk controlautomation bias, alert fatigue, anchoring, over-trustA reviewer who accepts AI output without challenge is not an effective control.
Decision rightswho may accept, edit, override, escalate, stop or approveRegulated decisions require accountable authority, not just a button in the UI.
Evidencewhat the operator saw, changed, ignored and relied onAudit needs reconstructable control evidence, not only model logs.
Qualityreviewer calibration, second-line QA, disagreement handlingHuman judgment quality drifts just like model quality.
Workload routingskill, risk, language, customer vulnerability, deadlineThe right work must reach the right human at the right time.
Trustcalibrated reliance, skepticism, confidence and recoverabilityTrust must match evidence strength, task risk and action reversibility.

In a financial retail AI system, the architecture question is not:

Can we put a human in the loop?

The senior question is:

Can the operating architecture preserve independent human judgment under real workload,
while proving that the human control reduced risk instead of becoming control theater?

This note deliberately avoids repeating generic Human-AI Interaction principles or Team Topologies cognitive load language. The focus is operational architecture: operator burden, review fatigue, automation bias, calibrated trust, escalation design, second-line QA, sampling, error cost, workload routing, skill matrix, training loops, decision rights and control evidence.


2. Concept Diagram

flowchart TB
  Intake[Case intake<br/>customer, alert, application, complaint, call] --> Classifier[Task, risk, impact and reversibility classifier]
  Classifier --> Workload[Operator load estimator<br/>volume, AHT, skill, fatigue, SLA]
  Classifier --> Assist[AI assistance layer<br/>RAG, copilot, agent, model score]
  Assist --> Evidence[Evidence bundle<br/>sources, tool trace, policy, confidence limits]
  Workload --> Route[Workload and skill router]
  Evidence --> BiasCtl[Automation bias controls<br/>blind pass, reason codes, friction, challenge prompts]
  BiasCtl --> Route
  Route --> Workspace[Operator workspace<br/>task, evidence, AI output, controls]
  Workspace --> Decision{Human decision}
  Decision -->|Accept or edit| Action[Downstream action<br/>reply, approve, freeze, close, escalate]
  Decision -->|Override| Override[Override governance<br/>reason, authority, evidence]
  Decision -->|Escalate| Escalation[Escalation path<br/>SME, compliance, second line, supervisor]
  Decision -->|Stop route| SafeStop[Safe stop<br/>pause automation or queue]
  Action --> Ledger[Evidence ledger<br/>trace, version, operator action, timing]
  Override --> Ledger
  Escalation --> Ledger
  SafeStop --> Ledger
  Ledger --> QA[Second-line QA and calibration]
  QA --> Metrics[Metrics and control dashboard]
  Metrics --> Improve[Training, prompt, RAG, workflow and policy improvement]
  Improve --> Assist
  Improve --> Route

Architecture interpretation:

  • The AI layer is only one part of the operating architecture.
  • Bias control sits before and inside the reviewer workspace, not only in training material.
  • Queue routing must consider cognitive load and skill, not just FIFO order.
  • QA and calibration are production feedback loops, not one-time launch activities.
  • Evidence ledger is the connective tissue across product, model risk, compliance, audit and operations.

3. Operating Architecture Model

3.1 Architecture Layers

LayerCore responsibilityDesign decision
Work intakeConvert events into reviewable work unitsDefine whether the unit is a claim, draft, recommendation, tool action, case, alert or sampled outcome.
Risk and impact classifierDetermine customer impact, financial impact, regulatory sensitivity and reversibilityUse risk tier and error cost to drive routing, review depth and escalation.
Cognitive load managerEstimate operator burden and fatigue riskTrack queue size, average handling time, interruption rate, context switching, active hours and case complexity.
AI assistance layerGenerate summaries, recommendations, drafts, evidence retrieval, next-best action or tool proposalsSeparate AI evidence, AI reasoning summary, model score, system-of-record facts and policy constraints.
Bias and trust control planeReduce over-reliance and under-relianceDesign blind review, challenge prompts, no default accept, confidence explanation and required evidence checks.
Workload routerMatch work to skill, authority, independence and capacityRoute by domain, product, language, risk, customer vulnerability, deadline and conflict-of-interest rules.
Operator workspaceGive the human enough context and control to make a defensible decisionShow evidence first, expose missing data, structure actions and capture reason codes.
Decision rights layerDecide who may accept, edit, override, approve, escalate or stopTie permissions to role, skill certification, risk tier and policy authority.
QA and calibrationDetect drift in human judgment and AI relianceUse gold cases, blind samples, second review, adjudication and reviewer coaching.
Evidence and observabilityRecord what happened and whyCapture input, output, evidence version, model/prompt/version, action, reason, trace, timings and reviewer identity.
Governance loopConvert operational signals into control improvementsReview trends, incidents, audit findings, training gaps, policy changes and release gates.

3.2 Review Unit Taxonomy

Review unitFinancial retail exampleHuman factor riskArchitecture implication
Claim"This fee can be waived under policy X."Operator may trust fluent unsupported claim.Require source-linked claim verification.
DraftComplaint response letter or hardship scriptReviewer skims language and misses commitment risk.Highlight obligations, promises, policy citations and prohibited phrases.
RecommendationAML alert close, credit approve, fraud blockAI recommendation anchors the human.Use blind first-pass or evidence-first design for high-impact cases.
Tool actionRefund, freeze, close case, send noticeOne click changes customer state.Require authority, preview, confirmation, trace and reversible design where possible.
Sampled outcomeAuto-handled contact center answerSample misses minority error segments.Use stratified and sentinel sampling, not only random QA.
EscalationPEP match, legal threat, vulnerable customerHuman may route late due to workload pressure.Define escalation triggers, SLA, stop rules and destination ownership.

3.3 Decision Rights Matrix

Risk tierAI roleHuman roleRequired control
P0: irreversible, legal, regulatory, material customer harmSummarize evidence and propose options onlyAuthorized human decides; second-line or specialist may approveMandatory review, no default accept, explicit reason, escalation trace
P1: high impact but controllableRecommend with evidence and uncertaintySkilled operator accepts, edits, rejects or escalatesEvidence checklist, reason code, QA sample and override monitoring
P2: customer-visible but reversibleDraft or answer with source supportFrontline reviews or risk-based samplesCitation support, feedback, random QA and recovery path
P3: internal productivityAssist task completionOperator accountable for usePeriodic QA, training loop and telemetry
P4: learning and experimentationShadow outputNo production decisionOffline eval, calibration cases and model comparison evidence

3.4 Skill Matrix

Skill dimensionWhy it mattersEvidence of readiness
Domain knowledgeAML, credit, fraud, complaints and collections have different risk rules.Certification record, gold case score, supervisor sign-off
Policy interpretationOperators must identify policy conflict and exception boundaries.Scenario test, policy quiz, audit sample
AI literacyOperators must recognize hallucination, retrieval gaps, false confidence and tool misuse.Training completion plus challenge-case performance
Evidence handlingReviewers need to know which sources are authoritative and which are weak.Evidence rubric score and citation sufficiency rate
Customer impact judgmentSame error has different cost for hardship, vulnerable customer or adverse action.Escalation accuracy and severity calibration
Decision authoritySome actions need senior approval or independent review.Role-based entitlement and authority matrix
Communication qualityCustomer-visible outputs require clear, compliant, empathetic language.QA language sample and complaint trend

4. Cognitive Load And Automation Bias Controls

4.1 Cognitive Load Control Model

Operator load is not only the number of cases. It is the mental work required to understand evidence, challenge AI, decide under uncertainty, document the decision and recover from exceptions.

operator_load =
  case_complexity
+ evidence_volume
+ evidence_conflict
+ policy_ambiguity
+ customer_impact
+ interruption_rate
+ context_switching
+ time_pressure
+ UI_navigation_cost
+ documentation_burden
- task_chunking
- evidence_prioritization
- skill_match
- workflow automation
- escalation clarity
Load driverArchitecture controlFinancial retail example
Evidence sprawlEvidence bundle with ranked sources, freshness, authority and missing fieldsAML investigator sees transaction cluster, SAR history, customer profile and typology in one bundle.
Policy ambiguityPolicy conflict detector and escalation ruleCollections hardship script flags conflict between temporary forbearance policy and state-specific rule.
High context switchingQueue batching by domain, product and case typeContact center QA reviewers handle complaint drafts in blocks instead of mixing fraud, credit and servicing.
Time pressureSLA-aware routing with surge modeFraud interventions route real-time blocks separately from batch post-event QA.
Documentation burdenStructured reason codes plus short free-text rationaleCredit underwriter records adverse action rationale without retyping the whole memo.
Fatigueshift limits, complexity caps, break triggers and reviewer rotationHigh-risk AML queue limits consecutive complex cases and rotates to calibration work.
Alert overloadrisk-based triage and sampled low-risk verificationFraud false-positive stream gets stratified QA while high-risk account takeover gets immediate review.

4.2 Automation Bias Controls

Automation bias means operators give excessive weight to AI output because it is fluent, confident, convenient, faster or socially endorsed by management. In production it shows up as one-click accept, low override rates, shallow evidence review, declining escalation and reduced error detection.

Bias patternControlArchitecture implementationEvidence
Anchoring on AI recommendationEvidence-first or blind first-pass review for P0/P1Hide AI recommendation until reviewer marks evidence sufficiency or preliminary risk tier.UI event sequence, preliminary decision, final decision delta
Default acceptanceNo preselected accept actionRequire active accept, edit, reject or escalate selection with reason code.Action log and reason-code distribution
Confidence theaterExplain confidence source and limitSeparate model confidence, retrieval support, policy certainty and data completeness.Confidence component telemetry and QA findings
Speed pressureBalanced metricsScore throughput together with quality, override validity, escalation accuracy and missed-risk rate.Operations dashboard and performance scorecard
Reviewer fatigueLoad-aware routingThrottle queue, cap complex cases, route surge and flag fatigue risk.Workload trace and shift-level quality trend
AI social proofIndependent challenge promptAsk "What evidence would make this recommendation wrong?" before final accept.Challenge response captured in high-risk cases
Shallow reviewMandatory evidence checklistRequire source opening, key field confirmation or missing-evidence acknowledgement for high-impact tasks.Evidence interaction log
Over-correction or under-trustCalibration and gold casesTrain reviewers on cases where AI is right, wrong, partially right and unsupported.Calibration score and drift trend
Blind spots in samplingSentinel and stratified QAInclude known tricky cases, edge segments, languages, products and customer vulnerability markers.QA sample design and hit rate

4.3 Calibrated Trust Design

Trust calibration is the match between reliance and actual capability under the current task, evidence and risk condition.

Trust stateSymptomControl
Over-trustAccept rate rises while evidence-open rate drops.Evidence-first review, reason-code friction, second-line QA and management metric reset.
Under-trustOperators ignore useful AI and duplicate all work manually.Training with model strengths, clear source support, workflow integration and feedback response.
Mis-trustOperators trust AI for the wrong tasks, such as policy exceptions or adverse action language.Scope boundaries, task-specific affordances and prohibited-use controls.
Calibrated trustReliance varies by evidence strength, risk tier and reversibility.Confidence decomposition, risk-tiered workflow, QA sampling and continuous calibration.

5. Financial Retail Scenarios

5.1 AML Investigator Copilot

DimensionArchitecture design
AI assistanceSummarizes alerts, clusters transactions, retrieves prior SAR narratives, identifies typology matches and drafts investigation notes.
Operator burdenHigh evidence volume, fragmented systems, deadline pressure and repetitive narrative writing.
Automation bias riskInvestigator accepts AI "close as false positive" recommendation because the summary looks complete.
ControlsEvidence-first review, mandatory typology evidence, reason code for close, senior review for high-risk customer, sentinel QA for false-negative patterns.
Metricsevidence-open rate, close override rate, SAR escalation precision, QA miss rate, backlog age, alert fatigue index.
Control evidencealert trace, sources used, model version, investigator action, close rationale, second-line sample and calibration outcome.

5.2 Credit Underwriter Assist

DimensionArchitecture design
AI assistanceBuilds credit memo, flags missing documents, summarizes income, debt, collateral and policy exceptions.
Operator burdenPolicy interpretation, fair lending sensitivity, adverse action reason quality and exception handling.
Automation bias riskUnderwriter over-relies on model score and misses contradictory evidence or prohibited variable proxy.
ControlsIndependent policy checklist, adverse action reason validation, feature/proxy warning, second review for exceptions, fair lending sample.
Metricspolicy exception accuracy, adverse action defect rate, override validity, protected-class proxy investigation rate, QA disagreement.
Control evidencememo version, evidence references, reason code, reviewer authority, override rationale, second-line QA and model risk sign-off sample.

5.3 Contact Center Agent Assist

DimensionArchitecture design
AI assistanceReal-time answer suggestions, account status lookup, call summary, next best action and knowledge retrieval.
Operator burdenSimultaneous listening, reading, compliance scripting, empathy and system navigation.
Automation bias riskAgent reads a suggested answer without checking source or customer context.
Controlsconcise evidence cards, prohibited phrase detection, customer vulnerability escalation, real-time fallback, sampled call QA.
Metricssuggestion acceptance with source-open rate, handle time, transfer rate, complaint after contact, script compliance, correction rate.
Control evidencecall segment, suggestion, source link, agent edit, customer-facing text, transcript marker and QA result.

5.4 Complaints Copilot

DimensionArchitecture design
AI assistanceClassifies complaint type, extracts allegations, drafts acknowledgement and response, tracks deadlines.
Operator burdenRegulatory deadlines, legal language, emotional context, root-cause analysis and remediation tracking.
Automation bias riskReviewer accepts a polished draft that under-admits issue severity or misses required rights language.
Controlsdeadline-first queue, complaint severity checklist, legal/compliance escalation trigger, evidence sufficiency gate, final response QA.
Metricsresponse defect rate, missed allegation rate, deadline breach, escalation accuracy, customer reopen rate.
Control evidencecomplaint taxonomy, allegation map, evidence bundle, draft edits, approval chain, customer communication version.

5.5 Fraud Intervention

DimensionArchitecture design
AI assistanceScores account takeover risk, recommends block or step-up, drafts customer outreach, explains signals.
Operator burdenTime-critical decision, false-positive customer friction, fraud loss exposure and live-channel pressure.
Automation bias riskOperator accepts high fraud score without considering customer travel or recent verified behavior.
Controlsreversible action preference, signal decomposition, customer contact path, real-time supervisor escalation for high-loss cases.
Metricsfalse positive rate, fraud loss prevented, customer friction rate, block reversal rate, decision latency.
Control evidencemodel score components, tool action request, human approval, customer verification status, downstream account action.

5.6 Collections Hardship

DimensionArchitecture design
AI assistanceIdentifies hardship indicators, suggests available programs, drafts empathetic scripts and repayment options.
Operator burdenEmotional labor, policy exceptions, vulnerability signals and jurisdictional constraints.
Automation bias riskAgent follows a repayment recommendation that is inappropriate for the customer's hardship status.
Controlsvulnerability-first routing, affordability evidence checklist, prohibited pressure language detection, supervisor escalation for complex hardship.
Metricshardship identification rate, complaint rate, script compliance, repayment plan suitability, customer repeat contact.
Control evidencehardship signal, program eligibility facts, script version, agent edits, customer consent and supervisor review where applicable.

6. Metrics, Control And Evidence Model

6.1 Metric Families

Metric familyExample metricsWhat it detects
Operator loadcases per hour, average handling time, queue age, interruption rate, complex-case streak, after-hours workoverload, fatigue, unsustainable control design
Evidence behaviorsource-open rate, missing-evidence acknowledgement, citation support, policy conflict reviewshallow review and weak evidence use
Automation relianceaccept rate, edit depth, override rate, blind-pass delta, AI-human disagreementover-trust, under-trust and anchoring
QualityQA defect rate, second-review disagreement, missed escalation, gold-case score, calibration drifthuman judgment drift and training gaps
Customer impactcomplaint reopen, adverse outcome defect, fraud false positive, collections complaint, contact center repeat contactharm and recovery quality
Control operationmandatory review completion, SLA breach, escalation timeliness, authority violations, safe-stop activationwhether controls actually operated
Improvement loopdefect closure time, knowledge update cycle time, eval case creation, retraining trigger responsewhether learning loops are real

6.2 Control Evidence Packet

Evidence objectRequired contentAudit question it answers
Work item recordcase id, risk tier, source channel, customer impact, SLAWhy did this item enter this workflow?
AI tracemodel, prompt, RAG query, retrieved sources, tool calls, confidence componentsWhat did the AI use and produce?
Evidence bundleauthoritative sources, timestamps, policy versions, missing evidence flagsWhat evidence was available to the human?
Operator interaction logevidence opened, AI output viewed, edits, time on task, action selectedDid the human perform meaningful review?
Decision recordfinal action, reason code, authority, override or escalation rationaleWho decided what and why?
QA samplesample frame, reviewer independence, result, defect class, adjudicationHow was control quality tested?
Training recordrole, skill certification, calibration score, retraining completionWas the human qualified for this task?
Governance recordissue owner, remediation, release gate, residual risk acceptanceWas the operating risk managed?

6.3 Sampling Model

Sampling must align with error cost, not just volume.

Sampling typeUseExample
100 percent reviewirreversible or high-impact actionscredit adverse action reason, account freeze, formal complaint final response
Risk-based samplemodel score or workflow signal is reliable but not completefraud interventions below threshold but with unusual geography
Stratified samplerisk varies by product, channel, language, region or customer segmentcontact center agent assist answers across English, Spanish and vulnerable-customer signals
Sentinel sampleknown difficult or high-risk casesAML typology edge cases, policy conflicts, tricky hardship conversations
Blind second reviewdetect anchoring and groupthinkcredit memo recommendations and AML alert closure decisions
Incident surge sampleproduction defect, policy change or model driftstale RAG policy discovered in complaints response drafting

6.4 Control Threshold Examples

SignalThresholdAction
Evidence-open rate for P1 cases below 90 percenttwo consecutive business dayssupervisor review, targeted coaching and UI friction increase
AI accept rate above 95 percent with low edit depthweekly trendautomation bias investigation and blind sample expansion
QA defect rate above 3 percent for customer-visible draftsrolling two-week samplerelease rollback review and knowledge/prompt correction
Escalation rate drops by 50 percent after AI launchmonthly comparisoncheck for suppressed escalations and revise incentives
Queue age breaches SLA for high-risk worksame daysurge staffing, intake throttling or safe-stop rule
Gold-case calibration below 85 percentper reviewer or teamrestrict high-risk queue access until recalibrated

7. Anti-Patterns And Failure Modes

Anti-patternWhy it looks attractiveFailure mode
"Human in the loop" as a single approval stepEasy to explain to executives and auditorsHuman lacks time, skill, evidence or authority to challenge AI.
All high-risk cases to one queueAppears conservativeQueue overload causes delay, missed deadlines and superficial review.
Default accept buttonImproves handling timeCreates anchoring and rubber-stamping.
One confidence badgeSimple UIHides whether confidence comes from model score, retrieval support, policy certainty or data completeness.
Throughput-only productivity targetShows ROI quicklyEncourages shallow review and suppressed escalation.
Low override rate celebratedLooks like AI qualityMay indicate automation bias or fear of challenging the system.
Training sign-off onlyEasy compliance artifactDoes not prove operators can detect difficult failures.
Random QA onlyStatistically neatMisses rare high-cost cases and minority segment harms.
AI output above evidenceFeels efficientReviewers read conclusion first and search for confirming evidence.
No safe-stop authorityKeeps automation runningOperators cannot stop a harmful route during incident conditions.
Hidden AI assistAvoids customer concernMakes responsibility, disclosure, audit and root cause unclear.
Control evidence scattered across toolsAvoids integration costAudit cannot reconstruct why a decision happened.

8. Architecture Mapping To RAG / Agent / Copilot / Eval / Governance

Architecture patternHuman factors riskRequired architecture moveEvidence
RAGUnsupported or stale retrieved content becomes fluent advice.Source authority ranking, evidence sufficiency gate, citation support, policy freshness monitoring.retrieval trace, source version, citation QA, stale-source incident log
AgentTool actions bypass careful human decision or increase time pressure.Action policy engine, pre-action review, reversible action preference, authority check and safe-stop.tool call proposal, approval trace, action result, rollback record
CopilotOperator accepts drafts without reading or editing.Evidence-first layout, no default accept, edit tracking, challenge prompt and QA sampling.accept/edit metrics, source-open rate, draft diff, QA defect class
EvalOffline accuracy hides workload and automation bias.Add human factors evals: review time, disagreement, evidence use, missed escalation and calibration.eval set, reviewer protocol, inter-rater agreement, production comparison
GovernancePolicies exist but do not operate at runtime.Connect risk tier, decision rights, training, QA, sampling and observability to release gates.RACI, control matrix, release approval, dashboard, management review
ObservabilityModel logs do not show human burden or review quality.Instrument workflow traces, operator events, queue metrics and evidence ledger.OpenTelemetry trace ids, queue dashboard, audit packet
Data productHuman corrections do not improve knowledge or model behavior.Structured feedback taxonomy, defect owner, knowledge update SLA and eval case creation.feedback log, correction ticket, updated corpus, regression eval

9. ADR Draft

ADR-165: Adopt A Human Factors Operations Control Plane For High-Impact AI Workflows

FieldDecision
StatusProposed for portfolio architecture review
ContextFinancial retail AI use cases rely on human operators to review AML narratives, credit memos, fraud interventions, complaint responses, contact center suggestions and hardship scripts. Existing HITL patterns do not adequately manage cognitive load, automation bias, calibrated trust, decision rights, QA sampling and audit evidence.
DecisionImplement a human factors operations control plane across high-impact AI workflows. The control plane includes risk-tiered work intake, cognitive load estimation, skill-based routing, evidence-first reviewer workspace, automation bias controls, decision-right enforcement, second-line QA, sampling, calibration, training loops and evidence ledger.
DriversCustomer harm prevention, regulatory defensibility, operational resilience, reviewer capacity, model risk management, audit replay, production incident response and executive accountability.
Selected optionCentral control-plane pattern integrated with workflow orchestration, AI gateway, reviewer workspace, QA tooling and observability.
Alternatives consideredLocal UI-only warnings; generic human approval step; post-hoc QA without runtime routing; full automation with exception review only.
Why selectedThe selected option treats human judgment as a managed production capacity and provides runtime controls plus evidence. It reduces the risk that human review becomes a bottleneck or rubber stamp.
ConsequencesRequires instrumentation, reviewer training, authority matrix, workflow changes, QA operations and management reporting. It may reduce short-term automation ROI but improves sustainable adoption and defensibility.
ScopeAML, credit, fraud, complaints, collections hardship, contact center agent assist and any AI workflow with customer-visible or customer-impacting output.
Non-goalsThis ADR does not approve a specific model, vendor, regulatory interpretation or legal position. It defines the architecture pattern for human factors operations.

Acceptance Criteria

CriterionEvidence
Every high-impact AI workflow has a review unit definition and risk tier.workflow catalog and risk-tier map
Review routing uses skill, authority, capacity and independence rules.routing configuration and role matrix
Reviewer workspace exposes evidence, uncertainty, missing data and allowed actions.UI review checklist and trace sample
Automation bias controls are implemented for P0/P1 tasks.blind pass logs, no-default-accept proof, reason-code logs
QA sampling covers risk, volume, segments and sentinel cases.sampling plan and QA report
Training and calibration affect queue eligibility.certification records and access control linkage
Control evidence can be replayed for a sample decision.evidence packet and audit replay script

10. Interview Answer

30秒版本

AI human factors 不是 UI 问题,而是生产控制架构问题。金融零售里,人类审核承担的是风险吸收和最终判断,但人的注意力、技能、疲劳和自动化偏差都是有限资源。我会把它设计成 human factors operations control plane:按风险分层、估算负荷、技能路由、证据优先、去默认采纳、设置升级和二线 QA,并记录完整 evidence。这样才能证明人不是橡皮图章,而是真正降低客户伤害和模型风险的控制。

2分钟版本

我会先定义 review unit,比如 AML alert close、credit memo、complaint response draft、fraud block request 或 contact center answer。然后按客户影响、监管敏感性、可逆性和错误成本分层。不同层级对应不同的 AI 角色和人类决策权:低风险可以抽样 QA,高风险需要证据优先、强制 reason code、无默认 accept、必要时 blind review 或 second review。

架构上我会建立四类能力。第一是 workload routing,按照技能、容量、语言、产品、风险和 deadline 路由,避免把所有高风险 case 堆到一个队列。第二是 automation bias control,比如先看证据再看 AI 建议、要求 reviewer 标记证据是否充分、记录 override 和 edit depth。第三是 calibrated trust,用 model confidence、retrieval support、policy certainty 和 data completeness 分开展示,不做一个虚假的信心徽章。第四是 evidence ledger,把模型版本、prompt、检索来源、工具调用、人类动作、reason code、QA 结果和下游影响串起来。

在面试里我会强调:human-in-the-loop 不等于控制有效。控制是否有效,要看 reviewer 是否有时间、技能、证据、独立性、升级权和可审计记录。否则它只是合规幻觉。

CTO版本

我会把 human factors 作为 AI platform control plane 的一部分,而不是每个产品团队自己加提示语。平台层提供 risk-tier classifier、review policy engine、skill/capacity router、evidence bundle service、decision-right enforcement、QA sampling service 和 trace/evidence ledger。业务线配置任务、风险、权限和 SLA。

技术上要把 AI gateway、RAG provenance、agent tool policy、workflow engine、IAM、observability 和 QA 数据模型打通。OpenTelemetry-style trace id 贯穿 case、retrieval、model invocation、tool proposal、human action 和 downstream system update。治理上用 NIST AI RMF 的 Govern / Map / Measure / Manage 组织风险闭环,用 ISO/IEC 42001 的管理体系语言定义责任、能力、运营控制、绩效评价和持续改进。

我不会只承诺 "we have human review." 我会要求能回答三个 CTO 级问题:生产高峰时 review control 是否还能运行?发生客户伤害时能否重放证据链?AI 提升效率是否以削弱人工判断为代价?这三个问题答不上来,AI 系统就还没有准备好扩大自动化范围。


11. 7-Day Practice Plan

DayPracticeOutput
1Pick one workflow: AML copilot, credit assist, contact center assist, complaints, fraud or collections hardship. Define review unit, risk tiers and error-cost ladder.one-page review unit map
2Build an operator load map with volume, average handling time, evidence volume, policy ambiguity, interruption rate and fatigue triggers.workload and capacity table
3Design automation bias controls for P0/P1/P2 tasks, including blind pass, no default accept, reason codes and challenge prompts.automation bias control matrix
4Create a skill and decision-right matrix for frontline, specialist, supervisor, compliance and second-line QA roles.authority and routing matrix
5Draft a QA sampling plan with 100 percent review, risk-based sample, stratified sample, sentinel cases and incident surge sampling.QA sampling plan
6Define evidence packet fields and observability trace across AI output, retrieved sources, human action and downstream result.evidence ledger schema
7Prepare interview narrative and ADR summary. Practice answering as PM, architect and CTO.30-second, 2-minute and CTO answer

12. Source Anchors

These anchors are used as architecture and operating model references. They are not legal, compliance, audit or model validation advice. Access date: 2026-06-30.

AnchorLinkHow this note uses it
NIST AI Risk Management Frameworkhttps://www.nist.gov/itl/ai-risk-management-frameworkUses Govern, Map, Measure and Manage as the lifecycle for human factors risk identification, monitoring, treatment and improvement.
NIST bias publicationhttps://www.nist.gov/blogs/taking-measure/powerful-ai-already-here-use-it-responsibly-we-need-mitigate-biasAnchors the need to mitigate bias beyond the model, including use context, human decision processes and deployment controls.
Microsoft Guidelines for Human-AI Interactionhttps://www.microsoft.com/en-us/research/project/guidelines-for-human-ai-interaction/Provides human-AI interaction principles that this note translates into operational controls for review, trust, escalation and recovery.
ISO/IEC 42001 AI management systemhttps://www.iso.org/standard/81230.htmlConnects human factors controls to management system concepts: responsibility, operation, performance evaluation and continual improvement.
ISO/IEC/IEEE 42010 architecture descriptionhttps://www.iso.org/standard/74393.htmlSupports treating human factors as architecture views, stakeholders, concerns, decisions and evidence, not isolated UI guidance.
OpenTelemetry Documentationhttps://opentelemetry.io/docs/Supports runtime traces, metrics and logs that connect AI output, human action, queue state and downstream impact.