AI Human Factors Operations / Cognitive Load / Automation Bias Playbook
This playbook helps teams design AI operations where human attention, judgment and authority are treated as production architecture. It gives practical templates for operator load, automation bias, tr
AI Human Factors Operations / Cognitive Load / Automation Bias Playbook
定位: 面向高级 AI PM / AI BA / Product Architect / Enterprise Architect / Operations Lead / Model Risk / Compliance / Internal Audit, 把 AI 人因风险从“培训和提醒”升级为可设计、可路由、可度量、可审计、可持续改进的运营架构。 适用范围: AML investigator copilot、credit underwriter assist、contact center agent assist、complaints copilot、fraud intervention、collections hardship、KYC review、payment dispute、financial retail internal knowledge assistant。 重要说明: 本文是学习、作品集和内部治理训练材料, 不是法律意见、合规结论、审计意见、模型验证报告、监管解释或生产批准。正式项目必须由 Legal、Compliance、Risk、Model Risk、Internal Audit、Security、Privacy、Business Owner、Operations、Workforce Planning 和管理层结合机构类型、司法辖区、客户影响和内部政策确认。
1. Purpose And When To Use
1.1 Purpose
This playbook helps teams design AI operations where human attention, judgment and authority are treated as production architecture. It gives practical templates for operator load, automation bias, trust calibration, QA sampling, escalation and evidence packets.
Use it when the AI system:
- assists regulated or customer-impacting decisions;
- drafts customer-visible messages;
- recommends case closure, escalation, approval, denial, freeze, refund or outreach;
- changes workload shape for frontline or specialist operators;
- relies on human review as a risk control;
- creates risk of over-trust, under-trust, alert fatigue, review fatigue or decision anchoring;
- must produce control evidence for audit, model risk, compliance or management review.
1.2 When To Use In Delivery
| Delivery point | How to use the playbook | Required output |
|---|---|---|
| Discovery | Identify human work, risk tier, decision rights and cognitive load drivers. | operator load map and PM/BA/architecture question log |
| Solution design | Design bias controls, routing, escalation, QA and evidence ledger. | control matrix, escalation design and evidence packet |
| Pilot readiness | Validate workload assumptions, reviewer training and calibration. | pilot runbook, QA sample and calibration report |
| Release gate | Confirm controls operate under production volume and incident conditions. | release checklist and management sign-off packet |
| Post-release | Monitor load, reliance, quality, customer impact and control drift. | dashboard, issue log and improvement backlog |
1.3 Core Principle
Do not ask humans to be the control unless the system gives them time, skill,
evidence, independence, authority, escalation paths and proof that their review
changed risk outcomes.
2. Operating Model
2.1 End-To-End Flow
AI-assisted work item
-> risk, impact and reversibility classification
-> operator load estimate
-> review policy decision
-> skill, authority and capacity routing
-> evidence-first workspace
-> automation bias controls
-> human decision
-> escalation, override or downstream action
-> evidence packet
-> QA sampling and calibration
-> training, workflow, RAG, prompt, model and policy improvement
2.2 Operating Roles
| Role | Responsibility | Decision rights |
|---|---|---|
| Business owner | Owns customer outcome, workflow scope, risk appetite and value case. | Approves use case scope and residual operational risk. |
| Product manager | Defines AI assistance boundaries, user workflow, adoption goals and success metrics. | Prioritizes controls, tradeoffs and release criteria. |
| CBAP / BA | Converts tasks, exceptions, decisions and evidence needs into requirements. | Accepts workflow requirements and scenario coverage. |
| AI architect | Designs AI gateway, RAG, agent policy, workflow integration, trace and evidence model. | Approves architecture pattern and integration controls. |
| Operations lead | Owns staffing, queue design, SLA, training, surge, fatigue and handoff. | Approves operating readiness and capacity plan. |
| Risk / compliance | Reviews regulatory sensitivity, control design, customer harm and escalation. | Approves control adequacy or requires additional mitigations. |
| Model risk / eval owner | Defines model, prompt, retrieval and human factors eval. | Approves eval coverage and residual model risk view. |
| Second-line QA | Tests review quality, independence, sampling and calibration. | Opens defects, escalates control failure and validates closure. |
| Internal audit | Reconstructs evidence and assesses control operation. | Challenges evidence sufficiency and governance effectiveness. |
2.3 Operating Cadence
| Cadence | Meeting or process | Inputs | Decisions |
|---|---|---|---|
| Daily during pilot | Queue and defect standup | backlog, SLA, evidence-open rate, accept/edit/escalate, defects | throttle, surge, coaching, safe-stop |
| Weekly | Human factors quality review | QA samples, gold cases, override validity, customer impact | adjust routing, update training, add eval cases |
| Biweekly | Product and architecture control review | workflow telemetry, trace gaps, RAG failures, agent tool issues | backlog priority and release fixes |
| Monthly | Governance review | trend dashboard, incidents, policy changes, audit samples | residual risk, expansion approval, management action |
| Quarterly | Calibration and role certification | gold-case performance, drift, new policy scenarios | queue eligibility and refresher training |
3. Template: Operator Load Map
Use this template before sizing AI benefits. It prevents teams from transferring effort from one role to a more expensive or fragile human control without recognizing it.
| Workflow step | Operator role | AI assistance | Volume signal | Load drivers | Fatigue triggers | Risk if overloaded | Controls | Evidence |
|---|---|---|---|---|---|---|---|---|
| AML alert narrative | AML investigator | transaction summary, typology retrieval, draft rationale | daily alerts by risk tier and alert type | high evidence volume, SAR history, entity links, policy ambiguity | complex-case streak, end-of-day deadlines, repeated false positives | false close, weak SAR rationale, missed escalation | evidence bundle, close reason, senior review for high-risk, sentinel QA | alert trace, sources opened, reason code, QA sample |
| Credit memo review | underwriter | income summary, exception detection, memo draft | applications by product and channel | policy interpretation, fair lending sensitivity, adverse action language | high queue age, repeated exceptions, policy changes | unsupported approval, unfair denial, weak adverse action reason | evidence-first memo, policy checklist, second review for exceptions | memo diff, policy references, reviewer authority |
| Live contact center answer | agent | real-time suggested answer and summary | contacts per hour and handle-time band | listening while reading, customer emotion, account navigation | long call streak, escalation pressure, chat concurrency | wrong customer answer, missed complaint, unapproved commitment | concise source cards, prohibited phrase guard, transfer trigger | transcript marker, suggestion, source-open log |
| Complaint final response | complaint specialist | allegation extraction, deadline tracker, draft response | complaint age and regulatory deadline | legal wording, emotional content, root-cause analysis | deadline clusters, repeated severe cases | missed allegation, deadline breach, poor remediation | severity checklist, compliance escalation, final QA sample | allegation map, approval chain, response version |
| Fraud block review | fraud analyst | risk signal explanation, action recommendation | real-time alerts by loss exposure | time pressure, false positive cost, customer friction | alert bursts, high-value cases, tool latency | wrongful block, missed fraud, customer harm | signal decomposition, reversible action preference, supervisor escalation | model signal, approval trace, downstream action |
| Collections hardship conversation | collections specialist | hardship detection, program match, script draft | hardship signals and delinquency stage | emotional labor, vulnerability, affordability evidence | difficult-call streak, aggressive target pressure | unsuitable plan, prohibited language, complaint | vulnerability routing, affordability checklist, language QA | hardship signal, program eligibility, script edits |
Load calculation for a release gate:
required_review_hours =
AI_assisted_items
* review_rate
* average_handling_minutes
* complexity_multiplier
* double_review_multiplier
* rework_multiplier
/ 60
Capacity sanity check:
available_review_hours =
reviewers
* productive_hours_per_shift
* skill_match_percentage
* attendance_factor
Release rule:
available_review_hours must exceed required_review_hours
with surge reserve for incidents, policy changes and volume spikes.
4. Template: Automation Bias Control Matrix
| Bias risk | Control | Product implementation | Architecture implementation | Owner | Telemetry | Evidence |
|---|---|---|---|---|---|---|
| AI recommendation anchors reviewer | Evidence-first review | Show evidence and required fields before recommendation for P0/P1 tasks. | UI state machine blocks recommendation until preliminary review step is complete. | PM + Architect | preliminary decision delta, evidence-open rate | UI event sequence |
| Reviewer accepts by habit | No default accept | Accept, edit, reject and escalate are neutral actions with no preselected button. | Action API requires explicit action and reason code. | Product + Engineering | accept rate, edit depth, reason-code distribution | decision record |
| Fluency hides uncertainty | Confidence decomposition | Display model score, retrieval support, policy certainty and data completeness separately. | AI gateway returns confidence components and missing-evidence flags. | AI Architect | low-support answer rate, conflict flag count | AI trace |
| Speed metric suppresses challenge | Balanced scorecard | Performance view includes quality, valid overrides, escalation accuracy and missed-risk defects. | Dashboard joins queue, QA and customer impact data. | Ops Lead | throughput plus QA defect rate | management report |
| Reviewer fatigue reduces challenge | Fatigue-aware routing | Cap consecutive complex cases and route breaks or simpler work. | Queue engine tracks complexity streak and shift load. | Ops Lead | complex-case streak, defect by hour | queue log |
| AI output appears institutionally endorsed | Challenge prompt | High-risk accept requires answer to "What evidence would make this wrong?" | Reviewer action schema includes challenge response. | Risk + PM | challenge completion and defect correlation | decision packet |
| Exceptions treated as normal cases | Escalation trigger | Complaint, vulnerability, legal threat, PEP, sanctions, adverse action and hardship signals route differently. | Risk classifier emits escalation tags and blocks normal closure. | BA + Architect | escalation rate, missed escalation defects | escalation record |
| Team stops detecting model drift | Sentinel cases | Add known hard cases and model-weakness cases to QA and calibration. | QA service manages sentinel sample labels and adjudication. | Model Risk + QA | sentinel miss rate | calibration report |
| Human corrections disappear | Structured feedback loop | Operator selects defect type and desired correction. | Feedback creates knowledge, prompt, model or workflow ticket with owner and SLA. | Product + Data Owner | correction closure time | improvement ticket |
5. Template: Trust Calibration Script
Use this script in training, supervisor coaching and in-product microcopy. It teaches operators how to rely on AI neither too much nor too little.
| Moment | Approved script | Purpose | Avoid saying | Required evidence |
|---|---|---|---|---|
| AI appears in workflow | "This AI output is an assistant-generated recommendation. You remain accountable for the final action within your authority." | Clarify responsibility. | "The AI has reviewed this case." | user role, authority matrix |
| Evidence is strong | "The answer is supported by current policy source, system-of-record data and no detected conflict. Confirm the evidence before sending or acting." | Encourage efficient but verified reliance. | "High confidence means safe to accept." | source version, data freshness, conflict check |
| Evidence is incomplete | "The AI found partial support but missing information affects the decision. Resolve missing fields or escalate before final action." | Prevent unsupported action. | "Use your judgment" without a path. | missing field list, escalation rule |
| Recommendation is high impact | "For this action, first decide whether the evidence supports the action, then review the AI recommendation." | Reduce anchoring. | "AI recommends approval." as first visible item. | preliminary review step, recommendation reveal event |
| Operator disagrees | "Override is expected when evidence, policy or customer context contradicts the AI. Select a reason so QA can improve the system." | Normalize valid challenge. | "Overrides reduce AI adoption." | override reason, supporting source |
| Operator is unsure | "Escalate when evidence conflicts, authority is insufficient, customer vulnerability is present or the consequence is not reversible." | Convert uncertainty into governed escalation. | "Try to resolve it yourself." | escalation trigger and destination |
| Post-error coaching | "The goal is calibrated trust: use AI where evidence supports it, challenge it where evidence is weak, and escalate where authority or risk requires it." | Build durable mental model. | "Do not trust AI" or "trust the model." | QA defect, corrected example |
6. Template: QA Sampling Plan
6.1 Sample Design
| Sample layer | Coverage | Sampling unit | Minimum production use | Independence | Escalation trigger | Evidence |
|---|---|---|---|---|---|---|
| Mandatory QA | 100 percent for P0 actions | tool action, final response, adverse action reason | account freeze, formal complaint final response, high-risk credit exception | second-line or authorized senior reviewer | any critical defect | full evidence packet |
| Risk-based QA | elevated rate for high-risk P1 | case or recommendation | AML high-risk closure, fraud high-value alert, hardship plan | independent reviewer from same domain | defect rate above risk appetite | QA result and adjudication |
| Stratified QA | representative across segment | customer-visible answer or draft | language, channel, product, region, vulnerability, age of policy | QA team | segment defect spike | sample frame |
| Sentinel QA | fixed hard cases | known edge case | policy conflict, false confidence, vulnerable customer, PEP near match | model risk or QA owner | sentinel miss | gold label and coaching record |
| Blind second review | selected decisions | recommendation or memo | credit memo, AML close, complaints severity | reviewer cannot see first decision or AI recommendation initially | high disagreement or weak rationale | decision comparison |
| Incident surge QA | temporary expanded sample | affected workflow | stale knowledge, model release issue, prompt defect, vendor outage | incident QA cell | critical defect or unknown scope | incident evidence log |
6.2 Sampling Formula
daily_QA_sample =
mandatory_P0_items
+ max(risk_based_minimum, ceil(P1_volume * P1_sample_rate))
+ stratified_segment_minimums
+ sentinel_case_count
+ incident_surge_addon
Example baseline for pilot:
| Risk tier | QA treatment |
|---|---|
| P0 | 100 percent QA or dual approval before downstream action |
| P1 | 10 percent risk-based QA plus all overrides and all escalations |
| P2 | 3 percent stratified sample across channel, product, language and customer vulnerability |
| P3 | 1 percent random sample plus defect-triggered targeted sample |
| Sentinel | 20 to 50 gold cases per week depending on workflow complexity |
6.3 QA Defect Taxonomy
| Defect class | Description | Example |
|---|---|---|
| Unsupported claim | Output or human decision lacks authoritative evidence. | Contact center answer cites a retired fee policy. |
| Missed escalation | Case met escalation criteria but stayed in normal flow. | Complaint includes legal threat and is treated as servicing inquiry. |
| Automation bias | Human accepted AI despite visible contradiction or weak support. | AML alert closed while transaction cluster matched typology. |
| Under-reliance | Human ignored correct AI assistance and increased error or delay. | Underwriter manually rewrites accurate income summary and introduces discrepancy. |
| Authority breach | Reviewer approved action outside role or certification. | Agent approves fee waiver above limit. |
| Communication defect | Customer-visible language is misleading, non-compliant or harmful. | Collections script pressures a hardship customer. |
| Evidence packet defect | Audit cannot reconstruct decision. | Missing retrieved source version or reviewer reason code. |
7. Template: Escalation Design
| Trigger | Stop condition | Destination | SLA | Decision rights | Communication rule | Evidence |
|---|---|---|---|---|---|---|
| AI evidence conflicts with system-of-record data | customer-visible response or downstream action blocked | domain SME or supervisor | same business day for standard cases, immediate for live fraud or complaints risk | SME can approve, reject, request more evidence or safe-stop | tell frontline "evidence conflict requires specialist review" | conflict trace and source versions |
| Customer vulnerability or hardship signal | collections recommendation cannot be finalized | vulnerability-trained lead | same day | lead approves hardship path or escalation | use approved empathetic script, no pressure language | vulnerability signal, script, decision |
| Legal threat or regulator mention | complaint response blocked | complaints lead and compliance | same day or regulatory deadline-driven | compliance-trained role approves final response | no legal admission without approved review | allegation map, draft, approval |
| High-value fraud action | account block or release requires approval | fraud supervisor | real time or defined fraud SLA | supervisor approves reversible action or enhanced verification | customer contact follows fraud script | risk signals, action preview, approval |
| Credit adverse action uncertainty | adverse action reason blocked | senior underwriter or fair lending review | before decision notice | authorized underwriter approves reason codes | customer notice uses approved reason language | memo, policy, reason code |
| AML high-risk close | alert closure blocked | senior AML investigator | before case close | senior investigator approves close or escalation | internal narrative only unless required workflow says otherwise | alert evidence, typology, close rationale |
| Control failure trend | automation route paused | AI incident owner and business owner | immediate triage | business owner can pause route, risk can require additional controls | management notification follows incident protocol | dashboard signal, defect samples, action log |
Escalation design rules:
- Escalation is not a button unless it has a destination, SLA, receiving role, decision right and evidence requirement.
- High-risk uncertainty should stop or narrow automation, not simply add a note.
- Escalation volume is a signal. A sudden drop can be as concerning as a spike.
- Escalation outcomes must feed training, eval cases, knowledge updates and release gates.
8. Template: Evidence Packet
| Artifact | Required fields | Generated by | Reviewer use | Audit question |
|---|---|---|---|---|
| Work item header | case id, workflow, risk tier, customer impact, SLA, jurisdiction, channel | workflow engine | understand priority and constraints | Why was this case handled this way? |
| AI output record | output text, recommendation, draft, model id, prompt id, timestamp, confidence components | AI gateway | compare output to evidence and scope | What did AI produce and under what version? |
| RAG evidence record | retrieved source ids, source authority, version, chunk ids, freshness, citation support | retrieval service | validate support and detect stale content | Which sources supported the output? |
| Tool observation record | system queried, parameters, result summary, latency, errors, permissions | tool gateway | verify system-of-record facts | What external facts or actions were used? |
| Human action record | view sequence, evidence opened, action, edit diff, reason code, challenge answer, time on task | reviewer workspace | demonstrate meaningful review | Did the human review or rubber-stamp? |
| Decision rights record | role, certification, authority limit, independence check, conflict-of-interest result | IAM and workflow policy | confirm reviewer was allowed to decide | Was the decision made by the right person? |
| Escalation record | trigger, destination, SLA, receiving role, outcome, final approver | workflow engine | track unresolved or high-risk work | Was escalation timely and effective? |
| QA record | sample frame, QA reviewer, defect class, severity, adjudication, remediation | QA system | assess control quality | Did second-line testing validate the control? |
| Improvement record | ticket id, owner, fix type, release id, regression eval, closure evidence | product backlog and governance tool | prove learning loop closure | Did the organization improve after defects? |
Evidence packet acceptance standard:
A qualified reviewer or auditor can reconstruct the decision without interviewing
the original operator, reading private chat, or relying on memory.
9. PM / BA / Architecture Questions
9.1 PM Questions
| Question | Strong answer should include |
|---|---|
| Which human task is AI changing? | specific review unit, before/after workflow, expected time and quality impact |
| Where can AI reduce burden and where can it add burden? | distinction between summarization benefit, verification cost, escalation cost and documentation cost |
| What level of trust should users have? | trust by risk tier, evidence strength, reversibility and user skill |
| What metrics would prove adoption is healthy? | not only usage and handle time, also evidence-open rate, valid override, escalation accuracy and customer impact |
| What happens if review volume exceeds capacity? | surge staffing, throttle, safe-stop, deferral rules and management notification |
| Which customer harms are unacceptable? | concrete harms such as wrongful block, unfair denial, missed complaint, coercive collections language |
9.2 BA Questions
| Question | Strong answer should include |
|---|---|
| What are the review units? | claim, draft, recommendation, tool action, case, sampled outcome |
| What evidence must the operator see? | source of truth, policy version, missing fields, conflicts, system facts and AI trace |
| What decisions can each role make? | accept, edit, override, escalate, approve, safe-stop and limits |
| What are the exception triggers? | legal threat, vulnerability, PEP, sanctions, adverse action, high-value fraud, policy conflict |
| What data must be captured? | reason code, evidence references, authority, edit diff, timing, escalation and QA outcome |
| What acceptance criteria prove meaningful review? | evidence interaction, correct decision, reason quality, escalation accuracy and audit replay |
9.3 Architecture Questions
| Question | Strong answer should include |
|---|---|
| Where is review policy enforced? | workflow engine or policy service, not only UI text |
| How is automation bias reduced at runtime? | evidence-first flow, no default accept, challenge prompt, confidence decomposition and QA |
| How do traces connect AI and human action? | shared trace id across model, retrieval, tool, reviewer action and downstream system |
| How are skill and authority enforced? | IAM, role certification, queue eligibility and action policy |
| How does the system respond to incident conditions? | route pause, sampling surge, rollback, communication and governance review |
| How does feedback improve the system? | defect taxonomy mapped to knowledge update, prompt change, model eval, workflow fix or training |
10. Release Checklist
10.1 Product And Workflow
- Review unit is defined for every AI-assisted workflow.
- Risk tiers include customer impact, financial impact, regulatory sensitivity and reversibility.
- AI role is explicit: summarize, retrieve, draft, recommend, classify or propose action.
- Human role is explicit: accept, edit, reject, override, approve, escalate or safe-stop.
- User-facing and employee-facing trust messages are approved for each risk tier.
- Recovery path exists for wrong answer, wrong action, missing evidence and customer complaint.
10.2 Operations And Capacity
- Operator load map includes volume, AHT, skill, complexity, fatigue and surge assumptions.
- Capacity exceeds required review hours with incident reserve.
- Skill routing covers domain, product, language, risk, authority and independence.
- Reviewer training includes AI failure modes, not only screen usage.
- Calibration uses gold cases and affects eligibility for high-risk queues.
- Queue dashboard includes backlog, SLA, complexity, fatigue and quality metrics.
10.3 Bias And Trust Controls
- P0/P1 workflows remove default accept.
- P0/P1 workflows use evidence-first or blind first-pass where anchoring risk is high.
- Confidence is decomposed into model, retrieval, policy and data-completeness signals.
- Reason code and evidence reference are required for high-impact accept, edit, reject, override or escalation.
- Management scorecards balance productivity with quality, escalation and customer impact.
- Sentinel cases and blind second reviews are active before expansion.
10.4 Architecture And Evidence
- AI gateway captures model, prompt, input, output, confidence components and policy decision.
- RAG layer captures source id, version, freshness, authority and citation support.
- Tool gateway captures parameters, permissions, result and downstream side effect.
- Reviewer workspace captures evidence opened, action, edit diff, reason and time on task.
- Shared trace id connects AI output, human action and downstream system.
- Evidence packet can be replayed by QA or audit without relying on memory.
10.5 Governance
- Risk, compliance, model risk, operations and business owner reviewed the control design.
- QA sampling plan covers risk-based, stratified, sentinel and incident surge samples.
- Escalation paths have destination, SLA, authority and communication rule.
- Safe-stop criteria are documented and tested.
- Residual risk and expansion criteria are approved by accountable owners.
- Post-release review date and dashboard owner are assigned.
11. Executive Narrative
11.1 One-Minute Executive Version
We should not describe this release as simply adding a human in the loop. The real control is whether qualified people can challenge AI under production workload. This playbook designs the human side as an operating architecture: risk-based routing, workload capacity, evidence-first review, automation bias controls, second-line QA, escalation rights and audit evidence.
The business value is sustainable automation. We can reduce manual effort where AI is reliable, but we avoid shifting hidden work to specialists or creating a rubber-stamp review queue. The release gate should ask three questions:
- Can operators handle the expected volume without fatigue-driven quality loss?
- Can they see enough evidence and authority boundaries to challenge AI?
- Can we prove through trace and QA that human review actually reduced risk?
11.2 Board / Audit Committee Version
The AI system relies on human oversight for customer-impacting workflows. Management has designed oversight as a measurable production control, not a general assurance statement. Controls include risk-tiered review, skill and authority routing, evidence-first workspaces, explicit override and escalation rights, QA sampling, calibration, training and traceable evidence packets.
Management will monitor operator load, automation reliance, evidence use, QA defects, customer impact and escalation performance. Expansion decisions will depend on control performance, not only efficiency or adoption metrics.
11.3 Product Portfolio Version
For the portfolio, this pattern becomes reusable across AML, credit, fraud, complaints, collections and contact center AI. Each use case configures its own risk tiers, skill matrix, evidence packet and QA sample, while the platform provides common trace, routing, reviewer actions, feedback taxonomy and governance dashboards.
12. Interview Drills
Drill 1: "Isn't human review enough to control AI risk?"
Strong answer:
No. Human review is only effective if the human has capacity, skill, evidence,
independence, authority and escalation rights. Otherwise it becomes a bottleneck
or rubber stamp. I would design review as an operating architecture with risk-tiered
routing, evidence-first workspace, automation bias controls, QA sampling and audit trace.
Drill 2: "How would you detect automation bias in production?"
Strong answer:
I would monitor accept rate, edit depth, evidence-open rate, override validity,
blind-pass delta, escalation trend and QA defects. A very high accept rate with low
evidence interaction is not automatically good. It may indicate anchoring or pressure
to accept AI. I would add sentinel cases and blind second reviews to validate.
Drill 3: "How do you reduce cognitive load for an AML investigator copilot?"
Strong answer:
I would not just shorten the summary. I would define the review unit, rank evidence
by authority, expose missing and conflicting evidence, group related transactions,
show typology support, require close reason codes and route high-risk cases to
senior investigators. The goal is to reduce navigation and narrative burden while
preserving independent investigation.
Drill 4: "What is the difference between confidence and calibrated trust?"
Strong answer:
Confidence is a system signal. Calibrated trust is a human behavior outcome.
In financial retail I would separate model confidence, retrieval support, policy
certainty and data completeness, then observe whether operators rely more when
evidence is strong and escalate when evidence is weak or risk is high.
Drill 5: "What would you tell a CTO before scaling a copilot?"
Strong answer:
I would ask for proof that controls work under load: queue capacity, skill routing,
no-default-accept for high-risk work, evidence trace, second-line QA, safe-stop,
and production telemetry connecting AI output to human action and downstream impact.
Scaling without that proof turns human review into control theater.
13. Reference Anchors
| Anchor | Link | Playbook use |
|---|---|---|
| NIST AI Risk Management Framework | https://www.nist.gov/itl/ai-risk-management-framework | Organizes human factors governance, mapping, measurement and management. |
| NIST bias publication | https://www.nist.gov/blogs/taking-measure/powerful-ai-already-here-use-it-responsibly-we-need-mitigate-bias | Supports treating bias as a socio-technical deployment issue, not only a model metric. |
| Microsoft Guidelines for Human-AI Interaction | https://www.microsoft.com/en-us/research/project/guidelines-for-human-ai-interaction/ | Provides interaction principles translated here into operational review controls. |
| ISO/IEC 42001 | https://www.iso.org/standard/81230.html | Anchors AI management system thinking for responsibility, competence, operation and improvement. |
| ISO/IEC/IEEE 42010 | https://www.iso.org/standard/74393.html | Supports architecture description through stakeholder concerns, views, decisions and evidence. |
| OpenTelemetry docs | https://opentelemetry.io/docs/ | Anchors trace, metric and log design for AI-human workflow observability. |