返回 Papers
AI 底层逻辑 / 经典论文

AI Uncertainty UX:不确定性体验与升级架构

Source nuance:

586ai-foundations/papers/169-ai-uncertainty-ux-abstention-confidence-escalation-architecture.md

AI 不确定性体验架构:Abstention / Confidence / Escalation

Date: 2026-06-30 Status: evergreen Audience: CBAP+ Senior BA / Financial Retail PM / Product Architect / Solution Architect / AI Governance Lead / Model Risk Partner / Contact Center Transformation Lead。 Output: 一套把 uncertainty UX、abstention、confidence language、safe refusal、partial answer、request-more-info、human escalation、evidence 和 monitoring 连接起来的产品架构笔记。


Source Anchors

SourceLink本文使用方式
A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantificationhttps://arxiv.org/abs/2107.07511作为 uncertainty quantification 的技术锚点, 但本文重点放在产品、流程、交接和证据架构, 不把置信度等同于体验设计
NIST AI Risk Management Frameworkhttps://www.nist.gov/itl/ai-risk-management-framework用 Govern / Map / Measure / Manage 组织不确定性风险、影响场景、监控和治理证据
NIST AI RMF Generative AI Profilehttps://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence用 GenAI 特有风险视角设计 hallucination、confabulation、misuse、human oversight 和 incident learning 控制
Microsoft Guidelines for Human-AI Interactionhttps://www.microsoft.com/en-us/research/project/guidelines-for-human-ai-interaction/参考 human-AI interaction 在初始使用、常规互动、出错和长期使用中的行为原则, 但进一步转成金融零售控制架构
ISO/IEC 42001 AI management systemshttps://www.iso.org/standard/81230.html将 uncertainty UX 纳入 AI management system 的 policy、role、operation、performance evaluation、audit 和 improvement
ISO/IEC 23894 AI risk managementhttps://www.iso.org/standard/77304.html将不确定性、影响、控制、残余风险和复盘纳入 AI 风险管理生命周期

Source nuance:

  • Conformal prediction、calibration、置信区间和模型评分只能说明技术层面的 uncertainty signal, 不能自动决定客户看到什么、员工如何接手、监管证据如何保留。
  • Human-AI interaction guideline 提醒系统要表达能力边界和错误恢复, 但金融零售需要把这些原则落到产品状态机、policy engine、case workflow、approved language 和 audit trail。
  • NIST 和 ISO 提供风险管理与管理体系结构, 本文将其转换为 PM / BA / Architect 可执行的场景、规则、指标和证据包。

一句话:

AI uncertainty UX is not a tooltip. It is a decision-control architecture that decides when AI answers, qualifies, refuses, asks, escalates, records evidence and learns.


1. Why Uncertainty UX Is Architecture

很多团队把不确定性体验简化为三件事:

  • 显示一个百分比。
  • 加一句 "AI may be wrong"。
  • 低 confidence 时转人工。

这在金融零售里远远不够。客户是否受到伤害, 不是由模型分数单独决定, 而是由整个服务系统决定:

  • 用户是否能理解 AI 的能力边界。
  • AI 是否知道哪些问题可以部分回答, 哪些必须拒答, 哪些需要补充信息。
  • 系统是否把低证据、低权限、高影响、高监管敏感度的交互转给正确 owner。
  • 员工是否看到足够 evidence, 而不是只看到一个自信摘要。
  • 审计、投诉、模型风险和业务 owner 是否能复盘当时为什么 answer / abstain / escalate。

1.1 Architecture Question

Given a user intent, context, evidence state, model uncertainty, policy boundary and customer impact:
what should the system do next, who should own the next action,
what should the user see, and what evidence must be retained?

这不是 UX copy 问题, 而是 architecture decision:

Layer关键设计问题失败后果
Product policy哪些场景可以回答、部分回答、拒答、补件或升级AI 越界承诺、客户被误导
Experience state用户看到 uncertainty、原因、下一步和人工路径的时机用户过度信任或被困在 AI 循环
Decision controlconfidence、evidence、policy、impact 如何共同决定 action低证据高影响场景被自动处理
Workflow谁接手、SLA、handoff payload、case type升级失败、重复询问、服务断裂
Evidence输入、检索、模型、规则、copy、审批如何留痕无法解释投诉、审计或模型事件
Monitoringabstention、override、appeal、harm signal 如何反馈控制失灵但没有人知道

1.2 Product Principle

confidence is an internal signal
+ uncertainty language is a user contract
+ abstention is a product action
+ escalation is an operating model
+ evidence is the accountability layer
= uncertainty architecture

2. Concept Diagram

flowchart TD
  A[User / Employee Intent] --> B[Context and Impact Assessment]
  B --> C[Evidence State]
  B --> D[Policy Boundary]
  B --> E[Model / Retrieval / Tool Confidence]
  C --> F[Uncertainty Decision Policy]
  D --> F
  E --> F
  F --> G{Action Class}
  G -->|answer| H[Answer with evidence and calibrated language]
  G -->|partial answer| I[Scope answer + missing facts + next step]
  G -->|ask more| J[Structured clarification or document request]
  G -->|safe refusal| K[Boundary explanation + allowed alternative]
  G -->|escalate| L[Case workflow + owner + SLA]
  G -->|block| M[Stop unsafe action + incident/control signal]
  H --> N[Telemetry and Evidence Packet]
  I --> N
  J --> N
  K --> N
  L --> N
  M --> N
  N --> O[Monitoring / Eval / Governance / CAPA]
  O --> F

Key idea: uncertainty architecture is a closed-loop system. It does not end when AI says "I am not sure"; it ends when the case reaches the right outcome, evidence is complete, and control owners can improve the policy.


3. Core Architecture Model

3.1 Runtime Components

ComponentResponsibilityExample output
Intent and domain classifier识别问题类型、业务域、是否 customer-impactingcredit_eligibility_explanation, wealth_advice_boundary, aml_case_summary
Impact and sensitivity classifier判断客户权益、资金、合规、投诉、脆弱客户、监管敏感度high_customer_impact, regulated_advice, complaint_signal
Evidence state service判断证据是否充分、来源是否新鲜、是否有冲突complete, missing_income_doc, conflicting_policy_versions
Confidence signal aggregator汇总 retrieval score、model self-check、tool validation、schema validation、historical error rateanswer_confidence=medium, evidence_confidence=low
Policy and boundary engine按产品、角色、渠道、地区、客户状态决定 allow / ask / refuse / escalatelicensed_handoff_required
Response planner决定 answer class、copy template、disclosure、citation、handoff payloadpartial_answer_with_missing_facts
Approved language service提供可审计 copy fragments 和 forbidden phrase scanningsafe_credit_language_v3
Workflow router创建 case、队列、owner、SLA、handoff reasonpayments_dispute_queue, wealth_advisor_queue
Evidence ledger保存 prompt、retrieval refs、policy ids、tool calls、decision reason、output hashuncertainty_event_id
Monitoring and eval loop观察 abstention、appeal、override、harm、complaint、driftthreshold_review_needed

3.2 Decision Inputs

不要只看一个 confidence_score。金融零售场景至少需要七类信号:

SignalWhat it meansWhy it matters
Model confidence模型对生成或分类结果的稳定性低稳定性需要谨慎, 但高稳定性也可能是 confident hallucination
Evidence confidence检索、文档、数据库、工具返回是否充分且一致证据不足时应 ask more 或 partial answer
Policy confidence当前政策、审批 copy、监管边界是否明确政策不确定应升级 owner, 不能让 LLM improvisation
Identity / authorization confidence用户或员工是否有权限看到信息或发起动作防止越权披露或执行
Impact severity错误是否影响资金、资格、权益、合规或客户伤害高影响场景阈值更保守
Reversibility错误能否轻易纠正不可逆动作需要人工或审批
User vulnerability / context是否存在投诉、焦虑、困难、老年、语言障碍、弱势信号同一答案在弱势场景下风险更高

3.3 Action Classes

Action classProduct behaviorArchitecture requirement
Direct answer回答并给依据、范围、下一步evidence refs、approved copy、output QA
Qualified answer明确条件、假设、适用范围condition tags、policy version、scenario bounds
Partial answer回答可证明部分, 标出未确定部分segment-level evidence and missing-info list
Ask more结构化收集缺失事实、文件或 consentquestion policy、field schema、drop-off tracking
Safe refusal解释边界, 给可允许替代路径refusal template、allowed alternative、policy id
Human escalationwarm handoff, 带证据包和 reasoncase workflow、SLA、queue owner、handoff payload
Block and incident阻止危险动作, 触发 control / security / compliance reviewincident type、retention、owner notification

4. Abstention Taxonomy And Escalation Policy

Abstention 不是一种单一拒答。成熟系统要区分为什么不能继续, 因为不同原因对应不同 UX、owner 和证据。

4.1 Abstention Taxonomy

Abstention classTriggerUser-facing behaviorOwner
Evidence insufficient缺少关键事实、文件、交易、政策版本或引用说明缺少什么, 请求补充或提供可回答范围Product / Ops
Evidence conflict多个来源不一致, 如政策版本冲突、系统状态冲突不下结论, 说明需要核验, 创建复核任务Knowledge Owner / Ops
Domain boundary用户要求法律、税务、投资、信贷审批或医疗等越界判断安全拒答, 提供教育信息或授权渠道Legal / Compliance / Licensed Owner
Authorization boundary用户或员工权限不足, 或 consent 不足不披露敏感信息, 引导认证或授权流程Security / Privacy
High-impact low-certainty影响资金、资格、账户限制、投诉、合规但信号不足暂停自动化, 升级人工Risk / Ops
Harm / vulnerability signal识别困难、压力、投诉、老年、语言障碍、潜在欺诈受害者降低自动化和销售强度, 提供支持路径Customer Care / Conduct Risk
Tool failure核心系统、支付、KYC、RAG、case API 失败说明暂时无法确认, 提供安全下一步Technology / Ops
Policy ambiguity内部政策没有覆盖或存在解释空间不能即兴解释, 升级 policy ownerPolicy / Compliance
Unsafe instruction用户要求绕过控制、欺诈、洗钱、隐藏信息或生成误导文案拒绝并可能触发安全事件Fraud / Financial Crime / Security
Monitoring hold规则、模型或数据源处于 degraded / disabled 状态限制能力, 转人工或延迟处理Platform Owner

4.2 Escalation Policy

Escalate when:
  impact severity is high
  OR action is irreversible
  OR customer-facing statement could create regulated obligation
  OR evidence is conflicting
  OR policy boundary is unclear
  OR customer harm / complaint / vulnerability signal exists
  OR AI would need to infer facts that should come from system of record.

4.3 Escalation Matrix

ScenarioAI can doAI must not doEscalation routeHandoff payload
Credit eligibilityExplain general eligibility factors and list missing application dataSay the customer is approved, likely approved, or should apply based on unsupported inferenceCredit ops or lending specialistapplication id, reason codes, missing evidence, source docs, customer question
Wealth advice boundaryProvide neutral education and explain that personal recommendation requires authorized channelRecommend buy/sell/hold, asset allocation, tax move or guaranteed outcomeLicensed advisor / wealth complianceuser intent, product mentioned, risk profile status, boundary reason
Payment dispute claimCollect facts, explain process, list documents, state timelinesPromise chargeback success or assign fault without investigationDispute operationstransaction refs, customer narrative, merchant info, evidence gaps
AML analyst assistantSummarize alerts, extract entities, cite transaction evidenceDecide SAR filing, close case solely, reveal SAR-sensitive reasoning to front officeAML investigator / BSA officeralert id, entity graph, evidence refs, AI uncertainty flags
KYC document extractionExtract fields, flag unreadable/contradictory docs, request re-uploadDeclare identity verified when evidence fails validationKYC operations / identity platformdoc refs, extraction confidence by field, validation failures
Complaint handlingClassify issue, preserve narrative, summarize facts, route caseDismiss complaint, make legal conclusion, promise compensationComplaint ops / legal-compliance reviewcomplaint id, original text, harm signal, AI touchpoint refs
Contact centerDraft response, suggest next best question, summarize callHide uncertainty from agent, push sales during hardship or complaintSupervisor / specialist queuetranscript, suggested script, confidence notes, prohibited content flags

5. Confidence And Explanation UX Rules

5.1 Do Not Expose Raw Confidence By Default

Raw percentages often create false precision. A customer who sees "82% confident" may misunderstand:

  • 82% of what: factual accuracy, policy fit, retrieval match, approval chance, or model probability?
  • Is 82% safe enough for a dispute claim, wealth advice boundary or KYC decision?
  • Is the number calibrated for this customer segment and distribution?

Recommended pattern:

AudiencePreferred expressionAvoid
Customer"I can confirm this from your transaction record" / "I need one more detail before I can explain the next step""82% confident"
Frontline employeeevidence status + confidence band + required review actionsingle model score without reason
Analyst / reviewerfield-level confidence, evidence refs, contradictions, uncertainty reasongreen/yellow/red without inspectable evidence
Governance / model riskmetric distributions, error bands, override rates, harm outcomescherry-picked demos

5.2 Confidence Language Guide

SituationBetter languageWhy
Evidence is strong"Based on the posted transaction and current dispute policy, the next step is..."Names evidence and policy scope
Evidence is missing"I can explain the process, but I need the merchant response date to assess the next step."Separates process education from case-specific conclusion
Policy boundary exists"I can provide general information. A personal investment recommendation requires a licensed advisor."Refuses overreach while preserving help
Conflict exists"Your upload and the system record do not match, so this needs review before we rely on it."Explains why automation paused
Tool failed"I cannot verify the account status right now. I will not guess; I can create a follow-up case."Prevents hallucinated system state
Customer harm signal"This may affect your account access, so I am routing it to a specialist and preserving the details you shared."Signals seriousness and action

5.3 Explanation Rules

  1. Explain the decision class, not the model internals.
  2. Tie confidence language to evidence quality, not AI personality.
  3. Show what is known, unknown, assumed and next.
  4. For customer-impacting decisions, provide reason codes and recourse path where policy allows.
  5. For employee copilots, show conflicting evidence before recommended action.
  6. For regulated advice boundaries, use approved copy and avoid personalized recommendations.
  7. Never use confidence language to soften a hard compliance boundary.
  8. Never use uncertainty as a way to avoid a complaint, dispute, appeal or accessibility path.

6. Financial Retail Scenarios

6.1 Credit Eligibility Assistant

Bad pattern:

"You are likely eligible for this credit card."

Why it fails:

  • It may imply pre-approval.
  • It may hide missing income, credit bureau, identity or product eligibility checks.
  • It may create unfair treatment if the confidence threshold behaves differently across segments.

Better architecture:

StepDesign
Intentcredit_eligibility_question
BoundaryAI can explain criteria, not approve or predict approval unless sourced from official prequalification engine
Evidenceapplication status, product rules, adverse action / reason code policy, customer consent
Actionqualified answer or ask more; escalate for exceptions
UX"I can explain the factors used. A final decision comes from the credit decisioning process."
Evidencepolicy id, product version, customer-safe copy id, model trace, handoff reason

6.2 Wealth Advice Boundary

User asks: "Should I sell my bond fund and buy the high-yield product?"

Architecture response:

  • Classify as personalized investment advice intent.
  • Check channel and role permissions.
  • Provide neutral education about risk, fees, liquidity and suitability concepts.
  • Refuse buy/sell recommendation.
  • Offer licensed advisor handoff.

The uncertainty is not only model uncertainty. It is authorization, suitability, profile completeness, licensing and conduct risk uncertainty.

6.3 Payment Dispute Claims

User asks: "Will I win this dispute?"

AI may:

  • Collect transaction details, merchant interaction, delivery evidence, cancellation date.
  • Explain dispute process and timelines.
  • Identify missing evidence.
  • Create case and route high-value or vulnerable customer cases.

AI must not:

  • Promise outcome.
  • Invent chargeback reason.
  • Tell customer to misstate facts.
  • Hide uncertainty about network rules or merchant evidence.

6.4 AML Analyst Assistant

Analyst asks: "Is this suspicious?"

AI may:

  • Summarize alert facts.
  • Extract counterparties, typologies, transaction sequence and anomalies.
  • Cite source transactions and unresolved evidence gaps.
  • Suggest investigation questions.

AI must not:

  • Make SAR filing decision.
  • Close case without reviewer.
  • Generate unsupported suspicion narrative.
  • Reveal SAR-sensitive reasoning to customer-facing teams.

Uncertainty UX here is employee-facing: field-level confidence, contradictory facts, missing KYC context and reviewer confirmation workflow.

6.5 KYC Document Extraction

AI extracts name, date of birth, address and document number.

Advanced design:

FieldUX / Ops behavior
High confidence and validation matchAuto-fill, show source crop to reviewer if sampled
Medium confidenceHighlight field for review, preserve extraction alternative
Low confidence or unreadableRequest re-upload or manual review
Document / application mismatchPause verification, route discrepancy queue
Expired or unsupported documentExplain acceptable document types, do not infer exception

The key is field-level abstention. The system can accept name while abstaining on address, rather than failing the whole journey or hallucinating a value.

6.6 Complaint Intelligence

Complaint AI should treat uncertainty as a preservation problem:

  • Preserve original narrative.
  • Summarize with span citations.
  • Separate allegation, verified fact, policy issue, harm signal and root cause hypothesis.
  • Escalate legal-sensitive, vulnerable customer, repeated harm and regulatory-source cases.

Never let "low confidence" become a reason to not create a complaint case. In complaints, ambiguity often increases the need to preserve and route.

6.7 Contact Center Copilot

Agent sees AI suggestion:

Recommended response: Explain dispute process and ask for merchant cancellation evidence.
Confidence: Medium
Why: Transaction is posted and policy found, but customer cancellation date is missing.
Do not say: "You will get your money back."
Escalate if: customer says hardship, fraud, legal threat, regulator, or prior unresolved complaint.

The agent does not need model math. The agent needs action-safe guidance, forbidden language, evidence and escalation triggers.


7. Metrics / Control / Evidence Model

7.1 Metrics

MetricWhat it detectsOwner
Abstention rate by classWhether refusals / asks / escalations are balanced by reasonProduct / Governance
Wrong-answer rate by impact tierWhether high-impact answers have unacceptable errorModel Risk / Product
Unsupported claim rateWhether outputs exceed evidence or approved languageCompliance / QA
Escalation precision and recallWhether the right cases go to humansOps / Risk
Handoff completion SLAWhether escalation is real service, not deflectionOperations
Human override rateWhether AI action policy is miscalibratedProduct / Ops
Appeal / complaint rate after AI interactionWhether AI is creating customer harmComplaint / Conduct Risk
Evidence completeness scoreWhether each decision can be replayedAudit / Governance
Customer trust calibrationWhether users understand AI boundaries without abandoning valid journeysCX / Research
Segment disparity in abstentionWhether certain groups get more friction or fewer helpful answersFairness / Risk

7.2 Control Model

ControlDesign
Intended-use inventoryEach use case records allowed answer classes, prohibited actions and impact tier
Policy decision tableMaps confidence/evidence/impact/boundary to answer / ask / refuse / escalate
Approved language libraryVersioned copy for boundary, refusal, disclosure, escalation and uncertainty
Evidence ledgerCaptures input refs, retrieved sources, tool responses, policy ids, model versions and output hashes
Human handoff protocolDefines queue, SLA, owner, required payload, customer message and closure event
QA and red-team evalTests low-evidence, conflicting-evidence, high-impact and adversarial cases
Monitoring thresholdsAlerts on abnormal abstention, complaint, override, error or segment friction
Incident and CAPA loopConnects harmful AI interactions to root cause, fix, validation and policy update

7.3 Evidence Packet

A replayable uncertainty decision needs:

uncertainty_event_id: ux-2026-06-30-000169
use_case: payment_dispute_assistant
intent: dispute_outcome_question
customer_impact_tier: high
evidence_state: missing_customer_cancellation_date
confidence_summary:
  retrieval: high
  tool_validation: high
  case_specific_conclusion: low
policy_decision: partial_answer_ask_more
policy_ids:
  - dispute_uncertainty_policy_v4
  - prohibited_outcome_promise_v2
approved_language_ids:
  - dispute_process_explanation_v7
output_class: partial_answer
escalation:
  required: false
  trigger_checked: hardship_or_fraud_signal
sources:
  - transaction_record_ref
  - dispute_policy_ref
human_feedback:
  override: false
monitoring_tags:
  - high_impact
  - missing_evidence

8. Anti-Patterns And Failure Modes

Anti-patternFailure modeCorrective design
Confidence theaterUI shows precise score but no evidence or action ruleReplace with evidence state, action class and next step
One global thresholdSame confidence cutoff used for FAQ, credit, complaints and AMLThresholds by impact, reversibility, policy boundary and evidence
Refusal as dead endAI says cannot help and ends the journeySafe refusal with allowed alternative, handoff or information request
Escalation dumpingLow confidence floods human queuesEscalation taxonomy, triage, SLA and capacity model
Hidden uncertainty from employeesCopilot gives polished answer without showing weak evidenceReviewer UI with missing facts, contradictions and confidence by field
Over-disclosure to customersReveals internal AML/fraud rules or sensitive thresholdsCustomer-safe explanation layer and role-based evidence
Under-disclosure to customers"AI may be wrong" without concrete limitation or recourseBoundary-specific language and next step
Partial answer missing boundaryAI answers generic part then implies case-specific decisionSegment answer by known / unknown / cannot determine
Human-in-the-loop mythHuman receives case but no evidence, no time, no authorityHandoff packet, owner, SLA, decision rights
Monitoring blind spotTeam tracks accuracy but not harm, complaints or overridesConnect telemetry to complaint, appeal, QA and CAPA

9. Architecture Mapping To RAG / Agent / Copilot / Eval / Governance

Architecture styleUncertainty problemRequired control
RAG assistantRetrieval may be stale, missing, conflicting or unauthorizedsource registry, freshness, permission filtering, citation quality, conflict detector
Tool-using agentTool may fail, return partial data or execute high-impact actiontool confidence, action approval, reversible/irreversible classification, execution guard
Employee copilotEmployee may over-trust fluent draftsevidence panel, required review fields, forbidden language, adoption monitoring
Customer chatbotCustomer may treat response as promise or advicecapability framing, boundary copy, escalation path, complaint/dispute/appeal routing
KYC / document AIField extraction varies by field and document qualityfield-level confidence, validation, exception queue, source image evidence
AML / fraud assistantSensitive reasoning and case decision boundariesanalyst-only workspace, SAR-sensitive controls, case owner approval
Eval platformOffline accuracy misses uncertainty behaviorscenario eval for abstention, partial answer, ask-more, refusal and handoff
Governance systemPolicies are not connected to runtime behaviorpolicy ids in logs, evidence packets, review cadence, residual risk acceptance

9.1 Eval Design

Uncertainty eval must include:

  • Strong-evidence answer cases.
  • Missing-evidence cases.
  • Conflicting-source cases.
  • Regulated-advice boundary cases.
  • High-impact irreversible action cases.
  • Sensitive information / authorization cases.
  • Harm, complaint and vulnerable-customer cases.
  • Tool outage and degraded RAG cases.
  • Multi-turn cases where user pressure attempts to force a conclusion.

Score dimensions:

DimensionPass condition
Action selectioncorrect answer / partial / ask / refusal / escalation
Language safetyno unsupported promise, advice or disclosure
Evidence usecited source supports the claim
Boundary clarityuser understands what AI can and cannot do
Handoff qualitycorrect queue, payload and customer message
Recoveryuser gets a useful next step

10. ADR Draft

# ADR: Adopt Uncertainty Decision Policy And Escalation Architecture

Date: 2026-06-30
Status: Proposed

## Context

AI assistants in credit, wealth, disputes, AML, KYC, complaints and contact center workflows produce outputs with varying evidence quality, policy certainty and customer impact. A single confidence score is not sufficient to decide what users should see or when humans should intervene.

## Decision

We will implement an uncertainty decision architecture that combines intent, impact tier, evidence state, policy boundary, authorization, model/retrieval/tool confidence and reversibility. Runtime outputs must be classified as direct answer, qualified answer, partial answer, ask more, safe refusal, human escalation or block. Each customer-impacting uncertainty decision must create an evidence packet with policy ids, source refs, approved language ids and monitoring tags.

## Consequences

Benefits:

- Reduces unsupported promises, advice-boundary breaches and hallucinated system states.
- Improves customer trust by giving specific next steps instead of vague disclaimers.
- Gives operations usable handoff payloads and clear ownership.
- Enables model risk, audit, complaint and governance teams to replay decisions.

Trade-offs:

- Requires policy decision tables, approved language assets and workflow integration.
- May initially increase ask-more and escalation volume while thresholds are tuned.
- Requires cross-functional ownership across Product, Risk, Compliance, Ops, CX and Technology.

## Alternatives Considered

1. Show raw confidence score to users.
   - Rejected because it creates false precision and does not encode policy or impact.
2. Use a single confidence threshold for human handoff.
   - Rejected because high confidence can still violate advice, authorization or evidence boundaries.
3. Let the LLM decide when to refuse or escalate.
   - Rejected because abstention and escalation are regulated product controls, not prompt-only behavior.

11. Interview Answer

30 秒版本

AI 不确定性体验不是在界面上显示一个 confidence score, 而是一个产品和架构控制系统。金融零售里要根据意图、证据、政策边界、客户影响、权限和可逆性决定 AI 是回答、限定回答、部分回答、追问、拒答、升级还是阻断。关键是把 uncertainty language、abstention taxonomy、human handoff、approved copy、monitoring 和 audit evidence 连接起来, 防止客户把 AI 的不确定输出当成承诺、建议或裁决。

2 分钟版本

我会把 uncertainty UX 设计成一个 runtime decision policy。第一步识别用户意图和影响等级, 比如 credit eligibility、wealth advice、payment dispute、AML review、KYC extraction 或 complaint。第二步评估证据状态: 来源是否充分、新鲜、一致、授权可见。第三步结合模型、检索和工具 confidence, 但不只依赖分数。第四步用 policy engine 判断 action class: direct answer、qualified answer、partial answer、ask more、safe refusal、human escalation 或 block。

例如客户问 "我会赢这个 dispute 吗", AI 不应该给成功概率或承诺结果。它可以解释流程、收集证据、说明缺失材料, 并在高金额、疑似欺诈、弱势客户或投诉信号时升级。又如 wealth 场景, 用户问是否卖出某基金, 不确定性不是模型不知道, 而是 licensing、suitability 和 advice boundary 决定 AI 只能教育和转 licensed advisor。

架构上我会落地 intent classifier、impact classifier、evidence state service、policy boundary engine、approved language library、workflow router、evidence ledger 和 monitoring。指标不只看 accuracy, 还要看 unsupported claim、abstention by class、escalation precision/recall、override、appeal、complaint 和 segment friction。这样 AI 才能在有用、诚实、可审计和可运营之间取得平衡。

CTO 版本

我不会把 uncertainty 当成模型团队的 calibration 项目, 而会把它放进 platform control plane。核心对象是 uncertainty_decision_event: intent、impact、evidence state、policy boundary、authorization、model/retrieval/tool confidence、action class、approved language id、handoff id 和 output hash。

Runtime 上, LLM 不能自行决定越界建议、客户承诺或人工升级。它调用一个 policy decision service, 输入来自 RAG source registry、tool validation、customer/account context、role permission、risk tier 和 use-case policy。输出是允许的 action class 和 copy constraints。对于 agentic workflows, irreversible tool actions 需要 stronger gate; 对 copilots, reviewer UI 必须展示 missing evidence 和 contradictions; 对 customer-facing flows, raw score 默认不展示, 只展示证据范围、边界和下一步。

Governance 上, 每个 uncertainty decision 可以被 replay: 用哪个模型、检索了哪些来源、哪个政策版本允许 partial answer、为什么没有 escalation、客户看到什么、员工是否 override、后来是否投诉或申诉。这使我们可以把 NIST RMF、ISO 42001 和 model risk 从文档治理连接到运行时证据。


12. 7-Day Practice Plan

DayPracticeOutput
Day 1选择一个金融零售 AI 场景, 如 payment dispute assistant, 画出 answer / ask / refuse / escalate 状态机one-page state machine
Day 2写 abstention taxonomy, 至少覆盖 evidence、policy、authorization、harm、tool failure 和 unsafe instructiontaxonomy table
Day 3为 credit eligibility 和 wealth advice boundary 写 confidence language guideapproved / forbidden language table
Day 4设计 KYC document extraction field-level confidence and review UI requirementsfield-level decision matrix
Day 5为 AML analyst assistant 写 handoff evidence packet schemaevidence YAML and owner map
Day 6设计 eval set: strong evidence、missing evidence、conflicting evidence、boundary breach、tool outage、complaint signal30-case eval matrix
Day 7写一页 ADR, 说明为什么不用 raw confidence score 和单一 thresholdportfolio ADR

13. Portfolio Takeaway

成熟的 AI 产品经理和架构师不会问 "模型有多自信" 这么简单的问题, 而会问:

  • 这个场景里什么叫足够证据?
  • 哪些答案会被客户理解为承诺、建议或裁决?
  • 哪些不确定性应该暴露给客户, 哪些应该只给员工或 reviewer?
  • 什么时候 partial answer 比 refusal 更好?
  • 什么时候 ask more 会增加摩擦但减少伤害?
  • 哪些 escalation 是服务承诺, 不是风险甩锅?
  • 六个月后投诉、审计或监管问询时, 我们能否重建当时的事实、规则、模型和人类判断?

Uncertainty UX 的高级能力, 是把 "AI 不确定" 转换成清晰、诚实、可执行、可审计的服务动作。