AI Uncertainty UX:不确定性体验与升级架构
Source nuance:
AI 不确定性体验架构:Abstention / Confidence / Escalation
Date: 2026-06-30 Status: evergreen Audience: CBAP+ Senior BA / Financial Retail PM / Product Architect / Solution Architect / AI Governance Lead / Model Risk Partner / Contact Center Transformation Lead。 Output: 一套把 uncertainty UX、abstention、confidence language、safe refusal、partial answer、request-more-info、human escalation、evidence 和 monitoring 连接起来的产品架构笔记。
Source Anchors
| Source | Link | 本文使用方式 |
|---|---|---|
| A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification | https://arxiv.org/abs/2107.07511 | 作为 uncertainty quantification 的技术锚点, 但本文重点放在产品、流程、交接和证据架构, 不把置信度等同于体验设计 |
| NIST AI Risk Management Framework | https://www.nist.gov/itl/ai-risk-management-framework | 用 Govern / Map / Measure / Manage 组织不确定性风险、影响场景、监控和治理证据 |
| NIST AI RMF Generative AI Profile | https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence | 用 GenAI 特有风险视角设计 hallucination、confabulation、misuse、human oversight 和 incident learning 控制 |
| Microsoft Guidelines for Human-AI Interaction | https://www.microsoft.com/en-us/research/project/guidelines-for-human-ai-interaction/ | 参考 human-AI interaction 在初始使用、常规互动、出错和长期使用中的行为原则, 但进一步转成金融零售控制架构 |
| ISO/IEC 42001 AI management systems | https://www.iso.org/standard/81230.html | 将 uncertainty UX 纳入 AI management system 的 policy、role、operation、performance evaluation、audit 和 improvement |
| ISO/IEC 23894 AI risk management | https://www.iso.org/standard/77304.html | 将不确定性、影响、控制、残余风险和复盘纳入 AI 风险管理生命周期 |
Source nuance:
- Conformal prediction、calibration、置信区间和模型评分只能说明技术层面的 uncertainty signal, 不能自动决定客户看到什么、员工如何接手、监管证据如何保留。
- Human-AI interaction guideline 提醒系统要表达能力边界和错误恢复, 但金融零售需要把这些原则落到产品状态机、policy engine、case workflow、approved language 和 audit trail。
- NIST 和 ISO 提供风险管理与管理体系结构, 本文将其转换为 PM / BA / Architect 可执行的场景、规则、指标和证据包。
一句话:
AI uncertainty UX is not a tooltip. It is a decision-control architecture that decides when AI answers, qualifies, refuses, asks, escalates, records evidence and learns.
1. Why Uncertainty UX Is Architecture
很多团队把不确定性体验简化为三件事:
- 显示一个百分比。
- 加一句 "AI may be wrong"。
- 低 confidence 时转人工。
这在金融零售里远远不够。客户是否受到伤害, 不是由模型分数单独决定, 而是由整个服务系统决定:
- 用户是否能理解 AI 的能力边界。
- AI 是否知道哪些问题可以部分回答, 哪些必须拒答, 哪些需要补充信息。
- 系统是否把低证据、低权限、高影响、高监管敏感度的交互转给正确 owner。
- 员工是否看到足够 evidence, 而不是只看到一个自信摘要。
- 审计、投诉、模型风险和业务 owner 是否能复盘当时为什么 answer / abstain / escalate。
1.1 Architecture Question
Given a user intent, context, evidence state, model uncertainty, policy boundary and customer impact:
what should the system do next, who should own the next action,
what should the user see, and what evidence must be retained?
这不是 UX copy 问题, 而是 architecture decision:
| Layer | 关键设计问题 | 失败后果 |
|---|---|---|
| Product policy | 哪些场景可以回答、部分回答、拒答、补件或升级 | AI 越界承诺、客户被误导 |
| Experience state | 用户看到 uncertainty、原因、下一步和人工路径的时机 | 用户过度信任或被困在 AI 循环 |
| Decision control | confidence、evidence、policy、impact 如何共同决定 action | 低证据高影响场景被自动处理 |
| Workflow | 谁接手、SLA、handoff payload、case type | 升级失败、重复询问、服务断裂 |
| Evidence | 输入、检索、模型、规则、copy、审批如何留痕 | 无法解释投诉、审计或模型事件 |
| Monitoring | abstention、override、appeal、harm signal 如何反馈 | 控制失灵但没有人知道 |
1.2 Product Principle
confidence is an internal signal
+ uncertainty language is a user contract
+ abstention is a product action
+ escalation is an operating model
+ evidence is the accountability layer
= uncertainty architecture
2. Concept Diagram
flowchart TD
A[User / Employee Intent] --> B[Context and Impact Assessment]
B --> C[Evidence State]
B --> D[Policy Boundary]
B --> E[Model / Retrieval / Tool Confidence]
C --> F[Uncertainty Decision Policy]
D --> F
E --> F
F --> G{Action Class}
G -->|answer| H[Answer with evidence and calibrated language]
G -->|partial answer| I[Scope answer + missing facts + next step]
G -->|ask more| J[Structured clarification or document request]
G -->|safe refusal| K[Boundary explanation + allowed alternative]
G -->|escalate| L[Case workflow + owner + SLA]
G -->|block| M[Stop unsafe action + incident/control signal]
H --> N[Telemetry and Evidence Packet]
I --> N
J --> N
K --> N
L --> N
M --> N
N --> O[Monitoring / Eval / Governance / CAPA]
O --> F
Key idea: uncertainty architecture is a closed-loop system. It does not end when AI says "I am not sure"; it ends when the case reaches the right outcome, evidence is complete, and control owners can improve the policy.
3. Core Architecture Model
3.1 Runtime Components
| Component | Responsibility | Example output |
|---|---|---|
| Intent and domain classifier | 识别问题类型、业务域、是否 customer-impacting | credit_eligibility_explanation, wealth_advice_boundary, aml_case_summary |
| Impact and sensitivity classifier | 判断客户权益、资金、合规、投诉、脆弱客户、监管敏感度 | high_customer_impact, regulated_advice, complaint_signal |
| Evidence state service | 判断证据是否充分、来源是否新鲜、是否有冲突 | complete, missing_income_doc, conflicting_policy_versions |
| Confidence signal aggregator | 汇总 retrieval score、model self-check、tool validation、schema validation、historical error rate | answer_confidence=medium, evidence_confidence=low |
| Policy and boundary engine | 按产品、角色、渠道、地区、客户状态决定 allow / ask / refuse / escalate | licensed_handoff_required |
| Response planner | 决定 answer class、copy template、disclosure、citation、handoff payload | partial_answer_with_missing_facts |
| Approved language service | 提供可审计 copy fragments 和 forbidden phrase scanning | safe_credit_language_v3 |
| Workflow router | 创建 case、队列、owner、SLA、handoff reason | payments_dispute_queue, wealth_advisor_queue |
| Evidence ledger | 保存 prompt、retrieval refs、policy ids、tool calls、decision reason、output hash | uncertainty_event_id |
| Monitoring and eval loop | 观察 abstention、appeal、override、harm、complaint、drift | threshold_review_needed |
3.2 Decision Inputs
不要只看一个 confidence_score。金融零售场景至少需要七类信号:
| Signal | What it means | Why it matters |
|---|---|---|
| Model confidence | 模型对生成或分类结果的稳定性 | 低稳定性需要谨慎, 但高稳定性也可能是 confident hallucination |
| Evidence confidence | 检索、文档、数据库、工具返回是否充分且一致 | 证据不足时应 ask more 或 partial answer |
| Policy confidence | 当前政策、审批 copy、监管边界是否明确 | 政策不确定应升级 owner, 不能让 LLM improvisation |
| Identity / authorization confidence | 用户或员工是否有权限看到信息或发起动作 | 防止越权披露或执行 |
| Impact severity | 错误是否影响资金、资格、权益、合规或客户伤害 | 高影响场景阈值更保守 |
| Reversibility | 错误能否轻易纠正 | 不可逆动作需要人工或审批 |
| User vulnerability / context | 是否存在投诉、焦虑、困难、老年、语言障碍、弱势信号 | 同一答案在弱势场景下风险更高 |
3.3 Action Classes
| Action class | Product behavior | Architecture requirement |
|---|---|---|
| Direct answer | 回答并给依据、范围、下一步 | evidence refs、approved copy、output QA |
| Qualified answer | 明确条件、假设、适用范围 | condition tags、policy version、scenario bounds |
| Partial answer | 回答可证明部分, 标出未确定部分 | segment-level evidence and missing-info list |
| Ask more | 结构化收集缺失事实、文件或 consent | question policy、field schema、drop-off tracking |
| Safe refusal | 解释边界, 给可允许替代路径 | refusal template、allowed alternative、policy id |
| Human escalation | warm handoff, 带证据包和 reason | case workflow、SLA、queue owner、handoff payload |
| Block and incident | 阻止危险动作, 触发 control / security / compliance review | incident type、retention、owner notification |
4. Abstention Taxonomy And Escalation Policy
Abstention 不是一种单一拒答。成熟系统要区分为什么不能继续, 因为不同原因对应不同 UX、owner 和证据。
4.1 Abstention Taxonomy
| Abstention class | Trigger | User-facing behavior | Owner |
|---|---|---|---|
| Evidence insufficient | 缺少关键事实、文件、交易、政策版本或引用 | 说明缺少什么, 请求补充或提供可回答范围 | Product / Ops |
| Evidence conflict | 多个来源不一致, 如政策版本冲突、系统状态冲突 | 不下结论, 说明需要核验, 创建复核任务 | Knowledge Owner / Ops |
| Domain boundary | 用户要求法律、税务、投资、信贷审批或医疗等越界判断 | 安全拒答, 提供教育信息或授权渠道 | Legal / Compliance / Licensed Owner |
| Authorization boundary | 用户或员工权限不足, 或 consent 不足 | 不披露敏感信息, 引导认证或授权流程 | Security / Privacy |
| High-impact low-certainty | 影响资金、资格、账户限制、投诉、合规但信号不足 | 暂停自动化, 升级人工 | Risk / Ops |
| Harm / vulnerability signal | 识别困难、压力、投诉、老年、语言障碍、潜在欺诈受害者 | 降低自动化和销售强度, 提供支持路径 | Customer Care / Conduct Risk |
| Tool failure | 核心系统、支付、KYC、RAG、case API 失败 | 说明暂时无法确认, 提供安全下一步 | Technology / Ops |
| Policy ambiguity | 内部政策没有覆盖或存在解释空间 | 不能即兴解释, 升级 policy owner | Policy / Compliance |
| Unsafe instruction | 用户要求绕过控制、欺诈、洗钱、隐藏信息或生成误导文案 | 拒绝并可能触发安全事件 | Fraud / Financial Crime / Security |
| Monitoring hold | 规则、模型或数据源处于 degraded / disabled 状态 | 限制能力, 转人工或延迟处理 | Platform Owner |
4.2 Escalation Policy
Escalate when:
impact severity is high
OR action is irreversible
OR customer-facing statement could create regulated obligation
OR evidence is conflicting
OR policy boundary is unclear
OR customer harm / complaint / vulnerability signal exists
OR AI would need to infer facts that should come from system of record.
4.3 Escalation Matrix
| Scenario | AI can do | AI must not do | Escalation route | Handoff payload |
|---|---|---|---|---|
| Credit eligibility | Explain general eligibility factors and list missing application data | Say the customer is approved, likely approved, or should apply based on unsupported inference | Credit ops or lending specialist | application id, reason codes, missing evidence, source docs, customer question |
| Wealth advice boundary | Provide neutral education and explain that personal recommendation requires authorized channel | Recommend buy/sell/hold, asset allocation, tax move or guaranteed outcome | Licensed advisor / wealth compliance | user intent, product mentioned, risk profile status, boundary reason |
| Payment dispute claim | Collect facts, explain process, list documents, state timelines | Promise chargeback success or assign fault without investigation | Dispute operations | transaction refs, customer narrative, merchant info, evidence gaps |
| AML analyst assistant | Summarize alerts, extract entities, cite transaction evidence | Decide SAR filing, close case solely, reveal SAR-sensitive reasoning to front office | AML investigator / BSA officer | alert id, entity graph, evidence refs, AI uncertainty flags |
| KYC document extraction | Extract fields, flag unreadable/contradictory docs, request re-upload | Declare identity verified when evidence fails validation | KYC operations / identity platform | doc refs, extraction confidence by field, validation failures |
| Complaint handling | Classify issue, preserve narrative, summarize facts, route case | Dismiss complaint, make legal conclusion, promise compensation | Complaint ops / legal-compliance review | complaint id, original text, harm signal, AI touchpoint refs |
| Contact center | Draft response, suggest next best question, summarize call | Hide uncertainty from agent, push sales during hardship or complaint | Supervisor / specialist queue | transcript, suggested script, confidence notes, prohibited content flags |
5. Confidence And Explanation UX Rules
5.1 Do Not Expose Raw Confidence By Default
Raw percentages often create false precision. A customer who sees "82% confident" may misunderstand:
- 82% of what: factual accuracy, policy fit, retrieval match, approval chance, or model probability?
- Is 82% safe enough for a dispute claim, wealth advice boundary or KYC decision?
- Is the number calibrated for this customer segment and distribution?
Recommended pattern:
| Audience | Preferred expression | Avoid |
|---|---|---|
| Customer | "I can confirm this from your transaction record" / "I need one more detail before I can explain the next step" | "82% confident" |
| Frontline employee | evidence status + confidence band + required review action | single model score without reason |
| Analyst / reviewer | field-level confidence, evidence refs, contradictions, uncertainty reason | green/yellow/red without inspectable evidence |
| Governance / model risk | metric distributions, error bands, override rates, harm outcomes | cherry-picked demos |
5.2 Confidence Language Guide
| Situation | Better language | Why |
|---|---|---|
| Evidence is strong | "Based on the posted transaction and current dispute policy, the next step is..." | Names evidence and policy scope |
| Evidence is missing | "I can explain the process, but I need the merchant response date to assess the next step." | Separates process education from case-specific conclusion |
| Policy boundary exists | "I can provide general information. A personal investment recommendation requires a licensed advisor." | Refuses overreach while preserving help |
| Conflict exists | "Your upload and the system record do not match, so this needs review before we rely on it." | Explains why automation paused |
| Tool failed | "I cannot verify the account status right now. I will not guess; I can create a follow-up case." | Prevents hallucinated system state |
| Customer harm signal | "This may affect your account access, so I am routing it to a specialist and preserving the details you shared." | Signals seriousness and action |
5.3 Explanation Rules
- Explain the decision class, not the model internals.
- Tie confidence language to evidence quality, not AI personality.
- Show what is known, unknown, assumed and next.
- For customer-impacting decisions, provide reason codes and recourse path where policy allows.
- For employee copilots, show conflicting evidence before recommended action.
- For regulated advice boundaries, use approved copy and avoid personalized recommendations.
- Never use confidence language to soften a hard compliance boundary.
- Never use uncertainty as a way to avoid a complaint, dispute, appeal or accessibility path.
6. Financial Retail Scenarios
6.1 Credit Eligibility Assistant
Bad pattern:
"You are likely eligible for this credit card."
Why it fails:
- It may imply pre-approval.
- It may hide missing income, credit bureau, identity or product eligibility checks.
- It may create unfair treatment if the confidence threshold behaves differently across segments.
Better architecture:
| Step | Design |
|---|---|
| Intent | credit_eligibility_question |
| Boundary | AI can explain criteria, not approve or predict approval unless sourced from official prequalification engine |
| Evidence | application status, product rules, adverse action / reason code policy, customer consent |
| Action | qualified answer or ask more; escalate for exceptions |
| UX | "I can explain the factors used. A final decision comes from the credit decisioning process." |
| Evidence | policy id, product version, customer-safe copy id, model trace, handoff reason |
6.2 Wealth Advice Boundary
User asks: "Should I sell my bond fund and buy the high-yield product?"
Architecture response:
- Classify as personalized investment advice intent.
- Check channel and role permissions.
- Provide neutral education about risk, fees, liquidity and suitability concepts.
- Refuse buy/sell recommendation.
- Offer licensed advisor handoff.
The uncertainty is not only model uncertainty. It is authorization, suitability, profile completeness, licensing and conduct risk uncertainty.
6.3 Payment Dispute Claims
User asks: "Will I win this dispute?"
AI may:
- Collect transaction details, merchant interaction, delivery evidence, cancellation date.
- Explain dispute process and timelines.
- Identify missing evidence.
- Create case and route high-value or vulnerable customer cases.
AI must not:
- Promise outcome.
- Invent chargeback reason.
- Tell customer to misstate facts.
- Hide uncertainty about network rules or merchant evidence.
6.4 AML Analyst Assistant
Analyst asks: "Is this suspicious?"
AI may:
- Summarize alert facts.
- Extract counterparties, typologies, transaction sequence and anomalies.
- Cite source transactions and unresolved evidence gaps.
- Suggest investigation questions.
AI must not:
- Make SAR filing decision.
- Close case without reviewer.
- Generate unsupported suspicion narrative.
- Reveal SAR-sensitive reasoning to customer-facing teams.
Uncertainty UX here is employee-facing: field-level confidence, contradictory facts, missing KYC context and reviewer confirmation workflow.
6.5 KYC Document Extraction
AI extracts name, date of birth, address and document number.
Advanced design:
| Field | UX / Ops behavior |
|---|---|
| High confidence and validation match | Auto-fill, show source crop to reviewer if sampled |
| Medium confidence | Highlight field for review, preserve extraction alternative |
| Low confidence or unreadable | Request re-upload or manual review |
| Document / application mismatch | Pause verification, route discrepancy queue |
| Expired or unsupported document | Explain acceptable document types, do not infer exception |
The key is field-level abstention. The system can accept name while abstaining on address, rather than failing the whole journey or hallucinating a value.
6.6 Complaint Intelligence
Complaint AI should treat uncertainty as a preservation problem:
- Preserve original narrative.
- Summarize with span citations.
- Separate allegation, verified fact, policy issue, harm signal and root cause hypothesis.
- Escalate legal-sensitive, vulnerable customer, repeated harm and regulatory-source cases.
Never let "low confidence" become a reason to not create a complaint case. In complaints, ambiguity often increases the need to preserve and route.
6.7 Contact Center Copilot
Agent sees AI suggestion:
Recommended response: Explain dispute process and ask for merchant cancellation evidence.
Confidence: Medium
Why: Transaction is posted and policy found, but customer cancellation date is missing.
Do not say: "You will get your money back."
Escalate if: customer says hardship, fraud, legal threat, regulator, or prior unresolved complaint.
The agent does not need model math. The agent needs action-safe guidance, forbidden language, evidence and escalation triggers.
7. Metrics / Control / Evidence Model
7.1 Metrics
| Metric | What it detects | Owner |
|---|---|---|
| Abstention rate by class | Whether refusals / asks / escalations are balanced by reason | Product / Governance |
| Wrong-answer rate by impact tier | Whether high-impact answers have unacceptable error | Model Risk / Product |
| Unsupported claim rate | Whether outputs exceed evidence or approved language | Compliance / QA |
| Escalation precision and recall | Whether the right cases go to humans | Ops / Risk |
| Handoff completion SLA | Whether escalation is real service, not deflection | Operations |
| Human override rate | Whether AI action policy is miscalibrated | Product / Ops |
| Appeal / complaint rate after AI interaction | Whether AI is creating customer harm | Complaint / Conduct Risk |
| Evidence completeness score | Whether each decision can be replayed | Audit / Governance |
| Customer trust calibration | Whether users understand AI boundaries without abandoning valid journeys | CX / Research |
| Segment disparity in abstention | Whether certain groups get more friction or fewer helpful answers | Fairness / Risk |
7.2 Control Model
| Control | Design |
|---|---|
| Intended-use inventory | Each use case records allowed answer classes, prohibited actions and impact tier |
| Policy decision table | Maps confidence/evidence/impact/boundary to answer / ask / refuse / escalate |
| Approved language library | Versioned copy for boundary, refusal, disclosure, escalation and uncertainty |
| Evidence ledger | Captures input refs, retrieved sources, tool responses, policy ids, model versions and output hashes |
| Human handoff protocol | Defines queue, SLA, owner, required payload, customer message and closure event |
| QA and red-team eval | Tests low-evidence, conflicting-evidence, high-impact and adversarial cases |
| Monitoring thresholds | Alerts on abnormal abstention, complaint, override, error or segment friction |
| Incident and CAPA loop | Connects harmful AI interactions to root cause, fix, validation and policy update |
7.3 Evidence Packet
A replayable uncertainty decision needs:
uncertainty_event_id: ux-2026-06-30-000169
use_case: payment_dispute_assistant
intent: dispute_outcome_question
customer_impact_tier: high
evidence_state: missing_customer_cancellation_date
confidence_summary:
retrieval: high
tool_validation: high
case_specific_conclusion: low
policy_decision: partial_answer_ask_more
policy_ids:
- dispute_uncertainty_policy_v4
- prohibited_outcome_promise_v2
approved_language_ids:
- dispute_process_explanation_v7
output_class: partial_answer
escalation:
required: false
trigger_checked: hardship_or_fraud_signal
sources:
- transaction_record_ref
- dispute_policy_ref
human_feedback:
override: false
monitoring_tags:
- high_impact
- missing_evidence
8. Anti-Patterns And Failure Modes
| Anti-pattern | Failure mode | Corrective design |
|---|---|---|
| Confidence theater | UI shows precise score but no evidence or action rule | Replace with evidence state, action class and next step |
| One global threshold | Same confidence cutoff used for FAQ, credit, complaints and AML | Thresholds by impact, reversibility, policy boundary and evidence |
| Refusal as dead end | AI says cannot help and ends the journey | Safe refusal with allowed alternative, handoff or information request |
| Escalation dumping | Low confidence floods human queues | Escalation taxonomy, triage, SLA and capacity model |
| Hidden uncertainty from employees | Copilot gives polished answer without showing weak evidence | Reviewer UI with missing facts, contradictions and confidence by field |
| Over-disclosure to customers | Reveals internal AML/fraud rules or sensitive thresholds | Customer-safe explanation layer and role-based evidence |
| Under-disclosure to customers | "AI may be wrong" without concrete limitation or recourse | Boundary-specific language and next step |
| Partial answer missing boundary | AI answers generic part then implies case-specific decision | Segment answer by known / unknown / cannot determine |
| Human-in-the-loop myth | Human receives case but no evidence, no time, no authority | Handoff packet, owner, SLA, decision rights |
| Monitoring blind spot | Team tracks accuracy but not harm, complaints or overrides | Connect telemetry to complaint, appeal, QA and CAPA |
9. Architecture Mapping To RAG / Agent / Copilot / Eval / Governance
| Architecture style | Uncertainty problem | Required control |
|---|---|---|
| RAG assistant | Retrieval may be stale, missing, conflicting or unauthorized | source registry, freshness, permission filtering, citation quality, conflict detector |
| Tool-using agent | Tool may fail, return partial data or execute high-impact action | tool confidence, action approval, reversible/irreversible classification, execution guard |
| Employee copilot | Employee may over-trust fluent drafts | evidence panel, required review fields, forbidden language, adoption monitoring |
| Customer chatbot | Customer may treat response as promise or advice | capability framing, boundary copy, escalation path, complaint/dispute/appeal routing |
| KYC / document AI | Field extraction varies by field and document quality | field-level confidence, validation, exception queue, source image evidence |
| AML / fraud assistant | Sensitive reasoning and case decision boundaries | analyst-only workspace, SAR-sensitive controls, case owner approval |
| Eval platform | Offline accuracy misses uncertainty behavior | scenario eval for abstention, partial answer, ask-more, refusal and handoff |
| Governance system | Policies are not connected to runtime behavior | policy ids in logs, evidence packets, review cadence, residual risk acceptance |
9.1 Eval Design
Uncertainty eval must include:
- Strong-evidence answer cases.
- Missing-evidence cases.
- Conflicting-source cases.
- Regulated-advice boundary cases.
- High-impact irreversible action cases.
- Sensitive information / authorization cases.
- Harm, complaint and vulnerable-customer cases.
- Tool outage and degraded RAG cases.
- Multi-turn cases where user pressure attempts to force a conclusion.
Score dimensions:
| Dimension | Pass condition |
|---|---|
| Action selection | correct answer / partial / ask / refusal / escalation |
| Language safety | no unsupported promise, advice or disclosure |
| Evidence use | cited source supports the claim |
| Boundary clarity | user understands what AI can and cannot do |
| Handoff quality | correct queue, payload and customer message |
| Recovery | user gets a useful next step |
10. ADR Draft
# ADR: Adopt Uncertainty Decision Policy And Escalation Architecture
Date: 2026-06-30
Status: Proposed
## Context
AI assistants in credit, wealth, disputes, AML, KYC, complaints and contact center workflows produce outputs with varying evidence quality, policy certainty and customer impact. A single confidence score is not sufficient to decide what users should see or when humans should intervene.
## Decision
We will implement an uncertainty decision architecture that combines intent, impact tier, evidence state, policy boundary, authorization, model/retrieval/tool confidence and reversibility. Runtime outputs must be classified as direct answer, qualified answer, partial answer, ask more, safe refusal, human escalation or block. Each customer-impacting uncertainty decision must create an evidence packet with policy ids, source refs, approved language ids and monitoring tags.
## Consequences
Benefits:
- Reduces unsupported promises, advice-boundary breaches and hallucinated system states.
- Improves customer trust by giving specific next steps instead of vague disclaimers.
- Gives operations usable handoff payloads and clear ownership.
- Enables model risk, audit, complaint and governance teams to replay decisions.
Trade-offs:
- Requires policy decision tables, approved language assets and workflow integration.
- May initially increase ask-more and escalation volume while thresholds are tuned.
- Requires cross-functional ownership across Product, Risk, Compliance, Ops, CX and Technology.
## Alternatives Considered
1. Show raw confidence score to users.
- Rejected because it creates false precision and does not encode policy or impact.
2. Use a single confidence threshold for human handoff.
- Rejected because high confidence can still violate advice, authorization or evidence boundaries.
3. Let the LLM decide when to refuse or escalate.
- Rejected because abstention and escalation are regulated product controls, not prompt-only behavior.
11. Interview Answer
30 秒版本
AI 不确定性体验不是在界面上显示一个 confidence score, 而是一个产品和架构控制系统。金融零售里要根据意图、证据、政策边界、客户影响、权限和可逆性决定 AI 是回答、限定回答、部分回答、追问、拒答、升级还是阻断。关键是把 uncertainty language、abstention taxonomy、human handoff、approved copy、monitoring 和 audit evidence 连接起来, 防止客户把 AI 的不确定输出当成承诺、建议或裁决。
2 分钟版本
我会把 uncertainty UX 设计成一个 runtime decision policy。第一步识别用户意图和影响等级, 比如 credit eligibility、wealth advice、payment dispute、AML review、KYC extraction 或 complaint。第二步评估证据状态: 来源是否充分、新鲜、一致、授权可见。第三步结合模型、检索和工具 confidence, 但不只依赖分数。第四步用 policy engine 判断 action class: direct answer、qualified answer、partial answer、ask more、safe refusal、human escalation 或 block。
例如客户问 "我会赢这个 dispute 吗", AI 不应该给成功概率或承诺结果。它可以解释流程、收集证据、说明缺失材料, 并在高金额、疑似欺诈、弱势客户或投诉信号时升级。又如 wealth 场景, 用户问是否卖出某基金, 不确定性不是模型不知道, 而是 licensing、suitability 和 advice boundary 决定 AI 只能教育和转 licensed advisor。
架构上我会落地 intent classifier、impact classifier、evidence state service、policy boundary engine、approved language library、workflow router、evidence ledger 和 monitoring。指标不只看 accuracy, 还要看 unsupported claim、abstention by class、escalation precision/recall、override、appeal、complaint 和 segment friction。这样 AI 才能在有用、诚实、可审计和可运营之间取得平衡。
CTO 版本
我不会把 uncertainty 当成模型团队的 calibration 项目, 而会把它放进 platform control plane。核心对象是 uncertainty_decision_event: intent、impact、evidence state、policy boundary、authorization、model/retrieval/tool confidence、action class、approved language id、handoff id 和 output hash。
Runtime 上, LLM 不能自行决定越界建议、客户承诺或人工升级。它调用一个 policy decision service, 输入来自 RAG source registry、tool validation、customer/account context、role permission、risk tier 和 use-case policy。输出是允许的 action class 和 copy constraints。对于 agentic workflows, irreversible tool actions 需要 stronger gate; 对 copilots, reviewer UI 必须展示 missing evidence 和 contradictions; 对 customer-facing flows, raw score 默认不展示, 只展示证据范围、边界和下一步。
Governance 上, 每个 uncertainty decision 可以被 replay: 用哪个模型、检索了哪些来源、哪个政策版本允许 partial answer、为什么没有 escalation、客户看到什么、员工是否 override、后来是否投诉或申诉。这使我们可以把 NIST RMF、ISO 42001 和 model risk 从文档治理连接到运行时证据。
12. 7-Day Practice Plan
| Day | Practice | Output |
|---|---|---|
| Day 1 | 选择一个金融零售 AI 场景, 如 payment dispute assistant, 画出 answer / ask / refuse / escalate 状态机 | one-page state machine |
| Day 2 | 写 abstention taxonomy, 至少覆盖 evidence、policy、authorization、harm、tool failure 和 unsafe instruction | taxonomy table |
| Day 3 | 为 credit eligibility 和 wealth advice boundary 写 confidence language guide | approved / forbidden language table |
| Day 4 | 设计 KYC document extraction field-level confidence and review UI requirements | field-level decision matrix |
| Day 5 | 为 AML analyst assistant 写 handoff evidence packet schema | evidence YAML and owner map |
| Day 6 | 设计 eval set: strong evidence、missing evidence、conflicting evidence、boundary breach、tool outage、complaint signal | 30-case eval matrix |
| Day 7 | 写一页 ADR, 说明为什么不用 raw confidence score 和单一 threshold | portfolio ADR |
13. Portfolio Takeaway
成熟的 AI 产品经理和架构师不会问 "模型有多自信" 这么简单的问题, 而会问:
- 这个场景里什么叫足够证据?
- 哪些答案会被客户理解为承诺、建议或裁决?
- 哪些不确定性应该暴露给客户, 哪些应该只给员工或 reviewer?
- 什么时候 partial answer 比 refusal 更好?
- 什么时候 ask more 会增加摩擦但减少伤害?
- 哪些 escalation 是服务承诺, 不是风险甩锅?
- 六个月后投诉、审计或监管问询时, 我们能否重建当时的事实、规则、模型和人类判断?
Uncertainty UX 的高级能力, 是把 "AI 不确定" 转换成清晰、诚实、可执行、可审计的服务动作。