AI Reasoning Budget / Test-Time Compute / Verifier Cascade Playbook
版本: v1.0
AI Reasoning Budget / Test-Time Compute / Verifier Cascade 实战手册
版本: v1.0
日期: 2026-06-30
适用对象: AI 产品经理、CBAP / BA、企业架构师、解决方案架构师、AI Governance Lead、模型风险管理、金融零售业务负责人
Purpose and when to use
本手册用于把 AI reasoning capability 从“模型能力”转成“产品和架构运营能力”。它帮助团队回答:
- 哪些任务可以快速回答, 哪些任务必须花更多 test-time compute?
- 哪些任务需要 RAG、工具、planner/solver/checker、verifier cascade、人工复核?
- 如何在成本、延迟、质量、客户影响和监管证据之间做可解释取舍?
- 如何给审计、模型风险、合规和业务负责人证明系统不是随意生成?
- 如何提供用户可理解的 rationale, 同时不泄露 hidden chain-of-thought?
适用场景:
| 场景 | 使用原因 |
|---|---|
| AML investigation copilot | 需要从交易、客户画像、typology、制裁/PEP 工具中综合证据, 但最终判断必须由 analyst 负责 |
| Credit policy reasoning | 需要解释政策、调用计算工具、控制公平借贷和 adverse action 证据 |
| Payment dispute reasoning | 需要平衡交易证据、reason code、时限、客户通知和人工复核 |
| Complaints root-cause analysis | 需要跨渠道证据、分类规则、根因假设和监管响应时限 |
| Contact center policy QA | 需要低延迟回答, 同时保证政策来源、禁用话术和升级路径 |
不适用场景:
| 场景 | 原因 |
|---|---|
| 纯营销文案创意 | 可以用轻量内容审核, 不需要完整 verifier cascade |
| 无客户影响的内部草稿 | 可用低预算路径, 重点控制数据泄露 |
| 已由规则引擎完全确定的决策 | AI 可解释或摘要, 不应替代确定性规则 |
| 法律或合规最终意见 | AI 可辅助检索和整理, 最终意见由授权人员给出 |
Operating model
1. Governance loop
flowchart LR
A[Use case intake] --> B[Risk and complexity rubric]
B --> C[Budget policy]
C --> D[Workflow orchestration]
D --> E[Verifier cascade]
E --> F[Output, abstention, escalation]
F --> G[Evidence packet]
G --> H[Monitoring and review]
H --> C
2. Roles and ownership
| Role | Accountabilities |
|---|---|
| Product owner | 定义用户旅程、业务价值、SLO、可接受等待时间、失败体验 |
| BA / CBAP | 拆解业务规则、例外路径、输入字段、验收样例、证据需求 |
| Solution architect | 设计 workflow、RAG、tool integration、verifier cascade、fallback |
| AI platform owner | 提供 budget classifier、orchestrator、verifier registry、trace/evidence store |
| Business control owner | 定义人工复核、审批、抽样、阈值、停止条件 |
| Model risk / governance | 审批风险分级、eval 计划、上线门禁、持续监控 |
| Operations lead | 管理人工队列、SLA、质量抽检、反馈闭环 |
| Audit evidence owner | 维护 evidence packet 标准、保留期限、访问控制 |
3. Lifecycle
| Phase | Decisions |
|---|---|
| Intake | 用例是否涉及客户影响、监管义务、敏感数据、自动动作 |
| Design | 使用哪个 budget tier, 哪些工具, 哪些 verifier, 何时 abstain |
| Build | 将 budget policy 配置化, 接入 trace/span 和 evidence packet |
| Eval | 按 tier 测试质量、成本、延迟、unsupported claim、human override |
| Release | 审批 release evidence packet, 明确 rollback 和 degraded mode |
| Operate | 监控 tier mix、SLO、成本、升级率、QA findings、投诉回流 |
| Improve | 根据失败样例更新 rubric、knowledge base、verifier、training examples |
Template: reasoning budget policy
| Policy field | Tier 0 Fast path | Tier 1 Evidence path | Tier 2 Deliberation path | Tier 3 Controlled decision path |
|---|---|---|---|---|
| Task examples | 低风险 FAQ、内部改写 | 政策 QA、客服话术 | 支付争议初判、投诉 RCA | AML、信贷边界、客户不利影响支持 |
| Max model calls | 1 | 1-2 | 2-4 | 4-8 plus human gate |
| Retrieval | optional | required for factual/policy claims | required, targeted re-query allowed | required, multi-source evidence required |
| Tools | none or read-only | read-only policy/source lookup | deterministic calculators, case lookup | approved calculators, risk tools, workflow tools |
| Verifiers | schema, safety | schema, citation, freshness | citation, calculation, policy, contradiction | full cascade plus expert review |
| Latency target | under 2s if channel requires | under 5s | under 30s or async | async or queue-based |
| Customer impact | none | informational | operational recommendation | high-impact recommendation only |
| Human review | none | sample-based | exception-based or sample-based | mandatory before final action |
| Evidence | minimal logs | source refs and output hash | trace, sources, verifier results | full evidence packet |
| Allowed output | answer | answer with citation and limits | recommendation with confidence and next step | draft/recommendation, not autonomous decision |
| Stop condition | safety fail | no source support | verifier fail, missing evidence | human gate, policy conflict, evidence conflict |
Policy statement:
The AI system must assign every request to a reasoning budget tier before generation. The tier determines retrieval, tool access, verifier checks, latency target, human review requirement, and evidence retention. Budget escalation must be justified by customer impact, uncertainty, policy edge, evidence gap, or user challenge. Budget caps cannot override authority, policy, or regulatory controls.
Template: task complexity rubric
Score each dimension from 0 to 3. Use the total and any hard trigger to choose a budget tier.
| Dimension | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| Customer impact | no impact | informational | operational next step | adverse, financial, legal, regulatory impact |
| Evidence dependency | answer from approved static content | one policy source | multiple systems or versions | conflicting, missing, or regulated evidence |
| Rule complexity | no rule | simple policy clause | multiple conditions | exception, cross-jurisdiction, time-sensitive rule |
| Data sensitivity | public/internal | customer PII | account/transaction/credit data | suspicious activity, hardship, protected class risk |
| Action authority | no action | advice only | workflow recommendation | decision support for controlled action |
| Latency pressure | batch | standard UI | live agent assist | real-time customer conversation with escalation risk |
| Uncertainty | deterministic | low ambiguity | multiple plausible outcomes | high ambiguity or disagreement |
Tier mapping:
| Score / trigger | Budget tier |
|---|---|
| 0-4 and no hard trigger | Tier 0 |
| 5-8 or factual/policy claim | Tier 1 |
| 9-13 or multi-step operational reasoning | Tier 2 |
| 14+ or hard trigger | Tier 3 |
Hard triggers for Tier 3:
| Trigger | Examples |
|---|---|
| Customer adverse impact | credit decline reason, fee reversal denial, account restriction recommendation |
| Regulatory reporting | AML SAR support, complaints regulatory classification |
| Protected-class or fair lending exposure | credit policy exception, affordability assessment |
| Material financial loss | dispute liability, scam reimbursement recommendation |
| Evidence conflict | transaction system conflicts with notes or customer statement |
Template: verifier cascade
| Stage | Verifier | Input | Pass criteria | Fail action | Evidence captured |
|---|---|---|---|---|---|
| V0 | Permission verifier | user, role, case id, account scope | user can access requested data and action | deny, redact, or route to authorized user | policy decision, user role, resource id hash |
| V1 | Completeness verifier | required fields by use case | all required fields present or explicitly unavailable | ask for missing field or abstain | missing field list |
| V2 | Retrieval verifier | query, source ids, policy versions | current approved sources retrieved | re-query, narrow scope, or escalate | source ids, index version, freshness |
| V3 | Claim support verifier | output claims, retrieved sources | each material claim has source support | rewrite or abstain | claim-source map |
| V4 | Tool verifier | calculations, timelines, risk scores | deterministic tool results match output | block output and rerun with tool result | tool name, input hash, result summary |
| V5 | Policy verifier | output, policy rules, prohibited statements | no prohibited promise, no unsupported decision, correct template | rewrite, escalate, or deny | policy id, rule id, failure tag |
| V6 | Consistency verifier | similar cases, prior disposition, current answer | no unexplained deviation from expected handling | reviewer queue or supervisor review | similarity ids, deviation reason |
| V7 | Human verifier | reviewer packet | approve, edit, reject, or request more evidence | hold final action | reviewer id, timestamp, decision |
Implementation guidance:
- Put deterministic and permission checks before expensive model checks.
- Do not let a later model verifier override a hard policy denial.
- Treat verifier disagreement as an escalation signal, not as a reason to keep retrying indefinitely.
- Version each cascade by use case, because AML, credit, payments, complaints and contact center need different controls.
Template: rationale exposure rule
| Audience | Show | Do not show | Example |
|---|---|---|---|
| Customer | concise explanation, verified facts, next step, required documents | hidden reasoning, internal risk score, security rules, unverified suspicion | “We need merchant documentation before we can complete the dispute review.” |
| Frontline agent | approved answer, source title, policy snippet, escalation path | full scratchpad, unsupported alternatives, restricted compliance logic | “Use policy CARD-DISP-12. If customer says fraud, collect affidavit and escalate.” |
| Analyst | evidence summary, source refs, tool results, model recommendation, uncertainty | raw hidden chain-of-thought unless explicitly approved by policy | “Three transaction clusters match typology indicators T1 and T4; analyst review required.” |
| Supervisor | verifier failures, reviewer packet, exception reason | irrelevant generation drafts | “Policy verifier failed because proposed refund denial lacked deadline calculation.” |
| Audit / model risk | run metadata, source ids, verifier outcomes, human approvals, output hash | customer-facing chat only, hidden reasoning as proof | “Tier 3 case had V0-V6 results and human approval before final disposition.” |
Rule statement:
Expose concise rationale and evidence appropriate to the audience. Do not expose or rely on hidden model reasoning as customer explanation, regulatory evidence, or audit proof. The evidence packet must prove what the system used, checked, blocked, escalated, and delivered.
Template: cost-latency-risk scorecard
| Metric | Definition | Target / threshold | Owner | Action when breached |
|---|---|---|---|---|
| Tier mix | percent of requests by budget tier | agreed baseline by use case | product + platform | investigate drift or misclassification |
| p95 latency by tier | end-to-end workflow latency | Tier 1 under 5s, Tier 2 under 30s, Tier 3 per queue SLA | platform | optimize retrieval, tools, model, or async path |
| Cost per completed case | model + tool + human review cost | below approved business case threshold | product finance | cap budget or redesign workflow |
| Unsupported claim rate | material claims without source support | 0 for customer-facing policy claims | governance | block release or update RAG/verifier |
| Verifier hard-fail rate | percent blocked by policy/tool/schema | monitored by stage | control owner | fix upstream workflow or knowledge source |
| Abstention rate | percent insufficient evidence / cannot answer | use-case specific acceptable band | product + operations | tune evidence intake or escalation |
| Human override rate | percent edited/rejected by reviewer | threshold by risk tier | model risk | review prompt, model, policy, data, training |
| Repeat contact / rework | customer or operations repeat due to poor answer | below baseline | business owner | RCA and corrective action |
| Complaint or harm signal | complaints linked to AI-assisted handling | zero tolerance for severe cases | risk + compliance | incident process and control review |
| Audit evidence completeness | required fields present in packet | 100 percent for Tier 3 | evidence owner | block production release or case closure |
Decision rule:
A higher reasoning budget is justified only when it measurably improves evidence support, reduces human rework, lowers customer harm, improves compliance quality, or protects material financial value. It is not justified by model fluency alone.
Template: release evidence packet
| Section | Required content |
|---|---|
| Use case summary | business objective, users, channels, customer impact, prohibited uses |
| Budget policy | tier definitions, assignment logic, hard triggers, budget caps |
| Workflow design | planner/solver/checker steps, RAG sources, tools, human gates, fallback |
| Verifier cascade | stages, pass/fail criteria, fail actions, owner, version |
| Data and knowledge sources | source systems, policy versions, index version, freshness controls |
| Eval results | golden set, challenge set, by-tier quality, unsupported claim, verifier fail, human override |
| Cost and latency | p50/p95 latency, cost per run, expected volume, capacity plan |
| Risk assessment | customer harm, regulatory exposure, privacy, security, fairness, operational resilience |
| Rationale exposure | customer, agent, analyst, supervisor, audit views |
| Evidence schema | run id, source refs, tool result summaries, verifier outcomes, output hash, reviewer actions |
| Operating model | RACI, review cadence, monitoring dashboard, incident workflow |
| Release decision | approvers, open exceptions, rollback/degraded mode, next review date |
Minimum Tier 3 evidence fields:
| Field | Required rule |
|---|---|
run_id | unique and traceable to case/work item |
use_case | matches approved AI use case registry |
budget_tier | must equal Tier 3 for high-impact trigger |
trigger_reason | explicit customer/regulatory/evidence reason |
source_refs | source ids or hashes, not just free text |
policy_versions | approved policy and rule versions |
tool_results | deterministic result summaries and input hashes |
verifier_results | pass/fail, rule ids, failure tags |
output_hash | hash of final output or recommendation |
human_action | reviewer decision before final action |
retention_policy | mapped to business and regulatory retention |
PM/BA/architecture questions
PM questions
| Question | Why it matters |
|---|---|
| Which user moments require speed, and which require correctness over speed? | Prevents one-size-fits-all SLO |
| What is the cost of a wrong answer, not just the cost of model tokens? | Connects budget to business value |
| When should the product say “I need more evidence”? | Makes abstention a designed experience |
| What should the user see when the system escalates? | Reduces confusion and trust loss |
| Which metrics prove high-budget reasoning is worth it? | Links architecture to adoption and ROI |
BA / CBAP questions
| Question | Why it matters |
|---|---|
| What fields are required before the AI can reason safely? | Prevents unsupported conclusions |
| Which policy clauses are deterministic rules vs interpretive guidance? | Decides tools vs model reasoning |
| What are the exception paths and who approves them? | Defines escalation gates |
| What evidence must be retained for audit or dispute? | Shapes evidence packet |
| What examples represent edge cases, not just happy paths? | Builds useful eval and UAT sets |
Architecture questions
| Question | Why it matters |
|---|---|
| Is budget tiering implemented as code/config or buried in prompts? | Determines governability |
| Which verifier stages are deterministic, model-based, or human? | Controls reliability and cost |
| How are tool permissions scoped by role and use case? | Prevents agent overreach |
| What trace/span attributes are mandatory? | Enables monitoring and incident replay |
| How does the system degrade when latency or cost budget is reached? | Protects SLO and user experience |
| How is hidden reasoning prevented from entering logs or user output? | Reduces privacy, security, and audit risk |
Release checklist
| Check | Pass criteria |
|---|---|
| Use case registered | AI use case has owner, scope, risk tier, prohibited uses |
| Budget policy approved | Tier mapping, hard triggers, caps and stop conditions approved |
| Complexity rubric tested | Sample cases mapped consistently by product, BA, risk and operations |
| RAG sources approved | Current policy versions, source owners and freshness checks documented |
| Tools approved | Each tool has permission scope, input validation, output contract and audit log |
| Verifier cascade implemented | V0-V7 stages configured or explicitly waived with approved rationale |
| Human gate working | Tier 3 actions cannot bypass reviewer approval |
| Rationale exposure tested | Customer/agent/audit views show allowed content only |
| Evidence packet complete | Required fields generated in test and production-like runs |
| Eval complete | Golden and challenge sets pass quality, safety, citation and policy gates |
| Cost/latency reviewed | p50/p95, max cost and capacity plan accepted |
| Monitoring dashboard ready | Tier mix, SLO, cost, verifier fail, override, abstention visible |
| Incident and rollback ready | Disable path, degraded mode and escalation contacts confirmed |
| Legal/compliance/model risk signoff | Required approvers recorded |
Release decision language:
The release is approved only for the registered use case, approved channels, configured budget tiers, listed tools, specified verifier cascade, and documented human gates. Any expansion to new actions, channels, data classes, or customer-impacting decisions requires a budget policy and evidence packet review.
Executive narrative
One-slide version
AI reasoning budget lets us spend intelligence where risk and value justify it. Instead of treating every request as a single model call, we classify each task by customer impact, evidence need, policy complexity and latency. Low-risk requests get fast answers. Policy questions get grounded answers with citations. Complex operational cases get decomposition, tools and verifier checks. High-impact financial retail decisions get controlled recommendations, human approval and audit evidence.
This approach improves quality without uncontrolled cost growth. It also makes governance practical: every high-risk AI run records what evidence was used, which tools were called, which verifiers passed or failed, who approved the final action and what was delivered. We do not expose hidden chain-of-thought or rely on it as audit evidence. We expose concise rationale and retain verifiable evidence.
Board / risk committee version
The control objective is not to make the model appear more confident. The control objective is to ensure that AI-supported work is risk-proportional, evidence-based, reviewable and interruptible. Reasoning budget and verifier cascade provide that operating model. They define when AI can answer, when it must retrieve sources, when it must call deterministic tools, when it must abstain, and when a human must decide.
For AML, credit, disputes, complaints and contact center use cases, this gives management a clear view of risk: tier mix, unsupported claim rate, verifier failures, human overrides, cost, latency, complaints and audit evidence completeness. It turns AI governance from a policy document into a measurable runtime control.
Technology leadership version
We should implement reasoning budget as a shared platform capability: budget classifier, orchestration templates, verifier registry, tool permission model, OpenTelemetry instrumentation and evidence store. Application teams configure use-case policies instead of hand-coding retry logic. This keeps cost, latency, model risk, security and audit evidence consistent across the AI portfolio.
Interview drills
Drill 1: Explain to a PM
Question: Why do we need a reasoning budget instead of just choosing the best model?
Answer: Because model quality is only one part of production quality. A contact center policy question, a payment dispute recommendation and an AML narrative have different risk, latency and evidence needs. Reasoning budget lets us route each request to the right level of retrieval, tool use, verification and human review. It prevents overspending on simple tasks and under-controlling high-impact tasks.
Drill 2: Explain to an architect
Question: What are the main components of verifier cascade architecture?
Answer: I would design a budget classifier, workflow orchestrator, RAG/source layer, deterministic tool layer, verifier registry, human review queue and evidence store. The cascade starts with permission and completeness, then source support, deterministic calculations, policy rules, consistency checks and finally expert review for high-impact cases. Each stage has pass/fail criteria and fail actions, and each run emits trace spans and evidence fields.
Drill 3: Explain to model risk
Question: How do you audit reasoning without storing chain-of-thought?
Answer: I would not use hidden chain-of-thought as audit evidence. I would store the approved sources, source ids, policy versions, input field hashes, tool result summaries, model/prompt config hash, verifier outcomes, output hash, abstention or escalation reason and human approval. That proves what the system used and checked without exposing internal scratchpad text.
Drill 4: AML case
Question: The AML copilot generated a strong suspicious narrative, but the analyst says the evidence is weak. What should happen?
Answer: The verifier cascade should catch unsupported claims before delivery. If it fails after generation, the case should move to escalation or rewrite with only observed facts. The evidence packet should show which claims lacked transaction or typology support. The analyst can approve, edit or reject, and the failure should feed the unsupported-claim metric and knowledge/prompt improvement backlog.
Drill 5: Credit policy case
Question: Can a high reasoning budget justify an automated credit decline?
Answer: No. Budget can improve evidence gathering and recommendation quality, but it cannot grant decision authority. Credit decline or adverse action needs approved policy, rule or human decision process, fair lending controls, adverse reason mapping and audit evidence. AI can assist with explanation and consistency checks, not bypass the control framework.
Drill 6: Payment dispute case
Question: What should the AI do if dispute evidence is incomplete but the customer is waiting in a live channel?
Answer: It should return a bounded response: confirm what is known, state what evidence is missing, give the approved next step and create or route the case. It should not guess liability. If latency budget is exceeded, use degraded mode: frontline guidance now, back-office reasoning asynchronously.
Drill 7: CTO case
Question: How do you know the extra test-time compute is worth the cost?
Answer: Measure by tier: unsupported claim reduction, human override reduction, rework reduction, complaint reduction, faster resolution, loss avoidance, analyst time saved and audit evidence completeness. If Tier 3 costs more but materially reduces rework and risk in AML or disputes, it may be justified. If a high-budget path only produces longer fluent answers, it should be capped or redesigned.
Source anchors
| Source | Link | 用途 |
|---|---|---|
| Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters | https://arxiv.org/abs/2408.03314 | 理解 test-time compute 作为可配置资源 |
| Self-Consistency Improves Chain of Thought Reasoning in Language Models | https://arxiv.org/abs/2203.11171 | 作为多路径推理思想背景, 不等同于企业上线架构 |
| Training Verifiers to Solve Math Word Problems | https://arxiv.org/abs/2110.14168 | 支撑 generator/checker 分离和 verifier 思路 |
| NIST AI Risk Management Framework | https://www.nist.gov/itl/ai-risk-management-framework | 支撑 Govern, Map, Measure, Manage 风险闭环 |
| NIST AI RMF Generative AI Profile | https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence | 支撑生成式 AI 专项风险、控制和证据 |
| ISO/IEC 42001 | https://www.iso.org/standard/81230.html | 支撑 AI management system 和持续改进要求 |
| OpenTelemetry Documentation | https://opentelemetry.io/docs/ | 支撑 trace/span/metrics/logs 的运行时证据设计 |