返回 Papers
AI 扩展计划 / Playbooks

AI Reasoning Budget / Test-Time Compute / Verifier Cascade Playbook

版本: v1.0

379AI_REASONING_BUDGET_TEST_TIME_COMPUTE_VERIFIER_CASCADE_PLAYBOOK.md

AI Reasoning Budget / Test-Time Compute / Verifier Cascade 实战手册

版本: v1.0
日期: 2026-06-30
适用对象: AI 产品经理、CBAP / BA、企业架构师、解决方案架构师、AI Governance Lead、模型风险管理、金融零售业务负责人


Purpose and when to use

本手册用于把 AI reasoning capability 从“模型能力”转成“产品和架构运营能力”。它帮助团队回答:

  • 哪些任务可以快速回答, 哪些任务必须花更多 test-time compute?
  • 哪些任务需要 RAG、工具、planner/solver/checker、verifier cascade、人工复核?
  • 如何在成本、延迟、质量、客户影响和监管证据之间做可解释取舍?
  • 如何给审计、模型风险、合规和业务负责人证明系统不是随意生成?
  • 如何提供用户可理解的 rationale, 同时不泄露 hidden chain-of-thought?

适用场景:

场景使用原因
AML investigation copilot需要从交易、客户画像、typology、制裁/PEP 工具中综合证据, 但最终判断必须由 analyst 负责
Credit policy reasoning需要解释政策、调用计算工具、控制公平借贷和 adverse action 证据
Payment dispute reasoning需要平衡交易证据、reason code、时限、客户通知和人工复核
Complaints root-cause analysis需要跨渠道证据、分类规则、根因假设和监管响应时限
Contact center policy QA需要低延迟回答, 同时保证政策来源、禁用话术和升级路径

不适用场景:

场景原因
纯营销文案创意可以用轻量内容审核, 不需要完整 verifier cascade
无客户影响的内部草稿可用低预算路径, 重点控制数据泄露
已由规则引擎完全确定的决策AI 可解释或摘要, 不应替代确定性规则
法律或合规最终意见AI 可辅助检索和整理, 最终意见由授权人员给出

Operating model

1. Governance loop

flowchart LR
  A[Use case intake] --> B[Risk and complexity rubric]
  B --> C[Budget policy]
  C --> D[Workflow orchestration]
  D --> E[Verifier cascade]
  E --> F[Output, abstention, escalation]
  F --> G[Evidence packet]
  G --> H[Monitoring and review]
  H --> C

2. Roles and ownership

RoleAccountabilities
Product owner定义用户旅程、业务价值、SLO、可接受等待时间、失败体验
BA / CBAP拆解业务规则、例外路径、输入字段、验收样例、证据需求
Solution architect设计 workflow、RAG、tool integration、verifier cascade、fallback
AI platform owner提供 budget classifier、orchestrator、verifier registry、trace/evidence store
Business control owner定义人工复核、审批、抽样、阈值、停止条件
Model risk / governance审批风险分级、eval 计划、上线门禁、持续监控
Operations lead管理人工队列、SLA、质量抽检、反馈闭环
Audit evidence owner维护 evidence packet 标准、保留期限、访问控制

3. Lifecycle

PhaseDecisions
Intake用例是否涉及客户影响、监管义务、敏感数据、自动动作
Design使用哪个 budget tier, 哪些工具, 哪些 verifier, 何时 abstain
Build将 budget policy 配置化, 接入 trace/span 和 evidence packet
Eval按 tier 测试质量、成本、延迟、unsupported claim、human override
Release审批 release evidence packet, 明确 rollback 和 degraded mode
Operate监控 tier mix、SLO、成本、升级率、QA findings、投诉回流
Improve根据失败样例更新 rubric、knowledge base、verifier、training examples

Template: reasoning budget policy

Policy fieldTier 0 Fast pathTier 1 Evidence pathTier 2 Deliberation pathTier 3 Controlled decision path
Task examples低风险 FAQ、内部改写政策 QA、客服话术支付争议初判、投诉 RCAAML、信贷边界、客户不利影响支持
Max model calls11-22-44-8 plus human gate
Retrievaloptionalrequired for factual/policy claimsrequired, targeted re-query allowedrequired, multi-source evidence required
Toolsnone or read-onlyread-only policy/source lookupdeterministic calculators, case lookupapproved calculators, risk tools, workflow tools
Verifiersschema, safetyschema, citation, freshnesscitation, calculation, policy, contradictionfull cascade plus expert review
Latency targetunder 2s if channel requiresunder 5sunder 30s or asyncasync or queue-based
Customer impactnoneinformationaloperational recommendationhigh-impact recommendation only
Human reviewnonesample-basedexception-based or sample-basedmandatory before final action
Evidenceminimal logssource refs and output hashtrace, sources, verifier resultsfull evidence packet
Allowed outputansweranswer with citation and limitsrecommendation with confidence and next stepdraft/recommendation, not autonomous decision
Stop conditionsafety failno source supportverifier fail, missing evidencehuman gate, policy conflict, evidence conflict

Policy statement:

The AI system must assign every request to a reasoning budget tier before generation. The tier determines retrieval, tool access, verifier checks, latency target, human review requirement, and evidence retention. Budget escalation must be justified by customer impact, uncertainty, policy edge, evidence gap, or user challenge. Budget caps cannot override authority, policy, or regulatory controls.

Template: task complexity rubric

Score each dimension from 0 to 3. Use the total and any hard trigger to choose a budget tier.

Dimension0123
Customer impactno impactinformationaloperational next stepadverse, financial, legal, regulatory impact
Evidence dependencyanswer from approved static contentone policy sourcemultiple systems or versionsconflicting, missing, or regulated evidence
Rule complexityno rulesimple policy clausemultiple conditionsexception, cross-jurisdiction, time-sensitive rule
Data sensitivitypublic/internalcustomer PIIaccount/transaction/credit datasuspicious activity, hardship, protected class risk
Action authorityno actionadvice onlyworkflow recommendationdecision support for controlled action
Latency pressurebatchstandard UIlive agent assistreal-time customer conversation with escalation risk
Uncertaintydeterministiclow ambiguitymultiple plausible outcomeshigh ambiguity or disagreement

Tier mapping:

Score / triggerBudget tier
0-4 and no hard triggerTier 0
5-8 or factual/policy claimTier 1
9-13 or multi-step operational reasoningTier 2
14+ or hard triggerTier 3

Hard triggers for Tier 3:

TriggerExamples
Customer adverse impactcredit decline reason, fee reversal denial, account restriction recommendation
Regulatory reportingAML SAR support, complaints regulatory classification
Protected-class or fair lending exposurecredit policy exception, affordability assessment
Material financial lossdispute liability, scam reimbursement recommendation
Evidence conflicttransaction system conflicts with notes or customer statement

Template: verifier cascade

StageVerifierInputPass criteriaFail actionEvidence captured
V0Permission verifieruser, role, case id, account scopeuser can access requested data and actiondeny, redact, or route to authorized userpolicy decision, user role, resource id hash
V1Completeness verifierrequired fields by use caseall required fields present or explicitly unavailableask for missing field or abstainmissing field list
V2Retrieval verifierquery, source ids, policy versionscurrent approved sources retrievedre-query, narrow scope, or escalatesource ids, index version, freshness
V3Claim support verifieroutput claims, retrieved sourceseach material claim has source supportrewrite or abstainclaim-source map
V4Tool verifiercalculations, timelines, risk scoresdeterministic tool results match outputblock output and rerun with tool resulttool name, input hash, result summary
V5Policy verifieroutput, policy rules, prohibited statementsno prohibited promise, no unsupported decision, correct templaterewrite, escalate, or denypolicy id, rule id, failure tag
V6Consistency verifiersimilar cases, prior disposition, current answerno unexplained deviation from expected handlingreviewer queue or supervisor reviewsimilarity ids, deviation reason
V7Human verifierreviewer packetapprove, edit, reject, or request more evidencehold final actionreviewer id, timestamp, decision

Implementation guidance:

  • Put deterministic and permission checks before expensive model checks.
  • Do not let a later model verifier override a hard policy denial.
  • Treat verifier disagreement as an escalation signal, not as a reason to keep retrying indefinitely.
  • Version each cascade by use case, because AML, credit, payments, complaints and contact center need different controls.

Template: rationale exposure rule

AudienceShowDo not showExample
Customerconcise explanation, verified facts, next step, required documentshidden reasoning, internal risk score, security rules, unverified suspicion“We need merchant documentation before we can complete the dispute review.”
Frontline agentapproved answer, source title, policy snippet, escalation pathfull scratchpad, unsupported alternatives, restricted compliance logic“Use policy CARD-DISP-12. If customer says fraud, collect affidavit and escalate.”
Analystevidence summary, source refs, tool results, model recommendation, uncertaintyraw hidden chain-of-thought unless explicitly approved by policy“Three transaction clusters match typology indicators T1 and T4; analyst review required.”
Supervisorverifier failures, reviewer packet, exception reasonirrelevant generation drafts“Policy verifier failed because proposed refund denial lacked deadline calculation.”
Audit / model riskrun metadata, source ids, verifier outcomes, human approvals, output hashcustomer-facing chat only, hidden reasoning as proof“Tier 3 case had V0-V6 results and human approval before final disposition.”

Rule statement:

Expose concise rationale and evidence appropriate to the audience. Do not expose or rely on hidden model reasoning as customer explanation, regulatory evidence, or audit proof. The evidence packet must prove what the system used, checked, blocked, escalated, and delivered.

Template: cost-latency-risk scorecard

MetricDefinitionTarget / thresholdOwnerAction when breached
Tier mixpercent of requests by budget tieragreed baseline by use caseproduct + platforminvestigate drift or misclassification
p95 latency by tierend-to-end workflow latencyTier 1 under 5s, Tier 2 under 30s, Tier 3 per queue SLAplatformoptimize retrieval, tools, model, or async path
Cost per completed casemodel + tool + human review costbelow approved business case thresholdproduct financecap budget or redesign workflow
Unsupported claim ratematerial claims without source support0 for customer-facing policy claimsgovernanceblock release or update RAG/verifier
Verifier hard-fail ratepercent blocked by policy/tool/schemamonitored by stagecontrol ownerfix upstream workflow or knowledge source
Abstention ratepercent insufficient evidence / cannot answeruse-case specific acceptable bandproduct + operationstune evidence intake or escalation
Human override ratepercent edited/rejected by reviewerthreshold by risk tiermodel riskreview prompt, model, policy, data, training
Repeat contact / reworkcustomer or operations repeat due to poor answerbelow baselinebusiness ownerRCA and corrective action
Complaint or harm signalcomplaints linked to AI-assisted handlingzero tolerance for severe casesrisk + complianceincident process and control review
Audit evidence completenessrequired fields present in packet100 percent for Tier 3evidence ownerblock production release or case closure

Decision rule:

A higher reasoning budget is justified only when it measurably improves evidence support, reduces human rework, lowers customer harm, improves compliance quality, or protects material financial value. It is not justified by model fluency alone.

Template: release evidence packet

SectionRequired content
Use case summarybusiness objective, users, channels, customer impact, prohibited uses
Budget policytier definitions, assignment logic, hard triggers, budget caps
Workflow designplanner/solver/checker steps, RAG sources, tools, human gates, fallback
Verifier cascadestages, pass/fail criteria, fail actions, owner, version
Data and knowledge sourcessource systems, policy versions, index version, freshness controls
Eval resultsgolden set, challenge set, by-tier quality, unsupported claim, verifier fail, human override
Cost and latencyp50/p95 latency, cost per run, expected volume, capacity plan
Risk assessmentcustomer harm, regulatory exposure, privacy, security, fairness, operational resilience
Rationale exposurecustomer, agent, analyst, supervisor, audit views
Evidence schemarun id, source refs, tool result summaries, verifier outcomes, output hash, reviewer actions
Operating modelRACI, review cadence, monitoring dashboard, incident workflow
Release decisionapprovers, open exceptions, rollback/degraded mode, next review date

Minimum Tier 3 evidence fields:

FieldRequired rule
run_idunique and traceable to case/work item
use_casematches approved AI use case registry
budget_tiermust equal Tier 3 for high-impact trigger
trigger_reasonexplicit customer/regulatory/evidence reason
source_refssource ids or hashes, not just free text
policy_versionsapproved policy and rule versions
tool_resultsdeterministic result summaries and input hashes
verifier_resultspass/fail, rule ids, failure tags
output_hashhash of final output or recommendation
human_actionreviewer decision before final action
retention_policymapped to business and regulatory retention

PM/BA/architecture questions

PM questions

QuestionWhy it matters
Which user moments require speed, and which require correctness over speed?Prevents one-size-fits-all SLO
What is the cost of a wrong answer, not just the cost of model tokens?Connects budget to business value
When should the product say “I need more evidence”?Makes abstention a designed experience
What should the user see when the system escalates?Reduces confusion and trust loss
Which metrics prove high-budget reasoning is worth it?Links architecture to adoption and ROI

BA / CBAP questions

QuestionWhy it matters
What fields are required before the AI can reason safely?Prevents unsupported conclusions
Which policy clauses are deterministic rules vs interpretive guidance?Decides tools vs model reasoning
What are the exception paths and who approves them?Defines escalation gates
What evidence must be retained for audit or dispute?Shapes evidence packet
What examples represent edge cases, not just happy paths?Builds useful eval and UAT sets

Architecture questions

QuestionWhy it matters
Is budget tiering implemented as code/config or buried in prompts?Determines governability
Which verifier stages are deterministic, model-based, or human?Controls reliability and cost
How are tool permissions scoped by role and use case?Prevents agent overreach
What trace/span attributes are mandatory?Enables monitoring and incident replay
How does the system degrade when latency or cost budget is reached?Protects SLO and user experience
How is hidden reasoning prevented from entering logs or user output?Reduces privacy, security, and audit risk

Release checklist

CheckPass criteria
Use case registeredAI use case has owner, scope, risk tier, prohibited uses
Budget policy approvedTier mapping, hard triggers, caps and stop conditions approved
Complexity rubric testedSample cases mapped consistently by product, BA, risk and operations
RAG sources approvedCurrent policy versions, source owners and freshness checks documented
Tools approvedEach tool has permission scope, input validation, output contract and audit log
Verifier cascade implementedV0-V7 stages configured or explicitly waived with approved rationale
Human gate workingTier 3 actions cannot bypass reviewer approval
Rationale exposure testedCustomer/agent/audit views show allowed content only
Evidence packet completeRequired fields generated in test and production-like runs
Eval completeGolden and challenge sets pass quality, safety, citation and policy gates
Cost/latency reviewedp50/p95, max cost and capacity plan accepted
Monitoring dashboard readyTier mix, SLO, cost, verifier fail, override, abstention visible
Incident and rollback readyDisable path, degraded mode and escalation contacts confirmed
Legal/compliance/model risk signoffRequired approvers recorded

Release decision language:

The release is approved only for the registered use case, approved channels, configured budget tiers, listed tools, specified verifier cascade, and documented human gates. Any expansion to new actions, channels, data classes, or customer-impacting decisions requires a budget policy and evidence packet review.

Executive narrative

One-slide version

AI reasoning budget lets us spend intelligence where risk and value justify it. Instead of treating every request as a single model call, we classify each task by customer impact, evidence need, policy complexity and latency. Low-risk requests get fast answers. Policy questions get grounded answers with citations. Complex operational cases get decomposition, tools and verifier checks. High-impact financial retail decisions get controlled recommendations, human approval and audit evidence.

This approach improves quality without uncontrolled cost growth. It also makes governance practical: every high-risk AI run records what evidence was used, which tools were called, which verifiers passed or failed, who approved the final action and what was delivered. We do not expose hidden chain-of-thought or rely on it as audit evidence. We expose concise rationale and retain verifiable evidence.

Board / risk committee version

The control objective is not to make the model appear more confident. The control objective is to ensure that AI-supported work is risk-proportional, evidence-based, reviewable and interruptible. Reasoning budget and verifier cascade provide that operating model. They define when AI can answer, when it must retrieve sources, when it must call deterministic tools, when it must abstain, and when a human must decide.

For AML, credit, disputes, complaints and contact center use cases, this gives management a clear view of risk: tier mix, unsupported claim rate, verifier failures, human overrides, cost, latency, complaints and audit evidence completeness. It turns AI governance from a policy document into a measurable runtime control.

Technology leadership version

We should implement reasoning budget as a shared platform capability: budget classifier, orchestration templates, verifier registry, tool permission model, OpenTelemetry instrumentation and evidence store. Application teams configure use-case policies instead of hand-coding retry logic. This keeps cost, latency, model risk, security and audit evidence consistent across the AI portfolio.


Interview drills

Drill 1: Explain to a PM

Question: Why do we need a reasoning budget instead of just choosing the best model?

Answer: Because model quality is only one part of production quality. A contact center policy question, a payment dispute recommendation and an AML narrative have different risk, latency and evidence needs. Reasoning budget lets us route each request to the right level of retrieval, tool use, verification and human review. It prevents overspending on simple tasks and under-controlling high-impact tasks.

Drill 2: Explain to an architect

Question: What are the main components of verifier cascade architecture?

Answer: I would design a budget classifier, workflow orchestrator, RAG/source layer, deterministic tool layer, verifier registry, human review queue and evidence store. The cascade starts with permission and completeness, then source support, deterministic calculations, policy rules, consistency checks and finally expert review for high-impact cases. Each stage has pass/fail criteria and fail actions, and each run emits trace spans and evidence fields.

Drill 3: Explain to model risk

Question: How do you audit reasoning without storing chain-of-thought?

Answer: I would not use hidden chain-of-thought as audit evidence. I would store the approved sources, source ids, policy versions, input field hashes, tool result summaries, model/prompt config hash, verifier outcomes, output hash, abstention or escalation reason and human approval. That proves what the system used and checked without exposing internal scratchpad text.

Drill 4: AML case

Question: The AML copilot generated a strong suspicious narrative, but the analyst says the evidence is weak. What should happen?

Answer: The verifier cascade should catch unsupported claims before delivery. If it fails after generation, the case should move to escalation or rewrite with only observed facts. The evidence packet should show which claims lacked transaction or typology support. The analyst can approve, edit or reject, and the failure should feed the unsupported-claim metric and knowledge/prompt improvement backlog.

Drill 5: Credit policy case

Question: Can a high reasoning budget justify an automated credit decline?

Answer: No. Budget can improve evidence gathering and recommendation quality, but it cannot grant decision authority. Credit decline or adverse action needs approved policy, rule or human decision process, fair lending controls, adverse reason mapping and audit evidence. AI can assist with explanation and consistency checks, not bypass the control framework.

Drill 6: Payment dispute case

Question: What should the AI do if dispute evidence is incomplete but the customer is waiting in a live channel?

Answer: It should return a bounded response: confirm what is known, state what evidence is missing, give the approved next step and create or route the case. It should not guess liability. If latency budget is exceeded, use degraded mode: frontline guidance now, back-office reasoning asynchronously.

Drill 7: CTO case

Question: How do you know the extra test-time compute is worth the cost?

Answer: Measure by tier: unsupported claim reduction, human override reduction, rework reduction, complaint reduction, faster resolution, loss avoidance, analyst time saved and audit evidence completeness. If Tier 3 costs more but materially reduces rework and risk in AML or disputes, it may be justified. If a high-budget path only produces longer fluent answers, it should be capped or redesigned.


Source anchors

SourceLink用途
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parametershttps://arxiv.org/abs/2408.03314理解 test-time compute 作为可配置资源
Self-Consistency Improves Chain of Thought Reasoning in Language Modelshttps://arxiv.org/abs/2203.11171作为多路径推理思想背景, 不等同于企业上线架构
Training Verifiers to Solve Math Word Problemshttps://arxiv.org/abs/2110.14168支撑 generator/checker 分离和 verifier 思路
NIST AI Risk Management Frameworkhttps://www.nist.gov/itl/ai-risk-management-framework支撑 Govern, Map, Measure, Manage 风险闭环
NIST AI RMF Generative AI Profilehttps://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence支撑生成式 AI 专项风险、控制和证据
ISO/IEC 42001https://www.iso.org/standard/81230.html支撑 AI management system 和持续改进要求
OpenTelemetry Documentationhttps://opentelemetry.io/docs/支撑 trace/span/metrics/logs 的运行时证据设计