AI 扩展计划 / Playbooks

AI Reasoning Budget / Test-Time Compute / Verifier Cascade Playbook

版本: v1.0

379 行AI_REASONING_BUDGET_TEST_TIME_COMPUTE_VERIFIER_CASCADE_PLAYBOOK.md

AI Reasoning Budget / Test-Time Compute / Verifier Cascade 实战手册

版本: v1.0
日期: 2026-06-30
适用对象: AI 产品经理、CBAP / BA、企业架构师、解决方案架构师、AI Governance Lead、模型风险管理、金融零售业务负责人

Purpose and when to use

本手册用于把 AI reasoning capability 从“模型能力”转成“产品和架构运营能力”。它帮助团队回答:

哪些任务可以快速回答, 哪些任务必须花更多 test-time compute?
哪些任务需要 RAG、工具、planner/solver/checker、verifier cascade、人工复核?
如何在成本、延迟、质量、客户影响和监管证据之间做可解释取舍?
如何给审计、模型风险、合规和业务负责人证明系统不是随意生成?
如何提供用户可理解的 rationale, 同时不泄露 hidden chain-of-thought?

适用场景:

场景	使用原因
AML investigation copilot	需要从交易、客户画像、typology、制裁/PEP 工具中综合证据, 但最终判断必须由 analyst 负责
Credit policy reasoning	需要解释政策、调用计算工具、控制公平借贷和 adverse action 证据
Payment dispute reasoning	需要平衡交易证据、reason code、时限、客户通知和人工复核
Complaints root-cause analysis	需要跨渠道证据、分类规则、根因假设和监管响应时限
Contact center policy QA	需要低延迟回答, 同时保证政策来源、禁用话术和升级路径

不适用场景:

场景	原因
纯营销文案创意	可以用轻量内容审核, 不需要完整 verifier cascade
无客户影响的内部草稿	可用低预算路径, 重点控制数据泄露
已由规则引擎完全确定的决策	AI 可解释或摘要, 不应替代确定性规则
法律或合规最终意见	AI 可辅助检索和整理, 最终意见由授权人员给出

Operating model

1. Governance loop

flowchart LR
  A[Use case intake] --> B[Risk and complexity rubric]
  B --> C[Budget policy]
  C --> D[Workflow orchestration]
  D --> E[Verifier cascade]
  E --> F[Output, abstention, escalation]
  F --> G[Evidence packet]
  G --> H[Monitoring and review]
  H --> C

2. Roles and ownership

Role	Accountabilities
Product owner	定义用户旅程、业务价值、SLO、可接受等待时间、失败体验
BA / CBAP	拆解业务规则、例外路径、输入字段、验收样例、证据需求
Solution architect	设计 workflow、RAG、tool integration、verifier cascade、fallback
AI platform owner	提供 budget classifier、orchestrator、verifier registry、trace/evidence store
Business control owner	定义人工复核、审批、抽样、阈值、停止条件
Model risk / governance	审批风险分级、eval 计划、上线门禁、持续监控
Operations lead	管理人工队列、SLA、质量抽检、反馈闭环
Audit evidence owner	维护 evidence packet 标准、保留期限、访问控制

3. Lifecycle

Phase	Decisions
Intake	用例是否涉及客户影响、监管义务、敏感数据、自动动作
Design	使用哪个 budget tier, 哪些工具, 哪些 verifier, 何时 abstain
Build	将 budget policy 配置化, 接入 trace/span 和 evidence packet
Eval	按 tier 测试质量、成本、延迟、unsupported claim、human override
Release	审批 release evidence packet, 明确 rollback 和 degraded mode
Operate	监控 tier mix、SLO、成本、升级率、QA findings、投诉回流
Improve	根据失败样例更新 rubric、knowledge base、verifier、training examples

Template: reasoning budget policy

Policy field	Tier 0 Fast path	Tier 1 Evidence path	Tier 2 Deliberation path	Tier 3 Controlled decision path
Task examples	低风险 FAQ、内部改写	政策 QA、客服话术	支付争议初判、投诉 RCA	AML、信贷边界、客户不利影响支持
Max model calls	1	1-2	2-4	4-8 plus human gate
Retrieval	optional	required for factual/policy claims	required, targeted re-query allowed	required, multi-source evidence required
Tools	none or read-only	read-only policy/source lookup	deterministic calculators, case lookup	approved calculators, risk tools, workflow tools
Verifiers	schema, safety	schema, citation, freshness	citation, calculation, policy, contradiction	full cascade plus expert review
Latency target	under 2s if channel requires	under 5s	under 30s or async	async or queue-based
Customer impact	none	informational	operational recommendation	high-impact recommendation only
Human review	none	sample-based	exception-based or sample-based	mandatory before final action
Evidence	minimal logs	source refs and output hash	trace, sources, verifier results	full evidence packet
Allowed output	answer	answer with citation and limits	recommendation with confidence and next step	draft/recommendation, not autonomous decision
Stop condition	safety fail	no source support	verifier fail, missing evidence	human gate, policy conflict, evidence conflict

Policy statement:

The AI system must assign every request to a reasoning budget tier before generation. The tier determines retrieval, tool access, verifier checks, latency target, human review requirement, and evidence retention. Budget escalation must be justified by customer impact, uncertainty, policy edge, evidence gap, or user challenge. Budget caps cannot override authority, policy, or regulatory controls.

Template: task complexity rubric

Score each dimension from 0 to 3. Use the total and any hard trigger to choose a budget tier.

Dimension	0	1	2	3
Customer impact	no impact	informational	operational next step	adverse, financial, legal, regulatory impact
Evidence dependency	answer from approved static content	one policy source	multiple systems or versions	conflicting, missing, or regulated evidence
Rule complexity	no rule	simple policy clause	multiple conditions	exception, cross-jurisdiction, time-sensitive rule
Data sensitivity	public/internal	customer PII	account/transaction/credit data	suspicious activity, hardship, protected class risk
Action authority	no action	advice only	workflow recommendation	decision support for controlled action
Latency pressure	batch	standard UI	live agent assist	real-time customer conversation with escalation risk
Uncertainty	deterministic	low ambiguity	multiple plausible outcomes	high ambiguity or disagreement

Tier mapping:

Score / trigger	Budget tier
0-4 and no hard trigger	Tier 0
5-8 or factual/policy claim	Tier 1
9-13 or multi-step operational reasoning	Tier 2
14+ or hard trigger	Tier 3

Hard triggers for Tier 3:

Trigger	Examples
Customer adverse impact	credit decline reason, fee reversal denial, account restriction recommendation
Regulatory reporting	AML SAR support, complaints regulatory classification
Protected-class or fair lending exposure	credit policy exception, affordability assessment
Material financial loss	dispute liability, scam reimbursement recommendation
Evidence conflict	transaction system conflicts with notes or customer statement

Template: verifier cascade

Stage	Verifier	Input	Pass criteria	Fail action	Evidence captured
V0	Permission verifier	user, role, case id, account scope	user can access requested data and action	deny, redact, or route to authorized user	policy decision, user role, resource id hash
V1	Completeness verifier	required fields by use case	all required fields present or explicitly unavailable	ask for missing field or abstain	missing field list
V2	Retrieval verifier	query, source ids, policy versions	current approved sources retrieved	re-query, narrow scope, or escalate	source ids, index version, freshness
V3	Claim support verifier	output claims, retrieved sources	each material claim has source support	rewrite or abstain	claim-source map
V4	Tool verifier	calculations, timelines, risk scores	deterministic tool results match output	block output and rerun with tool result	tool name, input hash, result summary
V5	Policy verifier	output, policy rules, prohibited statements	no prohibited promise, no unsupported decision, correct template	rewrite, escalate, or deny	policy id, rule id, failure tag
V6	Consistency verifier	similar cases, prior disposition, current answer	no unexplained deviation from expected handling	reviewer queue or supervisor review	similarity ids, deviation reason
V7	Human verifier	reviewer packet	approve, edit, reject, or request more evidence	hold final action	reviewer id, timestamp, decision

Implementation guidance:

Put deterministic and permission checks before expensive model checks.
Do not let a later model verifier override a hard policy denial.
Treat verifier disagreement as an escalation signal, not as a reason to keep retrying indefinitely.
Version each cascade by use case, because AML, credit, payments, complaints and contact center need different controls.

Template: rationale exposure rule

Audience	Show	Do not show	Example
Customer	concise explanation, verified facts, next step, required documents	hidden reasoning, internal risk score, security rules, unverified suspicion	“We need merchant documentation before we can complete the dispute review.”
Frontline agent	approved answer, source title, policy snippet, escalation path	full scratchpad, unsupported alternatives, restricted compliance logic	“Use policy CARD-DISP-12. If customer says fraud, collect affidavit and escalate.”
Analyst	evidence summary, source refs, tool results, model recommendation, uncertainty	raw hidden chain-of-thought unless explicitly approved by policy	“Three transaction clusters match typology indicators T1 and T4; analyst review required.”
Supervisor	verifier failures, reviewer packet, exception reason	irrelevant generation drafts	“Policy verifier failed because proposed refund denial lacked deadline calculation.”
Audit / model risk	run metadata, source ids, verifier outcomes, human approvals, output hash	customer-facing chat only, hidden reasoning as proof	“Tier 3 case had V0-V6 results and human approval before final disposition.”

Rule statement:

Expose concise rationale and evidence appropriate to the audience. Do not expose or rely on hidden model reasoning as customer explanation, regulatory evidence, or audit proof. The evidence packet must prove what the system used, checked, blocked, escalated, and delivered.

Template: cost-latency-risk scorecard

Metric	Definition	Target / threshold	Owner	Action when breached
Tier mix	percent of requests by budget tier	agreed baseline by use case	product + platform	investigate drift or misclassification
p95 latency by tier	end-to-end workflow latency	Tier 1 under 5s, Tier 2 under 30s, Tier 3 per queue SLA	platform	optimize retrieval, tools, model, or async path
Cost per completed case	model + tool + human review cost	below approved business case threshold	product finance	cap budget or redesign workflow
Unsupported claim rate	material claims without source support	0 for customer-facing policy claims	governance	block release or update RAG/verifier
Verifier hard-fail rate	percent blocked by policy/tool/schema	monitored by stage	control owner	fix upstream workflow or knowledge source
Abstention rate	percent insufficient evidence / cannot answer	use-case specific acceptable band	product + operations	tune evidence intake or escalation
Human override rate	percent edited/rejected by reviewer	threshold by risk tier	model risk	review prompt, model, policy, data, training
Repeat contact / rework	customer or operations repeat due to poor answer	below baseline	business owner	RCA and corrective action
Complaint or harm signal	complaints linked to AI-assisted handling	zero tolerance for severe cases	risk + compliance	incident process and control review
Audit evidence completeness	required fields present in packet	100 percent for Tier 3	evidence owner	block production release or case closure

Decision rule:

A higher reasoning budget is justified only when it measurably improves evidence support, reduces human rework, lowers customer harm, improves compliance quality, or protects material financial value. It is not justified by model fluency alone.

Template: release evidence packet

Section	Required content
Use case summary	business objective, users, channels, customer impact, prohibited uses
Budget policy	tier definitions, assignment logic, hard triggers, budget caps
Workflow design	planner/solver/checker steps, RAG sources, tools, human gates, fallback
Verifier cascade	stages, pass/fail criteria, fail actions, owner, version
Data and knowledge sources	source systems, policy versions, index version, freshness controls
Eval results	golden set, challenge set, by-tier quality, unsupported claim, verifier fail, human override
Cost and latency	p50/p95 latency, cost per run, expected volume, capacity plan
Risk assessment	customer harm, regulatory exposure, privacy, security, fairness, operational resilience
Rationale exposure	customer, agent, analyst, supervisor, audit views
Evidence schema	run id, source refs, tool result summaries, verifier outcomes, output hash, reviewer actions
Operating model	RACI, review cadence, monitoring dashboard, incident workflow
Release decision	approvers, open exceptions, rollback/degraded mode, next review date

Minimum Tier 3 evidence fields:

Field	Required rule
`run_id`	unique and traceable to case/work item
`use_case`	matches approved AI use case registry
`budget_tier`	must equal Tier 3 for high-impact trigger
`trigger_reason`	explicit customer/regulatory/evidence reason
`source_refs`	source ids or hashes, not just free text
`policy_versions`	approved policy and rule versions
`tool_results`	deterministic result summaries and input hashes
`verifier_results`	pass/fail, rule ids, failure tags
`output_hash`	hash of final output or recommendation
`human_action`	reviewer decision before final action
`retention_policy`	mapped to business and regulatory retention

PM/BA/architecture questions

PM questions

Question	Why it matters
Which user moments require speed, and which require correctness over speed?	Prevents one-size-fits-all SLO
What is the cost of a wrong answer, not just the cost of model tokens?	Connects budget to business value
When should the product say “I need more evidence”?	Makes abstention a designed experience
What should the user see when the system escalates?	Reduces confusion and trust loss
Which metrics prove high-budget reasoning is worth it?	Links architecture to adoption and ROI

BA / CBAP questions

Question	Why it matters
What fields are required before the AI can reason safely?	Prevents unsupported conclusions
Which policy clauses are deterministic rules vs interpretive guidance?	Decides tools vs model reasoning
What are the exception paths and who approves them?	Defines escalation gates
What evidence must be retained for audit or dispute?	Shapes evidence packet
What examples represent edge cases, not just happy paths?	Builds useful eval and UAT sets

Architecture questions

Question	Why it matters
Is budget tiering implemented as code/config or buried in prompts?	Determines governability
Which verifier stages are deterministic, model-based, or human?	Controls reliability and cost
How are tool permissions scoped by role and use case?	Prevents agent overreach
What trace/span attributes are mandatory?	Enables monitoring and incident replay
How does the system degrade when latency or cost budget is reached?	Protects SLO and user experience
How is hidden reasoning prevented from entering logs or user output?	Reduces privacy, security, and audit risk

Release checklist

Check	Pass criteria
Use case registered	AI use case has owner, scope, risk tier, prohibited uses
Budget policy approved	Tier mapping, hard triggers, caps and stop conditions approved
Complexity rubric tested	Sample cases mapped consistently by product, BA, risk and operations
RAG sources approved	Current policy versions, source owners and freshness checks documented
Tools approved	Each tool has permission scope, input validation, output contract and audit log
Verifier cascade implemented	V0-V7 stages configured or explicitly waived with approved rationale
Human gate working	Tier 3 actions cannot bypass reviewer approval
Rationale exposure tested	Customer/agent/audit views show allowed content only
Evidence packet complete	Required fields generated in test and production-like runs
Eval complete	Golden and challenge sets pass quality, safety, citation and policy gates
Cost/latency reviewed	p50/p95, max cost and capacity plan accepted
Monitoring dashboard ready	Tier mix, SLO, cost, verifier fail, override, abstention visible
Incident and rollback ready	Disable path, degraded mode and escalation contacts confirmed
Legal/compliance/model risk signoff	Required approvers recorded

Release decision language:

The release is approved only for the registered use case, approved channels, configured budget tiers, listed tools, specified verifier cascade, and documented human gates. Any expansion to new actions, channels, data classes, or customer-impacting decisions requires a budget policy and evidence packet review.

Executive narrative

One-slide version

AI reasoning budget lets us spend intelligence where risk and value justify it. Instead of treating every request as a single model call, we classify each task by customer impact, evidence need, policy complexity and latency. Low-risk requests get fast answers. Policy questions get grounded answers with citations. Complex operational cases get decomposition, tools and verifier checks. High-impact financial retail decisions get controlled recommendations, human approval and audit evidence.

This approach improves quality without uncontrolled cost growth. It also makes governance practical: every high-risk AI run records what evidence was used, which tools were called, which verifiers passed or failed, who approved the final action and what was delivered. We do not expose hidden chain-of-thought or rely on it as audit evidence. We expose concise rationale and retain verifiable evidence.

Board / risk committee version

The control objective is not to make the model appear more confident. The control objective is to ensure that AI-supported work is risk-proportional, evidence-based, reviewable and interruptible. Reasoning budget and verifier cascade provide that operating model. They define when AI can answer, when it must retrieve sources, when it must call deterministic tools, when it must abstain, and when a human must decide.

For AML, credit, disputes, complaints and contact center use cases, this gives management a clear view of risk: tier mix, unsupported claim rate, verifier failures, human overrides, cost, latency, complaints and audit evidence completeness. It turns AI governance from a policy document into a measurable runtime control.

Technology leadership version

We should implement reasoning budget as a shared platform capability: budget classifier, orchestration templates, verifier registry, tool permission model, OpenTelemetry instrumentation and evidence store. Application teams configure use-case policies instead of hand-coding retry logic. This keeps cost, latency, model risk, security and audit evidence consistent across the AI portfolio.

Interview drills

Drill 1: Explain to a PM

Question: Why do we need a reasoning budget instead of just choosing the best model?

Answer: Because model quality is only one part of production quality. A contact center policy question, a payment dispute recommendation and an AML narrative have different risk, latency and evidence needs. Reasoning budget lets us route each request to the right level of retrieval, tool use, verification and human review. It prevents overspending on simple tasks and under-controlling high-impact tasks.

Drill 2: Explain to an architect

Question: What are the main components of verifier cascade architecture?

Answer: I would design a budget classifier, workflow orchestrator, RAG/source layer, deterministic tool layer, verifier registry, human review queue and evidence store. The cascade starts with permission and completeness, then source support, deterministic calculations, policy rules, consistency checks and finally expert review for high-impact cases. Each stage has pass/fail criteria and fail actions, and each run emits trace spans and evidence fields.

Drill 3: Explain to model risk

Question: How do you audit reasoning without storing chain-of-thought?

Answer: I would not use hidden chain-of-thought as audit evidence. I would store the approved sources, source ids, policy versions, input field hashes, tool result summaries, model/prompt config hash, verifier outcomes, output hash, abstention or escalation reason and human approval. That proves what the system used and checked without exposing internal scratchpad text.

Drill 4: AML case

Question: The AML copilot generated a strong suspicious narrative, but the analyst says the evidence is weak. What should happen?

Answer: The verifier cascade should catch unsupported claims before delivery. If it fails after generation, the case should move to escalation or rewrite with only observed facts. The evidence packet should show which claims lacked transaction or typology support. The analyst can approve, edit or reject, and the failure should feed the unsupported-claim metric and knowledge/prompt improvement backlog.

Drill 5: Credit policy case

Question: Can a high reasoning budget justify an automated credit decline?

Answer: No. Budget can improve evidence gathering and recommendation quality, but it cannot grant decision authority. Credit decline or adverse action needs approved policy, rule or human decision process, fair lending controls, adverse reason mapping and audit evidence. AI can assist with explanation and consistency checks, not bypass the control framework.

Drill 6: Payment dispute case

Question: What should the AI do if dispute evidence is incomplete but the customer is waiting in a live channel?

Answer: It should return a bounded response: confirm what is known, state what evidence is missing, give the approved next step and create or route the case. It should not guess liability. If latency budget is exceeded, use degraded mode: frontline guidance now, back-office reasoning asynchronously.

Drill 7: CTO case

Question: How do you know the extra test-time compute is worth the cost?

Answer: Measure by tier: unsupported claim reduction, human override reduction, rework reduction, complaint reduction, faster resolution, loss avoidance, analyst time saved and audit evidence completeness. If Tier 3 costs more but materially reduces rework and risk in AML or disputes, it may be justified. If a high-budget path only produces longer fluent answers, it should be capped or redesigned.

Source anchors

Source	Link	用途
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters	https://arxiv.org/abs/2408.03314	理解 test-time compute 作为可配置资源
Self-Consistency Improves Chain of Thought Reasoning in Language Models	https://arxiv.org/abs/2203.11171	作为多路径推理思想背景, 不等同于企业上线架构
Training Verifiers to Solve Math Word Problems	https://arxiv.org/abs/2110.14168	支撑 generator/checker 分离和 verifier 思路
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	支撑 Govern, Map, Measure, Manage 风险闭环
NIST AI RMF Generative AI Profile	https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence	支撑生成式 AI 专项风险、控制和证据
ISO/IEC 42001	https://www.iso.org/standard/81230.html	支撑 AI management system 和持续改进要求
OpenTelemetry Documentation	https://opentelemetry.io/docs/	支撑 trace/span/metrics/logs 的运行时证据设计