AI Reasoning Budget:推理预算与验证级联架构
Date: 2026-06-30
AI 推理预算架构:Test-Time Compute / Verifier Cascade / Reasoning Budget
Date: 2026-06-30
Status: evergreen
Audience: experienced CBAP / financial retail PM / product architect / solution architect / AI governance lead
Output: advanced architecture note, ADR draft, interview answer, 7-day practice plan
Why reasoning budget matters for AI product/architecture
AI 产品上线后, 最贵的不是单次模型调用, 而是把所有问题都当成同一种问题处理。简单 FAQ、低风险摘要、高风险授信解释、AML case narrative、支付争议判断和监管投诉根因分析, 不应该共享同一条推理路径、同一延迟预算、同一审计证据、同一人工复核策略。
Reasoning budget 指的是系统在运行时愿意为一次任务投入多少 deliberation resource: tokens, samples, retrieval passes, tool calls, verifier checks, human review time, queue priority, and audit evidence capture。Test-time compute 不是单纯“让模型多想一会儿”, 而是产品和架构层面的动态资源分配机制。
对金融零售 AI 来说, 推理预算的价值有四层:
| 层级 | 核心问题 | 架构含义 |
|---|---|---|
| Product value | 哪些场景值得慢一点、贵一点、稳一点 | 用业务价值、客户影响、监管风险决定 budget tier |
| Risk control | 哪些结论必须被证据、规则、复核或审批约束 | 把 verifier cascade 和 escalation 变成工作流门禁 |
| SLO economics | 哪些请求必须实时返回, 哪些可以异步处理 | 把成本、延迟、质量、升级率纳入同一 scorecard |
| Governance evidence | 如何证明系统没有把“内部推理文本”当审计证据 | 保存输入、证据、版本、校验结果、审批动作, 不泄露 hidden chain-of-thought |
本笔记不重复 CoT、self-consistency、process supervision、LLM-as-judge 或 prompt optimization 的基础介绍。这里的关注点是: 企业如何把“推理能力”设计成可运营、可审计、可控成本的系统能力。
一句话:
Reasoning budget is the runtime policy that decides when an AI system should answer fast, deliberate deeper, ask for evidence, call tools, run verifiers, abstain, or escalate to a human.
Concept diagram
flowchart TB
A[Business request] --> B[Risk and complexity classifier]
B --> C{Budget tier}
C -->|Tier 0 Fast path| F0[Direct response with schema checks]
C -->|Tier 1 Evidence path| F1[RAG + grounded answer]
C -->|Tier 2 Deliberation path| F2[Plan -> solve -> check loop]
C -->|Tier 3 Controlled decision path| F3[Decompose + tools + verifier cascade + human gate]
F0 --> V0[Safety and format check]
F1 --> V1[Citation support verifier]
F2 --> V2[Consistency, policy and calculation verifier]
F3 --> V3[Independent verifier + rule engine + SME review]
V0 --> D{Deliver, abstain, or escalate}
V1 --> D
V2 --> D
V3 --> D
D --> E[Customer or analyst output]
D --> G[Runtime evidence packet]
G --> H[Trace spans, source ids, policy versions]
G --> I[Verifier results, risk score, abstention reason]
G --> J[Human review action, final disposition]
架构图表达的是控制路径, 不是模型内部实现。企业不需要知道模型内部到底如何推理, 也不应该要求模型暴露 hidden chain-of-thought。企业能控制的是外部运行时: 是否检索证据、是否拆任务、是否调用工具、是否多次生成、是否运行 verifier、是否要求人工审批、是否记录证据。
Core architecture model
1. Request intake and risk classification
每次 AI run 先进入 request intake, 识别:
| Signal | 示例 | 用途 |
|---|---|---|
| Use case | AML investigation, credit policy QA, complaint RCA | 匹配允许的 workflow 和控制清单 |
| User role | frontline agent, analyst, supervisor, customer | 决定可见信息、动作权限、解释深度 |
| Customer impact | informational, operational, adverse decision support | 决定是否需要人工复核和证据包 |
| Data sensitivity | PII, account data, transaction data, credit data | 决定 redaction, retention, access control |
| Time criticality | real-time call center vs back-office investigation | 决定 latency budget |
| Regulatory exposure | AML, fair lending, UDAAP, dispute rights, complaints | 决定 verifier 和 escalation gates |
风险分类器不必一开始就复杂。成熟做法是把规则、用例台账和模型分类结合:
budget_tier = f(use_case, action_type, customer_impact, uncertainty, data_sensitivity, user_role, channel_slo)
2. Budget tiering
| Tier | 适用任务 | 典型预算 | 输出策略 | 门禁 |
|---|---|---|---|---|
| Tier 0 Fast path | 低风险 FAQ、格式改写、内部草稿 | 1 model call, low max tokens, no extra retrieval unless required | 直接回答, 可带简短来源 | schema, safety, policy allowlist |
| Tier 1 Evidence path | 政策 QA、知识库问答、客服标准话术 | RAG top-k, one grounded generation, citation verifier | 回答 + 引用 + limitation | source support, freshness, no unsupported claim |
| Tier 2 Deliberation path | 投诉原因归类、支付争议初判、运营异常诊断 | plan/solve/check, limited tools, targeted verifier | 建议 + confidence + evidence summary | rule check, contradiction check, reviewer sampling |
| Tier 3 Controlled decision path | AML SAR support、信贷政策边界、客户不利影响解释 | decomposition, multiple evidence passes, deterministic tools, verifier cascade, human gate | recommendation only, no autonomous adverse action | independent verification, mandatory human decision, audit packet |
预算不是“越多越好”。高预算可能带来更长延迟、更高成本、更多表面合理化、更多隐私暴露面和更复杂的证据管理。关键是把预算和风险相称。
3. Decomposition and planner/solver/checker loop
复杂任务不要把所有要求塞进一个 prompt。推荐拆成三个角色, 但不要求一定是三个模型:
| Role | 责任 | 不能做什么 |
|---|---|---|
| Planner | 把任务拆成可验证子问题, 选择需要的证据和工具 | 不直接下最终业务结论 |
| Solver | 根据证据、工具结果和规则生成候选结论 | 不越过权限执行客户影响动作 |
| Checker | 检查证据支撑、规则一致性、计算正确性、输出合规 | 不把模型内部草稿当事实 |
示例: 支付争议 reasoning。
Planner:
1. 识别 dispute reason code.
2. 检查交易状态、授权方式、3DS/AVS/CVV 结果、merchant evidence.
3. 检查客户争议窗口和法规/卡组织时限.
4. 判断是否需要临时贷记、补件或人工审查.
Solver:
根据检索到的交易数据和政策版本生成建议处置。
Checker:
验证时限、金额、原因码、客户通知模板、证据引用和禁用话术。
4. Abstention and escalation
推理预算体系必须承认“不能答”和“不能自动决定”是正常产出。
| Trigger | 系统行为 | 用户可见表达 |
|---|---|---|
| Evidence missing | abstain or request missing field | “当前资料不足, 需要补充 X 后才能判断。” |
| Policy conflict | escalate to supervisor/compliance | “该问题涉及政策冲突, 已标记人工复核。” |
| High customer impact | recommendation only | “以下是供授权人员复核的建议, 不是自动决定。” |
| Verifier disagreement | hold output, generate reviewer packet | “系统发现证据与结论不一致, 需要人工确认。” |
| Latency breach | degraded mode | “先返回可确认事实, 复杂判断转入异步处理。” |
5. Hidden vs exposed rationale
企业系统应区分三类材料:
| Material | 是否可对客户展示 | 是否可作审计证据 | 管理原则 |
|---|---|---|---|
| Hidden model reasoning / scratchpad | 否 | 否 | 不保存或严格隔离, 不作为事实来源 |
| Concise rationale | 可以, 需经模板和政策控制 | 可作为输出记录 | 面向用户解释“为什么”, 不暴露内部草稿 |
| Evidence packet | 通常不直接展示全文, 可按权限查看 | 是 | 保存 source ids, policy versions, tool results, verifier outcomes, reviewer actions |
审计要的是可复现证据, 不是模型的完整思维过程。可复现证据包括: 输入摘要、检索来源、工具结果、规则版本、模型配置、输出哈希、verifier 分数、人工审批记录、最终业务处置。
Budgeting policies and gates
Policy 1: Risk-proportional compute
| Risk level | Rule |
|---|---|
| Low | 不允许为低价值请求无限增加 samples 或 verifier。通过 fast path 和缓存控制成本。 |
| Medium | 必须有 evidence grounding, schema validation, basic contradiction check。 |
| High | 必须有 decomposition、独立 verifier、人工复核或抽样、完整 evidence packet。 |
| Critical | AI 只能生成建议或材料, 不得自动执行客户不利动作或监管申报。 |
Policy 2: Budget cannot override authority
更多推理 token 不等于更多权限。系统即使很“自信”, 也不能绕过:
- 信贷拒绝、降额、冻结账户等 customer-impacting action 的授权流程。
- AML/SAR 的法定流程和合规复核。
- 支付争议的时限、通知和客户权利要求。
- 投诉处置的监管分类、根因编码和整改流程。
Policy 3: Budget escalation requires evidence
从 Tier 1 升到 Tier 2 或 Tier 3, 必须记录触发原因:
| Trigger | 示例 |
|---|---|
| Uncertainty | 候选答案不一致, citation support 低 |
| High impact | 可能影响授信、交易、投诉补救、监管报送 |
| Policy edge | 规则例外、产品条款冲突、跨辖区要求 |
| Evidence gap | 关键字段缺失或来源过期 |
| User challenge | 客户或员工对答案提出异议 |
Policy 4: Stop conditions
系统需要明确何时停止继续消耗 test-time compute:
| Stop condition | 处理 |
|---|---|
| Evidence exhausted | 输出 insufficiency statement, 不继续猜测 |
| Verifier hard fail | 阻断输出或转人工 |
| Max latency reached | 返回 partial factual answer 或异步工单 |
| Budget cap reached | 保存当前 evidence packet, 标记未决 |
| Policy denial | 直接拒绝或升级, 不允许 prompt retry 绕过 |
Policy 5: Budget ownership
| Owner | 责任 |
|---|---|
| Product owner | 定义用例价值、用户旅程、可接受等待时间 |
| Business control owner | 定义哪些任务必须人工复核或留痕 |
| AI architect | 设计 tiering, orchestration, verifier cascade, observability |
| Model risk / governance | 审批高风险用例的评估、证据和持续监控 |
| Operations owner | 管理升级队列、人工复核 SLA、反馈闭环 |
| Finance / platform owner | 管理 token cost, capacity, vendor spend, rate limits |
Verifier cascade and evidence design
Verifier cascade 是把多个低耦合校验器按成本、确定性和风险影响排序。目标不是让某个 judge “裁判一切”, 而是用便宜、确定、可解释的检查先拦截明显问题, 再把复杂判断交给更昂贵的模型、规则或人工。
Cascade pattern
| Stage | Verifier | 例子 | Fail action |
|---|---|---|---|
| V0 | Input and permission verifier | 用户是否有权访问账户、case、policy | deny or redact |
| V1 | Schema and completeness verifier | 是否包含 reason code、policy id、amount、date | ask for missing fields |
| V2 | Retrieval support verifier | 每个关键 claim 是否被 source id 支撑 | revise or abstain |
| V3 | Deterministic tool verifier | APR/DTI/期限/金额/时限计算是否正确 | block and rerun with corrected tool result |
| V4 | Policy rule verifier | 是否违反产品政策、监管口径、禁用承诺 | escalate or rewrite |
| V5 | Cross-case consistency verifier | 同类 case 结论是否明显偏离历史处理 | reviewer sampling or supervisor gate |
| V6 | Human expert verifier | AML investigator、credit policy officer、complaints QA | approve, edit, reject, create finding |
Evidence object design
建议每次 high-impact run 生成 reasoning_budget_evidence 对象:
| Field | 示例 |
|---|---|
| run_id | ai-run-20260630-aml-00031 |
| use_case | AML alert narrative support |
| budget_tier | Tier 3 |
| trigger_reason | high customer/regulatory impact, evidence conflict |
| model_config_hash | hash of model id, prompt version, temperature, max tokens |
| source_refs | transaction ids, policy ids, knowledge chunk ids |
| tool_results | sanctions screen result summary, DTI calculator result |
| verifier_results | V1 pass, V2 fail then pass, V4 pass, V6 approved |
| abstention_or_escalation | escalated to level-2 investigator |
| output_hash | hash of delivered narrative |
| human_action | approved with edits, reviewer id, timestamp |
| retention_policy | 7 years for AML case record, or local policy |
不要把 hidden chain-of-thought 放进这个对象。需要记录的是控制结果和证据链接。
Financial retail scenarios
1. AML investigations
AML copilot 的高价值不是“自动判定可疑”, 而是减少 analyst 搜集材料和起草 narrative 的时间。
| Step | Reasoning budget design |
|---|---|
| Triage | Tier 2: 总结 alert, 聚合交易模式, 检查客户画像偏差 |
| Typology match | Tier 3: 检索 typology library, 制裁/PEP/地理风险工具, verifier 检查 unsupported claim |
| Narrative draft | Tier 3: 只生成 analyst-facing draft, 必须人工批准 |
| Evidence | 保存 transaction refs、typology ids、tool results、analyst edits、final disposition |
关键控制: AI 不应把“看起来可疑”的语言变成事实断言。叙述必须区分 observed facts, policy indicators, analyst judgment。
2. Credit policy reasoning
信贷场景的 reasoning budget 要保护公平性、可解释性和授权边界。
| Step | Reasoning budget design |
|---|---|
| Eligibility QA | Tier 1: RAG 回答政策, 引用当前版本 |
| Borderline assessment | Tier 3: 调用 DTI/affordability 工具, 检查例外政策 |
| Adverse action support | Tier 3: AI 生成 reason candidates, 人类或规则系统决定最终原因 |
| Evidence | 保存 policy version、input fields、calculation tool result、adverse reason mapping |
关键控制: AI 不能生成未经验证的不利行动原因, 不能使用禁止变量或 proxy reasoning, 不能把内部推理作为客户解释。
3. Payment dispute reasoning
支付争议需要在客户体验、卡组织规则、监管时限和损失控制之间平衡。
| Step | Reasoning budget design |
|---|---|
| Frontline intake | Tier 1: 指导需要收集哪些事实, 不做最终拒绝 |
| Case classification | Tier 2: 根据 reason code、交易状态、时间线提出候选路径 |
| Liability analysis | Tier 3: 调用交易工具、规则库、时限计算器, verifier 检查冲突 |
| Evidence | 保存 transaction data refs、rule version、deadline calculation、customer notice template |
关键控制: 若证据不足, 系统应要求补件或升级, 而不是为了给出答案而猜测。
4. Complaints root-cause analysis
投诉 RCA 适合 test-time compute, 因为它通常需要跨渠道、跨系统、跨政策的证据拼接。
| Step | Reasoning budget design |
|---|---|
| Complaint summary | Tier 1: 摘要客户主张、时间线、涉及产品 |
| Root cause hypothesis | Tier 2: 生成多个候选根因, 每个候选必须引用证据 |
| Regulatory classification | Tier 3: 检查投诉分类、响应时限、补救要求 |
| Evidence | 保存 case notes、call transcript ids、policy refs、RCA code、QA review |
关键控制: 区分 customer allegation, confirmed fact, business error, systemic root cause, remediation action。
5. Contact center policy QA
Contact center 需要低延迟, 但不能牺牲政策准确性。
| Step | Reasoning budget design |
|---|---|
| Live answer | Tier 1: RAG + citation, 强制短答案 |
| Complex exception | Tier 2: 提醒转主管或创建 back-office task |
| Customer-facing language | Tier 1/Tier 2: 禁用承诺、禁用法律结论、使用批准模板 |
| Evidence | 保存 policy source ids、agent accepted/edited、call outcome |
关键控制: 对话中不要展示内部推理。坐席需要的是可读话术、来源、限制和下一步。
Metrics/control/evidence model
Product and operations metrics
| Metric | 含义 | 目标用法 |
|---|---|---|
| budget tier mix | 各 tier 请求占比 | 发现过度使用高预算或高风险请求被低估 |
| cost per resolved case | 单个完成 case 的模型和工具成本 | 和人工节省、损失减少、SLA 改善一起看 |
| p50/p95 latency by tier | 不同预算层延迟 | 证明 SLO 设计是否现实 |
| abstention rate | 系统拒答或要求补证比例 | 太低可能过度自信, 太高可能体验差 |
| escalation rate | 转人工比例 | 运营容量和控制强度的核心指标 |
| verifier failure rate | 各 verifier 拦截率 | 找到知识库、prompt、工具或政策缺口 |
| unsupported claim rate | 无证据支撑的关键 claim 比例 | RAG 和输出治理关键指标 |
| human override rate | 人工修改或推翻比例 | 高于阈值时触发模型/流程复盘 |
| repeat complaint / rework rate | 后续返工或投诉 | 衡量真实业务质量 |
Control metrics
| Control | Evidence | Review cadence |
|---|---|---|
| Budget tier assignment accuracy | 抽样复核 risk classifier 决策 | monthly |
| High-impact human gate | Tier 3 case approval logs | weekly |
| Citation support | Claim-source support report | daily dashboard |
| Policy version freshness | Knowledge base index version and policy release log | each release |
| Cost cap enforcement | per use case budget dashboard | weekly |
| Latency SLO breach | trace metrics and degraded mode events | daily |
| Hidden rationale protection | log inspection, redaction tests, output safety eval | each release |
| Reviewer calibration | inter-reviewer agreement and QA findings | monthly |
Evidence model aligned to observability
OpenTelemetry 的 trace/span 思路可以映射到 AI reasoning workflow:
root span: ai.reasoning.run
span: ai.budget.classify
span: ai.context.retrieve
span: ai.plan.create
span: ai.solve.generate
span: ai.tool.calculate
span: ai.verify.citation
span: ai.verify.policy
span: ai.escalate.human
span: ai.output.deliver
关键属性:
| Attribute | 示例 |
|---|---|
| ai.use_case | credit_policy_reasoning |
| ai.budget_tier | tier_3 |
| ai.budget_reason | adverse_action_support |
| ai.max_model_calls | 4 |
| ai.max_latency_ms | 30000 |
| ai.max_cost_usd | 0.35 |
| ai.verifier_policy | credit_v12_cascade |
| ai.abstention_reason | missing_income_evidence |
| ai.human_gate | required |
| ai.final_disposition | human_approved_with_edits |
Anti-patterns and failure modes
| Anti-pattern | Why it fails | Better design |
|---|---|---|
| One prompt for every risk tier | 低风险浪费成本, 高风险缺控制 | 按 use case、impact、uncertainty 分 tier |
| More tokens as universal fix | 增加成本和延迟, 不保证事实正确 | 先增加 evidence, tools, deterministic checks |
| Saving full hidden reasoning for audit | 暴露敏感草稿, 混淆事实与推测 | 保存 source refs, tool results, verifier outcomes |
| Letting confidence bypass controls | 自信不等于授权 | 高影响动作必须走权限和人工门禁 |
| Verifier only at final output | 错误已经污染上下文和结论 | input、retrieval、tool、policy、output 分层校验 |
| No stop condition | 系统会 retry 到成本失控或产生幻想 | 设置 latency, cost, evidence exhaustion, policy denial caps |
| Treating abstention as failure | 迫使模型在证据不足时编造 | 把 abstention/escalation 作为合格结果 |
| Unobserved reasoning workflows | 无法复盘成本、延迟、失败原因 | 用 trace/span 记录每个预算和 verifier 决策 |
| User-facing rationale copied from internal scratchpad | 可能泄露安全策略、错误分支、敏感信息 | 生成独立的 concise rationale 和 evidence summary |
| Business owners not involved in tiering | 技术团队无法独自判断客户影响 | PM/BA/risk/control owner 共同维护 policy |
Architecture mapping to RAG / Agent / Copilot / Eval / Governance
| Architecture area | Reasoning budget role | Design question |
|---|---|---|
| RAG | 决定是否检索、检索几轮、是否需要 citation verifier | 这个 claim 必须被哪个 source type 支撑? |
| Agent | 决定工具调用、计划深度、动作权限、停止条件 | 哪些 tool 可以自动调用, 哪些 action 只生成建议? |
| Copilot | 决定对员工展示多少 rationale、何时转人工 | 用户需要的是答案、建议、来源, 还是复核包? |
| Eval | 按 budget tier 建 golden set、challenge set、cost/latency eval | 高预算路径是否真的带来质量提升? |
| Governance | 将 tiering、abstention、human gate、evidence retention 纳入控制库 | 谁批准某个用例进入 Tier 3 controlled decision path? |
| Observability | 记录每次预算选择、verifier 结果、成本和延迟 | 事故后能否复盘为什么系统给出该建议? |
| Model risk | 验证模型变化对 tier mix、override、unsupported claim 的影响 | 换模型是否导致高风险请求被更少升级? |
ADR draft
Title
Adopt risk-proportional reasoning budget and verifier cascade for high-impact financial retail AI workflows.
Status
Proposed.
Context
Current AI workflows often use a uniform generation path. This creates three problems: low-risk tasks overconsume compute, high-risk tasks lack formal verification and escalation, and audit evidence is inconsistent across use cases. Financial retail workflows such as AML investigation support, credit policy reasoning, payment disputes, complaints RCA, and contact center policy QA require different latency, cost, evidence, and governance profiles.
Decision
We will introduce a reasoning budget policy with four tiers: Fast path, Evidence path, Deliberation path, and Controlled decision path. Budget tier assignment will be based on use case, customer impact, uncertainty, data sensitivity, user role, and channel SLO. High-impact workflows will use decomposition, deterministic tools where available, verifier cascade, abstention/escalation rules, and human approval gates. Runtime evidence will record inputs, source references, tool outputs, policy versions, verifier results, output hashes, and human actions. Hidden model reasoning will not be used as audit evidence or exposed to customers.
Consequences
Positive consequences:
- Better alignment between cost, latency, risk, and business value.
- Clearer control design for high-impact AI use cases.
- Stronger auditability without leaking chain-of-thought.
- Better operational capacity planning through tier mix and escalation metrics.
Tradeoffs:
- More orchestration complexity than a single-call model design.
- Requires ownership of budget policy, verifier maintenance, and evidence retention.
- High-risk workflows may have higher p95 latency and require human queue capacity.
Alternatives considered
| Alternative | Rejection reason |
|---|---|
| Single prompt with larger token budget | Does not solve evidence, authority, or audit requirements |
| LLM judge as final arbiter | Judge bias and domain limits make it insufficient for regulated decisions |
| Human review for every case | Too slow and expensive, and still lacks machine-readable runtime evidence |
| Fully autonomous agent | Not acceptable for adverse, regulated, or customer-impacting actions |
Decision criteria
| Criterion | Required outcome |
|---|---|
| Customer impact | High-impact actions require human gate |
| Evidence quality | Key claims require source or tool support |
| Cost control | Each tier has max calls, max latency, max spend |
| Governance | Tiering policy is approved and versioned |
| Auditability | Evidence packet supports review without hidden chain-of-thought |
Interview answer: 30秒, 2分钟, CTO版本
30秒版本
AI reasoning budget 是把 test-time compute 当成产品和架构策略来管理。低风险问题走 fast path, 需要政策证据的问题走 RAG 和引用校验, 高影响场景走拆解、工具、verifier cascade 和人工复核。关键不是展示模型的 chain-of-thought, 而是保存可审计证据: 输入、来源、工具结果、政策版本、校验结果和人工决策。金融零售里, 这能同时控制成本、延迟、客户影响和监管风险。
2分钟版本
我会把推理预算设计成四层运行时策略。第一层是低风险 fast path, 适合简单 FAQ 和文本改写。第二层是 evidence path, 用在客服政策 QA 这类必须引用来源的任务。第三层是 deliberation path, 用在支付争议初判、投诉根因分析等需要拆解和检查的任务。第四层是 controlled decision path, 用在 AML investigation、信贷政策边界和客户不利影响解释等高风险场景。
架构上先做 risk and complexity classification, 再选择预算层。高预算不是简单增加 token, 而是增加证据检索、确定性工具、planner/solver/checker loop、verifier cascade、abstention 和人工复核。Verifier 应从便宜确定的检查开始, 包括权限、字段完整性、引用支撑、计算工具、政策规则, 最后才到模型评估或人工专家。
治理上, 我不会保存 hidden chain-of-thought 作为审计证据。审计证据应该是 source ids、policy versions、tool outputs、verifier results、model config hash、output hash、human approval and final disposition。这样既能证明系统做了合理控制, 又不会泄露内部草稿或把模型推测误当事实。
CTO版本
我会把 reasoning budget 作为 AI platform 的 runtime policy layer, 而不是每个应用团队自己写 prompt retry。平台提供统一的 budget classifier、orchestration contract、verifier registry、tool permission model、OpenTelemetry instrumentation 和 evidence store。每个 use case 注册 risk tier、SLO、allowed tools、verifier cascade、human gate 和 retention policy。
技术决策上, 我会优先把确定性校验和业务规则前置, 把 LLM 放在擅长的 decomposition、summarization、evidence synthesis 和 language generation 位置。高影响任务用 planner/solver/checker loop, 但 checker 不依赖同一个上下文自证正确, 而是结合 citation verifier、rule engine、calculation tool、policy version check 和 human review。成本方面, 我会建立 per-use-case compute budget、tier mix dashboard、p95 latency、override rate、unsupported claim rate 和 cost per resolved case, 用真实业务结果验证高预算路径是否值得。
治理上, 我会把 NIST AI RMF 的 Govern/Map/Measure/Manage 思路落到 evidence plane, 并对齐 ISO/IEC 42001 的 AI management system 责任。上线门禁不问“模型聪不聪明”, 而问“哪些请求会进入高预算路径, 哪些结论会被阻断或升级, 证据能否复盘, 模型或政策变更会不会破坏这些控制”。
7-day practice plan
| Day | Practice | Output |
|---|---|---|
| Day 1 | 选择一个金融零售 AI use case, 画出 fast path、evidence path、deliberation path、controlled path | 一页 budget tier map |
| Day 2 | 为该 use case 设计 task complexity rubric, 包含 customer impact、evidence gap、policy edge、latency | complexity scoring table |
| Day 3 | 设计 planner/solver/checker loop, 标出每一步输入、输出、工具和停止条件 | workflow diagram |
| Day 4 | 设计 verifier cascade, 从权限、schema、citation、calculation、policy 到 human review | verifier table |
| Day 5 | 定义 evidence object, 明确哪些字段保存、hash、redact、retention | evidence schema |
| Day 6 | 设计 cost-latency-risk scorecard, 包含 tier mix、p95 latency、override、unsupported claim | dashboard mock table |
| Day 7 | 写一份 ADR, 说明为什么采用 risk-proportional reasoning budget, 如何上线和治理 | ADR draft + interview answer |
Source anchors
| Source | Link | 本文采用的思想 |
|---|---|---|
| Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters | https://arxiv.org/abs/2408.03314 | Test-time compute 是可优化资源, 不是固定推理成本 |
| Self-Consistency Improves Chain of Thought Reasoning in Language Models | https://arxiv.org/abs/2203.11171 | 多路径推理可作为历史背景, 但企业应把它抽象为受控预算策略 |
| Training Verifiers to Solve Math Word Problems | https://arxiv.org/abs/2110.14168 | Verifier 思路启发“生成答案”和“检查答案”分离 |
| NIST AI Risk Management Framework | https://www.nist.gov/itl/ai-risk-management-framework | 将 AI 风险纳入 Govern, Map, Measure, Manage 的闭环 |
| NIST AI RMF Generative AI Profile | https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence | 生成式 AI 特有风险需要专项控制、评估和证据 |
| ISO/IEC 42001 | https://www.iso.org/standard/81230.html | AI management system 对责任、运行控制和持续改进的管理体系要求 |
| OpenTelemetry Documentation | https://opentelemetry.io/docs/ | 用 trace, span, metrics, logs 设计 AI runtime evidence 和 SLO 监控 |