返回 Papers
AI 底层逻辑 / 经典论文

AI Reasoning Budget:推理预算与验证级联架构

Date: 2026-06-30

513ai-foundations/papers/167-ai-reasoning-budget-test-time-compute-verifier-cascade-architecture.md

AI 推理预算架构:Test-Time Compute / Verifier Cascade / Reasoning Budget

Date: 2026-06-30
Status: evergreen
Audience: experienced CBAP / financial retail PM / product architect / solution architect / AI governance lead
Output: advanced architecture note, ADR draft, interview answer, 7-day practice plan


Why reasoning budget matters for AI product/architecture

AI 产品上线后, 最贵的不是单次模型调用, 而是把所有问题都当成同一种问题处理。简单 FAQ、低风险摘要、高风险授信解释、AML case narrative、支付争议判断和监管投诉根因分析, 不应该共享同一条推理路径、同一延迟预算、同一审计证据、同一人工复核策略。

Reasoning budget 指的是系统在运行时愿意为一次任务投入多少 deliberation resource: tokens, samples, retrieval passes, tool calls, verifier checks, human review time, queue priority, and audit evidence capture。Test-time compute 不是单纯“让模型多想一会儿”, 而是产品和架构层面的动态资源分配机制。

对金融零售 AI 来说, 推理预算的价值有四层:

层级核心问题架构含义
Product value哪些场景值得慢一点、贵一点、稳一点用业务价值、客户影响、监管风险决定 budget tier
Risk control哪些结论必须被证据、规则、复核或审批约束把 verifier cascade 和 escalation 变成工作流门禁
SLO economics哪些请求必须实时返回, 哪些可以异步处理把成本、延迟、质量、升级率纳入同一 scorecard
Governance evidence如何证明系统没有把“内部推理文本”当审计证据保存输入、证据、版本、校验结果、审批动作, 不泄露 hidden chain-of-thought

本笔记不重复 CoT、self-consistency、process supervision、LLM-as-judge 或 prompt optimization 的基础介绍。这里的关注点是: 企业如何把“推理能力”设计成可运营、可审计、可控成本的系统能力。

一句话:

Reasoning budget is the runtime policy that decides when an AI system should answer fast, deliberate deeper, ask for evidence, call tools, run verifiers, abstain, or escalate to a human.


Concept diagram

flowchart TB
  A[Business request] --> B[Risk and complexity classifier]
  B --> C{Budget tier}

  C -->|Tier 0 Fast path| F0[Direct response with schema checks]
  C -->|Tier 1 Evidence path| F1[RAG + grounded answer]
  C -->|Tier 2 Deliberation path| F2[Plan -> solve -> check loop]
  C -->|Tier 3 Controlled decision path| F3[Decompose + tools + verifier cascade + human gate]

  F0 --> V0[Safety and format check]
  F1 --> V1[Citation support verifier]
  F2 --> V2[Consistency, policy and calculation verifier]
  F3 --> V3[Independent verifier + rule engine + SME review]

  V0 --> D{Deliver, abstain, or escalate}
  V1 --> D
  V2 --> D
  V3 --> D

  D --> E[Customer or analyst output]
  D --> G[Runtime evidence packet]

  G --> H[Trace spans, source ids, policy versions]
  G --> I[Verifier results, risk score, abstention reason]
  G --> J[Human review action, final disposition]

架构图表达的是控制路径, 不是模型内部实现。企业不需要知道模型内部到底如何推理, 也不应该要求模型暴露 hidden chain-of-thought。企业能控制的是外部运行时: 是否检索证据、是否拆任务、是否调用工具、是否多次生成、是否运行 verifier、是否要求人工审批、是否记录证据。


Core architecture model

1. Request intake and risk classification

每次 AI run 先进入 request intake, 识别:

Signal示例用途
Use caseAML investigation, credit policy QA, complaint RCA匹配允许的 workflow 和控制清单
User rolefrontline agent, analyst, supervisor, customer决定可见信息、动作权限、解释深度
Customer impactinformational, operational, adverse decision support决定是否需要人工复核和证据包
Data sensitivityPII, account data, transaction data, credit data决定 redaction, retention, access control
Time criticalityreal-time call center vs back-office investigation决定 latency budget
Regulatory exposureAML, fair lending, UDAAP, dispute rights, complaints决定 verifier 和 escalation gates

风险分类器不必一开始就复杂。成熟做法是把规则、用例台账和模型分类结合:

budget_tier = f(use_case, action_type, customer_impact, uncertainty, data_sensitivity, user_role, channel_slo)

2. Budget tiering

Tier适用任务典型预算输出策略门禁
Tier 0 Fast path低风险 FAQ、格式改写、内部草稿1 model call, low max tokens, no extra retrieval unless required直接回答, 可带简短来源schema, safety, policy allowlist
Tier 1 Evidence path政策 QA、知识库问答、客服标准话术RAG top-k, one grounded generation, citation verifier回答 + 引用 + limitationsource support, freshness, no unsupported claim
Tier 2 Deliberation path投诉原因归类、支付争议初判、运营异常诊断plan/solve/check, limited tools, targeted verifier建议 + confidence + evidence summaryrule check, contradiction check, reviewer sampling
Tier 3 Controlled decision pathAML SAR support、信贷政策边界、客户不利影响解释decomposition, multiple evidence passes, deterministic tools, verifier cascade, human gaterecommendation only, no autonomous adverse actionindependent verification, mandatory human decision, audit packet

预算不是“越多越好”。高预算可能带来更长延迟、更高成本、更多表面合理化、更多隐私暴露面和更复杂的证据管理。关键是把预算和风险相称。

3. Decomposition and planner/solver/checker loop

复杂任务不要把所有要求塞进一个 prompt。推荐拆成三个角色, 但不要求一定是三个模型:

Role责任不能做什么
Planner把任务拆成可验证子问题, 选择需要的证据和工具不直接下最终业务结论
Solver根据证据、工具结果和规则生成候选结论不越过权限执行客户影响动作
Checker检查证据支撑、规则一致性、计算正确性、输出合规不把模型内部草稿当事实

示例: 支付争议 reasoning。

Planner:
1. 识别 dispute reason code.
2. 检查交易状态、授权方式、3DS/AVS/CVV 结果、merchant evidence.
3. 检查客户争议窗口和法规/卡组织时限.
4. 判断是否需要临时贷记、补件或人工审查.

Solver:
根据检索到的交易数据和政策版本生成建议处置。

Checker:
验证时限、金额、原因码、客户通知模板、证据引用和禁用话术。

4. Abstention and escalation

推理预算体系必须承认“不能答”和“不能自动决定”是正常产出。

Trigger系统行为用户可见表达
Evidence missingabstain or request missing field“当前资料不足, 需要补充 X 后才能判断。”
Policy conflictescalate to supervisor/compliance“该问题涉及政策冲突, 已标记人工复核。”
High customer impactrecommendation only“以下是供授权人员复核的建议, 不是自动决定。”
Verifier disagreementhold output, generate reviewer packet“系统发现证据与结论不一致, 需要人工确认。”
Latency breachdegraded mode“先返回可确认事实, 复杂判断转入异步处理。”

5. Hidden vs exposed rationale

企业系统应区分三类材料:

Material是否可对客户展示是否可作审计证据管理原则
Hidden model reasoning / scratchpad不保存或严格隔离, 不作为事实来源
Concise rationale可以, 需经模板和政策控制可作为输出记录面向用户解释“为什么”, 不暴露内部草稿
Evidence packet通常不直接展示全文, 可按权限查看保存 source ids, policy versions, tool results, verifier outcomes, reviewer actions

审计要的是可复现证据, 不是模型的完整思维过程。可复现证据包括: 输入摘要、检索来源、工具结果、规则版本、模型配置、输出哈希、verifier 分数、人工审批记录、最终业务处置。


Budgeting policies and gates

Policy 1: Risk-proportional compute

Risk levelRule
Low不允许为低价值请求无限增加 samples 或 verifier。通过 fast path 和缓存控制成本。
Medium必须有 evidence grounding, schema validation, basic contradiction check。
High必须有 decomposition、独立 verifier、人工复核或抽样、完整 evidence packet。
CriticalAI 只能生成建议或材料, 不得自动执行客户不利动作或监管申报。

Policy 2: Budget cannot override authority

更多推理 token 不等于更多权限。系统即使很“自信”, 也不能绕过:

  • 信贷拒绝、降额、冻结账户等 customer-impacting action 的授权流程。
  • AML/SAR 的法定流程和合规复核。
  • 支付争议的时限、通知和客户权利要求。
  • 投诉处置的监管分类、根因编码和整改流程。

Policy 3: Budget escalation requires evidence

从 Tier 1 升到 Tier 2 或 Tier 3, 必须记录触发原因:

Trigger示例
Uncertainty候选答案不一致, citation support 低
High impact可能影响授信、交易、投诉补救、监管报送
Policy edge规则例外、产品条款冲突、跨辖区要求
Evidence gap关键字段缺失或来源过期
User challenge客户或员工对答案提出异议

Policy 4: Stop conditions

系统需要明确何时停止继续消耗 test-time compute:

Stop condition处理
Evidence exhausted输出 insufficiency statement, 不继续猜测
Verifier hard fail阻断输出或转人工
Max latency reached返回 partial factual answer 或异步工单
Budget cap reached保存当前 evidence packet, 标记未决
Policy denial直接拒绝或升级, 不允许 prompt retry 绕过

Policy 5: Budget ownership

Owner责任
Product owner定义用例价值、用户旅程、可接受等待时间
Business control owner定义哪些任务必须人工复核或留痕
AI architect设计 tiering, orchestration, verifier cascade, observability
Model risk / governance审批高风险用例的评估、证据和持续监控
Operations owner管理升级队列、人工复核 SLA、反馈闭环
Finance / platform owner管理 token cost, capacity, vendor spend, rate limits

Verifier cascade and evidence design

Verifier cascade 是把多个低耦合校验器按成本、确定性和风险影响排序。目标不是让某个 judge “裁判一切”, 而是用便宜、确定、可解释的检查先拦截明显问题, 再把复杂判断交给更昂贵的模型、规则或人工。

Cascade pattern

StageVerifier例子Fail action
V0Input and permission verifier用户是否有权访问账户、case、policydeny or redact
V1Schema and completeness verifier是否包含 reason code、policy id、amount、dateask for missing fields
V2Retrieval support verifier每个关键 claim 是否被 source id 支撑revise or abstain
V3Deterministic tool verifierAPR/DTI/期限/金额/时限计算是否正确block and rerun with corrected tool result
V4Policy rule verifier是否违反产品政策、监管口径、禁用承诺escalate or rewrite
V5Cross-case consistency verifier同类 case 结论是否明显偏离历史处理reviewer sampling or supervisor gate
V6Human expert verifierAML investigator、credit policy officer、complaints QAapprove, edit, reject, create finding

Evidence object design

建议每次 high-impact run 生成 reasoning_budget_evidence 对象:

Field示例
run_idai-run-20260630-aml-00031
use_caseAML alert narrative support
budget_tierTier 3
trigger_reasonhigh customer/regulatory impact, evidence conflict
model_config_hashhash of model id, prompt version, temperature, max tokens
source_refstransaction ids, policy ids, knowledge chunk ids
tool_resultssanctions screen result summary, DTI calculator result
verifier_resultsV1 pass, V2 fail then pass, V4 pass, V6 approved
abstention_or_escalationescalated to level-2 investigator
output_hashhash of delivered narrative
human_actionapproved with edits, reviewer id, timestamp
retention_policy7 years for AML case record, or local policy

不要把 hidden chain-of-thought 放进这个对象。需要记录的是控制结果和证据链接。


Financial retail scenarios

1. AML investigations

AML copilot 的高价值不是“自动判定可疑”, 而是减少 analyst 搜集材料和起草 narrative 的时间。

StepReasoning budget design
TriageTier 2: 总结 alert, 聚合交易模式, 检查客户画像偏差
Typology matchTier 3: 检索 typology library, 制裁/PEP/地理风险工具, verifier 检查 unsupported claim
Narrative draftTier 3: 只生成 analyst-facing draft, 必须人工批准
Evidence保存 transaction refs、typology ids、tool results、analyst edits、final disposition

关键控制: AI 不应把“看起来可疑”的语言变成事实断言。叙述必须区分 observed facts, policy indicators, analyst judgment。

2. Credit policy reasoning

信贷场景的 reasoning budget 要保护公平性、可解释性和授权边界。

StepReasoning budget design
Eligibility QATier 1: RAG 回答政策, 引用当前版本
Borderline assessmentTier 3: 调用 DTI/affordability 工具, 检查例外政策
Adverse action supportTier 3: AI 生成 reason candidates, 人类或规则系统决定最终原因
Evidence保存 policy version、input fields、calculation tool result、adverse reason mapping

关键控制: AI 不能生成未经验证的不利行动原因, 不能使用禁止变量或 proxy reasoning, 不能把内部推理作为客户解释。

3. Payment dispute reasoning

支付争议需要在客户体验、卡组织规则、监管时限和损失控制之间平衡。

StepReasoning budget design
Frontline intakeTier 1: 指导需要收集哪些事实, 不做最终拒绝
Case classificationTier 2: 根据 reason code、交易状态、时间线提出候选路径
Liability analysisTier 3: 调用交易工具、规则库、时限计算器, verifier 检查冲突
Evidence保存 transaction data refs、rule version、deadline calculation、customer notice template

关键控制: 若证据不足, 系统应要求补件或升级, 而不是为了给出答案而猜测。

4. Complaints root-cause analysis

投诉 RCA 适合 test-time compute, 因为它通常需要跨渠道、跨系统、跨政策的证据拼接。

StepReasoning budget design
Complaint summaryTier 1: 摘要客户主张、时间线、涉及产品
Root cause hypothesisTier 2: 生成多个候选根因, 每个候选必须引用证据
Regulatory classificationTier 3: 检查投诉分类、响应时限、补救要求
Evidence保存 case notes、call transcript ids、policy refs、RCA code、QA review

关键控制: 区分 customer allegation, confirmed fact, business error, systemic root cause, remediation action。

5. Contact center policy QA

Contact center 需要低延迟, 但不能牺牲政策准确性。

StepReasoning budget design
Live answerTier 1: RAG + citation, 强制短答案
Complex exceptionTier 2: 提醒转主管或创建 back-office task
Customer-facing languageTier 1/Tier 2: 禁用承诺、禁用法律结论、使用批准模板
Evidence保存 policy source ids、agent accepted/edited、call outcome

关键控制: 对话中不要展示内部推理。坐席需要的是可读话术、来源、限制和下一步。


Metrics/control/evidence model

Product and operations metrics

Metric含义目标用法
budget tier mix各 tier 请求占比发现过度使用高预算或高风险请求被低估
cost per resolved case单个完成 case 的模型和工具成本和人工节省、损失减少、SLA 改善一起看
p50/p95 latency by tier不同预算层延迟证明 SLO 设计是否现实
abstention rate系统拒答或要求补证比例太低可能过度自信, 太高可能体验差
escalation rate转人工比例运营容量和控制强度的核心指标
verifier failure rate各 verifier 拦截率找到知识库、prompt、工具或政策缺口
unsupported claim rate无证据支撑的关键 claim 比例RAG 和输出治理关键指标
human override rate人工修改或推翻比例高于阈值时触发模型/流程复盘
repeat complaint / rework rate后续返工或投诉衡量真实业务质量

Control metrics

ControlEvidenceReview cadence
Budget tier assignment accuracy抽样复核 risk classifier 决策monthly
High-impact human gateTier 3 case approval logsweekly
Citation supportClaim-source support reportdaily dashboard
Policy version freshnessKnowledge base index version and policy release logeach release
Cost cap enforcementper use case budget dashboardweekly
Latency SLO breachtrace metrics and degraded mode eventsdaily
Hidden rationale protectionlog inspection, redaction tests, output safety evaleach release
Reviewer calibrationinter-reviewer agreement and QA findingsmonthly

Evidence model aligned to observability

OpenTelemetry 的 trace/span 思路可以映射到 AI reasoning workflow:

root span: ai.reasoning.run
  span: ai.budget.classify
  span: ai.context.retrieve
  span: ai.plan.create
  span: ai.solve.generate
  span: ai.tool.calculate
  span: ai.verify.citation
  span: ai.verify.policy
  span: ai.escalate.human
  span: ai.output.deliver

关键属性:

Attribute示例
ai.use_casecredit_policy_reasoning
ai.budget_tiertier_3
ai.budget_reasonadverse_action_support
ai.max_model_calls4
ai.max_latency_ms30000
ai.max_cost_usd0.35
ai.verifier_policycredit_v12_cascade
ai.abstention_reasonmissing_income_evidence
ai.human_gaterequired
ai.final_dispositionhuman_approved_with_edits

Anti-patterns and failure modes

Anti-patternWhy it failsBetter design
One prompt for every risk tier低风险浪费成本, 高风险缺控制按 use case、impact、uncertainty 分 tier
More tokens as universal fix增加成本和延迟, 不保证事实正确先增加 evidence, tools, deterministic checks
Saving full hidden reasoning for audit暴露敏感草稿, 混淆事实与推测保存 source refs, tool results, verifier outcomes
Letting confidence bypass controls自信不等于授权高影响动作必须走权限和人工门禁
Verifier only at final output错误已经污染上下文和结论input、retrieval、tool、policy、output 分层校验
No stop condition系统会 retry 到成本失控或产生幻想设置 latency, cost, evidence exhaustion, policy denial caps
Treating abstention as failure迫使模型在证据不足时编造把 abstention/escalation 作为合格结果
Unobserved reasoning workflows无法复盘成本、延迟、失败原因用 trace/span 记录每个预算和 verifier 决策
User-facing rationale copied from internal scratchpad可能泄露安全策略、错误分支、敏感信息生成独立的 concise rationale 和 evidence summary
Business owners not involved in tiering技术团队无法独自判断客户影响PM/BA/risk/control owner 共同维护 policy

Architecture mapping to RAG / Agent / Copilot / Eval / Governance

Architecture areaReasoning budget roleDesign question
RAG决定是否检索、检索几轮、是否需要 citation verifier这个 claim 必须被哪个 source type 支撑?
Agent决定工具调用、计划深度、动作权限、停止条件哪些 tool 可以自动调用, 哪些 action 只生成建议?
Copilot决定对员工展示多少 rationale、何时转人工用户需要的是答案、建议、来源, 还是复核包?
Eval按 budget tier 建 golden set、challenge set、cost/latency eval高预算路径是否真的带来质量提升?
Governance将 tiering、abstention、human gate、evidence retention 纳入控制库谁批准某个用例进入 Tier 3 controlled decision path?
Observability记录每次预算选择、verifier 结果、成本和延迟事故后能否复盘为什么系统给出该建议?
Model risk验证模型变化对 tier mix、override、unsupported claim 的影响换模型是否导致高风险请求被更少升级?

ADR draft

Title

Adopt risk-proportional reasoning budget and verifier cascade for high-impact financial retail AI workflows.

Status

Proposed.

Context

Current AI workflows often use a uniform generation path. This creates three problems: low-risk tasks overconsume compute, high-risk tasks lack formal verification and escalation, and audit evidence is inconsistent across use cases. Financial retail workflows such as AML investigation support, credit policy reasoning, payment disputes, complaints RCA, and contact center policy QA require different latency, cost, evidence, and governance profiles.

Decision

We will introduce a reasoning budget policy with four tiers: Fast path, Evidence path, Deliberation path, and Controlled decision path. Budget tier assignment will be based on use case, customer impact, uncertainty, data sensitivity, user role, and channel SLO. High-impact workflows will use decomposition, deterministic tools where available, verifier cascade, abstention/escalation rules, and human approval gates. Runtime evidence will record inputs, source references, tool outputs, policy versions, verifier results, output hashes, and human actions. Hidden model reasoning will not be used as audit evidence or exposed to customers.

Consequences

Positive consequences:

  • Better alignment between cost, latency, risk, and business value.
  • Clearer control design for high-impact AI use cases.
  • Stronger auditability without leaking chain-of-thought.
  • Better operational capacity planning through tier mix and escalation metrics.

Tradeoffs:

  • More orchestration complexity than a single-call model design.
  • Requires ownership of budget policy, verifier maintenance, and evidence retention.
  • High-risk workflows may have higher p95 latency and require human queue capacity.

Alternatives considered

AlternativeRejection reason
Single prompt with larger token budgetDoes not solve evidence, authority, or audit requirements
LLM judge as final arbiterJudge bias and domain limits make it insufficient for regulated decisions
Human review for every caseToo slow and expensive, and still lacks machine-readable runtime evidence
Fully autonomous agentNot acceptable for adverse, regulated, or customer-impacting actions

Decision criteria

CriterionRequired outcome
Customer impactHigh-impact actions require human gate
Evidence qualityKey claims require source or tool support
Cost controlEach tier has max calls, max latency, max spend
GovernanceTiering policy is approved and versioned
AuditabilityEvidence packet supports review without hidden chain-of-thought

Interview answer: 30秒, 2分钟, CTO版本

30秒版本

AI reasoning budget 是把 test-time compute 当成产品和架构策略来管理。低风险问题走 fast path, 需要政策证据的问题走 RAG 和引用校验, 高影响场景走拆解、工具、verifier cascade 和人工复核。关键不是展示模型的 chain-of-thought, 而是保存可审计证据: 输入、来源、工具结果、政策版本、校验结果和人工决策。金融零售里, 这能同时控制成本、延迟、客户影响和监管风险。

2分钟版本

我会把推理预算设计成四层运行时策略。第一层是低风险 fast path, 适合简单 FAQ 和文本改写。第二层是 evidence path, 用在客服政策 QA 这类必须引用来源的任务。第三层是 deliberation path, 用在支付争议初判、投诉根因分析等需要拆解和检查的任务。第四层是 controlled decision path, 用在 AML investigation、信贷政策边界和客户不利影响解释等高风险场景。

架构上先做 risk and complexity classification, 再选择预算层。高预算不是简单增加 token, 而是增加证据检索、确定性工具、planner/solver/checker loop、verifier cascade、abstention 和人工复核。Verifier 应从便宜确定的检查开始, 包括权限、字段完整性、引用支撑、计算工具、政策规则, 最后才到模型评估或人工专家。

治理上, 我不会保存 hidden chain-of-thought 作为审计证据。审计证据应该是 source ids、policy versions、tool outputs、verifier results、model config hash、output hash、human approval and final disposition。这样既能证明系统做了合理控制, 又不会泄露内部草稿或把模型推测误当事实。

CTO版本

我会把 reasoning budget 作为 AI platform 的 runtime policy layer, 而不是每个应用团队自己写 prompt retry。平台提供统一的 budget classifier、orchestration contract、verifier registry、tool permission model、OpenTelemetry instrumentation 和 evidence store。每个 use case 注册 risk tier、SLO、allowed tools、verifier cascade、human gate 和 retention policy。

技术决策上, 我会优先把确定性校验和业务规则前置, 把 LLM 放在擅长的 decomposition、summarization、evidence synthesis 和 language generation 位置。高影响任务用 planner/solver/checker loop, 但 checker 不依赖同一个上下文自证正确, 而是结合 citation verifier、rule engine、calculation tool、policy version check 和 human review。成本方面, 我会建立 per-use-case compute budget、tier mix dashboard、p95 latency、override rate、unsupported claim rate 和 cost per resolved case, 用真实业务结果验证高预算路径是否值得。

治理上, 我会把 NIST AI RMF 的 Govern/Map/Measure/Manage 思路落到 evidence plane, 并对齐 ISO/IEC 42001 的 AI management system 责任。上线门禁不问“模型聪不聪明”, 而问“哪些请求会进入高预算路径, 哪些结论会被阻断或升级, 证据能否复盘, 模型或政策变更会不会破坏这些控制”。


7-day practice plan

DayPracticeOutput
Day 1选择一个金融零售 AI use case, 画出 fast path、evidence path、deliberation path、controlled path一页 budget tier map
Day 2为该 use case 设计 task complexity rubric, 包含 customer impact、evidence gap、policy edge、latencycomplexity scoring table
Day 3设计 planner/solver/checker loop, 标出每一步输入、输出、工具和停止条件workflow diagram
Day 4设计 verifier cascade, 从权限、schema、citation、calculation、policy 到 human reviewverifier table
Day 5定义 evidence object, 明确哪些字段保存、hash、redact、retentionevidence schema
Day 6设计 cost-latency-risk scorecard, 包含 tier mix、p95 latency、override、unsupported claimdashboard mock table
Day 7写一份 ADR, 说明为什么采用 risk-proportional reasoning budget, 如何上线和治理ADR draft + interview answer

Source anchors

SourceLink本文采用的思想
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parametershttps://arxiv.org/abs/2408.03314Test-time compute 是可优化资源, 不是固定推理成本
Self-Consistency Improves Chain of Thought Reasoning in Language Modelshttps://arxiv.org/abs/2203.11171多路径推理可作为历史背景, 但企业应把它抽象为受控预算策略
Training Verifiers to Solve Math Word Problemshttps://arxiv.org/abs/2110.14168Verifier 思路启发“生成答案”和“检查答案”分离
NIST AI Risk Management Frameworkhttps://www.nist.gov/itl/ai-risk-management-framework将 AI 风险纳入 Govern, Map, Measure, Manage 的闭环
NIST AI RMF Generative AI Profilehttps://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence生成式 AI 特有风险需要专项控制、评估和证据
ISO/IEC 42001https://www.iso.org/standard/81230.htmlAI management system 对责任、运行控制和持续改进的管理体系要求
OpenTelemetry Documentationhttps://opentelemetry.io/docs/用 trace, span, metrics, logs 设计 AI runtime evidence 和 SLO 监控