返回 Papers
AI 扩展计划 / Playbooks

AI Requirements-to-Eval Cookbook

这些来源作为学习锚点, 不构成法律或合规意见。

540AI_REQUIREMENTS_TO_EVAL_COOKBOOK.md

AI Requirements-to-Eval Cookbook

定位: 面向 AI BA / AI PM / AI Solutions Architect 的可测需求手册。 目标: 把“AI 应该准确、安全、专业、有用”改写成 eval contract、release gate、monitoring 和 incident loop。 使用方式: 每个 AI use case 至少填一张 Requirements-to-Eval Matrix, 并把关键失败模式放入 release gate。


Source Anchors

这些来源作为学习锚点, 不构成法律或合规意见。

AnchorLink用法
NIST AI RMFhttps://www.nist.gov/itl/ai-risk-management-framework将风险管理转成 Govern / Map / Measure / Manage 的 eval 和控制
NIST AI 600-1 GenAI Profilehttps://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence为 GenAI 风险设计测试样本和监控
EU AI Acthttps://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng用 risk-based lens 识别高风险场景、透明度、人类监督和文档需求
ISO/IEC 42001https://www.iso.org/standard/42001用 AI management system 思路管理 eval 生命周期
OWASP LLM Top 10https://owasp.org/www-project-top-10-for-large-language-model-applications/将 prompt injection、data leakage、excessive agency 等转成 red-team cases
G-Evalhttps://arxiv.org/abs/2303.16634参考 LLM-as-Judge 的 rubric-based evaluation 思路
MT-Bench / LLM-as-Judgehttps://arxiv.org/abs/2306.05685理解 LLM judge 的价值和偏差

1. 为什么 AI 需求不能只写 Acceptance Criteria

传统软件需求可以写:

  • 点击提交后创建工单。
  • 金额必须大于 0。
  • 查询结果按时间倒序。

AI 需求常写:

  • 回答要准确。
  • 摘要要完整。
  • 建议要安全。
  • 语气要专业。
  • 不能幻觉。
  • 要遵守政策。

这些需求如果不转成 eval, 实际上不可验收。

AI requirement 的合格形式应该是:

Business requirement
-> expected AI behavior
-> unacceptable behavior
-> test data
-> evaluation method
-> threshold
-> severity
-> owner
-> release gate
-> monitoring signal
-> incident response

一句话:

AI requirements are not done until they are testable, risk-tiered, monitored and owned.


2. Requirements-to-Eval 主流程

flowchart TB
  B[Business outcome] --> W[Workflow point]
  W --> A[AI behavior]
  A --> F[Failure modes]
  F --> D[Data and evidence]
  D --> R[Rubric]
  R --> T[Test cases]
  T --> M[Metrics and threshold]
  M --> G[Release gate]
  G --> O[Production monitoring]
  O --> I[Incident and improvement loop]

Step 1: Business outcome

不要从模型能力开始, 从业务结果开始。

例子:

  • AML: 降低 evidence gathering 时间, 提升 case narrative 完整性。
  • KYC: 缩短 remediation cycle time, 降低重复联系客户。
  • 客服: 提升 first-contact resolution, 降低错误政策回答。
  • 支付: 缩短 exception resolution time, 避免错误修复动作。
  • 信贷: 提升 memo 一致性, 不越过人工信贷决策边界。

Step 2: Workflow point

AI 插入流程的哪个点?

Insert pointExampleRisk
Read查政策、查交易、查历史 case低到中
Summarize摘要证据、客户资料、投诉
Recommend推荐下一步、风险标记中到高
Draft草拟回复、memo、narrative中到高
Decide做最终决定高, 多数金融场景不应交给 LLM
Act调用工具执行动作高, 必须 bounded + approval

Step 3: AI behavior

把“好回答”拆成可评估行为:

  • 是否引用正确证据。
  • 是否覆盖必要字段。
  • 是否说出缺失信息。
  • 是否遵守输出格式。
  • 是否触发人工复核。
  • 是否避免越权承诺。
  • 是否区分事实和推断。

Step 4: Failure modes

常见 AI failure modes:

FailureDescription
Unsupported claim没有证据支持的事实性陈述
Wrong citation引用不支持结论
Missing evidence没指出关键缺失信息
Policy violation违反政策或监管边界
Unauthorized action建议或执行未授权动作
Hallucinated rationale编造看似合理的解释
Over-refusal应该回答但过度拒答
Under-escalation高风险没有升级
Bad tone不适合客户或监管沟通的语气
Data leakage泄露不该展示的数据
Prompt injection follow遵循了文档或用户中的恶意指令

Step 5: Data and evidence

每个 eval case 要包含:

  • user input。
  • workflow state。
  • retrieved evidence。
  • tool observations。
  • expected behavior。
  • unacceptable behavior。
  • severity。
  • reviewer role。

Step 6: Rubric

Rubric 要短、明确、可校准。

示例:

Dimension135
Grounding多处无证据大体有证据但有缺口每个关键事实均有证据
Completeness漏掉关键步骤覆盖主要内容覆盖主线、例外和下一步
Safety有高风险违规有轻微边界问题无违规且正确升级
Usefulness不可执行可部分使用清晰、可执行、适合用户

Step 7: Test cases

按场景类型设计:

  • common cases。
  • edge cases。
  • missing-data cases。
  • policy conflict cases。
  • adversarial cases。
  • high-risk cases。
  • historical failure cases。

Step 8: Threshold

阈值不能只有平均分。高风险场景要设置 hard stop:

  • critical violation = 0。
  • unauthorized action = 0。
  • unsupported high-risk claim = 0。
  • regression critical failures = 0。
  • expert review pass >= defined threshold。

Step 9: Release gate

Release gate 应输出:

  • pass / conditional / fail。
  • failed cases。
  • severity。
  • owner。
  • mitigation。
  • next review。

Step 10: Monitoring and incident loop

上线后继续看:

  • user override。
  • user complaint。
  • expert QA defects。
  • citation failure。
  • unsafe output。
  • latency/cost。
  • adoption。
  • drift。

3. Evaluation Methods

Deterministic checks

适合:

  • JSON schema。
  • forbidden phrase。
  • citation exists。
  • tool action allowed。
  • required fields present。
  • policy version current。

优点: 稳定、便宜、可重复。

限制: 不理解开放式质量。

LLM-as-Judge

适合:

  • completeness。
  • tone。
  • explanation quality。
  • policy compliance 初筛。
  • groundedness 初筛。

必须控制:

  • position bias。
  • verbosity bias。
  • judge version drift。
  • human calibration。

Expert review

适合:

  • AML typology。
  • lending / fair lending。
  • wealth suitability。
  • regulatory impact。
  • high severity failures。

限制:

  • 成本高。
  • 速度慢。
  • 需要 calibration。

Shadow mode

适合:

  • 高风险系统上线前。
  • 与人工决策对照。
  • 观察 false positive / false negative。

Production monitoring

适合:

  • 持续质量管理。
  • prompt/model/index 变更后观察。
  • adoption and trust。

4. Severity Levels

SeverityMeaningExampleGate
S0 Critical可能造成严重客户/合规/财务风险LLM 给出最终拒贷决定release blocked
S1 High高风险错误, 需立即修复AML narrative 无证据指控客户release blocked or limited
S2 Medium影响质量或效率漏掉一个非关键字段fix before scale
S3 Low轻微表达或格式问题语气不够简洁backlog

5. Reusable Requirements-to-Eval Matrix

RequirementExpected behaviorFailure modeEval methodTest dataThresholdSeverityOwnerGate
deterministic / judge / expertS0-S3discovery/pilot/release

Minimum fields:

  • requirement id。
  • business owner。
  • risk owner。
  • data source。
  • eval owner。
  • release threshold。
  • monitoring signal。

6. Bad Requirements -> Eval-Ready Requirements

Bad requirementWhy weakEval-ready rewrite
AI should answer accurately无法测试For policy questions, answer must cite current approved policy section; unsupported factual claims in high-risk answers must be 0
AI should summarize AML cases未定义完整性Summary must include customer profile, transaction timeline, red flags, missing evidence and evidence IDs
AI should recommend next best action风险边界不明For payment exceptions, AI may recommend allowed next actions but cannot execute repair without approval
AI should be compliant太泛Output must not provide personalized investment advice unless advisor review path is triggered
AI should understand KYC不可验收Detect missing required KYC fields by jurisdiction/product with >= target recall and no unauthorized outreach
AI should reduce manual work价值不可测Reduce average evidence gathering time by target % in pilot without increasing QA defect rate

7. Financial Retail Examples

AML Copilot

RequirementEval
Narrative must cite evidencecitation precision check + expert sample
Must not decide SAR filingforbidden final decision check
Must cover red flagschecklist recall
Must show missing evidencemissing-data cases
Must resist injected adverse-media instructionsred-team prompt injection

Release gate:

  • critical unsupported claim = 0。
  • final SAR decision suggestion = 0。
  • evidence citation threshold met。
  • expert QA accepts pilot sample。

KYC Remediation

RequirementEval
Detect missing fieldshistorical remediation cases
Draft approved outreachpolicy and tone judge
Respect jurisdictionjurisdiction-specific gold cases
No unauthorized document requestdeterministic + expert review
No golden source update without approvalworkflow state check

Customer Service RAG

RequirementEval
Answer with current policypolicy version check
Cite sourcecitation coverage
No unsupported fee waiverforbidden commitment check
Escalate complaint/riskescalation test cases
Say unknown when evidence missingmissing evidence cases

Payments Exception Agent

RequirementEval
Interpret return code correctlydeterministic code set
Explain root causeexpert review
Recommend allowed actionsaction allowlist check
Require approval for write actionsworkflow gate
Handle tool failuretool failure tests

Lending Assistant

RequirementEval
Separate calculations from prosedeterministic calculation check
Cite policycitation check
Suggest reason codes safelycompliance expert review
Trigger human decisionworkflow gate
Avoid protected/proxy factorsfairness review

Wealth Compliance Guardrail

RequirementEval
Detect personalized adviceclassifier + expert review
Escalate to licensed advisorworkflow gate
Use approved product factsRAG citation
Block prohibited languagedeterministic check
Provide compliant rewriteLLM judge + compliance sample

Regulatory Change Impact

RequirementEval
Extract obligationslegal/compliance review
Map to capability/process/systemBIAN/TOGAF review
Cite regulation sectioncitation check
Generate owner backlogreviewer acceptance
Flag uncertaintymissing/conflict cases

8. Risk-Tiered Release Gates

Risk tierExampleRequired eval
Lowinternal product knowledge RAGdeterministic + LLM judge sample
Mediumcustomer service agent assistdeterministic + LLM judge + QA sample
HighAML/lending/wealth decision supportexpert review + HITL + audit + strict gate
Criticalautonomous customer-impacting decisionsusually no-go for LLM-only systems

9. Red-Team Cases

Include at least:

  • Prompt injection in retrieved docs。
  • Conflicting policy versions。
  • Missing evidence。
  • Unauthorized user。
  • Sensitive data request。
  • High-risk advice request。
  • Tool unavailable。
  • User asks to bypass approval。
  • Old case with wrong historical label。
  • Ambiguous customer complaint。

10. Ownership and RACI

ActivityPMBAArchitectEvalOpsRisk/ComplianceOps
Define business outcomeARCCCC
Map workflowCA/RCICR
Define requirementsARCCCC
Build eval datasetCRCA/RCC
Set release thresholdACCRA/RC
Approve high-risk casesCCCCA/RC
Monitor productionACCRCR
Incident reviewARCRA/RR

11. Interview Talking Points

How do you make AI requirements testable?

30-second answer:

I convert each AI requirement into an eval contract: expected behavior, failure modes, test cases, rubric, threshold, severity, owner, release gate and monitoring signal. For financial services I separate deterministic checks, LLM judge and expert review, and I require zero critical failures for high-risk outputs.

What is the difference between acceptance criteria and eval?

Answer:

Acceptance criteria often describe what should happen. Eval defines how we will measure whether probabilistic behavior is good enough across common, edge, adversarial and high-risk cases, and how failures affect release.

How do you evaluate an AML Copilot?

Answer:

I would measure evidence recall, citation precision, red-flag coverage, missing evidence detection, unsupported claim rate, and whether it avoids final SAR decisions. High-risk samples require expert review and human approval.


12. Exercises

Exercise 1: Rewrite bad requirements

Rewrite:

  • AI should be accurate。
  • AI should help AML analysts。
  • AI should answer customer questions。
  • AI should recommend payment repair。
  • AI should support lending decisions。

For each, define:

  • expected behavior。
  • unacceptable behavior。
  • eval method。
  • threshold。
  • severity。

Exercise 2: Build 20-case golden set

For customer service RAG:

  • 10 common cases。
  • 3 missing evidence cases。
  • 3 policy conflict cases。
  • 2 prompt injection cases。
  • 2 escalation cases。

Exercise 3: Release memo

Write:

Use case:
Eval result:
Critical failures:
Risk acceptance:
Release decision:
Conditions:
Owner:
Next review:

13. Connections

Existing assetUse
docs/abpa/templates/04-requirements-to-eval-matrix.mdFill matrix
docs/ai-foundations/papers/08-llm-as-judge-evaluation.mdDesign judge/rubric
docs/AI_ARCHITECTURE_REVIEW_GATE_CHECKLISTS.mdUse gates
docs/AI_CONTEXT_ENGINEERING_PLAYBOOK.mdTurn context requirements into eval cases
docs/AI_GOVERNANCE_EVALOPS_RISK_90_PLAN.mdDeep governance practice
docs/FINANCIAL_RETAIL_AI_CASE_PORTFOLIO.mdCase examples

14. Final Rule

An AI requirement is not ready if it cannot answer:

What should happen?
What must never happen?
How will we test it?
What data proves it?
What threshold gates release?
Who owns failures?
How will production drift be detected?