AI Requirements-to-Eval Cookbook
这些来源作为学习锚点, 不构成法律或合规意见。
AI Requirements-to-Eval Cookbook
定位: 面向 AI BA / AI PM / AI Solutions Architect 的可测需求手册。 目标: 把“AI 应该准确、安全、专业、有用”改写成 eval contract、release gate、monitoring 和 incident loop。 使用方式: 每个 AI use case 至少填一张 Requirements-to-Eval Matrix, 并把关键失败模式放入 release gate。
Source Anchors
这些来源作为学习锚点, 不构成法律或合规意见。
| Anchor | Link | 用法 |
|---|---|---|
| NIST AI RMF | https://www.nist.gov/itl/ai-risk-management-framework | 将风险管理转成 Govern / Map / Measure / Manage 的 eval 和控制 |
| NIST AI 600-1 GenAI Profile | https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence | 为 GenAI 风险设计测试样本和监控 |
| EU AI Act | https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng | 用 risk-based lens 识别高风险场景、透明度、人类监督和文档需求 |
| ISO/IEC 42001 | https://www.iso.org/standard/42001 | 用 AI management system 思路管理 eval 生命周期 |
| OWASP LLM Top 10 | https://owasp.org/www-project-top-10-for-large-language-model-applications/ | 将 prompt injection、data leakage、excessive agency 等转成 red-team cases |
| G-Eval | https://arxiv.org/abs/2303.16634 | 参考 LLM-as-Judge 的 rubric-based evaluation 思路 |
| MT-Bench / LLM-as-Judge | https://arxiv.org/abs/2306.05685 | 理解 LLM judge 的价值和偏差 |
1. 为什么 AI 需求不能只写 Acceptance Criteria
传统软件需求可以写:
- 点击提交后创建工单。
- 金额必须大于 0。
- 查询结果按时间倒序。
AI 需求常写:
- 回答要准确。
- 摘要要完整。
- 建议要安全。
- 语气要专业。
- 不能幻觉。
- 要遵守政策。
这些需求如果不转成 eval, 实际上不可验收。
AI requirement 的合格形式应该是:
Business requirement
-> expected AI behavior
-> unacceptable behavior
-> test data
-> evaluation method
-> threshold
-> severity
-> owner
-> release gate
-> monitoring signal
-> incident response
一句话:
AI requirements are not done until they are testable, risk-tiered, monitored and owned.
2. Requirements-to-Eval 主流程
flowchart TB
B[Business outcome] --> W[Workflow point]
W --> A[AI behavior]
A --> F[Failure modes]
F --> D[Data and evidence]
D --> R[Rubric]
R --> T[Test cases]
T --> M[Metrics and threshold]
M --> G[Release gate]
G --> O[Production monitoring]
O --> I[Incident and improvement loop]
Step 1: Business outcome
不要从模型能力开始, 从业务结果开始。
例子:
- AML: 降低 evidence gathering 时间, 提升 case narrative 完整性。
- KYC: 缩短 remediation cycle time, 降低重复联系客户。
- 客服: 提升 first-contact resolution, 降低错误政策回答。
- 支付: 缩短 exception resolution time, 避免错误修复动作。
- 信贷: 提升 memo 一致性, 不越过人工信贷决策边界。
Step 2: Workflow point
AI 插入流程的哪个点?
| Insert point | Example | Risk |
|---|---|---|
| Read | 查政策、查交易、查历史 case | 低到中 |
| Summarize | 摘要证据、客户资料、投诉 | 中 |
| Recommend | 推荐下一步、风险标记 | 中到高 |
| Draft | 草拟回复、memo、narrative | 中到高 |
| Decide | 做最终决定 | 高, 多数金融场景不应交给 LLM |
| Act | 调用工具执行动作 | 高, 必须 bounded + approval |
Step 3: AI behavior
把“好回答”拆成可评估行为:
- 是否引用正确证据。
- 是否覆盖必要字段。
- 是否说出缺失信息。
- 是否遵守输出格式。
- 是否触发人工复核。
- 是否避免越权承诺。
- 是否区分事实和推断。
Step 4: Failure modes
常见 AI failure modes:
| Failure | Description |
|---|---|
| Unsupported claim | 没有证据支持的事实性陈述 |
| Wrong citation | 引用不支持结论 |
| Missing evidence | 没指出关键缺失信息 |
| Policy violation | 违反政策或监管边界 |
| Unauthorized action | 建议或执行未授权动作 |
| Hallucinated rationale | 编造看似合理的解释 |
| Over-refusal | 应该回答但过度拒答 |
| Under-escalation | 高风险没有升级 |
| Bad tone | 不适合客户或监管沟通的语气 |
| Data leakage | 泄露不该展示的数据 |
| Prompt injection follow | 遵循了文档或用户中的恶意指令 |
Step 5: Data and evidence
每个 eval case 要包含:
- user input。
- workflow state。
- retrieved evidence。
- tool observations。
- expected behavior。
- unacceptable behavior。
- severity。
- reviewer role。
Step 6: Rubric
Rubric 要短、明确、可校准。
示例:
| Dimension | 1 | 3 | 5 |
|---|---|---|---|
| Grounding | 多处无证据 | 大体有证据但有缺口 | 每个关键事实均有证据 |
| Completeness | 漏掉关键步骤 | 覆盖主要内容 | 覆盖主线、例外和下一步 |
| Safety | 有高风险违规 | 有轻微边界问题 | 无违规且正确升级 |
| Usefulness | 不可执行 | 可部分使用 | 清晰、可执行、适合用户 |
Step 7: Test cases
按场景类型设计:
- common cases。
- edge cases。
- missing-data cases。
- policy conflict cases。
- adversarial cases。
- high-risk cases。
- historical failure cases。
Step 8: Threshold
阈值不能只有平均分。高风险场景要设置 hard stop:
- critical violation = 0。
- unauthorized action = 0。
- unsupported high-risk claim = 0。
- regression critical failures = 0。
- expert review pass >= defined threshold。
Step 9: Release gate
Release gate 应输出:
- pass / conditional / fail。
- failed cases。
- severity。
- owner。
- mitigation。
- next review。
Step 10: Monitoring and incident loop
上线后继续看:
- user override。
- user complaint。
- expert QA defects。
- citation failure。
- unsafe output。
- latency/cost。
- adoption。
- drift。
3. Evaluation Methods
Deterministic checks
适合:
- JSON schema。
- forbidden phrase。
- citation exists。
- tool action allowed。
- required fields present。
- policy version current。
优点: 稳定、便宜、可重复。
限制: 不理解开放式质量。
LLM-as-Judge
适合:
- completeness。
- tone。
- explanation quality。
- policy compliance 初筛。
- groundedness 初筛。
必须控制:
- position bias。
- verbosity bias。
- judge version drift。
- human calibration。
Expert review
适合:
- AML typology。
- lending / fair lending。
- wealth suitability。
- regulatory impact。
- high severity failures。
限制:
- 成本高。
- 速度慢。
- 需要 calibration。
Shadow mode
适合:
- 高风险系统上线前。
- 与人工决策对照。
- 观察 false positive / false negative。
Production monitoring
适合:
- 持续质量管理。
- prompt/model/index 变更后观察。
- adoption and trust。
4. Severity Levels
| Severity | Meaning | Example | Gate |
|---|---|---|---|
| S0 Critical | 可能造成严重客户/合规/财务风险 | LLM 给出最终拒贷决定 | release blocked |
| S1 High | 高风险错误, 需立即修复 | AML narrative 无证据指控客户 | release blocked or limited |
| S2 Medium | 影响质量或效率 | 漏掉一个非关键字段 | fix before scale |
| S3 Low | 轻微表达或格式问题 | 语气不够简洁 | backlog |
5. Reusable Requirements-to-Eval Matrix
| Requirement | Expected behavior | Failure mode | Eval method | Test data | Threshold | Severity | Owner | Gate |
|---|---|---|---|---|---|---|---|---|
| deterministic / judge / expert | S0-S3 | discovery/pilot/release |
Minimum fields:
- requirement id。
- business owner。
- risk owner。
- data source。
- eval owner。
- release threshold。
- monitoring signal。
6. Bad Requirements -> Eval-Ready Requirements
| Bad requirement | Why weak | Eval-ready rewrite |
|---|---|---|
| AI should answer accurately | 无法测试 | For policy questions, answer must cite current approved policy section; unsupported factual claims in high-risk answers must be 0 |
| AI should summarize AML cases | 未定义完整性 | Summary must include customer profile, transaction timeline, red flags, missing evidence and evidence IDs |
| AI should recommend next best action | 风险边界不明 | For payment exceptions, AI may recommend allowed next actions but cannot execute repair without approval |
| AI should be compliant | 太泛 | Output must not provide personalized investment advice unless advisor review path is triggered |
| AI should understand KYC | 不可验收 | Detect missing required KYC fields by jurisdiction/product with >= target recall and no unauthorized outreach |
| AI should reduce manual work | 价值不可测 | Reduce average evidence gathering time by target % in pilot without increasing QA defect rate |
7. Financial Retail Examples
AML Copilot
| Requirement | Eval |
|---|---|
| Narrative must cite evidence | citation precision check + expert sample |
| Must not decide SAR filing | forbidden final decision check |
| Must cover red flags | checklist recall |
| Must show missing evidence | missing-data cases |
| Must resist injected adverse-media instructions | red-team prompt injection |
Release gate:
- critical unsupported claim = 0。
- final SAR decision suggestion = 0。
- evidence citation threshold met。
- expert QA accepts pilot sample。
KYC Remediation
| Requirement | Eval |
|---|---|
| Detect missing fields | historical remediation cases |
| Draft approved outreach | policy and tone judge |
| Respect jurisdiction | jurisdiction-specific gold cases |
| No unauthorized document request | deterministic + expert review |
| No golden source update without approval | workflow state check |
Customer Service RAG
| Requirement | Eval |
|---|---|
| Answer with current policy | policy version check |
| Cite source | citation coverage |
| No unsupported fee waiver | forbidden commitment check |
| Escalate complaint/risk | escalation test cases |
| Say unknown when evidence missing | missing evidence cases |
Payments Exception Agent
| Requirement | Eval |
|---|---|
| Interpret return code correctly | deterministic code set |
| Explain root cause | expert review |
| Recommend allowed actions | action allowlist check |
| Require approval for write actions | workflow gate |
| Handle tool failure | tool failure tests |
Lending Assistant
| Requirement | Eval |
|---|---|
| Separate calculations from prose | deterministic calculation check |
| Cite policy | citation check |
| Suggest reason codes safely | compliance expert review |
| Trigger human decision | workflow gate |
| Avoid protected/proxy factors | fairness review |
Wealth Compliance Guardrail
| Requirement | Eval |
|---|---|
| Detect personalized advice | classifier + expert review |
| Escalate to licensed advisor | workflow gate |
| Use approved product facts | RAG citation |
| Block prohibited language | deterministic check |
| Provide compliant rewrite | LLM judge + compliance sample |
Regulatory Change Impact
| Requirement | Eval |
|---|---|
| Extract obligations | legal/compliance review |
| Map to capability/process/system | BIAN/TOGAF review |
| Cite regulation section | citation check |
| Generate owner backlog | reviewer acceptance |
| Flag uncertainty | missing/conflict cases |
8. Risk-Tiered Release Gates
| Risk tier | Example | Required eval |
|---|---|---|
| Low | internal product knowledge RAG | deterministic + LLM judge sample |
| Medium | customer service agent assist | deterministic + LLM judge + QA sample |
| High | AML/lending/wealth decision support | expert review + HITL + audit + strict gate |
| Critical | autonomous customer-impacting decisions | usually no-go for LLM-only systems |
9. Red-Team Cases
Include at least:
- Prompt injection in retrieved docs。
- Conflicting policy versions。
- Missing evidence。
- Unauthorized user。
- Sensitive data request。
- High-risk advice request。
- Tool unavailable。
- User asks to bypass approval。
- Old case with wrong historical label。
- Ambiguous customer complaint。
10. Ownership and RACI
| Activity | PM | BA | Architect | EvalOps | Risk/Compliance | Ops |
|---|---|---|---|---|---|---|
| Define business outcome | A | R | C | C | C | C |
| Map workflow | C | A/R | C | I | C | R |
| Define requirements | A | R | C | C | C | C |
| Build eval dataset | C | R | C | A/R | C | C |
| Set release threshold | A | C | C | R | A/R | C |
| Approve high-risk cases | C | C | C | C | A/R | C |
| Monitor production | A | C | C | R | C | R |
| Incident review | A | R | C | R | A/R | R |
11. Interview Talking Points
How do you make AI requirements testable?
30-second answer:
I convert each AI requirement into an eval contract: expected behavior, failure modes, test cases, rubric, threshold, severity, owner, release gate and monitoring signal. For financial services I separate deterministic checks, LLM judge and expert review, and I require zero critical failures for high-risk outputs.
What is the difference between acceptance criteria and eval?
Answer:
Acceptance criteria often describe what should happen. Eval defines how we will measure whether probabilistic behavior is good enough across common, edge, adversarial and high-risk cases, and how failures affect release.
How do you evaluate an AML Copilot?
Answer:
I would measure evidence recall, citation precision, red-flag coverage, missing evidence detection, unsupported claim rate, and whether it avoids final SAR decisions. High-risk samples require expert review and human approval.
12. Exercises
Exercise 1: Rewrite bad requirements
Rewrite:
- AI should be accurate。
- AI should help AML analysts。
- AI should answer customer questions。
- AI should recommend payment repair。
- AI should support lending decisions。
For each, define:
- expected behavior。
- unacceptable behavior。
- eval method。
- threshold。
- severity。
Exercise 2: Build 20-case golden set
For customer service RAG:
- 10 common cases。
- 3 missing evidence cases。
- 3 policy conflict cases。
- 2 prompt injection cases。
- 2 escalation cases。
Exercise 3: Release memo
Write:
Use case:
Eval result:
Critical failures:
Risk acceptance:
Release decision:
Conditions:
Owner:
Next review:
13. Connections
| Existing asset | Use |
|---|---|
docs/abpa/templates/04-requirements-to-eval-matrix.md | Fill matrix |
docs/ai-foundations/papers/08-llm-as-judge-evaluation.md | Design judge/rubric |
docs/AI_ARCHITECTURE_REVIEW_GATE_CHECKLISTS.md | Use gates |
docs/AI_CONTEXT_ENGINEERING_PLAYBOOK.md | Turn context requirements into eval cases |
docs/AI_GOVERNANCE_EVALOPS_RISK_90_PLAN.md | Deep governance practice |
docs/FINANCIAL_RETAIL_AI_CASE_PORTFOLIO.md | Case examples |
14. Final Rule
An AI requirement is not ready if it cannot answer:
What should happen?
What must never happen?
How will we test it?
What data proves it?
What threshold gates release?
Who owns failures?
How will production drift be detected?