AI Model Validation / Independent Challenge Playbook
这份 playbook 解决一个高级问题:
AI Model Validation / Independent Challenge Playbook
定位: 面向 CBAP 之后的高级 AI PM / BA / Product Architect / Solutions Architect / Model Risk / Validation / AI Governance 的 AI system validation 实战手册。
目标: 把传统 model validation 升级为 GenAI system validation, 覆盖 model、prompt、RAG、tool、workflow、HITL、eval、monitoring、business outcome 和 revalidation trigger。
核心观点: 高影响 AI 系统的验证对象不是一个模型 benchmark, 而是一个在特定业务用途、版本边界、数据依赖、人工控制、运营监控和风险偏好内运行的 socio-technical system。
重要说明: 本文是学习、作品集和治理设计材料, 不是法律意见、合规意见、审计意见、模型验证结论或监管解释。正式项目必须由 Legal、Compliance、Model Risk、Internal Audit、Security、Privacy、Data Owner、业务管理层和适用监管关系共同确认。
1. 目的 / 适用对象 / 核心观点
1.1 目的
这份 playbook 解决一个高级问题:
How do we validate an AI system, not just benchmark a model?
在金融零售场景中, AI 不再只是一个离线模型。一个客户服务 copilot、信贷政策 RAG、AML case narrative assistant、投诉处理 agent 或财富顾问助手通常包含:
- Foundation model 或 fine-tuned model。
- System prompt、developer prompt、policy prompt 和 output schema。
- RAG retriever、embedding model、reranker、chunking、source repository 和 index refresh。
- Tool / API / workflow action, 例如查询账户、创建 case、草拟通知、更新 CRM 字段。
- Human-in-the-loop 或 human-on-the-loop 控制。
- Eval contract、red-team cases、production monitoring、incident response 和 change management。
- Business outcome, 例如处理时长、复核质量、客户影响、投诉率、合规缺陷和运营成本。
本文的目标是训练你把这些对象转成:
| Artifact | 作用 |
|---|---|
| AI system inventory | 识别验证边界、owner、版本、用途、风险等级和依赖 |
| Validation plan | 定义验证目标、范围、方法、样本、阈值、证据和 gate decision |
| Independent challenge memo | 记录验证方如何挑战设计假设、实现证据、结果解释和剩余风险 |
| Validation report | 支撑 pilot / release / scale / restriction / retirement 的正式判断 |
| Portfolio evidence pack | 支撑 Model Risk、AI Governance、Internal Audit 和监管问询 |
1.2 适用对象
| 角色 | 需要掌握的能力 | 本文对应训练 |
|---|---|---|
| AI PM / Product Owner | 把 AI idea 转成可验证、可放行、可监控、可问责的产品能力 | use case boundary、business outcome、release gate、revalidation trigger |
| AI BA / CBAP | 把需求、流程、控制、数据、证据和 stakeholder concern 连接起来 | requirement-to-eval-to-control traceability |
| Product / Solution Architect | 把 prompt、RAG、tool、workflow、HITL、logging 和 fallback 设计成可验证架构 | system inventory、process verification、architecture challenge |
| Model Risk / Validation | 执行独立验证和有效挑战, 不只审查模型分数 | conceptual soundness、process verification、outcome analysis |
| Internal Audit / Compliance | 检查控制是否真实运行, 证据是否能支持管理层声明 | evidence sufficiency、control operating effectiveness |
| Risk Executive | 决定是否接受 residual risk、限制范围、延后上线或要求整改 | severity、remediation、risk acceptance |
1.3 核心观点
- Model validation 要升级为 AI system validation。
- Benchmark 只是证据之一, 不能替代 use-case-specific validation。
- GenAI validation 必须覆盖模型、prompt、RAG、tool、workflow、人、监控和结果。
- Independent challenge 的价值不是挑错, 而是让上线判断能够承受业务、审计、监管和事故复盘的挑战。
- 对高影响金融零售 AI, 验证结论必须绑定版本边界、使用边界、限制条件、issue remediation 和 revalidation trigger。
2. Source Anchors
以下来源作为框架语言和治理锚点。访问日期按 2026-06-30 记录。正式项目应复核最新监管、机构政策和司法辖区要求。
| Anchor | Official link | 本文使用方式 |
|---|---|---|
| Federal Reserve SR 11-7 / OCC 2011-12 model risk management guidance | https://www.federalreserve.gov/supervisionreg/srletters/sr1107.htm | 作为传统 MRM 的经典锚点: model risk、effective challenge、conceptual soundness、ongoing monitoring、outcomes analysis、independence、governance |
| NIST AI Risk Management Framework | https://www.nist.gov/itl/ai-risk-management-framework | 用 Govern / Map / Measure / Manage 组织 AI 风险识别、测量、控制、监控和持续改进 |
| ISO/IEC 42001 AI management system | https://www.iso.org/standard/81230.html | 用管理体系视角组织 AI policy、owner、risk process、operation control、performance evaluation、internal audit 和 management review |
使用纪律:
- 不把任何 source anchor 机械转成 checklist。
- 不用 "we follow SR 11-7 / NIST / ISO" 作为结论。
- 每个高影响 use case 都要把 source anchor 转成可验证 artifact: inventory、validation plan、evidence map、finding log、approval record、monitoring spec 和 revalidation record。
- 对 GenAI / RAG / Agent, 传统 model validation 语言必须扩展到 system component、workflow control 和 production learning loop。
3. One-Sentence Positioning
AI system validation = independent, risk-based evaluation of whether a specific AI system is conceptually sound, correctly implemented, fit for approved use, controlled in operation, delivering intended outcomes, and constrained by evidence-backed limits.
中文表达:
AI system validation 是对一个特定 AI 系统在批准用途、版本边界、业务流程、数据依赖、人工控制和生产监控下是否适用的独立风险判断。
最小闭环:
Business use case
-> model / use / system inventory
-> risk tiering
-> validation plan
-> conceptual soundness review
-> process verification
-> outcome analysis
-> independent challenge
-> findings and remediation
-> release decision
-> monitoring
-> revalidation trigger
面试中的一句话:
I would not validate a GenAI product by asking whether the base model has a strong benchmark. I would validate the whole deployed system against its approved use, evidence chain, human controls, tool authority, production monitoring, and business-risk outcomes.
4. GenAI System Validation != Model Benchmark
4.1 Benchmark 回答什么
Benchmark 通常回答:
| Benchmark 问题 | 典型证据 |
|---|---|
| 这个模型在通用任务上能力如何 | public benchmark、vendor report、internal comparison |
| 这个模型在某个静态测试集上表现如何 | offline eval score、accuracy、F1、pass rate、rubric score |
| 模型 A 是否优于模型 B | head-to-head experiment、win rate、cost-latency-quality tradeoff |
| 模型是否满足基础能力门槛 | summarization、classification、reasoning、language coverage |
这些证据有价值, 但不足以支撑金融零售 AI 上线。原因是 benchmark 通常没有完整覆盖:
- 实际业务流程中的输入分布和异常场景。
- RAG source freshness、权限过滤、citation correctness 和 policy conflict。
- Prompt 版本、tool schema、workflow routing 和 HITL 操作。
- 用户过度信任、人工复核失败、handoff 断点和操作风险。
- 生产环境中的 latency、fallback、日志、审计、监控和 incident response。
- 客户影响、合规缺陷、投诉、业务 KPI 和 residual risk。
4.2 Validation 回答什么
Validation 回答的是:
| Validation 问题 | 判断重点 |
|---|---|
| 这个 AI 系统是否适合这个 approved use | use case boundary、risk tier、prohibited use、human role |
| 设计逻辑是否合理 | conceptual soundness、assumptions、limitations、architecture rationale |
| 实现是否与批准设计一致 | prompt registry、model route、index version、tool permission、workflow config |
| 结果是否可接受 | eval results、slice performance、critical failure、business outcome、risk outcome |
| 控制是否真实运行 | HITL logs、approval records、monitoring alerts、incident drills、access review |
| 证据是否足以支持 release | evidence quality、sample coverage、version boundary、owner attestation |
| 什么变化会使验证失效 | revalidation trigger、change impact、regression gate、rollback path |
4.3 关键差异
| 维度 | Model benchmark | AI system validation |
|---|---|---|
| 对象 | 模型能力 | 业务系统和控制环境 |
| 范围 | 通用任务或静态测试集 | approved use、workflow、data、prompt、RAG、tool、human、monitoring |
| 结果 | 分数和排名 | release decision、use restriction、findings、residual risk |
| 证据 | eval score、benchmark report | validation plan、trace、config、sample review、control test、issue remediation |
| 责任 | Data science / platform | Business owner、Model Risk、AI governance、architecture、operations |
| 时间点 | 选型或上线前 | 上线前、变更时、生产中、事故后、定期复核 |
高级表达:
A benchmark can inform model selection. It cannot prove that a deployed AI system is fit for a regulated workflow.
5. Model / Use / System Inventory
GenAI validation 的第一步不是测试, 而是登记和边界定义。没有 inventory, 验证就没有对象、版本、owner 和责任链。
5.1 三层 inventory
| Inventory layer | 核心问题 | 典型字段 |
|---|---|---|
| Model inventory | 哪些模型或模型型组件被使用 | foundation model、embedding、reranker、judge、classifier、rules、provider、version、owner |
| Use inventory | 模型被用在什么业务用途和流程节点 | use case、approved use、prohibited use、risk tier、customer impact、decision boundary |
| System inventory | 哪些系统组件共同产生 AI 行为 | prompt、RAG、tool、workflow、HITL、eval、monitoring、fallback、evidence |
5.2 Model inventory 字段
| Field | 定义 | 验证意义 |
|---|---|---|
| Component ID | 组件唯一编号 | 支持追踪和变更影响分析 |
| Component type | LLM、embedding、reranker、judge、classifier、rules | 不同组件验证方法不同 |
| Provider and hosting | 内部、第三方、托管云、开源自部署 | 影响 vendor risk、data handling、update control |
| Version boundary | 模型名称、版本、deployment id、release family | 绑定验证证据和回归测试 |
| Intended component role | 生成、检索、排序、评分、拒答、路由、分类 | 防止组件被误用 |
| Known limitations | 语言、领域、上下文长度、稳定性、偏差、不可解释性 | 进入 residual risk 和 user guidance |
| Change notice mechanism | vendor notice、internal release note、API change log | 决定 revalidation trigger |
5.3 Use inventory 字段
| Field | 定义 | 金融零售例子 |
|---|---|---|
| Business workflow | AI 插入的流程 | credit memo drafting、card dispute handling、AML alert review |
| User role | 直接使用者 | underwriter、branch banker、call center agent、AML analyst |
| Affected party | 受影响对象 | customer、applicant、employee、regulator、business line |
| Approved use | 被批准用途 | 生成带引用的内部政策答案和草稿 |
| Prohibited use | 禁止用途 | 不得自动批准/拒绝贷款, 不得承诺费用减免 |
| Decision boundary | AI 是 read / summarize / recommend / draft / decide / act 中哪一类 | draft and recommend, no final decision |
| Risk tier | 风险等级及理由 | High, 因为可能影响客户权益和受监管流程 |
| Human role | 人工复核方式 | human final decision, mandatory review for high-impact outputs |
| Fallback | AI 不可用或低置信时的路径 | policy portal search、SME escalation、manual workflow |
5.4 System inventory 字段
| Field | 定义 | 验证关注 |
|---|---|---|
| Prompt registry | system / developer / task prompt 版本和 owner | prompt drift、规则冲突、审批记录 |
| RAG stack | source repository、chunking、embedding、retriever、reranker、index | stale source、wrong citation、permission leakage |
| Tool registry | tool name、scope、permission、side effect、approval rule | excessive agency、unauthorized action、audit trail |
| Workflow state | routing、handoff、exception、approval、rollback | HITL 是否可执行, 升级路径是否有效 |
| Eval contract | dataset、rubric、threshold、critical failures、slices | release gate 是否可重复执行 |
| Monitoring spec | quality、risk、cost、latency、adoption、complaint、override | 生产中是否能发现失效 |
| Logging and trace | input reference、retrieved source、prompt version、model version、tool call、human action | 是否可复现和可审计 |
| Revalidation triggers | model/prompt/index/tool/use/process/regulatory changes | 何时重新验证 |
5.5 Inventory 的高级判断
| 弱 inventory | 高级 inventory |
|---|---|
| "We use GPT-4 for customer service." | "Customer Service Fee Policy Copilot uses LLM deployment X, prompt v2.3, fee-policy index 2026Q2, approved FAQ source set, read-only account tools, human approval before customer response, and monitoring for unauthorized commitment." |
| 只登记模型名称 | 登记模型、用途、系统组件、业务流程和控制环境 |
| 只看当前版本 | 记录版本边界、证据边界和变更触发条件 |
| 只给 IT owner | 明确 business owner、model risk owner、data owner、control owner 和 evidence owner |
6. Validation Domains
传统 model validation 常用 conceptual soundness、ongoing monitoring、outcomes analysis。对 GenAI system, 本文改写为三组可执行验证域:
Conceptual soundness
Process verification
Outcome analysis
其中 process verification 包含传统 "implementation and use" 与生产控制检查, outcome analysis 包含离线 eval、线上结果和业务/风险 outcome。
6.1 Conceptual soundness
核心问题:
Is the AI system design logically sound for the approved business use and risk tier?
| Review area | Challenge question | Evidence |
|---|---|---|
| Business problem | AI 是否解决真实流程瓶颈, 是否存在 no-AI alternative | opportunity memo、baseline measurement、workflow map |
| Approved use | 使用边界是否足够窄, 禁止用途是否明确 | use inventory、policy decision table |
| Risk tiering | 风险等级是否反映客户影响、自动化程度、数据敏感性和监管敏感性 | risk tier worksheet、customer impact analysis |
| Model choice | 模型能力、限制、供应商控制和成本是否适合用途 | model selection memo、vendor due diligence |
| Prompt design | prompt 是否表达政策边界、拒答规则、输出 schema 和升级条件 | prompt design note、prompt review record |
| RAG design | source-of-truth、权限、生效日期、chunking、retrieval 方法是否合理 | RAG architecture、source inventory、retrieval eval |
| Tool design | tool 权限是否最小化, side effect 是否受控 | tool registry、permission matrix、approval flow |
| HITL design | 人工复核是否有能力、时间、权限和证据 | SOP、training record、review queue design |
| Eval design | eval 是否覆盖真实失败模式、严重度、切片和 gate decision | eval contract、critical failure taxonomy |
| Monitoring design | 生产监控是否覆盖质量、安全、流程、成本、客户影响和 drift | monitoring spec、KRI/KPI definitions |
Conceptual soundness 的高质量结论应该包括:
- Design is fit for approved use, with stated limits。
- Key assumptions are explicit and testable。
- Critical risks have controls and evidence。
- Known limitations are communicated to users and management。
- Residual risks are either remediated, restricted, or accepted by the right owner。
6.2 Process verification
核心问题:
Was the approved AI system implemented and operated as designed?
| Verification area | Test method | Evidence |
|---|---|---|
| Model routing | 检查生产调用是否使用批准的 model deployment | config snapshot、deployment log、API route record |
| Prompt version | 检查 prompt registry 与生产版本一致 | prompt hash、release record、change approval |
| RAG source | 检查 source repository、effective date、access filter 和 index version | source manifest、index build log、permission test |
| Retrieval behavior | 抽样检查 query、retrieved chunks、citations 和 final answer | trace sample、citation audit |
| Tool authority | 检查 allowlist、RBAC、rate limit、approval 和 rollback | tool permission matrix、negative test |
| Workflow routing | 检查 handoff、exception、fallback 和 review queue | workflow test evidence、case sample |
| HITL operation | 抽样检查 reviewer 是否真实复核并能 override | review logs、edit diff、approval record |
| Logging | 检查是否能重建输出路径和人工动作 | traceability sample、log retention policy |
| Release gate | 检查 go / limited go / no-go 决策是否按证据执行 | release gate memo、sign-off record |
| Monitoring | 检查 dashboard、alerts、sampling review 和 issue escalation 是否运行 | monitoring dashboard、alert history、issue log |
Process verification 的常见发现:
| Finding | 风险 |
|---|---|
| Prompt registry 显示 v2.1, 生产调用实际为 v2.3 | 验证证据和生产行为不一致 |
| RAG index 没有记录 policy effective date | 旧政策可能被引用 |
| Tool allowlist 包含写入型 API, 但 validation 只测了查询型 API | 验证范围漏掉 excessive agency |
| Human review log 只有点击确认, 没有 AI draft 与 human edit diff | 无法证明人工控制有效 |
| Monitoring 只看 latency 和 cost, 不看 wrong citation 或 policy violation | 生产风险不可见 |
6.3 Outcome analysis
核心问题:
Are the AI system outputs and business outcomes acceptable within the approved risk appetite?
Outcome analysis 不等于只看准确率。它至少包含四类结果:
| Outcome layer | 指标例子 | 解释方式 |
|---|---|---|
| AI output quality | groundedness、citation correctness、completeness、policy compliance、format validity | 按场景、产品、语言、客户类型、风险等级切片 |
| Risk outcome | critical failure、unauthorized action、PII leakage、under-escalation、complaint defect | 高影响场景通常采用 hard stop 或 near-zero tolerance |
| Workflow outcome | handling time、rework rate、escalation quality、review backlog、fallback rate | 同时看效率和控制负担 |
| Business outcome | first contact resolution、case quality、loss avoidance、customer harm、regulatory issue | 不能用业务改善掩盖高严重度风险 |
Outcome analysis 的证据类型:
- Golden set eval and expert review。
- Adversarial and red-team tests。
- Historical incident replay。
- Slice analysis by product, channel, language, customer segment and policy domain。
- Human review defect analysis。
- Production trace sampling。
- Complaint, QA, audit and incident linkage。
- Baseline vs pilot vs scaled rollout comparison。
结果解释规则:
| Result pattern | 高级解释 |
|---|---|
| Average score high, critical failures present | 不应通过高影响 release, 因为平均分掩盖严重失败 |
| Offline eval strong, production override high | 可能存在真实输入分布偏移、用户不信任或 workflow mismatch |
| Efficiency improved, review backlog increased | 控制设计可能不可运营, 需要 capacity 和 triage redesign |
| Citation correctness strong, answer completeness weak | RAG 找到证据但生成层没有完整使用证据 |
| Model benchmark improved, business defect unchanged | 问题可能在流程、数据、prompt、tool 或用户行为, 不在 foundation model |
7. Independent Challenge Operating Model
7.1 Independent challenge 的定义
Independent challenge 是由具备能力、独立性和影响力的人员, 对 AI 系统设计、实现、结果和风险接受进行批判性审查, 并能推动限制、整改、延后上线或重新验证。
在 GenAI system validation 中, independent challenge 必须能挑战:
- Business use 是否被夸大或误定义。
- Risk tier 是否低估。
- Prompt 和 RAG 设计是否有真实依据。
- Tool 权限是否过宽。
- HITL 是否只是形式控制。
- Eval dataset 是否覆盖真实失败模式。
- 结果解释是否忽略 high-severity slice。
- Monitoring 是否能发现生产中真正会出事的问题。
- Management acceptance 是否理解 residual risk。
7.2 Operating model
| Element | 设计要求 |
|---|---|
| Mandate | Validation team 有权要求 evidence、提出 findings、限制 release、要求 revalidation |
| Independence | Validator 不应负责开发、上线或业务收益目标 |
| Competence | 团队必须懂业务流程、AI 架构、数据、模型风险、控制测试和金融零售风险 |
| Influence | High severity finding 能进入 release gate, management 必须明确处置 |
| Traceability | 每个 challenge question 对应 evidence、finding、remediation 或 risk acceptance |
| Escalation | 对重大分歧有 model risk committee / AI governance committee 决策路径 |
7.3 三道防线视角
| Line | 主要职责 | 不能替代什么 |
|---|---|---|
| First line - Business / Product / Engineering | 设计、构建、测试、运行 AI 系统, 维护证据 | 不能替代独立验证 |
| Second line - Model Risk / Risk / Compliance | 制定标准, 独立验证, challenge findings, risk oversight | 不能承担产品收益目标 |
| Third line - Internal Audit | 审查治理和控制是否设计并运行有效 | 不能成为上线前质量团队 |
7.4 RACI
| Activity | Business Owner | Product / BA | Architect / Engineering | Data Owner | Model Risk / Validation | Compliance / Legal | Internal Audit |
|---|---|---|---|---|---|---|---|
| Use case boundary | A | R | C | C | C | C | I |
| Risk tiering | A | R | C | C | C | C | I |
| System inventory | A | R | R | R | C | C | I |
| Eval contract | A | R | R | C | C | C | I |
| Validation plan | C | C | C | C | A/R | C | I |
| Process verification evidence | C | R | R | R | A/R | C | I |
| Findings severity | C | C | C | C | A/R | C | I |
| Remediation | A | R | R | R | C | C | I |
| Risk acceptance | A | C | C | C | C | C | I |
| Audit review | I | I | I | I | C | C | A/R |
Legend:
- R = Responsible。
- A = Accountable。
- C = Consulted。
- I = Informed。
7.5 Challenge forum cadence
| Cadence | Forum | Decision |
|---|---|---|
| Intake | AI use case triage | risk tier、validation depth、pilot boundary |
| Pre-pilot | Design challenge | approve pilot, restrict scope, require redesign |
| Pre-release | Validation challenge | go, limited go, no-go, remediation before release |
| Post-release | Monitoring review | continue, scale, restrict, rollback, revalidate |
| Triggered | Change / incident challenge | regression eval, emergency restriction, full revalidation |
| Quarterly | Portfolio challenge | concentration risk, overdue findings, evidence gaps |
8. Validation Evidence
8.1 Evidence principle
Good evidence is not "a document exists"。Good evidence proves:
The right control operated
on the right AI system version
for the right use case boundary
over the right sample or time window
with the right owner review
and with failures handled through a tracked decision.
8.2 Evidence stack
| Evidence domain | Evidence examples | Validation use |
|---|---|---|
| Business and use | use case memo、workflow map、approved/prohibited use、risk tier worksheet | boundary and materiality |
| Architecture | system diagram、data flow、RAG design、tool registry、HITL workflow | conceptual soundness |
| Data and RAG | source inventory、lineage、access control、index build log、citation audit | retrieval and evidence grounding |
| Prompt and policy | prompt registry、prompt diff、review record、policy mapping | behavior control |
| Model and vendor | model card、selection memo、vendor review、update notice process | component risk |
| Eval | eval contract、golden set、rubric、run results、slice analysis、expert review | release readiness |
| Red-team | adversarial cases、prompt injection test、sensitive data test、tool abuse test | stress and misuse |
| Process verification | config snapshot、deployment log、negative test、trace sample、workflow walkthrough | implementation fidelity |
| HITL | reviewer training、approval log、override log、human edit diff、QA sample | human control effectiveness |
| Monitoring | dashboard、alert rules、sample review、complaint linkage、incident log | ongoing performance |
| Findings | issue log、severity、owner、remediation evidence、retest result | risk reduction |
| Decision | release memo、management sign-off、risk acceptance、restrictions、expiry | governance and accountability |
8.3 Evidence quality standards
| Standard | What it means |
|---|---|
| Versioned | Evidence states model, prompt, index, tool and workflow versions |
| Traceable | Evidence links to use case, requirement, risk, control and gate decision |
| Reproducible | A reviewer can reconstruct test method and sample selection |
| Risk-based | High-risk claims have stronger evidence and more independent review |
| Time-bounded | Evidence has date, covered period and expiry or review cadence |
| Owner-backed | A named owner attests to evidence accuracy and operating status |
| Failure-aware | Evidence includes failures, exceptions and remediation, not only pass results |
8.4 Evidence graph view
Claim: AI does not make final credit decision
-> Risk: automated adverse customer impact
-> Control objective: human remains final decision maker
-> Control activity: no approve/decline tool permission, mandatory underwriter review
-> Test: negative API test, workflow walkthrough, case sample
-> Evidence: permission matrix, trace logs, human approval records
-> Decision: release allowed with no direct decision automation
-> Monitoring: monthly sample of AI-influenced decisions and override rate
8.5 Evidence anti-patterns
| Anti-pattern | Why it fails independent challenge |
|---|---|
| Only vendor benchmark | Does not prove approved-use fitness |
| Screenshot of working UI | Does not prove control design or operation |
| Average score only | Hides severe failures and weak slices |
| Unversioned eval report | Cannot bind evidence to released system |
| Manual review claim without logs | Cannot prove HITL operated |
| Monitoring dashboard without thresholds | No decision rule for action |
| Risk acceptance without expiry | Residual risk becomes unmanaged |
9. Revalidation Triggers
Validation is not a one-time release activity。A GenAI system must define triggers that invalidate or weaken prior evidence.
| Trigger category | Trigger examples | Required response |
|---|---|---|
| Model change | foundation model upgrade、new deployment、context length change、temperature change | regression eval, limitation review, release gate update |
| Prompt change | system instruction, policy prompt, output schema, refusal rule | prompt diff review, targeted eval, approval record |
| RAG change | new source repository、index rebuild、embedding model change、chunking change、permission filter update | retrieval eval, citation audit, access test |
| Tool change | new API, expanded permission, write action, batch action, approval bypass | tool risk review, negative tests, HITL verification |
| Workflow change | new user role、new channel、automation step、exception path | process walkthrough, control test, training update |
| Use expansion | customer-facing expansion、new product、new geography、new customer segment | risk tier reassessment, validation scope expansion |
| Data shift | policy changes、new product terms、seasonal volume, language mix change | sample refresh, outcome analysis, monitoring threshold review |
| Monitoring signal | critical failures、complaint spike、wrong citation trend、override anomaly | incident triage, issue remediation, targeted revalidation |
| Regulatory or policy change | new internal AI policy, supervisory request, legal interpretation change | control mapping refresh, management review |
| Vendor event | model behavior incident、SLA breach、data handling change、subprocessor change | vendor risk review, contingency test |
Rule of thumb:
If a change can alter the AI output, evidence source, tool authority, human control, customer impact or risk interpretation, it should trigger at least targeted revalidation.
10. Financial Retail Case: Credit Card Dispute and Complaint AI Copilot
10.1 Use case
Use case:
Credit Card Dispute and Complaint AI Copilot
Users:
Call center agents, complaint operations analysts, QA reviewers
Approved use:
Summarize customer interaction, retrieve approved dispute policy, draft internal case note, suggest escalation reason with citations.
Prohibited use:
Do not make final dispute decision.
Do not promise fee reversal, provisional credit or compensation.
Do not send customer response without human approval.
Do not override complaint classification or regulatory clock.
10.2 System architecture
| Component | Design |
|---|---|
| Foundation model | Approved enterprise LLM deployment with logging and no training on customer data |
| Prompt | Role, policy boundary, citation requirement, refusal behavior, escalation rules |
| RAG | Approved policy repository, dispute procedure, complaint taxonomy, product terms, effective-date metadata |
| Tools | Read-only account context, case lookup, draft note creation, no direct customer communication |
| Workflow | Agent asks question -> RAG answer -> draft note -> human review -> QA sampling |
| HITL | Human must approve customer-facing language and final disposition |
| Eval | Dispute policy golden set, complaint escalation cases, red-team misuse cases, historical defect replay |
| Monitoring | wrong citation rate, unauthorized commitment, under-escalation, human edit rate, complaint QA defect |
10.3 Risk profile
| Risk | Why material | Control |
|---|---|---|
| Unauthorized commitment | AI may imply fee waiver, provisional credit or compensation | forbidden commitment eval, response policy, human approval |
| Wrong policy citation | Agent may rely on outdated or irrelevant policy | active source filter, citation audit, source effective date |
| Under-escalation | Complaint or regulatory-sensitive case may not be escalated | escalation classifier, high-risk case eval, QA sampling |
| Privacy leakage | Customer data may appear in prompt, log or generated note beyond need | data minimization, log policy, access control |
| Over-reliance | Agent may accept AI draft without reading evidence | UI evidence display, reviewer training, edit diff monitoring |
| Tool misuse | AI may update case fields incorrectly if tool authority expands | read-only tools, approval workflow, negative tool tests |
10.4 Validation plan summary
| Validation domain | Example test |
|---|---|
| Conceptual soundness | Review whether AI role is limited to summarize / retrieve / draft / suggest, not decide / act |
| Prompt review | Test whether prompt blocks commitments and requires evidence-backed caveats |
| RAG review | Test active policy retrieval, stale policy rejection and citation correctness |
| Tool review | Confirm no write authority for final disposition or customer communication |
| HITL review | Sample draft-to-final diff and verify human approval before sending |
| Outcome analysis | Compare pre-pilot and pilot QA defects, handling time and complaint escalation quality |
| Monitoring review | Confirm alert thresholds for unauthorized commitment and under-escalation |
10.5 Independent challenge examples
| Challenge | Evidence requested | Possible finding |
|---|---|---|
| Does the system know when not to answer? | unanswered / policy-conflict eval cases | Medium if refusal behavior is inconsistent |
| Can RAG retrieve stale policy? | index manifest and effective-date test | High if outdated policy can be cited |
| Is human review real or ceremonial? | human edit diff, approval time, QA sample | High if approval is one-click without evidence review |
| Does the system change complaint classification? | tool permission matrix, workflow trace | Critical if AI can alter regulatory clock fields |
| Are high-risk complaints monitored post-release? | monitoring dashboard and alert history | High if no under-escalation metric exists |
10.6 Release decision example
| Decision element | Example |
|---|---|
| Decision | Limited release to 50 trained agents in one card product line |
| Conditions | No direct customer send, no disposition update, mandatory human approval, weekly QA sample |
| Blockers closed | Stale policy retrieval fixed with active-source filter and retested |
| Open medium issue | Refusal wording inconsistent for policy-conflict cases, mitigated with escalation banner and targeted monitoring |
| Revalidation trigger | Any tool write permission, new product line, policy repository migration, or model deployment upgrade |
11. Templates
11.1 Validation Plan Template
# AI System Validation Plan
## 1. System and Use Boundary
- System ID:
- System name:
- Business owner:
- Product / BA owner:
- Technical owner:
- Model risk owner:
- Approved use:
- Prohibited use:
- Risk tier and rationale:
- Customer / employee / regulatory impact:
## 2. System Components
- Foundation model:
- Prompt versions:
- RAG sources and index versions:
- Embedding / reranker / judge components:
- Tools and permissions:
- Workflow states:
- HITL checkpoints:
- Monitoring scope:
## 3. Validation Objectives
- Conceptual soundness objective:
- Process verification objective:
- Outcome analysis objective:
- Independent challenge objective:
- Release decision supported:
## 4. Validation Scope
- In scope:
- Out of scope with rationale:
- Version boundary:
- User population:
- Channel / product / geography boundary:
- Data and knowledge boundary:
## 5. Test and Evidence Plan
| Domain | Test method | Sample / data | Metric / threshold | Evidence | Owner |
|---|---|---|---|---|---|
| Conceptual soundness | Design review | Architecture and requirements | Fit for approved use | Review memo | Model Risk |
| RAG | Retrieval eval | Golden policy questions | Wrong citation <= defined limit, critical stale citation = 0 | Retrieval report | Data Owner |
| Tool | Negative permission test | Tool abuse cases | Unauthorized action = 0 | Test log | Engineering |
| HITL | Case sample review | Production pilot sample | Required approvals present | Approval sample | Operations |
| Outcome | Expert review | Golden set and pilot traces | Critical failure = 0 | Eval report | Validation |
## 6. Findings and Gate Rules
- Critical finding rule:
- High finding rule:
- Medium finding rule:
- Low finding rule:
- Risk acceptance authority:
- Release options: go, limited go, no-go, rollback, retire
## 7. Monitoring and Revalidation
- Production metrics:
- Alert thresholds:
- Sampling cadence:
- Revalidation triggers:
- Evidence retention:
11.2 Independent Challenge Questions Template
| Challenge domain | Questions |
|---|---|
| Business use | What decision or workflow does AI influence? Is the approved use narrow enough? What is explicitly prohibited? |
| Risk tier | Could the system affect customer rights, pricing, credit, complaints, AML, fraud, privacy or regulatory reporting? |
| Model | Why was this model selected? What limitations matter for this use case? How are vendor changes detected? |
| Prompt | Are policy boundaries, refusal rules, evidence requirements and escalation conditions encoded and tested? |
| RAG | Are sources approved, current, permission-filtered and traceable? Can the system cite stale or irrelevant evidence? |
| Tool | What can the AI execute? Which actions are read-only, draft-only, approval-required or prohibited? |
| Workflow | Where can handoff fail? What happens when AI is uncertain, unavailable or conflicts with policy? |
| HITL | Does the human have enough context, time, authority and training to challenge AI output? |
| Eval | Does the eval set include edge cases, historical incidents, adversarial cases and high-risk slices? |
| Monitoring | Which production signals would reveal hallucination, wrong citation, over-reliance, under-escalation or tool misuse? |
| Evidence | Can a reviewer reconstruct a critical output from input to source to prompt to model to tool to human action? |
| Decision | Who can accept residual risk, for how long, under what conditions and with what revalidation triggers? |
11.3 Finding Severity and Remediation Template
| Severity | Definition | Release impact | Remediation expectation | Example |
|---|---|---|---|---|
| Critical | Failure can cause unauthorized customer-impacting action, regulatory-sensitive error, data leakage or unbounded automation | No-go or immediate rollback | Fix before release, retest, management review | AI can submit final dispute decision through tool |
| High | Material control gap or quality failure in approved use, with plausible customer or compliance impact | No-go for full release, limited pilot only with strong compensating controls | Fix or restrict scope before scale | RAG can cite outdated complaint policy |
| Medium | Weakness that can degrade quality, auditability or operational control but has containment | Release may proceed with conditions | Remediate by agreed date, monitor with owner | Refusal wording inconsistent for policy-conflict cases |
| Low | Documentation, minor usability or evidence clarity issue with limited risk impact | Does not block release | Address in normal backlog with evidence update | Evidence index missing reviewer title |
Remediation record:
| Field | Description |
|---|---|
| Finding ID | Unique issue ID linked to validation report |
| Severity | Critical / High / Medium / Low |
| Root cause | Design, data, prompt, RAG, tool, workflow, monitoring, governance |
| Required action | Specific change needed |
| Owner | Named accountable owner |
| Due date | Date tied to release or monitoring cadence |
| Evidence required | Test report, config diff, trace sample, approval, monitoring update |
| Retest result | Passed, partially passed, failed, restricted |
| Residual risk | Remaining limitation after remediation |
| Risk acceptance | Approver, condition, expiry, review trigger |
11.4 Validation Report Template
# AI System Validation Report
## 1. Executive Conclusion
- System:
- Use case:
- Risk tier:
- Validation period:
- Overall decision:
- Key restrictions:
- Material residual risks:
## 2. Scope and Version Boundary
- Model version:
- Prompt version:
- RAG source and index version:
- Tool permission version:
- Workflow version:
- User and channel boundary:
## 3. Validation Work Performed
- Conceptual soundness review:
- Process verification:
- Outcome analysis:
- Red-team / adversarial testing:
- HITL and workflow review:
- Monitoring readiness review:
- Evidence review:
## 4. Results
| Domain | Result | Key evidence | Finding IDs |
|---|---|---|---|
| Conceptual soundness | | | |
| RAG | | | |
| Tool authority | | | |
| HITL | | | |
| Outcome analysis | | | |
| Monitoring | | | |
## 5. Findings
| Finding ID | Severity | Description | Required action | Owner | Due date | Release impact |
|---|---|---|---|---|---|---|
## 6. Limitations
- Validation limitations:
- Known system limitations:
- Evidence limitations:
- Use restrictions:
## 7. Decision and Conditions
- Gate decision:
- Required restrictions:
- Required monitoring:
- Revalidation triggers:
- Risk acceptance:
## 8. Appendices
- Inventory record:
- Architecture diagram:
- Eval contract:
- Eval results:
- Trace samples:
- Control test evidence:
- Monitoring spec:
- Approval record:
11.5 Portfolio Evidence Template
| Portfolio question | Evidence view | Example metrics |
|---|---|---|
| How many high-risk AI systems are in production? | AI system inventory by risk tier and business line | count by tier, release stage, owner |
| Which systems have overdue validation? | validation schedule and expiry | overdue count, days overdue |
| Which components create concentration risk? | shared model / vendor / RAG source / tool dependency map | systems per provider, shared source count |
| What findings remain open? | finding register | open critical/high, aging, owner |
| Which systems rely on HITL as key control? | HITL control inventory | review volume, backlog, defect rate |
| Which changes triggered revalidation? | change and revalidation log | model changes, prompt changes, index rebuilds, tool changes |
| Which evidence supports management attestation? | evidence graph and release memo index | evidence freshness, missing evidence, owner attestation |
| What production signals indicate drift or misuse? | monitoring dashboard | wrong citation, policy violation, override, complaint linkage |
Portfolio evidence should support these executive questions:
- Are high-risk AI systems known, owned and validated?
- Are independent challenge findings being remediated on time?
- Are release decisions tied to evidence, not optimism?
- Are revalidation triggers actually firing?
- Are business benefits being achieved without unacceptable risk?
- Are shared vendors, models, data sources or tools creating aggregate risk?
12. 面试表达
12.1 30 秒版本
我不会把 GenAI 验证等同于模型 benchmark。对金融零售高影响场景, 我会先定义 approved use、prohibited use、risk tier 和 system inventory, 然后验证 conceptual soundness、process implementation、outcome results 和 monitoring readiness。验证对象包括 model、prompt、RAG、tool、workflow、HITL 和 business outcome。最后用 independent challenge 把 findings、remediation、release decision 和 revalidation trigger 固化成证据链。
12.2 2 分钟版本
我的做法是把传统 SR 11-7 式的 model risk management 升级成 AI system validation。第一步不是跑分, 而是建立 model / use / system inventory, 明确这个 AI 系统在什么业务流程中影响谁、允许做什么、禁止做什么、风险等级是什么。
第二步做 validation plan。Conceptual soundness 看设计是否适合用途, 包括模型选择、prompt、RAG source、tool 权限、HITL 和监控设计。Process verification 看生产实现是否和批准版本一致, 包括 prompt hash、index version、tool permission、workflow routing 和日志可追溯。Outcome analysis 看 eval、red-team、expert review、pilot trace、客户影响和业务结果, 不能只看平均 accuracy。
第三步是 independent challenge。验证方要有独立性、能力和影响力, 可以挑战 risk tier、数据证据、RAG 引用、tool 权限、人工复核有效性、监控盲区和 residual risk。最终产物不是一张分数表, 而是 validation report、finding log、release restriction、risk acceptance 和 revalidation trigger。
12.3 面试追问: "模型 benchmark 很高, 为什么还要 validation?"
回答:
Benchmark 只能说明模型在某个测试集或通用任务上的能力, 不能证明它在我的业务流程、数据源、prompt、RAG、tool 和人工控制下是安全适用的。金融零售 AI 的关键风险经常出在系统层, 例如过期政策被 RAG 引用、tool 权限过宽、人工复核不可运营、客户投诉未升级、或生产监控看不到 wrong citation。Validation 是把模型能力放进真实 use case 和控制环境里验证。
12.4 面试追问: "你如何设计 independent challenge?"
回答:
我会把 independent challenge 设计成一个有 mandate 的 operating model, 而不是会议点评。验证团队独立于开发和业务收益目标, 但必须懂业务流程、AI 架构、数据和风险。每个 challenge question 都要对应 evidence 或 finding。High severity finding 可以阻断 release 或限制范围。对无法立即修复的问题, 必须有 residual risk、owner、expiry 和 revalidation trigger。这样 challenge 才有 competence、independence 和 influence。
12.5 面试追问: "GenAI validation 的证据包包括什么?"
回答:
我会准备五类证据。第一是边界证据, 包括 use case、risk tier、approved/prohibited use。第二是架构证据, 包括 model、prompt、RAG、tool、workflow、HITL 和 logging。第三是测试证据, 包括 eval contract、golden set、red-team、slice analysis 和 expert review。第四是运行证据, 包括 production traces、human approval logs、monitoring dashboard、incident and complaint linkage。第五是治理证据, 包括 validation report、findings、remediation、release decision、risk acceptance 和 revalidation triggers。
12.6 面试追问: "作为 CBAP 背景, 你的差异化在哪里?"
回答:
CBAP 给我的优势是需求、流程、stakeholder、traceability 和 solution evaluation。AI 项目里我会把这些能力升级成 requirement-to-eval-to-control-to-evidence 的闭环。我不是只写 user story, 而是定义 AI 在流程中的 decision boundary、human control、failure mode、eval contract、release gate 和 monitoring trigger。对金融零售场景, 这能把产品价值、模型风险、架构实现和审计证据连接起来。
12.7 作品集表达
可以把本文转成一个 portfolio artifact:
Case:
Credit Card Dispute and Complaint AI Copilot
Artifacts:
1. AI system inventory
2. Validation plan
3. Eval contract and red-team case set
4. RAG citation audit
5. Tool permission matrix
6. HITL workflow evidence
7. Independent challenge memo
8. Finding and remediation log
9. Validation report
10. Portfolio evidence dashboard
Positioning:
This project demonstrates that I can move beyond AI feature design into AI product architecture, model risk, independent validation and governance-grade evidence.
13. 最小实战清单
对任何高影响 AI use case, 至少产出以下材料:
| Artifact | Minimum quality bar |
|---|---|
| AI system inventory | Covers model, prompt, RAG, tool, workflow, HITL, eval, monitoring and revalidation trigger |
| Approved / prohibited use | Clear enough to block misuse and scope expansion |
| Validation plan | Defines conceptual soundness, process verification and outcome analysis |
| Eval contract | Includes critical failures, slices, thresholds, evaluator and gate decision |
| Process verification checklist | Confirms production implementation matches approved design |
| Independent challenge questions | Challenges assumptions, evidence, results and residual risk |
| Finding log | Includes severity, release impact, owner, remediation, retest and risk acceptance |
| Validation report | Supports go / limited go / no-go / rollback / retire |
| Monitoring spec | Connects production signals to issue management and revalidation |
| Portfolio evidence view | Supports executive, audit and regulator-ready questions |
Final mindset:
In GenAI, the model is only one component. The validation object is the AI system in use.