AI 扩展计划 / Playbooks

AI Model Validation / Independent Challenge Playbook

这份 playbook 解决一个高级问题:

867 行AI_MODEL_VALIDATION_INDEPENDENT_CHALLENGE_PLAYBOOK.md

AI Model Validation / Independent Challenge Playbook

定位: 面向 CBAP 之后的高级 AI PM / BA / Product Architect / Solutions Architect / Model Risk / Validation / AI Governance 的 AI system validation 实战手册。

目标: 把传统 model validation 升级为 GenAI system validation, 覆盖 model、prompt、RAG、tool、workflow、HITL、eval、monitoring、business outcome 和 revalidation trigger。

核心观点: 高影响 AI 系统的验证对象不是一个模型 benchmark, 而是一个在特定业务用途、版本边界、数据依赖、人工控制、运营监控和风险偏好内运行的 socio-technical system。

重要说明: 本文是学习、作品集和治理设计材料, 不是法律意见、合规意见、审计意见、模型验证结论或监管解释。正式项目必须由 Legal、Compliance、Model Risk、Internal Audit、Security、Privacy、Data Owner、业务管理层和适用监管关系共同确认。

1. 目的 / 适用对象 / 核心观点

1.1 目的

这份 playbook 解决一个高级问题:

How do we validate an AI system, not just benchmark a model?

在金融零售场景中, AI 不再只是一个离线模型。一个客户服务 copilot、信贷政策 RAG、AML case narrative assistant、投诉处理 agent 或财富顾问助手通常包含:

Foundation model 或 fine-tuned model。
System prompt、developer prompt、policy prompt 和 output schema。
RAG retriever、embedding model、reranker、chunking、source repository 和 index refresh。
Tool / API / workflow action, 例如查询账户、创建 case、草拟通知、更新 CRM 字段。
Human-in-the-loop 或 human-on-the-loop 控制。
Eval contract、red-team cases、production monitoring、incident response 和 change management。
Business outcome, 例如处理时长、复核质量、客户影响、投诉率、合规缺陷和运营成本。

本文的目标是训练你把这些对象转成:

Artifact	作用
AI system inventory	识别验证边界、owner、版本、用途、风险等级和依赖
Validation plan	定义验证目标、范围、方法、样本、阈值、证据和 gate decision
Independent challenge memo	记录验证方如何挑战设计假设、实现证据、结果解释和剩余风险
Validation report	支撑 pilot / release / scale / restriction / retirement 的正式判断
Portfolio evidence pack	支撑 Model Risk、AI Governance、Internal Audit 和监管问询

1.2 适用对象

角色	需要掌握的能力	本文对应训练
AI PM / Product Owner	把 AI idea 转成可验证、可放行、可监控、可问责的产品能力	use case boundary、business outcome、release gate、revalidation trigger
AI BA / CBAP	把需求、流程、控制、数据、证据和 stakeholder concern 连接起来	requirement-to-eval-to-control traceability
Product / Solution Architect	把 prompt、RAG、tool、workflow、HITL、logging 和 fallback 设计成可验证架构	system inventory、process verification、architecture challenge
Model Risk / Validation	执行独立验证和有效挑战, 不只审查模型分数	conceptual soundness、process verification、outcome analysis
Internal Audit / Compliance	检查控制是否真实运行, 证据是否能支持管理层声明	evidence sufficiency、control operating effectiveness
Risk Executive	决定是否接受 residual risk、限制范围、延后上线或要求整改	severity、remediation、risk acceptance

1.3 核心观点

Model validation 要升级为 AI system validation。
Benchmark 只是证据之一, 不能替代 use-case-specific validation。
GenAI validation 必须覆盖模型、prompt、RAG、tool、workflow、人、监控和结果。
Independent challenge 的价值不是挑错, 而是让上线判断能够承受业务、审计、监管和事故复盘的挑战。
对高影响金融零售 AI, 验证结论必须绑定版本边界、使用边界、限制条件、issue remediation 和 revalidation trigger。

2. Source Anchors

以下来源作为框架语言和治理锚点。访问日期按 2026-06-30 记录。正式项目应复核最新监管、机构政策和司法辖区要求。

Anchor	Official link	本文使用方式
Federal Reserve SR 11-7 / OCC 2011-12 model risk management guidance	https://www.federalreserve.gov/supervisionreg/srletters/sr1107.htm	作为传统 MRM 的经典锚点: model risk、effective challenge、conceptual soundness、ongoing monitoring、outcomes analysis、independence、governance
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 组织 AI 风险识别、测量、控制、监控和持续改进
ISO/IEC 42001 AI management system	https://www.iso.org/standard/81230.html	用管理体系视角组织 AI policy、owner、risk process、operation control、performance evaluation、internal audit 和 management review

使用纪律:

不把任何 source anchor 机械转成 checklist。
不用 "we follow SR 11-7 / NIST / ISO" 作为结论。
每个高影响 use case 都要把 source anchor 转成可验证 artifact: inventory、validation plan、evidence map、finding log、approval record、monitoring spec 和 revalidation record。
对 GenAI / RAG / Agent, 传统 model validation 语言必须扩展到 system component、workflow control 和 production learning loop。

3. One-Sentence Positioning

AI system validation = independent, risk-based evaluation of whether a specific AI system is conceptually sound, correctly implemented, fit for approved use, controlled in operation, delivering intended outcomes, and constrained by evidence-backed limits.

中文表达:

AI system validation 是对一个特定 AI 系统在批准用途、版本边界、业务流程、数据依赖、人工控制和生产监控下是否适用的独立风险判断。

最小闭环:

Business use case
-> model / use / system inventory
-> risk tiering
-> validation plan
-> conceptual soundness review
-> process verification
-> outcome analysis
-> independent challenge
-> findings and remediation
-> release decision
-> monitoring
-> revalidation trigger

面试中的一句话:

I would not validate a GenAI product by asking whether the base model has a strong benchmark. I would validate the whole deployed system against its approved use, evidence chain, human controls, tool authority, production monitoring, and business-risk outcomes.

4. GenAI System Validation != Model Benchmark

4.1 Benchmark 回答什么

Benchmark 通常回答:

Benchmark 问题	典型证据
这个模型在通用任务上能力如何	public benchmark、vendor report、internal comparison
这个模型在某个静态测试集上表现如何	offline eval score、accuracy、F1、pass rate、rubric score
模型 A 是否优于模型 B	head-to-head experiment、win rate、cost-latency-quality tradeoff
模型是否满足基础能力门槛	summarization、classification、reasoning、language coverage

这些证据有价值, 但不足以支撑金融零售 AI 上线。原因是 benchmark 通常没有完整覆盖:

实际业务流程中的输入分布和异常场景。
RAG source freshness、权限过滤、citation correctness 和 policy conflict。
Prompt 版本、tool schema、workflow routing 和 HITL 操作。
用户过度信任、人工复核失败、handoff 断点和操作风险。
生产环境中的 latency、fallback、日志、审计、监控和 incident response。
客户影响、合规缺陷、投诉、业务 KPI 和 residual risk。

4.2 Validation 回答什么

Validation 回答的是:

Validation 问题	判断重点
这个 AI 系统是否适合这个 approved use	use case boundary、risk tier、prohibited use、human role
设计逻辑是否合理	conceptual soundness、assumptions、limitations、architecture rationale
实现是否与批准设计一致	prompt registry、model route、index version、tool permission、workflow config
结果是否可接受	eval results、slice performance、critical failure、business outcome、risk outcome
控制是否真实运行	HITL logs、approval records、monitoring alerts、incident drills、access review
证据是否足以支持 release	evidence quality、sample coverage、version boundary、owner attestation
什么变化会使验证失效	revalidation trigger、change impact、regression gate、rollback path

4.3 关键差异

维度	Model benchmark	AI system validation
对象	模型能力	业务系统和控制环境
范围	通用任务或静态测试集	approved use、workflow、data、prompt、RAG、tool、human、monitoring
结果	分数和排名	release decision、use restriction、findings、residual risk
证据	eval score、benchmark report	validation plan、trace、config、sample review、control test、issue remediation
责任	Data science / platform	Business owner、Model Risk、AI governance、architecture、operations
时间点	选型或上线前	上线前、变更时、生产中、事故后、定期复核

高级表达:

A benchmark can inform model selection. It cannot prove that a deployed AI system is fit for a regulated workflow.

5. Model / Use / System Inventory

GenAI validation 的第一步不是测试, 而是登记和边界定义。没有 inventory, 验证就没有对象、版本、owner 和责任链。

5.1 三层 inventory

Inventory layer	核心问题	典型字段
Model inventory	哪些模型或模型型组件被使用	foundation model、embedding、reranker、judge、classifier、rules、provider、version、owner
Use inventory	模型被用在什么业务用途和流程节点	use case、approved use、prohibited use、risk tier、customer impact、decision boundary
System inventory	哪些系统组件共同产生 AI 行为	prompt、RAG、tool、workflow、HITL、eval、monitoring、fallback、evidence

5.2 Model inventory 字段

Field	定义	验证意义
Component ID	组件唯一编号	支持追踪和变更影响分析
Component type	LLM、embedding、reranker、judge、classifier、rules	不同组件验证方法不同
Provider and hosting	内部、第三方、托管云、开源自部署	影响 vendor risk、data handling、update control
Version boundary	模型名称、版本、deployment id、release family	绑定验证证据和回归测试
Intended component role	生成、检索、排序、评分、拒答、路由、分类	防止组件被误用
Known limitations	语言、领域、上下文长度、稳定性、偏差、不可解释性	进入 residual risk 和 user guidance
Change notice mechanism	vendor notice、internal release note、API change log	决定 revalidation trigger

5.3 Use inventory 字段

Field	定义	金融零售例子
Business workflow	AI 插入的流程	credit memo drafting、card dispute handling、AML alert review
User role	直接使用者	underwriter、branch banker、call center agent、AML analyst
Affected party	受影响对象	customer、applicant、employee、regulator、business line
Approved use	被批准用途	生成带引用的内部政策答案和草稿
Prohibited use	禁止用途	不得自动批准/拒绝贷款, 不得承诺费用减免
Decision boundary	AI 是 read / summarize / recommend / draft / decide / act 中哪一类	draft and recommend, no final decision
Risk tier	风险等级及理由	High, 因为可能影响客户权益和受监管流程
Human role	人工复核方式	human final decision, mandatory review for high-impact outputs
Fallback	AI 不可用或低置信时的路径	policy portal search、SME escalation、manual workflow

5.4 System inventory 字段

Field	定义	验证关注
Prompt registry	system / developer / task prompt 版本和 owner	prompt drift、规则冲突、审批记录
RAG stack	source repository、chunking、embedding、retriever、reranker、index	stale source、wrong citation、permission leakage
Tool registry	tool name、scope、permission、side effect、approval rule	excessive agency、unauthorized action、audit trail
Workflow state	routing、handoff、exception、approval、rollback	HITL 是否可执行, 升级路径是否有效
Eval contract	dataset、rubric、threshold、critical failures、slices	release gate 是否可重复执行
Monitoring spec	quality、risk、cost、latency、adoption、complaint、override	生产中是否能发现失效
Logging and trace	input reference、retrieved source、prompt version、model version、tool call、human action	是否可复现和可审计
Revalidation triggers	model/prompt/index/tool/use/process/regulatory changes	何时重新验证

5.5 Inventory 的高级判断

弱 inventory	高级 inventory
"We use GPT-4 for customer service."	"Customer Service Fee Policy Copilot uses LLM deployment X, prompt v2.3, fee-policy index 2026Q2, approved FAQ source set, read-only account tools, human approval before customer response, and monitoring for unauthorized commitment."
只登记模型名称	登记模型、用途、系统组件、业务流程和控制环境
只看当前版本	记录版本边界、证据边界和变更触发条件
只给 IT owner	明确 business owner、model risk owner、data owner、control owner 和 evidence owner

6. Validation Domains

传统 model validation 常用 conceptual soundness、ongoing monitoring、outcomes analysis。对 GenAI system, 本文改写为三组可执行验证域:

Conceptual soundness
Process verification
Outcome analysis

其中 process verification 包含传统 "implementation and use" 与生产控制检查, outcome analysis 包含离线 eval、线上结果和业务/风险 outcome。

6.1 Conceptual soundness

核心问题:

Is the AI system design logically sound for the approved business use and risk tier?

Review area	Challenge question	Evidence
Business problem	AI 是否解决真实流程瓶颈, 是否存在 no-AI alternative	opportunity memo、baseline measurement、workflow map
Approved use	使用边界是否足够窄, 禁止用途是否明确	use inventory、policy decision table
Risk tiering	风险等级是否反映客户影响、自动化程度、数据敏感性和监管敏感性	risk tier worksheet、customer impact analysis
Model choice	模型能力、限制、供应商控制和成本是否适合用途	model selection memo、vendor due diligence
Prompt design	prompt 是否表达政策边界、拒答规则、输出 schema 和升级条件	prompt design note、prompt review record
RAG design	source-of-truth、权限、生效日期、chunking、retrieval 方法是否合理	RAG architecture、source inventory、retrieval eval
Tool design	tool 权限是否最小化, side effect 是否受控	tool registry、permission matrix、approval flow
HITL design	人工复核是否有能力、时间、权限和证据	SOP、training record、review queue design
Eval design	eval 是否覆盖真实失败模式、严重度、切片和 gate decision	eval contract、critical failure taxonomy
Monitoring design	生产监控是否覆盖质量、安全、流程、成本、客户影响和 drift	monitoring spec、KRI/KPI definitions

Conceptual soundness 的高质量结论应该包括:

Design is fit for approved use, with stated limits。
Key assumptions are explicit and testable。
Critical risks have controls and evidence。
Known limitations are communicated to users and management。
Residual risks are either remediated, restricted, or accepted by the right owner。

6.2 Process verification

核心问题:

Was the approved AI system implemented and operated as designed?

Verification area	Test method	Evidence
Model routing	检查生产调用是否使用批准的 model deployment	config snapshot、deployment log、API route record
Prompt version	检查 prompt registry 与生产版本一致	prompt hash、release record、change approval
RAG source	检查 source repository、effective date、access filter 和 index version	source manifest、index build log、permission test
Retrieval behavior	抽样检查 query、retrieved chunks、citations 和 final answer	trace sample、citation audit
Tool authority	检查 allowlist、RBAC、rate limit、approval 和 rollback	tool permission matrix、negative test
Workflow routing	检查 handoff、exception、fallback 和 review queue	workflow test evidence、case sample
HITL operation	抽样检查 reviewer 是否真实复核并能 override	review logs、edit diff、approval record
Logging	检查是否能重建输出路径和人工动作	traceability sample、log retention policy
Release gate	检查 go / limited go / no-go 决策是否按证据执行	release gate memo、sign-off record
Monitoring	检查 dashboard、alerts、sampling review 和 issue escalation 是否运行	monitoring dashboard、alert history、issue log

Process verification 的常见发现:

Finding	风险
Prompt registry 显示 v2.1, 生产调用实际为 v2.3	验证证据和生产行为不一致
RAG index 没有记录 policy effective date	旧政策可能被引用
Tool allowlist 包含写入型 API, 但 validation 只测了查询型 API	验证范围漏掉 excessive agency
Human review log 只有点击确认, 没有 AI draft 与 human edit diff	无法证明人工控制有效
Monitoring 只看 latency 和 cost, 不看 wrong citation 或 policy violation	生产风险不可见

6.3 Outcome analysis

核心问题:

Are the AI system outputs and business outcomes acceptable within the approved risk appetite?

Outcome analysis 不等于只看准确率。它至少包含四类结果:

Outcome layer	指标例子	解释方式
AI output quality	groundedness、citation correctness、completeness、policy compliance、format validity	按场景、产品、语言、客户类型、风险等级切片
Risk outcome	critical failure、unauthorized action、PII leakage、under-escalation、complaint defect	高影响场景通常采用 hard stop 或 near-zero tolerance
Workflow outcome	handling time、rework rate、escalation quality、review backlog、fallback rate	同时看效率和控制负担
Business outcome	first contact resolution、case quality、loss avoidance、customer harm、regulatory issue	不能用业务改善掩盖高严重度风险

Outcome analysis 的证据类型:

Golden set eval and expert review。
Adversarial and red-team tests。
Historical incident replay。
Slice analysis by product, channel, language, customer segment and policy domain。
Human review defect analysis。
Production trace sampling。
Complaint, QA, audit and incident linkage。
Baseline vs pilot vs scaled rollout comparison。

结果解释规则:

Result pattern	高级解释
Average score high, critical failures present	不应通过高影响 release, 因为平均分掩盖严重失败
Offline eval strong, production override high	可能存在真实输入分布偏移、用户不信任或 workflow mismatch
Efficiency improved, review backlog increased	控制设计可能不可运营, 需要 capacity 和 triage redesign
Citation correctness strong, answer completeness weak	RAG 找到证据但生成层没有完整使用证据
Model benchmark improved, business defect unchanged	问题可能在流程、数据、prompt、tool 或用户行为, 不在 foundation model

7. Independent Challenge Operating Model

7.1 Independent challenge 的定义

Independent challenge 是由具备能力、独立性和影响力的人员, 对 AI 系统设计、实现、结果和风险接受进行批判性审查, 并能推动限制、整改、延后上线或重新验证。

在 GenAI system validation 中, independent challenge 必须能挑战:

Business use 是否被夸大或误定义。
Risk tier 是否低估。
Prompt 和 RAG 设计是否有真实依据。
Tool 权限是否过宽。
HITL 是否只是形式控制。
Eval dataset 是否覆盖真实失败模式。
结果解释是否忽略 high-severity slice。
Monitoring 是否能发现生产中真正会出事的问题。
Management acceptance 是否理解 residual risk。

7.2 Operating model

Element	设计要求
Mandate	Validation team 有权要求 evidence、提出 findings、限制 release、要求 revalidation
Independence	Validator 不应负责开发、上线或业务收益目标
Competence	团队必须懂业务流程、AI 架构、数据、模型风险、控制测试和金融零售风险
Influence	High severity finding 能进入 release gate, management 必须明确处置
Traceability	每个 challenge question 对应 evidence、finding、remediation 或 risk acceptance
Escalation	对重大分歧有 model risk committee / AI governance committee 决策路径

7.3 三道防线视角

Line	主要职责	不能替代什么
First line - Business / Product / Engineering	设计、构建、测试、运行 AI 系统, 维护证据	不能替代独立验证
Second line - Model Risk / Risk / Compliance	制定标准, 独立验证, challenge findings, risk oversight	不能承担产品收益目标
Third line - Internal Audit	审查治理和控制是否设计并运行有效	不能成为上线前质量团队

7.4 RACI

Activity	Business Owner	Product / BA	Architect / Engineering	Data Owner	Model Risk / Validation	Compliance / Legal	Internal Audit
Use case boundary	A	R	C	C	C	C	I
Risk tiering	A	R	C	C	C	C	I
System inventory	A	R	R	R	C	C	I
Eval contract	A	R	R	C	C	C	I
Validation plan	C	C	C	C	A/R	C	I
Process verification evidence	C	R	R	R	A/R	C	I
Findings severity	C	C	C	C	A/R	C	I
Remediation	A	R	R	R	C	C	I
Risk acceptance	A	C	C	C	C	C	I
Audit review	I	I	I	I	C	C	A/R

Legend:

R = Responsible。
A = Accountable。
C = Consulted。
I = Informed。

7.5 Challenge forum cadence

Cadence	Forum	Decision
Intake	AI use case triage	risk tier、validation depth、pilot boundary
Pre-pilot	Design challenge	approve pilot, restrict scope, require redesign
Pre-release	Validation challenge	go, limited go, no-go, remediation before release
Post-release	Monitoring review	continue, scale, restrict, rollback, revalidate
Triggered	Change / incident challenge	regression eval, emergency restriction, full revalidation
Quarterly	Portfolio challenge	concentration risk, overdue findings, evidence gaps

8. Validation Evidence

8.1 Evidence principle

Good evidence is not "a document exists"。Good evidence proves:

The right control operated
on the right AI system version
for the right use case boundary
over the right sample or time window
with the right owner review
and with failures handled through a tracked decision.

8.2 Evidence stack

Evidence domain	Evidence examples	Validation use
Business and use	use case memo、workflow map、approved/prohibited use、risk tier worksheet	boundary and materiality
Architecture	system diagram、data flow、RAG design、tool registry、HITL workflow	conceptual soundness
Data and RAG	source inventory、lineage、access control、index build log、citation audit	retrieval and evidence grounding
Prompt and policy	prompt registry、prompt diff、review record、policy mapping	behavior control
Model and vendor	model card、selection memo、vendor review、update notice process	component risk
Eval	eval contract、golden set、rubric、run results、slice analysis、expert review	release readiness
Red-team	adversarial cases、prompt injection test、sensitive data test、tool abuse test	stress and misuse
Process verification	config snapshot、deployment log、negative test、trace sample、workflow walkthrough	implementation fidelity
HITL	reviewer training、approval log、override log、human edit diff、QA sample	human control effectiveness
Monitoring	dashboard、alert rules、sample review、complaint linkage、incident log	ongoing performance
Findings	issue log、severity、owner、remediation evidence、retest result	risk reduction
Decision	release memo、management sign-off、risk acceptance、restrictions、expiry	governance and accountability

8.3 Evidence quality standards

Standard	What it means
Versioned	Evidence states model, prompt, index, tool and workflow versions
Traceable	Evidence links to use case, requirement, risk, control and gate decision
Reproducible	A reviewer can reconstruct test method and sample selection
Risk-based	High-risk claims have stronger evidence and more independent review
Time-bounded	Evidence has date, covered period and expiry or review cadence
Owner-backed	A named owner attests to evidence accuracy and operating status
Failure-aware	Evidence includes failures, exceptions and remediation, not only pass results

8.4 Evidence graph view

Claim: AI does not make final credit decision
-> Risk: automated adverse customer impact
-> Control objective: human remains final decision maker
-> Control activity: no approve/decline tool permission, mandatory underwriter review
-> Test: negative API test, workflow walkthrough, case sample
-> Evidence: permission matrix, trace logs, human approval records
-> Decision: release allowed with no direct decision automation
-> Monitoring: monthly sample of AI-influenced decisions and override rate

8.5 Evidence anti-patterns

Anti-pattern	Why it fails independent challenge
Only vendor benchmark	Does not prove approved-use fitness
Screenshot of working UI	Does not prove control design or operation
Average score only	Hides severe failures and weak slices
Unversioned eval report	Cannot bind evidence to released system
Manual review claim without logs	Cannot prove HITL operated
Monitoring dashboard without thresholds	No decision rule for action
Risk acceptance without expiry	Residual risk becomes unmanaged

9. Revalidation Triggers

Validation is not a one-time release activity。A GenAI system must define triggers that invalidate or weaken prior evidence.

Trigger category	Trigger examples	Required response
Model change	foundation model upgrade、new deployment、context length change、temperature change	regression eval, limitation review, release gate update
Prompt change	system instruction, policy prompt, output schema, refusal rule	prompt diff review, targeted eval, approval record
RAG change	new source repository、index rebuild、embedding model change、chunking change、permission filter update	retrieval eval, citation audit, access test
Tool change	new API, expanded permission, write action, batch action, approval bypass	tool risk review, negative tests, HITL verification
Workflow change	new user role、new channel、automation step、exception path	process walkthrough, control test, training update
Use expansion	customer-facing expansion、new product、new geography、new customer segment	risk tier reassessment, validation scope expansion
Data shift	policy changes、new product terms、seasonal volume, language mix change	sample refresh, outcome analysis, monitoring threshold review
Monitoring signal	critical failures、complaint spike、wrong citation trend、override anomaly	incident triage, issue remediation, targeted revalidation
Regulatory or policy change	new internal AI policy, supervisory request, legal interpretation change	control mapping refresh, management review
Vendor event	model behavior incident、SLA breach、data handling change、subprocessor change	vendor risk review, contingency test

Rule of thumb:

If a change can alter the AI output, evidence source, tool authority, human control, customer impact or risk interpretation, it should trigger at least targeted revalidation.

10. Financial Retail Case: Credit Card Dispute and Complaint AI Copilot

10.1 Use case

Use case:
Credit Card Dispute and Complaint AI Copilot

Users:
Call center agents, complaint operations analysts, QA reviewers

Approved use:
Summarize customer interaction, retrieve approved dispute policy, draft internal case note, suggest escalation reason with citations.

Prohibited use:
Do not make final dispute decision.
Do not promise fee reversal, provisional credit or compensation.
Do not send customer response without human approval.
Do not override complaint classification or regulatory clock.

10.2 System architecture

Component	Design
Foundation model	Approved enterprise LLM deployment with logging and no training on customer data
Prompt	Role, policy boundary, citation requirement, refusal behavior, escalation rules
RAG	Approved policy repository, dispute procedure, complaint taxonomy, product terms, effective-date metadata
Tools	Read-only account context, case lookup, draft note creation, no direct customer communication
Workflow	Agent asks question -> RAG answer -> draft note -> human review -> QA sampling
HITL	Human must approve customer-facing language and final disposition
Eval	Dispute policy golden set, complaint escalation cases, red-team misuse cases, historical defect replay
Monitoring	wrong citation rate, unauthorized commitment, under-escalation, human edit rate, complaint QA defect

10.3 Risk profile

Risk	Why material	Control
Unauthorized commitment	AI may imply fee waiver, provisional credit or compensation	forbidden commitment eval, response policy, human approval
Wrong policy citation	Agent may rely on outdated or irrelevant policy	active source filter, citation audit, source effective date
Under-escalation	Complaint or regulatory-sensitive case may not be escalated	escalation classifier, high-risk case eval, QA sampling
Privacy leakage	Customer data may appear in prompt, log or generated note beyond need	data minimization, log policy, access control
Over-reliance	Agent may accept AI draft without reading evidence	UI evidence display, reviewer training, edit diff monitoring
Tool misuse	AI may update case fields incorrectly if tool authority expands	read-only tools, approval workflow, negative tool tests

10.4 Validation plan summary

Validation domain	Example test
Conceptual soundness	Review whether AI role is limited to summarize / retrieve / draft / suggest, not decide / act
Prompt review	Test whether prompt blocks commitments and requires evidence-backed caveats
RAG review	Test active policy retrieval, stale policy rejection and citation correctness
Tool review	Confirm no write authority for final disposition or customer communication
HITL review	Sample draft-to-final diff and verify human approval before sending
Outcome analysis	Compare pre-pilot and pilot QA defects, handling time and complaint escalation quality
Monitoring review	Confirm alert thresholds for unauthorized commitment and under-escalation

10.5 Independent challenge examples

Challenge	Evidence requested	Possible finding
Does the system know when not to answer?	unanswered / policy-conflict eval cases	Medium if refusal behavior is inconsistent
Can RAG retrieve stale policy?	index manifest and effective-date test	High if outdated policy can be cited
Is human review real or ceremonial?	human edit diff, approval time, QA sample	High if approval is one-click without evidence review
Does the system change complaint classification?	tool permission matrix, workflow trace	Critical if AI can alter regulatory clock fields
Are high-risk complaints monitored post-release?	monitoring dashboard and alert history	High if no under-escalation metric exists

10.6 Release decision example

Decision element	Example
Decision	Limited release to 50 trained agents in one card product line
Conditions	No direct customer send, no disposition update, mandatory human approval, weekly QA sample
Blockers closed	Stale policy retrieval fixed with active-source filter and retested
Open medium issue	Refusal wording inconsistent for policy-conflict cases, mitigated with escalation banner and targeted monitoring
Revalidation trigger	Any tool write permission, new product line, policy repository migration, or model deployment upgrade

11. Templates

11.1 Validation Plan Template

# AI System Validation Plan

## 1. System and Use Boundary
- System ID:
- System name:
- Business owner:
- Product / BA owner:
- Technical owner:
- Model risk owner:
- Approved use:
- Prohibited use:
- Risk tier and rationale:
- Customer / employee / regulatory impact:

## 2. System Components
- Foundation model:
- Prompt versions:
- RAG sources and index versions:
- Embedding / reranker / judge components:
- Tools and permissions:
- Workflow states:
- HITL checkpoints:
- Monitoring scope:

## 3. Validation Objectives
- Conceptual soundness objective:
- Process verification objective:
- Outcome analysis objective:
- Independent challenge objective:
- Release decision supported:

## 4. Validation Scope
- In scope:
- Out of scope with rationale:
- Version boundary:
- User population:
- Channel / product / geography boundary:
- Data and knowledge boundary:

## 5. Test and Evidence Plan
| Domain | Test method | Sample / data | Metric / threshold | Evidence | Owner |
|---|---|---|---|---|---|
| Conceptual soundness | Design review | Architecture and requirements | Fit for approved use | Review memo | Model Risk |
| RAG | Retrieval eval | Golden policy questions | Wrong citation <= defined limit, critical stale citation = 0 | Retrieval report | Data Owner |
| Tool | Negative permission test | Tool abuse cases | Unauthorized action = 0 | Test log | Engineering |
| HITL | Case sample review | Production pilot sample | Required approvals present | Approval sample | Operations |
| Outcome | Expert review | Golden set and pilot traces | Critical failure = 0 | Eval report | Validation |

## 6. Findings and Gate Rules
- Critical finding rule:
- High finding rule:
- Medium finding rule:
- Low finding rule:
- Risk acceptance authority:
- Release options: go, limited go, no-go, rollback, retire

## 7. Monitoring and Revalidation
- Production metrics:
- Alert thresholds:
- Sampling cadence:
- Revalidation triggers:
- Evidence retention:

11.2 Independent Challenge Questions Template

Challenge domain	Questions
Business use	What decision or workflow does AI influence? Is the approved use narrow enough? What is explicitly prohibited?
Risk tier	Could the system affect customer rights, pricing, credit, complaints, AML, fraud, privacy or regulatory reporting?
Model	Why was this model selected? What limitations matter for this use case? How are vendor changes detected?
Prompt	Are policy boundaries, refusal rules, evidence requirements and escalation conditions encoded and tested?
RAG	Are sources approved, current, permission-filtered and traceable? Can the system cite stale or irrelevant evidence?
Tool	What can the AI execute? Which actions are read-only, draft-only, approval-required or prohibited?
Workflow	Where can handoff fail? What happens when AI is uncertain, unavailable or conflicts with policy?
HITL	Does the human have enough context, time, authority and training to challenge AI output?
Eval	Does the eval set include edge cases, historical incidents, adversarial cases and high-risk slices?
Monitoring	Which production signals would reveal hallucination, wrong citation, over-reliance, under-escalation or tool misuse?
Evidence	Can a reviewer reconstruct a critical output from input to source to prompt to model to tool to human action?
Decision	Who can accept residual risk, for how long, under what conditions and with what revalidation triggers?

11.3 Finding Severity and Remediation Template

Severity	Definition	Release impact	Remediation expectation	Example
Critical	Failure can cause unauthorized customer-impacting action, regulatory-sensitive error, data leakage or unbounded automation	No-go or immediate rollback	Fix before release, retest, management review	AI can submit final dispute decision through tool
High	Material control gap or quality failure in approved use, with plausible customer or compliance impact	No-go for full release, limited pilot only with strong compensating controls	Fix or restrict scope before scale	RAG can cite outdated complaint policy
Medium	Weakness that can degrade quality, auditability or operational control but has containment	Release may proceed with conditions	Remediate by agreed date, monitor with owner	Refusal wording inconsistent for policy-conflict cases
Low	Documentation, minor usability or evidence clarity issue with limited risk impact	Does not block release	Address in normal backlog with evidence update	Evidence index missing reviewer title

Remediation record:

Field	Description
Finding ID	Unique issue ID linked to validation report
Severity	Critical / High / Medium / Low
Root cause	Design, data, prompt, RAG, tool, workflow, monitoring, governance
Required action	Specific change needed
Owner	Named accountable owner
Due date	Date tied to release or monitoring cadence
Evidence required	Test report, config diff, trace sample, approval, monitoring update
Retest result	Passed, partially passed, failed, restricted
Residual risk	Remaining limitation after remediation
Risk acceptance	Approver, condition, expiry, review trigger

11.4 Validation Report Template

# AI System Validation Report

## 1. Executive Conclusion
- System:
- Use case:
- Risk tier:
- Validation period:
- Overall decision:
- Key restrictions:
- Material residual risks:

## 2. Scope and Version Boundary
- Model version:
- Prompt version:
- RAG source and index version:
- Tool permission version:
- Workflow version:
- User and channel boundary:

## 3. Validation Work Performed
- Conceptual soundness review:
- Process verification:
- Outcome analysis:
- Red-team / adversarial testing:
- HITL and workflow review:
- Monitoring readiness review:
- Evidence review:

## 4. Results
| Domain | Result | Key evidence | Finding IDs |
|---|---|---|---|
| Conceptual soundness |  |  |  |
| RAG |  |  |  |
| Tool authority |  |  |  |
| HITL |  |  |  |
| Outcome analysis |  |  |  |
| Monitoring |  |  |  |

## 5. Findings
| Finding ID | Severity | Description | Required action | Owner | Due date | Release impact |
|---|---|---|---|---|---|---|

## 6. Limitations
- Validation limitations:
- Known system limitations:
- Evidence limitations:
- Use restrictions:

## 7. Decision and Conditions
- Gate decision:
- Required restrictions:
- Required monitoring:
- Revalidation triggers:
- Risk acceptance:

## 8. Appendices
- Inventory record:
- Architecture diagram:
- Eval contract:
- Eval results:
- Trace samples:
- Control test evidence:
- Monitoring spec:
- Approval record:

11.5 Portfolio Evidence Template

Portfolio question	Evidence view	Example metrics
How many high-risk AI systems are in production?	AI system inventory by risk tier and business line	count by tier, release stage, owner
Which systems have overdue validation?	validation schedule and expiry	overdue count, days overdue
Which components create concentration risk?	shared model / vendor / RAG source / tool dependency map	systems per provider, shared source count
What findings remain open?	finding register	open critical/high, aging, owner
Which systems rely on HITL as key control?	HITL control inventory	review volume, backlog, defect rate
Which changes triggered revalidation?	change and revalidation log	model changes, prompt changes, index rebuilds, tool changes
Which evidence supports management attestation?	evidence graph and release memo index	evidence freshness, missing evidence, owner attestation
What production signals indicate drift or misuse?	monitoring dashboard	wrong citation, policy violation, override, complaint linkage

Portfolio evidence should support these executive questions:

Are high-risk AI systems known, owned and validated?
Are independent challenge findings being remediated on time?
Are release decisions tied to evidence, not optimism?
Are revalidation triggers actually firing?
Are business benefits being achieved without unacceptable risk?
Are shared vendors, models, data sources or tools creating aggregate risk?

12. 面试表达

12.1 30 秒版本

我不会把 GenAI 验证等同于模型 benchmark。对金融零售高影响场景, 我会先定义 approved use、prohibited use、risk tier 和 system inventory, 然后验证 conceptual soundness、process implementation、outcome results 和 monitoring readiness。验证对象包括 model、prompt、RAG、tool、workflow、HITL 和 business outcome。最后用 independent challenge 把 findings、remediation、release decision 和 revalidation trigger 固化成证据链。

12.2 2 分钟版本

我的做法是把传统 SR 11-7 式的 model risk management 升级成 AI system validation。第一步不是跑分, 而是建立 model / use / system inventory, 明确这个 AI 系统在什么业务流程中影响谁、允许做什么、禁止做什么、风险等级是什么。

第二步做 validation plan。Conceptual soundness 看设计是否适合用途, 包括模型选择、prompt、RAG source、tool 权限、HITL 和监控设计。Process verification 看生产实现是否和批准版本一致, 包括 prompt hash、index version、tool permission、workflow routing 和日志可追溯。Outcome analysis 看 eval、red-team、expert review、pilot trace、客户影响和业务结果, 不能只看平均 accuracy。

第三步是 independent challenge。验证方要有独立性、能力和影响力, 可以挑战 risk tier、数据证据、RAG 引用、tool 权限、人工复核有效性、监控盲区和 residual risk。最终产物不是一张分数表, 而是 validation report、finding log、release restriction、risk acceptance 和 revalidation trigger。

12.3 面试追问: "模型 benchmark 很高, 为什么还要 validation?"

回答:

Benchmark 只能说明模型在某个测试集或通用任务上的能力, 不能证明它在我的业务流程、数据源、prompt、RAG、tool 和人工控制下是安全适用的。金融零售 AI 的关键风险经常出在系统层, 例如过期政策被 RAG 引用、tool 权限过宽、人工复核不可运营、客户投诉未升级、或生产监控看不到 wrong citation。Validation 是把模型能力放进真实 use case 和控制环境里验证。

12.4 面试追问: "你如何设计 independent challenge?"

回答:

我会把 independent challenge 设计成一个有 mandate 的 operating model, 而不是会议点评。验证团队独立于开发和业务收益目标, 但必须懂业务流程、AI 架构、数据和风险。每个 challenge question 都要对应 evidence 或 finding。High severity finding 可以阻断 release 或限制范围。对无法立即修复的问题, 必须有 residual risk、owner、expiry 和 revalidation trigger。这样 challenge 才有 competence、independence 和 influence。

12.5 面试追问: "GenAI validation 的证据包包括什么?"

回答:

我会准备五类证据。第一是边界证据, 包括 use case、risk tier、approved/prohibited use。第二是架构证据, 包括 model、prompt、RAG、tool、workflow、HITL 和 logging。第三是测试证据, 包括 eval contract、golden set、red-team、slice analysis 和 expert review。第四是运行证据, 包括 production traces、human approval logs、monitoring dashboard、incident and complaint linkage。第五是治理证据, 包括 validation report、findings、remediation、release decision、risk acceptance 和 revalidation triggers。

12.6 面试追问: "作为 CBAP 背景, 你的差异化在哪里?"

回答:

CBAP 给我的优势是需求、流程、stakeholder、traceability 和 solution evaluation。AI 项目里我会把这些能力升级成 requirement-to-eval-to-control-to-evidence 的闭环。我不是只写 user story, 而是定义 AI 在流程中的 decision boundary、human control、failure mode、eval contract、release gate 和 monitoring trigger。对金融零售场景, 这能把产品价值、模型风险、架构实现和审计证据连接起来。

12.7 作品集表达

可以把本文转成一个 portfolio artifact:

Case:
Credit Card Dispute and Complaint AI Copilot

Artifacts:
1. AI system inventory
2. Validation plan
3. Eval contract and red-team case set
4. RAG citation audit
5. Tool permission matrix
6. HITL workflow evidence
7. Independent challenge memo
8. Finding and remediation log
9. Validation report
10. Portfolio evidence dashboard

Positioning:
This project demonstrates that I can move beyond AI feature design into AI product architecture, model risk, independent validation and governance-grade evidence.

13. 最小实战清单

对任何高影响 AI use case, 至少产出以下材料:

Artifact	Minimum quality bar
AI system inventory	Covers model, prompt, RAG, tool, workflow, HITL, eval, monitoring and revalidation trigger
Approved / prohibited use	Clear enough to block misuse and scope expansion
Validation plan	Defines conceptual soundness, process verification and outcome analysis
Eval contract	Includes critical failures, slices, thresholds, evaluator and gate decision
Process verification checklist	Confirms production implementation matches approved design
Independent challenge questions	Challenges assumptions, evidence, results and residual risk
Finding log	Includes severity, release impact, owner, remediation, retest and risk acceptance
Validation report	Supports go / limited go / no-go / rollback / retire
Monitoring spec	Connects production signals to issue management and revalidation
Portfolio evidence view	Supports executive, audit and regulator-ready questions

Final mindset:

In GenAI, the model is only one component. The validation object is the AI system in use.