AI 扩展计划 / Playbooks

AI UAT / Regression Certification / Business Acceptance Playbook

边界说明: 本手册不是法律意见、合规结论、审计结论、模型验证报告、信息安全认证或监管承诺。正式项目中的适用范围、控制要求、测试深度、风险接受权限、上线批准、客户沟通、回滚和事故处置由机构授权角色决定。本文只把官方锚点和架构实践转成 PM / BA / Architect 可落地 artifact。访问日期: 2026-06-30。

603 行AI_UAT_REGRESSION_CERTIFICATION_BUSINESS_ACCEPTANCE_PLAYBOOK.md

AI UAT / Regression Certification / Business Acceptance Playbook

定位: 面向 AI PM / Senior BA / Product Architecture / Solution Architecture / QA Governance / AI Governance / Operational Risk / Release Management 的高级落地手册。目标: 把 AI UAT、业务验收标准、golden journey、synthetic transaction、persona/segment coverage、risk/control coverage、AI regression、workflow replay、shadow testing、parallel run、defect triage、release certification、exception/risk acceptance、test data governance、operational readiness、post-release monitoring 和 rollback 做成可执行、可审计、可复用的业务接受体系。核心观点: UAT 的交付物不是 sign-off, 而是 release acceptance evidence bundle。

1. Source Anchors

Anchor	Official link	本手册使用方式
FFIEC Development, Acquisition, and Maintenance IT Handbook	https://ithandbook.ffiec.gov/it-booklets/development-acquisition-and-maintenance/	作为 SDLC、testing、implementation、maintenance、change management、documentation 和 rollback/back-out 的金融机构 IT 治理锚点。
FFIEC DA&M - V.B Testing	https://ithandbook.ffiec.gov/it-booklets/development-acquisition-and-maintenance/v-development/vb-testing/	用 UAT、regression testing、stress testing、testing scope、test results、corrective action 和 testing data controls 设计 UAT 证据。
FFIEC Management IT Handbook	https://ithandbook.ffiec.gov/it-booklets/management	用 governance、risk management、enterprise architecture、project management 和 reporting 组织 owner、RACI 和 management readiness MI。
FFIEC Business Continuity Management IT Handbook	https://ithandbook.ffiec.gov/it-booklets/business-continuity-management	用 BIA、interdependency、resilience、continuity/recovery、exercises/tests、maintenance/improvement 支撑 operational readiness、fallback 和 rollback。
NIST SP 800-218 SSDF	https://csrc.nist.gov/pubs/sp/800/218/final	用 secure software development practices 组织 secure release、change evidence、vulnerability/defect response。
NIST SP 800-53 Rev. 5	https://csrc.nist.gov/pubs/sp/800/53/r5/upd1/final	用 control/evidence vocabulary 支撑访问、审计、配置、风险、系统完整性、应急与隐私控制。
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 组织 AI risk-to-acceptance lifecycle。
NIST AI RMF Core	https://airc.nist.gov/airmf-resources/airmf/	用 Core functions/categories 设计 AI acceptance coverage 和 release decision language。
ISO/IEC 42001 AI management systems	https://www.iso.org/standard/81230.html	用 AIMS 的 establish / implement / maintain / continually improve 思维组织 AI UAT operating model。

2. Capability Map

Business objective and risk context
  -> acceptance criteria registry
  -> golden journey library
  -> synthetic transaction and persona packs
  -> risk/control coverage matrix
  -> AI object regression matrix
  -> workflow replay / shadow / parallel run
  -> defect triage and retest evidence
  -> release certification packet
  -> exception and risk acceptance workflow
  -> operational readiness and rollback controls
  -> post-release monitoring and recertification loop

Capability	Why it matters	Core artifacts
Acceptance criteria registry	把业务验收从口头判断变成可测、可签、可追溯对象	acceptance criteria contract、owner map
Golden journey library	用端到端业务场景证明能力可接受	golden journey card、journey coverage matrix
Synthetic transaction packs	覆盖稀有、边界、隐私敏感和高风险情境	synthetic pack card、expected output review
Persona / segment coverage	证明不是只对平均用户有效	segment matrix、inclusive/accessibility cases
Risk/control coverage	把业务风险和控制证明前置到 UAT	risk-control-test-evidence matrix
AI regression	覆盖 model、prompt、RAG、tool、policy、workflow 变化	AI object change matrix、regression certificate
Workflow replay / shadow / parallel	在不同比例风险下获取生产相似证据	replay trace、shadow report、parallel comparison
Defect governance	防止缺陷被会议化、口头化、低估	defect taxonomy、severity decision table
Release certification	汇总业务、风险、技术、运营证据支持上线决策	certification memo、evidence index
Exception acceptance	结构化接受残余风险	risk acceptance record、expiry、compensating control
Operational readiness	保证上线后有人、流程、监控和回退	runbook、training、support、rollback drill
Post-release loop	把生产信号转成回归资产	monitoring dashboard、eval expansion、CAPA

3. Operating Model

3.1 Productized UAT Team Topology

Team / role	Primary responsibility	Key output
Product owner	定义业务目标、scope、release decision 和 customer/business value	release scope, acceptance claim
Senior BA	把政策、流程、例外、journey 和业务判断写成可验收契约	acceptance criteria, journey scripts, coverage matrix
QA / Test architect	设计自动化回归、test asset library、defect workflow	test plan, regression suite, defect summary
AI / ML owner	提供 model/prompt/RAG/tool versions、eval results、known limitations	AI object inventory, eval report
Solution architect	确保证据捕获、日志、版本、回滚、监控进入架构	release architecture, rollback design
Operations owner	确认人工队列、培训、runbook、support、degraded mode ready	ops readiness packet
Risk / Compliance / Control owner	审核 high-risk scenarios、controls、exceptions、monitoring	control evidence, risk acceptance
Release manager	组织部署窗口、go/no-go、feature flag、回滚演练	release checklist, deployment record
Internal audit liaison	观察证据质量和可追溯性, 不拥有上线批准	evidence quality feedback

3.2 RACI

Activity	PM	Senior BA	QA	Architect	AI Owner	Ops	Risk/Compliance	Release
Acceptance claim	A/R	R	C	C	C	C	C	I
Acceptance criteria	A	R	C	C	C	C	C	I
Golden journeys	A	R	R	C	C	C	C	I
Synthetic packs	C	R	R	C	C	C	C	I
AI regression scope	C	C	R	A/R	A/R	I	C	C
Workflow replay	C	R	A/R	C	C	C	I	I
Shadow / parallel run	A	R	R	C	C	A/R	C	C
Defect severity	A	R	R	C	C	C	C	I
Risk acceptance	C	R	C	C	C	C	A/R	I
Release certification	A/R	R	R	R	R	R	C	A/R
Operational readiness	C	C	C	C	I	A/R	C	C
Rollback decision	A	C	C	R	C	R	C	A/R
Post-release monitoring	A	C	C	C	R	R	C	I

Legend: A = accountable, R = responsible, C = consulted, I = informed.

4. End-to-End Workflow

4.1 Acceptance Planning

release candidate identified
  -> acceptance claim drafted
  -> scope and population confirmed
  -> acceptance criteria created
  -> risk/control coverage mapped
  -> AI object changes identified
  -> test assets selected or created
  -> evidence capture requirements locked

Control questions:

Which business capability is being certified?
Which model/prompt/RAG/tool/workflow versions are in scope?
Which customers, employees, products, channels and regions are exposed?
Which high-risk journeys and segments require evidence?
Which criteria block release and which can be accepted by exception?
What production monitoring and rollback criteria must exist before release?

4.2 Test Asset Preparation

Asset	Preparation rule
Golden journeys	Each journey has objective, persona, steps, expected AI behavior, controls and evidence fields
Synthetic packs	Each pack has generation method, privacy classification, expected output and reviewer
Replay datasets	Historical sample is frozen, de-identified where required, and linked to production version
Prompt / RAG eval set	Known-answer, no-answer, stale-source, citation and adversarial cases included
Tool trajectory set	Correct tool, wrong tool, unauthorized tool, failed tool and retry cases included
Accessibility pack	Screen reader, keyboard, contrast, plain-language and alternate-channel cases included

4.3 Execution

automated regression
  -> AI eval and tool trajectory tests
  -> business UAT session
  -> workflow replay
  -> shadow or parallel run if required
  -> operational readiness drill
  -> defect triage and retest
  -> evidence bundle generation

4.4 Certification Decision

criteria results complete
  -> evidence index reviewed
  -> open defects dispositioned
  -> exceptions accepted or release held
  -> monitoring and rollback confirmed
  -> certification memo approved
  -> release / pilot / hold / rollback decision recorded

5. Artifact Templates

5.1 Acceptance Criteria Contract

Field	Required content
acceptance_id	Stable ID, e.g. `ACC-FRAUD-COPILOT-023`
capability	Business capability, not component name
release scope	Product, channel, segment, region, user role, feature flag
AI object versions	model, prompt, RAG corpus, retriever, tool schema, policy engine, workflow
business outcome	measurable business result
decision impact	recommendation, ranking, summary, automation, tool action, customer communication
success criteria	metric, threshold, sample, owner
risk criteria	prohibited outputs, escalation rules, customer harm prevention
control criteria	HITL, logging, access, policy gate, approval, retention
evidence required	eval report, journey result, replay trace, shadow/parallel comparison, approval
release blocker	yes/no and condition
recertification trigger	model/prompt/RAG/tool/workflow/policy/data change, monitoring breach, incident

5.2 Golden Journey Card

Field	Required content
journey_id	Stable journey ID
name	business-readable journey name
persona / segment	customer or employee profile and risk segment
channel	mobile, web, branch, contact center, back office, API
preconditions	data state, account status, permissions, feature flag
journey steps	end-to-end business actions
AI touchpoints	model output, retrieval, tool call, summary, recommendation
expected behavior	desired action, refusal, escalation, human review
controls	policy checks, HITL, logging, masking, dual control
evidence	trace IDs, test run IDs, screenshots only if necessary, data proof, reviewer
linked criteria	acceptance_ids and risk/control IDs

5.3 Synthetic Transaction Pack Card

Field	Required content
pack_id	Stable ID
scenario family	boundary, rare event, protected segment, privacy, adversarial, operational
generation method	rule-based, sampled, transformed, expert-authored
source pattern	production pattern, policy scenario, risk typology
privacy classification	no sensitive data, sanitized, tokenized, restricted
expected output	exact or rubric-based expected behavior
prohibited output	what must never happen
reviewer	business/risk/control owner
coverage	linked journeys, criteria, controls
version	effective date, changes, retired cases

5.4 AI Regression Certificate

Field	Required content
certificate_id	Stable release certificate section ID
changed objects	model, prompt, retriever, corpus, tool schema, workflow, policy, data, UI/API
unchanged impacted objects	dependencies that may be affected
regression assets run	eval sets, golden journeys, synthetic packs, tool tests, workflow replay
pass/fail	threshold results and failures
defect links	open and closed defects
comparison baseline	previous release, production baseline, control group
risk owner review	owner, date, decision
recertification scope	what must rerun for next change

5.5 Release Certification Memo

Section	Required content
Executive summary	release decision and residual risk in business language
Scope	population, channels, products, roles, feature flags
Version identity	model/prompt/RAG/tool/workflow/config/build versions
Acceptance results	criteria, threshold, evidence, owner decision
Coverage	journey, segment, risk/control, AI regression coverage
Defects	severity, root cause, retest, disposition
Exceptions	risk accepted, owner, expiry, compensating control
Operational readiness	runbook, training, support, capacity, BCM/degraded mode
Monitoring	post-release metrics, thresholds, review cadence
Rollback	trigger, owner, technical path, business communication
Sign-off	business, technology, operations, risk/control, release

5.6 Risk Acceptance Record

Field	Required content
exception_id	Stable ID
linked release / criteria / defect	IDs, not free text
residual risk	clear risk statement
impacted population	segment, channel, product, employee role
compensating control	manual review, rate limit, monitoring, fallback, customer recourse
expiry or scale gate	date, batch size, exposure cap, trigger
approver	role with authority
monitoring	metric, threshold, owner, cadence
closure	fix, retest, recertification evidence

5.7 Test Data Governance Card

Field	Required content
dataset_id	UAT/replay/synthetic dataset ID
data source	production extract, synthetic generation, vendor sample, expert-authored
data classification	public, internal, confidential, restricted, regulated
sanitization	masking, tokenization, NULLing, substitution, aggregation
allowed environment	UAT, secure sandbox, local prohibited, vendor prohibited
retention	retention period and deletion evidence
access	roles, approvals, logs
AI model routing	approved models/environments only
evidence use	which tests and criteria rely on this data

5.8 Evidence Index

Field	Required content
evidence_id	Stable ID
evidence_type	test_run, eval_report, UAT_session, replay_trace, shadow_report, approval
release_id	release and feature flag
linked criteria/control	IDs
object versions	model, prompt, corpus, tool, workflow
produced_by	system, tester, business user, AI assistant
reviewer	accountable human reviewer
timestamp	creation and approval time
integrity	checksum, immutable store, access log
retention	retention class

6. Coverage Matrices

6.1 Acceptance-to-Evidence Matrix

Acceptance criterion	Golden journey	Synthetic pack	AI regression	Control evidence	Status
AI must escalate uncertain KYC document	GJ-KYC-004	SYN-KYC-BOUNDARY-002	prompt + tool + workflow	HITL log, reviewer queue	covered
RAG answer must cite current fee policy	GJ-CS-011	SYN-RAG-STALE-001	corpus + retriever	citation eval, source version	covered
Fraud alert summary must not close case	GJ-FRAUD-021	SYN-TOOL-DENY-003	tool gateway	tool audit trace	covered
Spanish mobile onboarding must preserve recourse	GJ-ONB-ESP-007	SYN-ACCESS-005	UI + prompt	language QA, appeal path	covered

6.2 Segment Coverage Matrix

Segment axis	Required examples	Evidence
Customer value/risk	low balance, high value, high-risk AML, fraud watch	journey run and segment eval
Language	English, Spanish, high-volume non-English path	translated source and output QA
Accessibility	screen reader, keyboard only, plain-language error	accessibility evidence
Channel	mobile, web, branch, contact center, back office	channel-specific trace
Product	deposit, credit card, loan, payment, dispute	product journey coverage
Employee role	analyst, supervisor, QA reviewer, admin	permission and workflow tests

6.3 Risk-Control-Test-Evidence Matrix

Risk	Control	Test asset	Evidence	Release implication
Unsupported claim in customer answer	RAG source grounding and citation required	known-answer and no-answer pack	citation correctness report	blocker if high-risk topic fails
Unauthorized tool action	tool gateway policy and HITL	tool trajectory negative cases	denied action logs	blocker for agent release
Segment harm	segment-specific thresholds and recourse	vulnerable customer pack	segment result and appeal path	pilot cap if amber
Queue overload	exposure cap and capacity monitor	parallel run workload comparison	queue forecast and ops sign-off	cap rollout if workload high
Missing audit trail	trace ID and version capture	evidence completeness query	log sample and evidence index	no release if critical fields missing

7. AI Regression Decision Tables

7.1 Regression Scope by Change

Change	Minimum regression	Additional tests for high-risk AI
Prompt wording	prompt golden set, refusal, citation, tone	adversarial prompts, high-risk journeys
Model version	full eval, segment eval, latency/cost	shadow or parallel comparison
RAG corpus	retrieval eval, source freshness, stale-source tests	regulatory/customer-facing narrative QA
Retriever / embedding	recall/ranking benchmark	no-answer and edge source cases
Tool schema/API	contract test, positive/negative tool trajectory	side-effect sandbox, HITL approval
Policy rule	decision table regression	segment and jurisdiction matrix
Workflow state	state transition and exception replay	queue/capacity and fallback
Data pipeline	DQ, distribution shift, backfill/replay	model input drift, bias/segment impact
UI/API	functional and accessibility regression	operator error, explanation comprehension

7.2 Certification Decision

Condition	Decision
All blocking criteria met, no critical/high open defect, ops ready	full release
Blocking criteria met, amber workload or segment signal within cap	limited pilot with exposure cap
Non-blocking defect accepted by authorized owner with expiry	exception release
critical defect, missing key control evidence, or no rollback path	hold
production early signal breaches stop rule	rollback or disable feature flag

7.3 Defect Severity

Severity	Signal	Required action
Critical	prohibited output, privacy breach, unauthorized action, critical control failure	block release, freeze evidence, executive escalation
High	high-risk journey failure, repeated segment harm, monitoring gap	fix or formal exception committee
Medium	contained failure with compensating control	fix before scale, monitor in pilot
Low	low-risk UI/text issue without customer/control impact	backlog with owner and target release

7.4 Rollback Triggers

Trigger	Rollback action
prohibited output in production	disable AI output path, preserve logs, incident workflow
tool gateway unauthorized action	disable affected tool integration, revert policy/tool schema
citation correctness below threshold	revert corpus/retriever or disable grounded answer feature
manual queue exceeds capacity limit	reduce cohort, switch to manual triage
segment complaint spike	pause affected segment, expand review, notify authorized owners
evidence capture failure	pause rollout, restore prior version, assess audit gap
vendor model degradation	route to fallback model or manual process

8. Workflow Replay, Shadow Testing, Parallel Run

8.1 Workflow Replay Pack

Field	Required content
replay_id	Stable run ID
event source	historical, synthetic, mixed
frozen input version	dataset and business event version
expected state transitions	states, owners, SLAs
AI actions	recommendation, retrieval, tool call, refusal, escalation
control checks	policy, authorization, logging, HITL
comparison	expected vs actual
defect links	deviations and retest

8.2 Shadow Test Report

Section	Required content
Traffic scope	cohort, channel, product, duration
Non-impact proof	how shadow output was prevented from affecting production decision
Output quality	accuracy, citation, refusal, unsupported claim
Risk signals	policy violations, privacy events, segment issues
Operations	latency, cost, error, queue projection
Decision	continue shadow, move to pilot, expand tests, hold

8.3 Parallel Run Report

Section	Required content
Baseline	current process/model/manual decision
New process	AI-assisted or automated path
Comparison	decision delta, reason delta, workload delta, segment delta
Human adjudication	where deltas were reviewed and how final truth was assigned
Customer impact	potential harm, recourse, communication
Control impact	HITL, logging, override, approval
Release decision	full, pilot, hold, exception

9. Evidence / Control Checklist

9.1 Pre-Release Checklist

Check	Pass standard
Acceptance criteria defined	all blocking criteria have owner, metric, threshold and evidence
Version identity locked	model, prompt, RAG, tool, policy, workflow and feature flag recorded
Golden journeys run	high-risk and material journeys executed with trace evidence
Synthetic packs reviewed	expected outputs approved by business/risk owner
Segment coverage complete	material and vulnerable segments have evidence
AI regression complete	changed and impacted objects tested
Defects dispositioned	no critical open; high defects fixed or formally accepted
Test data controlled	sensitive data protected, synthetic data governed, access logged
Accessibility covered	relevant digital and language accessibility tests passed
Operational readiness ready	runbook, training, support, capacity, fallback
Monitoring configured	metrics, thresholds, owners and cadence active
Rollback rehearsed	technical and business rollback path verified
Certification memo approved	release decision and residual risks signed by owners

9.2 Evidence Quality Rubric

Quality dimension	Good evidence
Complete	covers scope, criteria, version, result, owner
Accurate	produced from system logs or controlled test outputs
Timely	generated during workflow, not reconstructed later
Traceable	linked to release, criteria, control and defect IDs
Reproducible	input pack and versions allow rerun
Reviewed	accountable human reviewer recorded
Protected	sensitive data minimized and access-controlled
Durable	retained with integrity and retention metadata

9.3 Management Readiness MI

Dashboard tile	RAG signal
Release certification status	criteria passed, blocked, exceptioned
High-risk journey coverage	executed, failed, not run
AI regression status	changed objects covered or missing
Segment and accessibility coverage	material gaps and amber signals
Defect severity	critical/high open, aging, repeat root cause
Exception inventory	accepted risk, expiry, compensating control
Operational readiness	training, runbook, support, capacity
Monitoring readiness	metrics live, thresholds approved
Rollback readiness	drill completed, owner assigned

10. AI Assist Guardrails

AI use in UAT	Allowed output	Required guardrail
Test case generation	candidate cases and edge cases	human approves expected behavior
Coverage gap analysis	unmapped criteria/controls/tests	traceability graph remains source of truth
Defect clustering	likely root cause and duplicate groups	severity and closure by accountable human
Requirement-test mapping	suggested matrix links	BA/QA confirms links
Evidence summarization	draft certification summary	cites evidence IDs, human approval
Synthetic scenario generation	candidate synthetic records/scenarios	privacy review, reviewer approval
UAT transcript analysis	pain points, confusion, training signals	access control and PII minimization

Non-negotiable boundaries:

AI does not sign UAT.
AI does not approve release.
AI does not accept residual risk.
AI does not close defects.
AI does not decide legal/compliance applicability.
AI does not fabricate missing evidence.
AI does not send sensitive test data to unapproved tools.

11. Implementation Guardrails

Guardrail	Practical rule
Criteria before cases	no test asset accepted without linked acceptance criteria
Risk before coverage	high-risk journey and segment coverage outranks raw test count
Version every AI object	model, prompt, corpus, retriever, tool, policy and workflow must be recorded
Evidence by design	trace IDs, logs and approvals are generated during execution
Synthetic with governance	generated data must have business rationale and expected output
No uncontrolled production data	production data in UAT requires need, controls, access and retention
Defect decisions structured	severity, impact, root cause, retest and disposition required
Exceptions expire	every accepted risk has owner, expiry and compensating control
Operational readiness is blocking	no release without support, training, monitoring and rollback
Production signals update tests	incidents and complaints expand golden journeys or synthetic packs

12. 30-60-90 Day Roadmap

Days 1-30: Stabilize Acceptance Foundations

Workstream	Outcome
Inventory AI releases	top AI use cases, release types, model/prompt/RAG/tool owners
Define acceptance contract	template for business, risk, control, ops and rollback criteria
Build initial golden journeys	top 15-25 material and high-risk journeys
Create defect taxonomy	severity, root cause, customer/control impact
Stand up evidence index	release ID, criteria ID, evidence ID conventions

Days 31-60: Engineer Regression and Evidence

Workstream	Outcome
Build synthetic packs	boundary, rare event, segment, privacy, adversarial, operational packs
Add AI object regression	model/prompt/RAG/tool/workflow change matrix
Implement workflow replay	traceable replay for key journeys
Structure certification memo	criteria, coverage, defects, exceptions, ops readiness
Define monitoring and rollback	metrics, thresholds, owners, kill switch path

Days 61-90: Prove and Scale

Workstream	Outcome
Run shadow or parallel pilots	comparison evidence for selected high-risk use case
Launch readiness MI	release certification, defects, exceptions, monitoring, rollback dashboard
Automate evidence capture	test run, eval, logs, approvals, version metadata
Close repeat root causes	production signals update regression packs
Formalize recertification policy	triggers for model/prompt/RAG/tool/workflow/data changes

13. Interview-Ready Answers

Q1: 你如何重新设计 AI UAT?

30 秒版本: 我会把 AI UAT 从业务签字改造成 evidence architecture。先定义 acceptance claim 和 criteria, 再建立 golden journey、synthetic pack、segment coverage、risk/control coverage 和 AI regression matrix。执行 workflow replay、shadow 或 parallel run 后, 用 release certification memo 记录证据、缺陷、例外、运营准备、监控和回滚。

2 分钟版本: AI UAT 要证明的不只是功能可用, 而是业务能力在特定模型、prompt、RAG、工具、workflow 和控制版本下可接受。我的流程从 acceptance criteria registry 开始, 每条标准都有 owner、threshold、evidence 和 release blocker 标记。测试资产包括端到端 golden journeys、synthetic transaction packs、persona/segment matrix、adversarial and accessibility cases。回归覆盖 model、prompt、RAG corpus、retriever、tool schema、policy engine、workflow state 和 data pipeline。缺陷按 customer impact 和 control impact 分类。上线前形成 certification memo, 包含 residual risk、exception approval、operational readiness、monitoring 和 rollback criteria。

Q2: 为什么 UAT 不能只由业务用户点页面完成?

30 秒版本: 因为 AI 风险常在模型输出、检索来源、工具调用、控制日志、人工复核、分群表现和运营负载里, 不一定在页面点击中暴露。业务用户参与很重要, 但必须嵌入证据架构。

2 分钟版本: 例如客服 AI 助手的页面能展示答案, 不代表答案基于当前政策、引用准确、不会越过建议边界, 也不代表西班牙语客户、脆弱客户、投诉升级和系统超时路径被覆盖。UAT 应把业务用户放到 golden journey 中, 同时捕获 trace id、source citation、model/prompt version、policy decision、tool call、HITL approval 和 defect evidence。业务判断仍由人负责, 但判断要绑定可审计证据。

Q3: 如何设计 AI 回归认证?

30 秒版本: 先识别变化对象: model、prompt、RAG、retriever、tool、policy、workflow、data、UI/API。每类变化映射到最小回归资产和高风险附加测试。通过后生成 AI regression certificate, 记录版本、测试、缺陷、例外和下次重认证触发器。

2 分钟版本: prompt 改动要跑拒答、引用、语气和高风险 journey; RAG corpus 改动要跑 retrieval eval、stale-source 和 citation correctness; tool schema 改动要跑 positive/negative trajectory、authorization 和 side-effect sandbox; model version 改动要跑 full eval、segment eval、latency/cost, 高风险场景再加 shadow 或 parallel run。认证不是一个总分, 而是 changed object -> impacted behavior -> regression asset -> evidence -> owner decision 的链条。

Q4: Shadow testing 和 parallel run 的证据如何用于上线决策?

30 秒版本: shadow testing 证明 AI 在真实流量旁路下的输出质量和风险信号; parallel run 比较新旧流程的决策差异、人工负载、segment impact 和控制效果。两者都必须提前定义阈值。

2 分钟版本: shadow 适合低干扰观察, 比如客服建议、RAG 答案或 analyst summary, 但要证明 shadow output 没有影响真实决策。parallel run 适合 KYC、fraud、credit 或 operational workflow, 因为需要比较 baseline 和新流程。上线决策看 decision delta 是否解释充分, workload 是否在容量内, high-risk segment 是否稳定, controls 是否完整记录, customer harm signal 是否可接受。没有 comparison criteria 的 shadow/parallel 不能支持 certification。

Q5: AI 可以帮 UAT 到什么程度?

30 秒版本: AI 可以生成候选测试、找覆盖缺口、聚类缺陷、映射需求到测试、总结证据。AI 不能最终验收、接受风险、关闭缺陷、批准 release 或替代审计/模型验证。

2 分钟版本: 我会用 AI 提升 UAT 的分析效率, 但设置强边界。AI 生成的 test cases 必须由 BA/业务/risk owner 确认 expected outcome。AI 的 coverage gap 建议要和 traceability matrix 对账。Defect clustering 只能作为 triage 输入, severity 和 disposition 由人决定。Release memo 可以由 AI 起草, 但必须引用 evidence IDs, 并由授权 owner 审批。这样 AI 帮团队更快看见问题, 不接管责任链。

Q6: 如何处理带缺陷的上线请求?

30 秒版本: 先判断 defect 是否 blocking。Critical 不上线; High 要修复或正式 exception committee; Medium 可以有限灰度但要有补偿控制; Low 进入 backlog。所有例外必须有 owner、expiry、monitoring 和 closure criteria。

2 分钟版本: 我不会用"业务接受"四个字掩盖缺陷。缺陷记录必须连接 acceptance criteria、journey、segment、AI object version、customer/control impact 和 root cause。若决定 exception release, 需要 risk acceptance record: residual risk、impacted population、compensating control、exposure cap、expiry、approver、monitoring 和 closure。上线后监控必须能验证这个风险没有扩大, 到期前要修复或重新接受。

14. Quality Bar

一个 AI UAT / Regression Certification 体系达标的最低标准:

For any AI release,
the team can reconstruct:
  what business capability was accepted,
  which criteria and risks were in scope,
  which journeys, segments and synthetic scenarios were tested,
  which model/prompt/RAG/tool/workflow versions were certified,
  which defects and exceptions remained,
  who accepted the residual risk,
  what monitoring was active,
  and what conditions would trigger rollback or recertification.

如果这条链断了, UAT 就不是业务接受架构, 只是上线仪式。