返回 Papers
AI 扩展计划 / Playbooks

AI UAT / Regression Certification / Business Acceptance Playbook

边界说明: 本手册不是法律意见、合规结论、审计结论、模型验证报告、信息安全认证或监管承诺。正式项目中的适用范围、控制要求、测试深度、风险接受权限、上线批准、客户沟通、回滚和事故处置由机构授权角色决定。本文只把官方锚点和架构实践转成 PM / BA / Architect 可落地 artifact。访问日期: 2026-06-30。

603AI_UAT_REGRESSION_CERTIFICATION_BUSINESS_ACCEPTANCE_PLAYBOOK.md

AI UAT / Regression Certification / Business Acceptance Playbook

定位: 面向 AI PM / Senior BA / Product Architecture / Solution Architecture / QA Governance / AI Governance / Operational Risk / Release Management 的高级落地手册。 目标: 把 AI UAT、业务验收标准、golden journey、synthetic transaction、persona/segment coverage、risk/control coverage、AI regression、workflow replay、shadow testing、parallel run、defect triage、release certification、exception/risk acceptance、test data governance、operational readiness、post-release monitoring 和 rollback 做成可执行、可审计、可复用的业务接受体系。 核心观点: UAT 的交付物不是 sign-off, 而是 release acceptance evidence bundle。

边界说明: 本手册不是法律意见、合规结论、审计结论、模型验证报告、信息安全认证或监管承诺。正式项目中的适用范围、控制要求、测试深度、风险接受权限、上线批准、客户沟通、回滚和事故处置由机构授权角色决定。本文只把官方锚点和架构实践转成 PM / BA / Architect 可落地 artifact。访问日期: 2026-06-30。


1. Source Anchors

AnchorOfficial link本手册使用方式
FFIEC Development, Acquisition, and Maintenance IT Handbookhttps://ithandbook.ffiec.gov/it-booklets/development-acquisition-and-maintenance/作为 SDLC、testing、implementation、maintenance、change management、documentation 和 rollback/back-out 的金融机构 IT 治理锚点。
FFIEC DA&M - V.B Testinghttps://ithandbook.ffiec.gov/it-booklets/development-acquisition-and-maintenance/v-development/vb-testing/用 UAT、regression testing、stress testing、testing scope、test results、corrective action 和 testing data controls 设计 UAT 证据。
FFIEC Management IT Handbookhttps://ithandbook.ffiec.gov/it-booklets/management用 governance、risk management、enterprise architecture、project management 和 reporting 组织 owner、RACI 和 management readiness MI。
FFIEC Business Continuity Management IT Handbookhttps://ithandbook.ffiec.gov/it-booklets/business-continuity-management用 BIA、interdependency、resilience、continuity/recovery、exercises/tests、maintenance/improvement 支撑 operational readiness、fallback 和 rollback。
NIST SP 800-218 SSDFhttps://csrc.nist.gov/pubs/sp/800/218/final用 secure software development practices 组织 secure release、change evidence、vulnerability/defect response。
NIST SP 800-53 Rev. 5https://csrc.nist.gov/pubs/sp/800/53/r5/upd1/final用 control/evidence vocabulary 支撑访问、审计、配置、风险、系统完整性、应急与隐私控制。
NIST AI Risk Management Frameworkhttps://www.nist.gov/itl/ai-risk-management-framework用 Govern / Map / Measure / Manage 组织 AI risk-to-acceptance lifecycle。
NIST AI RMF Corehttps://airc.nist.gov/airmf-resources/airmf/用 Core functions/categories 设计 AI acceptance coverage 和 release decision language。
ISO/IEC 42001 AI management systemshttps://www.iso.org/standard/81230.html用 AIMS 的 establish / implement / maintain / continually improve 思维组织 AI UAT operating model。

2. Capability Map

Business objective and risk context
  -> acceptance criteria registry
  -> golden journey library
  -> synthetic transaction and persona packs
  -> risk/control coverage matrix
  -> AI object regression matrix
  -> workflow replay / shadow / parallel run
  -> defect triage and retest evidence
  -> release certification packet
  -> exception and risk acceptance workflow
  -> operational readiness and rollback controls
  -> post-release monitoring and recertification loop
CapabilityWhy it mattersCore artifacts
Acceptance criteria registry把业务验收从口头判断变成可测、可签、可追溯对象acceptance criteria contract、owner map
Golden journey library用端到端业务场景证明能力可接受golden journey card、journey coverage matrix
Synthetic transaction packs覆盖稀有、边界、隐私敏感和高风险情境synthetic pack card、expected output review
Persona / segment coverage证明不是只对平均用户有效segment matrix、inclusive/accessibility cases
Risk/control coverage把业务风险和控制证明前置到 UATrisk-control-test-evidence matrix
AI regression覆盖 model、prompt、RAG、tool、policy、workflow 变化AI object change matrix、regression certificate
Workflow replay / shadow / parallel在不同比例风险下获取生产相似证据replay trace、shadow report、parallel comparison
Defect governance防止缺陷被会议化、口头化、低估defect taxonomy、severity decision table
Release certification汇总业务、风险、技术、运营证据支持上线决策certification memo、evidence index
Exception acceptance结构化接受残余风险risk acceptance record、expiry、compensating control
Operational readiness保证上线后有人、流程、监控和回退runbook、training、support、rollback drill
Post-release loop把生产信号转成回归资产monitoring dashboard、eval expansion、CAPA

3. Operating Model

3.1 Productized UAT Team Topology

Team / rolePrimary responsibilityKey output
Product owner定义业务目标、scope、release decision 和 customer/business valuerelease scope, acceptance claim
Senior BA把政策、流程、例外、journey 和业务判断写成可验收契约acceptance criteria, journey scripts, coverage matrix
QA / Test architect设计自动化回归、test asset library、defect workflowtest plan, regression suite, defect summary
AI / ML owner提供 model/prompt/RAG/tool versions、eval results、known limitationsAI object inventory, eval report
Solution architect确保证据捕获、日志、版本、回滚、监控进入架构release architecture, rollback design
Operations owner确认人工队列、培训、runbook、support、degraded mode readyops readiness packet
Risk / Compliance / Control owner审核 high-risk scenarios、controls、exceptions、monitoringcontrol evidence, risk acceptance
Release manager组织部署窗口、go/no-go、feature flag、回滚演练release checklist, deployment record
Internal audit liaison观察证据质量和可追溯性, 不拥有上线批准evidence quality feedback

3.2 RACI

ActivityPMSenior BAQAArchitectAI OwnerOpsRisk/ComplianceRelease
Acceptance claimA/RRCCCCCI
Acceptance criteriaARCCCCCI
Golden journeysARRCCCCI
Synthetic packsCRRCCCCI
AI regression scopeCCRA/RA/RICC
Workflow replayCRA/RCCCII
Shadow / parallel runARRCCA/RCC
Defect severityARRCCCCI
Risk acceptanceCRCCCCA/RI
Release certificationA/RRRRRRCA/R
Operational readinessCCCCIA/RCC
Rollback decisionACCRCRCA/R
Post-release monitoringACCCRRCI

Legend: A = accountable, R = responsible, C = consulted, I = informed.


4. End-to-End Workflow

4.1 Acceptance Planning

release candidate identified
  -> acceptance claim drafted
  -> scope and population confirmed
  -> acceptance criteria created
  -> risk/control coverage mapped
  -> AI object changes identified
  -> test assets selected or created
  -> evidence capture requirements locked

Control questions:

  • Which business capability is being certified?
  • Which model/prompt/RAG/tool/workflow versions are in scope?
  • Which customers, employees, products, channels and regions are exposed?
  • Which high-risk journeys and segments require evidence?
  • Which criteria block release and which can be accepted by exception?
  • What production monitoring and rollback criteria must exist before release?

4.2 Test Asset Preparation

AssetPreparation rule
Golden journeysEach journey has objective, persona, steps, expected AI behavior, controls and evidence fields
Synthetic packsEach pack has generation method, privacy classification, expected output and reviewer
Replay datasetsHistorical sample is frozen, de-identified where required, and linked to production version
Prompt / RAG eval setKnown-answer, no-answer, stale-source, citation and adversarial cases included
Tool trajectory setCorrect tool, wrong tool, unauthorized tool, failed tool and retry cases included
Accessibility packScreen reader, keyboard, contrast, plain-language and alternate-channel cases included

4.3 Execution

automated regression
  -> AI eval and tool trajectory tests
  -> business UAT session
  -> workflow replay
  -> shadow or parallel run if required
  -> operational readiness drill
  -> defect triage and retest
  -> evidence bundle generation

4.4 Certification Decision

criteria results complete
  -> evidence index reviewed
  -> open defects dispositioned
  -> exceptions accepted or release held
  -> monitoring and rollback confirmed
  -> certification memo approved
  -> release / pilot / hold / rollback decision recorded

5. Artifact Templates

5.1 Acceptance Criteria Contract

FieldRequired content
acceptance_idStable ID, e.g. ACC-FRAUD-COPILOT-023
capabilityBusiness capability, not component name
release scopeProduct, channel, segment, region, user role, feature flag
AI object versionsmodel, prompt, RAG corpus, retriever, tool schema, policy engine, workflow
business outcomemeasurable business result
decision impactrecommendation, ranking, summary, automation, tool action, customer communication
success criteriametric, threshold, sample, owner
risk criteriaprohibited outputs, escalation rules, customer harm prevention
control criteriaHITL, logging, access, policy gate, approval, retention
evidence requiredeval report, journey result, replay trace, shadow/parallel comparison, approval
release blockeryes/no and condition
recertification triggermodel/prompt/RAG/tool/workflow/policy/data change, monitoring breach, incident

5.2 Golden Journey Card

FieldRequired content
journey_idStable journey ID
namebusiness-readable journey name
persona / segmentcustomer or employee profile and risk segment
channelmobile, web, branch, contact center, back office, API
preconditionsdata state, account status, permissions, feature flag
journey stepsend-to-end business actions
AI touchpointsmodel output, retrieval, tool call, summary, recommendation
expected behaviordesired action, refusal, escalation, human review
controlspolicy checks, HITL, logging, masking, dual control
evidencetrace IDs, test run IDs, screenshots only if necessary, data proof, reviewer
linked criteriaacceptance_ids and risk/control IDs

5.3 Synthetic Transaction Pack Card

FieldRequired content
pack_idStable ID
scenario familyboundary, rare event, protected segment, privacy, adversarial, operational
generation methodrule-based, sampled, transformed, expert-authored
source patternproduction pattern, policy scenario, risk typology
privacy classificationno sensitive data, sanitized, tokenized, restricted
expected outputexact or rubric-based expected behavior
prohibited outputwhat must never happen
reviewerbusiness/risk/control owner
coveragelinked journeys, criteria, controls
versioneffective date, changes, retired cases

5.4 AI Regression Certificate

FieldRequired content
certificate_idStable release certificate section ID
changed objectsmodel, prompt, retriever, corpus, tool schema, workflow, policy, data, UI/API
unchanged impacted objectsdependencies that may be affected
regression assets runeval sets, golden journeys, synthetic packs, tool tests, workflow replay
pass/failthreshold results and failures
defect linksopen and closed defects
comparison baselineprevious release, production baseline, control group
risk owner reviewowner, date, decision
recertification scopewhat must rerun for next change

5.5 Release Certification Memo

SectionRequired content
Executive summaryrelease decision and residual risk in business language
Scopepopulation, channels, products, roles, feature flags
Version identitymodel/prompt/RAG/tool/workflow/config/build versions
Acceptance resultscriteria, threshold, evidence, owner decision
Coveragejourney, segment, risk/control, AI regression coverage
Defectsseverity, root cause, retest, disposition
Exceptionsrisk accepted, owner, expiry, compensating control
Operational readinessrunbook, training, support, capacity, BCM/degraded mode
Monitoringpost-release metrics, thresholds, review cadence
Rollbacktrigger, owner, technical path, business communication
Sign-offbusiness, technology, operations, risk/control, release

5.6 Risk Acceptance Record

FieldRequired content
exception_idStable ID
linked release / criteria / defectIDs, not free text
residual riskclear risk statement
impacted populationsegment, channel, product, employee role
compensating controlmanual review, rate limit, monitoring, fallback, customer recourse
expiry or scale gatedate, batch size, exposure cap, trigger
approverrole with authority
monitoringmetric, threshold, owner, cadence
closurefix, retest, recertification evidence

5.7 Test Data Governance Card

FieldRequired content
dataset_idUAT/replay/synthetic dataset ID
data sourceproduction extract, synthetic generation, vendor sample, expert-authored
data classificationpublic, internal, confidential, restricted, regulated
sanitizationmasking, tokenization, NULLing, substitution, aggregation
allowed environmentUAT, secure sandbox, local prohibited, vendor prohibited
retentionretention period and deletion evidence
accessroles, approvals, logs
AI model routingapproved models/environments only
evidence usewhich tests and criteria rely on this data

5.8 Evidence Index

FieldRequired content
evidence_idStable ID
evidence_typetest_run, eval_report, UAT_session, replay_trace, shadow_report, approval
release_idrelease and feature flag
linked criteria/controlIDs
object versionsmodel, prompt, corpus, tool, workflow
produced_bysystem, tester, business user, AI assistant
revieweraccountable human reviewer
timestampcreation and approval time
integritychecksum, immutable store, access log
retentionretention class

6. Coverage Matrices

6.1 Acceptance-to-Evidence Matrix

Acceptance criterionGolden journeySynthetic packAI regressionControl evidenceStatus
AI must escalate uncertain KYC documentGJ-KYC-004SYN-KYC-BOUNDARY-002prompt + tool + workflowHITL log, reviewer queuecovered
RAG answer must cite current fee policyGJ-CS-011SYN-RAG-STALE-001corpus + retrievercitation eval, source versioncovered
Fraud alert summary must not close caseGJ-FRAUD-021SYN-TOOL-DENY-003tool gatewaytool audit tracecovered
Spanish mobile onboarding must preserve recourseGJ-ONB-ESP-007SYN-ACCESS-005UI + promptlanguage QA, appeal pathcovered

6.2 Segment Coverage Matrix

Segment axisRequired examplesEvidence
Customer value/risklow balance, high value, high-risk AML, fraud watchjourney run and segment eval
LanguageEnglish, Spanish, high-volume non-English pathtranslated source and output QA
Accessibilityscreen reader, keyboard only, plain-language erroraccessibility evidence
Channelmobile, web, branch, contact center, back officechannel-specific trace
Productdeposit, credit card, loan, payment, disputeproduct journey coverage
Employee roleanalyst, supervisor, QA reviewer, adminpermission and workflow tests

6.3 Risk-Control-Test-Evidence Matrix

RiskControlTest assetEvidenceRelease implication
Unsupported claim in customer answerRAG source grounding and citation requiredknown-answer and no-answer packcitation correctness reportblocker if high-risk topic fails
Unauthorized tool actiontool gateway policy and HITLtool trajectory negative casesdenied action logsblocker for agent release
Segment harmsegment-specific thresholds and recoursevulnerable customer packsegment result and appeal pathpilot cap if amber
Queue overloadexposure cap and capacity monitorparallel run workload comparisonqueue forecast and ops sign-offcap rollout if workload high
Missing audit trailtrace ID and version captureevidence completeness querylog sample and evidence indexno release if critical fields missing

7. AI Regression Decision Tables

7.1 Regression Scope by Change

ChangeMinimum regressionAdditional tests for high-risk AI
Prompt wordingprompt golden set, refusal, citation, toneadversarial prompts, high-risk journeys
Model versionfull eval, segment eval, latency/costshadow or parallel comparison
RAG corpusretrieval eval, source freshness, stale-source testsregulatory/customer-facing narrative QA
Retriever / embeddingrecall/ranking benchmarkno-answer and edge source cases
Tool schema/APIcontract test, positive/negative tool trajectoryside-effect sandbox, HITL approval
Policy ruledecision table regressionsegment and jurisdiction matrix
Workflow statestate transition and exception replayqueue/capacity and fallback
Data pipelineDQ, distribution shift, backfill/replaymodel input drift, bias/segment impact
UI/APIfunctional and accessibility regressionoperator error, explanation comprehension

7.2 Certification Decision

ConditionDecision
All blocking criteria met, no critical/high open defect, ops readyfull release
Blocking criteria met, amber workload or segment signal within caplimited pilot with exposure cap
Non-blocking defect accepted by authorized owner with expiryexception release
critical defect, missing key control evidence, or no rollback pathhold
production early signal breaches stop rulerollback or disable feature flag

7.3 Defect Severity

SeveritySignalRequired action
Criticalprohibited output, privacy breach, unauthorized action, critical control failureblock release, freeze evidence, executive escalation
Highhigh-risk journey failure, repeated segment harm, monitoring gapfix or formal exception committee
Mediumcontained failure with compensating controlfix before scale, monitor in pilot
Lowlow-risk UI/text issue without customer/control impactbacklog with owner and target release

7.4 Rollback Triggers

TriggerRollback action
prohibited output in productiondisable AI output path, preserve logs, incident workflow
tool gateway unauthorized actiondisable affected tool integration, revert policy/tool schema
citation correctness below thresholdrevert corpus/retriever or disable grounded answer feature
manual queue exceeds capacity limitreduce cohort, switch to manual triage
segment complaint spikepause affected segment, expand review, notify authorized owners
evidence capture failurepause rollout, restore prior version, assess audit gap
vendor model degradationroute to fallback model or manual process

8. Workflow Replay, Shadow Testing, Parallel Run

8.1 Workflow Replay Pack

FieldRequired content
replay_idStable run ID
event sourcehistorical, synthetic, mixed
frozen input versiondataset and business event version
expected state transitionsstates, owners, SLAs
AI actionsrecommendation, retrieval, tool call, refusal, escalation
control checkspolicy, authorization, logging, HITL
comparisonexpected vs actual
defect linksdeviations and retest

8.2 Shadow Test Report

SectionRequired content
Traffic scopecohort, channel, product, duration
Non-impact proofhow shadow output was prevented from affecting production decision
Output qualityaccuracy, citation, refusal, unsupported claim
Risk signalspolicy violations, privacy events, segment issues
Operationslatency, cost, error, queue projection
Decisioncontinue shadow, move to pilot, expand tests, hold

8.3 Parallel Run Report

SectionRequired content
Baselinecurrent process/model/manual decision
New processAI-assisted or automated path
Comparisondecision delta, reason delta, workload delta, segment delta
Human adjudicationwhere deltas were reviewed and how final truth was assigned
Customer impactpotential harm, recourse, communication
Control impactHITL, logging, override, approval
Release decisionfull, pilot, hold, exception

9. Evidence / Control Checklist

9.1 Pre-Release Checklist

CheckPass standard
Acceptance criteria definedall blocking criteria have owner, metric, threshold and evidence
Version identity lockedmodel, prompt, RAG, tool, policy, workflow and feature flag recorded
Golden journeys runhigh-risk and material journeys executed with trace evidence
Synthetic packs reviewedexpected outputs approved by business/risk owner
Segment coverage completematerial and vulnerable segments have evidence
AI regression completechanged and impacted objects tested
Defects dispositionedno critical open; high defects fixed or formally accepted
Test data controlledsensitive data protected, synthetic data governed, access logged
Accessibility coveredrelevant digital and language accessibility tests passed
Operational readiness readyrunbook, training, support, capacity, fallback
Monitoring configuredmetrics, thresholds, owners and cadence active
Rollback rehearsedtechnical and business rollback path verified
Certification memo approvedrelease decision and residual risks signed by owners

9.2 Evidence Quality Rubric

Quality dimensionGood evidence
Completecovers scope, criteria, version, result, owner
Accurateproduced from system logs or controlled test outputs
Timelygenerated during workflow, not reconstructed later
Traceablelinked to release, criteria, control and defect IDs
Reproducibleinput pack and versions allow rerun
Reviewedaccountable human reviewer recorded
Protectedsensitive data minimized and access-controlled
Durableretained with integrity and retention metadata

9.3 Management Readiness MI

Dashboard tileRAG signal
Release certification statuscriteria passed, blocked, exceptioned
High-risk journey coverageexecuted, failed, not run
AI regression statuschanged objects covered or missing
Segment and accessibility coveragematerial gaps and amber signals
Defect severitycritical/high open, aging, repeat root cause
Exception inventoryaccepted risk, expiry, compensating control
Operational readinesstraining, runbook, support, capacity
Monitoring readinessmetrics live, thresholds approved
Rollback readinessdrill completed, owner assigned

10. AI Assist Guardrails

AI use in UATAllowed outputRequired guardrail
Test case generationcandidate cases and edge caseshuman approves expected behavior
Coverage gap analysisunmapped criteria/controls/teststraceability graph remains source of truth
Defect clusteringlikely root cause and duplicate groupsseverity and closure by accountable human
Requirement-test mappingsuggested matrix linksBA/QA confirms links
Evidence summarizationdraft certification summarycites evidence IDs, human approval
Synthetic scenario generationcandidate synthetic records/scenariosprivacy review, reviewer approval
UAT transcript analysispain points, confusion, training signalsaccess control and PII minimization

Non-negotiable boundaries:

  • AI does not sign UAT.
  • AI does not approve release.
  • AI does not accept residual risk.
  • AI does not close defects.
  • AI does not decide legal/compliance applicability.
  • AI does not fabricate missing evidence.
  • AI does not send sensitive test data to unapproved tools.

11. Implementation Guardrails

GuardrailPractical rule
Criteria before casesno test asset accepted without linked acceptance criteria
Risk before coveragehigh-risk journey and segment coverage outranks raw test count
Version every AI objectmodel, prompt, corpus, retriever, tool, policy and workflow must be recorded
Evidence by designtrace IDs, logs and approvals are generated during execution
Synthetic with governancegenerated data must have business rationale and expected output
No uncontrolled production dataproduction data in UAT requires need, controls, access and retention
Defect decisions structuredseverity, impact, root cause, retest and disposition required
Exceptions expireevery accepted risk has owner, expiry and compensating control
Operational readiness is blockingno release without support, training, monitoring and rollback
Production signals update testsincidents and complaints expand golden journeys or synthetic packs

12. 30-60-90 Day Roadmap

Days 1-30: Stabilize Acceptance Foundations

WorkstreamOutcome
Inventory AI releasestop AI use cases, release types, model/prompt/RAG/tool owners
Define acceptance contracttemplate for business, risk, control, ops and rollback criteria
Build initial golden journeystop 15-25 material and high-risk journeys
Create defect taxonomyseverity, root cause, customer/control impact
Stand up evidence indexrelease ID, criteria ID, evidence ID conventions

Days 31-60: Engineer Regression and Evidence

WorkstreamOutcome
Build synthetic packsboundary, rare event, segment, privacy, adversarial, operational packs
Add AI object regressionmodel/prompt/RAG/tool/workflow change matrix
Implement workflow replaytraceable replay for key journeys
Structure certification memocriteria, coverage, defects, exceptions, ops readiness
Define monitoring and rollbackmetrics, thresholds, owners, kill switch path

Days 61-90: Prove and Scale

WorkstreamOutcome
Run shadow or parallel pilotscomparison evidence for selected high-risk use case
Launch readiness MIrelease certification, defects, exceptions, monitoring, rollback dashboard
Automate evidence capturetest run, eval, logs, approvals, version metadata
Close repeat root causesproduction signals update regression packs
Formalize recertification policytriggers for model/prompt/RAG/tool/workflow/data changes

13. Interview-Ready Answers

Q1: 你如何重新设计 AI UAT?

30 秒版本: 我会把 AI UAT 从业务签字改造成 evidence architecture。先定义 acceptance claim 和 criteria, 再建立 golden journey、synthetic pack、segment coverage、risk/control coverage 和 AI regression matrix。执行 workflow replay、shadow 或 parallel run 后, 用 release certification memo 记录证据、缺陷、例外、运营准备、监控和回滚。

2 分钟版本: AI UAT 要证明的不只是功能可用, 而是业务能力在特定模型、prompt、RAG、工具、workflow 和控制版本下可接受。我的流程从 acceptance criteria registry 开始, 每条标准都有 owner、threshold、evidence 和 release blocker 标记。测试资产包括端到端 golden journeys、synthetic transaction packs、persona/segment matrix、adversarial and accessibility cases。回归覆盖 model、prompt、RAG corpus、retriever、tool schema、policy engine、workflow state 和 data pipeline。缺陷按 customer impact 和 control impact 分类。上线前形成 certification memo, 包含 residual risk、exception approval、operational readiness、monitoring 和 rollback criteria。

Q2: 为什么 UAT 不能只由业务用户点页面完成?

30 秒版本: 因为 AI 风险常在模型输出、检索来源、工具调用、控制日志、人工复核、分群表现和运营负载里, 不一定在页面点击中暴露。业务用户参与很重要, 但必须嵌入证据架构。

2 分钟版本: 例如客服 AI 助手的页面能展示答案, 不代表答案基于当前政策、引用准确、不会越过建议边界, 也不代表西班牙语客户、脆弱客户、投诉升级和系统超时路径被覆盖。UAT 应把业务用户放到 golden journey 中, 同时捕获 trace id、source citation、model/prompt version、policy decision、tool call、HITL approval 和 defect evidence。业务判断仍由人负责, 但判断要绑定可审计证据。

Q3: 如何设计 AI 回归认证?

30 秒版本: 先识别变化对象: model、prompt、RAG、retriever、tool、policy、workflow、data、UI/API。每类变化映射到最小回归资产和高风险附加测试。通过后生成 AI regression certificate, 记录版本、测试、缺陷、例外和下次重认证触发器。

2 分钟版本: prompt 改动要跑拒答、引用、语气和高风险 journey; RAG corpus 改动要跑 retrieval eval、stale-source 和 citation correctness; tool schema 改动要跑 positive/negative trajectory、authorization 和 side-effect sandbox; model version 改动要跑 full eval、segment eval、latency/cost, 高风险场景再加 shadow 或 parallel run。认证不是一个总分, 而是 changed object -> impacted behavior -> regression asset -> evidence -> owner decision 的链条。

Q4: Shadow testing 和 parallel run 的证据如何用于上线决策?

30 秒版本: shadow testing 证明 AI 在真实流量旁路下的输出质量和风险信号; parallel run 比较新旧流程的决策差异、人工负载、segment impact 和控制效果。两者都必须提前定义阈值。

2 分钟版本: shadow 适合低干扰观察, 比如客服建议、RAG 答案或 analyst summary, 但要证明 shadow output 没有影响真实决策。parallel run 适合 KYC、fraud、credit 或 operational workflow, 因为需要比较 baseline 和新流程。上线决策看 decision delta 是否解释充分, workload 是否在容量内, high-risk segment 是否稳定, controls 是否完整记录, customer harm signal 是否可接受。没有 comparison criteria 的 shadow/parallel 不能支持 certification。

Q5: AI 可以帮 UAT 到什么程度?

30 秒版本: AI 可以生成候选测试、找覆盖缺口、聚类缺陷、映射需求到测试、总结证据。AI 不能最终验收、接受风险、关闭缺陷、批准 release 或替代审计/模型验证。

2 分钟版本: 我会用 AI 提升 UAT 的分析效率, 但设置强边界。AI 生成的 test cases 必须由 BA/业务/risk owner 确认 expected outcome。AI 的 coverage gap 建议要和 traceability matrix 对账。Defect clustering 只能作为 triage 输入, severity 和 disposition 由人决定。Release memo 可以由 AI 起草, 但必须引用 evidence IDs, 并由授权 owner 审批。这样 AI 帮团队更快看见问题, 不接管责任链。

Q6: 如何处理带缺陷的上线请求?

30 秒版本: 先判断 defect 是否 blocking。Critical 不上线; High 要修复或正式 exception committee; Medium 可以有限灰度但要有补偿控制; Low 进入 backlog。所有例外必须有 owner、expiry、monitoring 和 closure criteria。

2 分钟版本: 我不会用"业务接受"四个字掩盖缺陷。缺陷记录必须连接 acceptance criteria、journey、segment、AI object version、customer/control impact 和 root cause。若决定 exception release, 需要 risk acceptance record: residual risk、impacted population、compensating control、exposure cap、expiry、approver、monitoring 和 closure。上线后监控必须能验证这个风险没有扩大, 到期前要修复或重新接受。


14. Quality Bar

一个 AI UAT / Regression Certification 体系达标的最低标准:

For any AI release,
the team can reconstruct:
  what business capability was accepted,
  which criteria and risks were in scope,
  which journeys, segments and synthetic scenarios were tested,
  which model/prompt/RAG/tool/workflow versions were certified,
  which defects and exceptions remained,
  who accepted the residual risk,
  what monitoring was active,
  and what conditions would trigger rollback or recertification.

如果这条链断了, UAT 就不是业务接受架构, 只是上线仪式。