目录
AI UAT / Regression Certification / Business Acceptance Playbook
定位: 面向 AI PM / Senior BA / Product Architecture / Solution Architecture / QA Governance / AI Governance / Operational Risk / Release Management 的高级落地手册。
目标: 把 AI UAT、业务验收标准、golden journey、synthetic transaction、persona/segment coverage、risk/control coverage、AI regression、workflow replay、shadow testing、parallel run、defect triage、release certification、exception/risk acceptance、test data governance、operational readiness、post-release monitoring 和 rollback 做成可执行、可审计、可复用的业务接受体系。
核心观点: UAT 的交付物不是 sign-off, 而是 release acceptance evidence bundle。
边界说明: 本手册不是法律意见、合规结论、审计结论、模型验证报告、信息安全认证或监管承诺。正式项目中的适用范围、控制要求、测试深度、风险接受权限、上线批准、客户沟通、回滚和事故处置由机构授权角色决定。本文只把官方锚点和架构实践转成 PM / BA / Architect 可落地 artifact。访问日期: 2026-06-30。
1. Source Anchors
2. Capability Map
Business objective and risk context
-> acceptance criteria registry
-> golden journey library
-> synthetic transaction and persona packs
-> risk/control coverage matrix
-> AI object regression matrix
-> workflow replay / shadow / parallel run
-> defect triage and retest evidence
-> release certification packet
-> exception and risk acceptance workflow
-> operational readiness and rollback controls
-> post-release monitoring and recertification loop
Capability Why it matters Core artifacts Acceptance criteria registry 把业务验收从口头判断变成可测、可签、可追溯对象 acceptance criteria contract、owner map Golden journey library 用端到端业务场景证明能力可接受 golden journey card、journey coverage matrix Synthetic transaction packs 覆盖稀有、边界、隐私敏感和高风险情境 synthetic pack card、expected output review Persona / segment coverage 证明不是只对平均用户有效 segment matrix、inclusive/accessibility cases Risk/control coverage 把业务风险和控制证明前置到 UAT risk-control-test-evidence matrix AI regression 覆盖 model、prompt、RAG、tool、policy、workflow 变化 AI object change matrix、regression certificate Workflow replay / shadow / parallel 在不同比例风险下获取生产相似证据 replay trace、shadow report、parallel comparison Defect governance 防止缺陷被会议化、口头化、低估 defect taxonomy、severity decision table Release certification 汇总业务、风险、技术、运营证据支持上线决策 certification memo、evidence index Exception acceptance 结构化接受残余风险 risk acceptance record、expiry、compensating control Operational readiness 保证上线后有人、流程、监控和回退 runbook、training、support、rollback drill Post-release loop 把生产信号转成回归资产 monitoring dashboard、eval expansion、CAPA
3. Operating Model
3.1 Productized UAT Team Topology
Team / role Primary responsibility Key output Product owner 定义业务目标、scope、release decision 和 customer/business value release scope, acceptance claim Senior BA 把政策、流程、例外、journey 和业务判断写成可验收契约 acceptance criteria, journey scripts, coverage matrix QA / Test architect 设计自动化回归、test asset library、defect workflow test plan, regression suite, defect summary AI / ML owner 提供 model/prompt/RAG/tool versions、eval results、known limitations AI object inventory, eval report Solution architect 确保证据捕获、日志、版本、回滚、监控进入架构 release architecture, rollback design Operations owner 确认人工队列、培训、runbook、support、degraded mode ready ops readiness packet Risk / Compliance / Control owner 审核 high-risk scenarios、controls、exceptions、monitoring control evidence, risk acceptance Release manager 组织部署窗口、go/no-go、feature flag、回滚演练 release checklist, deployment record Internal audit liaison 观察证据质量和可追溯性, 不拥有上线批准 evidence quality feedback
3.2 RACI
Activity PM Senior BA QA Architect AI Owner Ops Risk/Compliance Release Acceptance claim A/R R C C C C C I Acceptance criteria A R C C C C C I Golden journeys A R R C C C C I Synthetic packs C R R C C C C I AI regression scope C C R A/R A/R I C C Workflow replay C R A/R C C C I I Shadow / parallel run A R R C C A/R C C Defect severity A R R C C C C I Risk acceptance C R C C C C A/R I Release certification A/R R R R R R C A/R Operational readiness C C C C I A/R C C Rollback decision A C C R C R C A/R Post-release monitoring A C C C R R C I
Legend: A = accountable, R = responsible, C = consulted, I = informed.
4. End-to-End Workflow
4.1 Acceptance Planning
release candidate identified
-> acceptance claim drafted
-> scope and population confirmed
-> acceptance criteria created
-> risk/control coverage mapped
-> AI object changes identified
-> test assets selected or created
-> evidence capture requirements locked
Control questions:
Which business capability is being certified?
Which model/prompt/RAG/tool/workflow versions are in scope?
Which customers, employees, products, channels and regions are exposed?
Which high-risk journeys and segments require evidence?
Which criteria block release and which can be accepted by exception?
What production monitoring and rollback criteria must exist before release?
4.2 Test Asset Preparation
Asset Preparation rule Golden journeys Each journey has objective, persona, steps, expected AI behavior, controls and evidence fields Synthetic packs Each pack has generation method, privacy classification, expected output and reviewer Replay datasets Historical sample is frozen, de-identified where required, and linked to production version Prompt / RAG eval set Known-answer, no-answer, stale-source, citation and adversarial cases included Tool trajectory set Correct tool, wrong tool, unauthorized tool, failed tool and retry cases included Accessibility pack Screen reader, keyboard, contrast, plain-language and alternate-channel cases included
4.3 Execution
automated regression
-> AI eval and tool trajectory tests
-> business UAT session
-> workflow replay
-> shadow or parallel run if required
-> operational readiness drill
-> defect triage and retest
-> evidence bundle generation
4.4 Certification Decision
criteria results complete
-> evidence index reviewed
-> open defects dispositioned
-> exceptions accepted or release held
-> monitoring and rollback confirmed
-> certification memo approved
-> release / pilot / hold / rollback decision recorded
5. Artifact Templates
5.1 Acceptance Criteria Contract
Field Required content acceptance_id Stable ID, e.g. ACC-FRAUD-COPILOT-023 capability Business capability, not component name release scope Product, channel, segment, region, user role, feature flag AI object versions model, prompt, RAG corpus, retriever, tool schema, policy engine, workflow business outcome measurable business result decision impact recommendation, ranking, summary, automation, tool action, customer communication success criteria metric, threshold, sample, owner risk criteria prohibited outputs, escalation rules, customer harm prevention control criteria HITL, logging, access, policy gate, approval, retention evidence required eval report, journey result, replay trace, shadow/parallel comparison, approval release blocker yes/no and condition recertification trigger model/prompt/RAG/tool/workflow/policy/data change, monitoring breach, incident
5.2 Golden Journey Card
Field Required content journey_id Stable journey ID name business-readable journey name persona / segment customer or employee profile and risk segment channel mobile, web, branch, contact center, back office, API preconditions data state, account status, permissions, feature flag journey steps end-to-end business actions AI touchpoints model output, retrieval, tool call, summary, recommendation expected behavior desired action, refusal, escalation, human review controls policy checks, HITL, logging, masking, dual control evidence trace IDs, test run IDs, screenshots only if necessary, data proof, reviewer linked criteria acceptance_ids and risk/control IDs
5.3 Synthetic Transaction Pack Card
Field Required content pack_id Stable ID scenario family boundary, rare event, protected segment, privacy, adversarial, operational generation method rule-based, sampled, transformed, expert-authored source pattern production pattern, policy scenario, risk typology privacy classification no sensitive data, sanitized, tokenized, restricted expected output exact or rubric-based expected behavior prohibited output what must never happen reviewer business/risk/control owner coverage linked journeys, criteria, controls version effective date, changes, retired cases
5.4 AI Regression Certificate
Field Required content certificate_id Stable release certificate section ID changed objects model, prompt, retriever, corpus, tool schema, workflow, policy, data, UI/API unchanged impacted objects dependencies that may be affected regression assets run eval sets, golden journeys, synthetic packs, tool tests, workflow replay pass/fail threshold results and failures defect links open and closed defects comparison baseline previous release, production baseline, control group risk owner review owner, date, decision recertification scope what must rerun for next change
5.5 Release Certification Memo
Section Required content Executive summary release decision and residual risk in business language Scope population, channels, products, roles, feature flags Version identity model/prompt/RAG/tool/workflow/config/build versions Acceptance results criteria, threshold, evidence, owner decision Coverage journey, segment, risk/control, AI regression coverage Defects severity, root cause, retest, disposition Exceptions risk accepted, owner, expiry, compensating control Operational readiness runbook, training, support, capacity, BCM/degraded mode Monitoring post-release metrics, thresholds, review cadence Rollback trigger, owner, technical path, business communication Sign-off business, technology, operations, risk/control, release
5.6 Risk Acceptance Record
Field Required content exception_id Stable ID linked release / criteria / defect IDs, not free text residual risk clear risk statement impacted population segment, channel, product, employee role compensating control manual review, rate limit, monitoring, fallback, customer recourse expiry or scale gate date, batch size, exposure cap, trigger approver role with authority monitoring metric, threshold, owner, cadence closure fix, retest, recertification evidence
5.7 Test Data Governance Card
Field Required content dataset_id UAT/replay/synthetic dataset ID data source production extract, synthetic generation, vendor sample, expert-authored data classification public, internal, confidential, restricted, regulated sanitization masking, tokenization, NULLing, substitution, aggregation allowed environment UAT, secure sandbox, local prohibited, vendor prohibited retention retention period and deletion evidence access roles, approvals, logs AI model routing approved models/environments only evidence use which tests and criteria rely on this data
5.8 Evidence Index
Field Required content evidence_id Stable ID evidence_type test_run, eval_report, UAT_session, replay_trace, shadow_report, approval release_id release and feature flag linked criteria/control IDs object versions model, prompt, corpus, tool, workflow produced_by system, tester, business user, AI assistant reviewer accountable human reviewer timestamp creation and approval time integrity checksum, immutable store, access log retention retention class
6. Coverage Matrices
6.1 Acceptance-to-Evidence Matrix
Acceptance criterion Golden journey Synthetic pack AI regression Control evidence Status AI must escalate uncertain KYC document GJ-KYC-004 SYN-KYC-BOUNDARY-002 prompt + tool + workflow HITL log, reviewer queue covered RAG answer must cite current fee policy GJ-CS-011 SYN-RAG-STALE-001 corpus + retriever citation eval, source version covered Fraud alert summary must not close case GJ-FRAUD-021 SYN-TOOL-DENY-003 tool gateway tool audit trace covered Spanish mobile onboarding must preserve recourse GJ-ONB-ESP-007 SYN-ACCESS-005 UI + prompt language QA, appeal path covered
6.2 Segment Coverage Matrix
Segment axis Required examples Evidence Customer value/risk low balance, high value, high-risk AML, fraud watch journey run and segment eval Language English, Spanish, high-volume non-English path translated source and output QA Accessibility screen reader, keyboard only, plain-language error accessibility evidence Channel mobile, web, branch, contact center, back office channel-specific trace Product deposit, credit card, loan, payment, dispute product journey coverage Employee role analyst, supervisor, QA reviewer, admin permission and workflow tests
6.3 Risk-Control-Test-Evidence Matrix
Risk Control Test asset Evidence Release implication Unsupported claim in customer answer RAG source grounding and citation required known-answer and no-answer pack citation correctness report blocker if high-risk topic fails Unauthorized tool action tool gateway policy and HITL tool trajectory negative cases denied action logs blocker for agent release Segment harm segment-specific thresholds and recourse vulnerable customer pack segment result and appeal path pilot cap if amber Queue overload exposure cap and capacity monitor parallel run workload comparison queue forecast and ops sign-off cap rollout if workload high Missing audit trail trace ID and version capture evidence completeness query log sample and evidence index no release if critical fields missing
7. AI Regression Decision Tables
7.1 Regression Scope by Change
Change Minimum regression Additional tests for high-risk AI Prompt wording prompt golden set, refusal, citation, tone adversarial prompts, high-risk journeys Model version full eval, segment eval, latency/cost shadow or parallel comparison RAG corpus retrieval eval, source freshness, stale-source tests regulatory/customer-facing narrative QA Retriever / embedding recall/ranking benchmark no-answer and edge source cases Tool schema/API contract test, positive/negative tool trajectory side-effect sandbox, HITL approval Policy rule decision table regression segment and jurisdiction matrix Workflow state state transition and exception replay queue/capacity and fallback Data pipeline DQ, distribution shift, backfill/replay model input drift, bias/segment impact UI/API functional and accessibility regression operator error, explanation comprehension
7.2 Certification Decision
Condition Decision All blocking criteria met, no critical/high open defect, ops ready full release Blocking criteria met, amber workload or segment signal within cap limited pilot with exposure cap Non-blocking defect accepted by authorized owner with expiry exception release critical defect, missing key control evidence, or no rollback path hold production early signal breaches stop rule rollback or disable feature flag
7.3 Defect Severity
Severity Signal Required action Critical prohibited output, privacy breach, unauthorized action, critical control failure block release, freeze evidence, executive escalation High high-risk journey failure, repeated segment harm, monitoring gap fix or formal exception committee Medium contained failure with compensating control fix before scale, monitor in pilot Low low-risk UI/text issue without customer/control impact backlog with owner and target release
7.4 Rollback Triggers
Trigger Rollback action prohibited output in production disable AI output path, preserve logs, incident workflow tool gateway unauthorized action disable affected tool integration, revert policy/tool schema citation correctness below threshold revert corpus/retriever or disable grounded answer feature manual queue exceeds capacity limit reduce cohort, switch to manual triage segment complaint spike pause affected segment, expand review, notify authorized owners evidence capture failure pause rollout, restore prior version, assess audit gap vendor model degradation route to fallback model or manual process
8. Workflow Replay, Shadow Testing, Parallel Run
8.1 Workflow Replay Pack
Field Required content replay_id Stable run ID event source historical, synthetic, mixed frozen input version dataset and business event version expected state transitions states, owners, SLAs AI actions recommendation, retrieval, tool call, refusal, escalation control checks policy, authorization, logging, HITL comparison expected vs actual defect links deviations and retest
8.2 Shadow Test Report
Section Required content Traffic scope cohort, channel, product, duration Non-impact proof how shadow output was prevented from affecting production decision Output quality accuracy, citation, refusal, unsupported claim Risk signals policy violations, privacy events, segment issues Operations latency, cost, error, queue projection Decision continue shadow, move to pilot, expand tests, hold
8.3 Parallel Run Report
Section Required content Baseline current process/model/manual decision New process AI-assisted or automated path Comparison decision delta, reason delta, workload delta, segment delta Human adjudication where deltas were reviewed and how final truth was assigned Customer impact potential harm, recourse, communication Control impact HITL, logging, override, approval Release decision full, pilot, hold, exception
9. Evidence / Control Checklist
9.1 Pre-Release Checklist
Check Pass standard Acceptance criteria defined all blocking criteria have owner, metric, threshold and evidence Version identity locked model, prompt, RAG, tool, policy, workflow and feature flag recorded Golden journeys run high-risk and material journeys executed with trace evidence Synthetic packs reviewed expected outputs approved by business/risk owner Segment coverage complete material and vulnerable segments have evidence AI regression complete changed and impacted objects tested Defects dispositioned no critical open; high defects fixed or formally accepted Test data controlled sensitive data protected, synthetic data governed, access logged Accessibility covered relevant digital and language accessibility tests passed Operational readiness ready runbook, training, support, capacity, fallback Monitoring configured metrics, thresholds, owners and cadence active Rollback rehearsed technical and business rollback path verified Certification memo approved release decision and residual risks signed by owners
9.2 Evidence Quality Rubric
Quality dimension Good evidence Complete covers scope, criteria, version, result, owner Accurate produced from system logs or controlled test outputs Timely generated during workflow, not reconstructed later Traceable linked to release, criteria, control and defect IDs Reproducible input pack and versions allow rerun Reviewed accountable human reviewer recorded Protected sensitive data minimized and access-controlled Durable retained with integrity and retention metadata
9.3 Management Readiness MI
Dashboard tile RAG signal Release certification status criteria passed, blocked, exceptioned High-risk journey coverage executed, failed, not run AI regression status changed objects covered or missing Segment and accessibility coverage material gaps and amber signals Defect severity critical/high open, aging, repeat root cause Exception inventory accepted risk, expiry, compensating control Operational readiness training, runbook, support, capacity Monitoring readiness metrics live, thresholds approved Rollback readiness drill completed, owner assigned
10. AI Assist Guardrails
AI use in UAT Allowed output Required guardrail Test case generation candidate cases and edge cases human approves expected behavior Coverage gap analysis unmapped criteria/controls/tests traceability graph remains source of truth Defect clustering likely root cause and duplicate groups severity and closure by accountable human Requirement-test mapping suggested matrix links BA/QA confirms links Evidence summarization draft certification summary cites evidence IDs, human approval Synthetic scenario generation candidate synthetic records/scenarios privacy review, reviewer approval UAT transcript analysis pain points, confusion, training signals access control and PII minimization
Non-negotiable boundaries:
AI does not sign UAT.
AI does not approve release.
AI does not accept residual risk.
AI does not close defects.
AI does not decide legal/compliance applicability.
AI does not fabricate missing evidence.
AI does not send sensitive test data to unapproved tools.
11. Implementation Guardrails
Guardrail Practical rule Criteria before cases no test asset accepted without linked acceptance criteria Risk before coverage high-risk journey and segment coverage outranks raw test count Version every AI object model, prompt, corpus, retriever, tool, policy and workflow must be recorded Evidence by design trace IDs, logs and approvals are generated during execution Synthetic with governance generated data must have business rationale and expected output No uncontrolled production data production data in UAT requires need, controls, access and retention Defect decisions structured severity, impact, root cause, retest and disposition required Exceptions expire every accepted risk has owner, expiry and compensating control Operational readiness is blocking no release without support, training, monitoring and rollback Production signals update tests incidents and complaints expand golden journeys or synthetic packs
12. 30-60-90 Day Roadmap
Days 1-30: Stabilize Acceptance Foundations
Workstream Outcome Inventory AI releases top AI use cases, release types, model/prompt/RAG/tool owners Define acceptance contract template for business, risk, control, ops and rollback criteria Build initial golden journeys top 15-25 material and high-risk journeys Create defect taxonomy severity, root cause, customer/control impact Stand up evidence index release ID, criteria ID, evidence ID conventions
Days 31-60: Engineer Regression and Evidence
Workstream Outcome Build synthetic packs boundary, rare event, segment, privacy, adversarial, operational packs Add AI object regression model/prompt/RAG/tool/workflow change matrix Implement workflow replay traceable replay for key journeys Structure certification memo criteria, coverage, defects, exceptions, ops readiness Define monitoring and rollback metrics, thresholds, owners, kill switch path
Days 61-90: Prove and Scale
Workstream Outcome Run shadow or parallel pilots comparison evidence for selected high-risk use case Launch readiness MI release certification, defects, exceptions, monitoring, rollback dashboard Automate evidence capture test run, eval, logs, approvals, version metadata Close repeat root causes production signals update regression packs Formalize recertification policy triggers for model/prompt/RAG/tool/workflow/data changes
13. Interview-Ready Answers
Q1: 你如何重新设计 AI UAT?
30 秒版本 :
我会把 AI UAT 从业务签字改造成 evidence architecture。先定义 acceptance claim 和 criteria, 再建立 golden journey、synthetic pack、segment coverage、risk/control coverage 和 AI regression matrix。执行 workflow replay、shadow 或 parallel run 后, 用 release certification memo 记录证据、缺陷、例外、运营准备、监控和回滚。
2 分钟版本 :
AI UAT 要证明的不只是功能可用, 而是业务能力在特定模型、prompt、RAG、工具、workflow 和控制版本下可接受。我的流程从 acceptance criteria registry 开始, 每条标准都有 owner、threshold、evidence 和 release blocker 标记。测试资产包括端到端 golden journeys、synthetic transaction packs、persona/segment matrix、adversarial and accessibility cases。回归覆盖 model、prompt、RAG corpus、retriever、tool schema、policy engine、workflow state 和 data pipeline。缺陷按 customer impact 和 control impact 分类。上线前形成 certification memo, 包含 residual risk、exception approval、operational readiness、monitoring 和 rollback criteria。
Q2: 为什么 UAT 不能只由业务用户点页面完成?
30 秒版本 :
因为 AI 风险常在模型输出、检索来源、工具调用、控制日志、人工复核、分群表现和运营负载里, 不一定在页面点击中暴露。业务用户参与很重要, 但必须嵌入证据架构。
2 分钟版本 :
例如客服 AI 助手的页面能展示答案, 不代表答案基于当前政策、引用准确、不会越过建议边界, 也不代表西班牙语客户、脆弱客户、投诉升级和系统超时路径被覆盖。UAT 应把业务用户放到 golden journey 中, 同时捕获 trace id、source citation、model/prompt version、policy decision、tool call、HITL approval 和 defect evidence。业务判断仍由人负责, 但判断要绑定可审计证据。
Q3: 如何设计 AI 回归认证?
30 秒版本 :
先识别变化对象: model、prompt、RAG、retriever、tool、policy、workflow、data、UI/API。每类变化映射到最小回归资产和高风险附加测试。通过后生成 AI regression certificate, 记录版本、测试、缺陷、例外和下次重认证触发器。
2 分钟版本 :
prompt 改动要跑拒答、引用、语气和高风险 journey; RAG corpus 改动要跑 retrieval eval、stale-source 和 citation correctness; tool schema 改动要跑 positive/negative trajectory、authorization 和 side-effect sandbox; model version 改动要跑 full eval、segment eval、latency/cost, 高风险场景再加 shadow 或 parallel run。认证不是一个总分, 而是 changed object -> impacted behavior -> regression asset -> evidence -> owner decision 的链条。
Q4: Shadow testing 和 parallel run 的证据如何用于上线决策?
30 秒版本 :
shadow testing 证明 AI 在真实流量旁路下的输出质量和风险信号; parallel run 比较新旧流程的决策差异、人工负载、segment impact 和控制效果。两者都必须提前定义阈值。
2 分钟版本 :
shadow 适合低干扰观察, 比如客服建议、RAG 答案或 analyst summary, 但要证明 shadow output 没有影响真实决策。parallel run 适合 KYC、fraud、credit 或 operational workflow, 因为需要比较 baseline 和新流程。上线决策看 decision delta 是否解释充分, workload 是否在容量内, high-risk segment 是否稳定, controls 是否完整记录, customer harm signal 是否可接受。没有 comparison criteria 的 shadow/parallel 不能支持 certification。
Q5: AI 可以帮 UAT 到什么程度?
30 秒版本 :
AI 可以生成候选测试、找覆盖缺口、聚类缺陷、映射需求到测试、总结证据。AI 不能最终验收、接受风险、关闭缺陷、批准 release 或替代审计/模型验证。
2 分钟版本 :
我会用 AI 提升 UAT 的分析效率, 但设置强边界。AI 生成的 test cases 必须由 BA/业务/risk owner 确认 expected outcome。AI 的 coverage gap 建议要和 traceability matrix 对账。Defect clustering 只能作为 triage 输入, severity 和 disposition 由人决定。Release memo 可以由 AI 起草, 但必须引用 evidence IDs, 并由授权 owner 审批。这样 AI 帮团队更快看见问题, 不接管责任链。
Q6: 如何处理带缺陷的上线请求?
30 秒版本 :
先判断 defect 是否 blocking。Critical 不上线; High 要修复或正式 exception committee; Medium 可以有限灰度但要有补偿控制; Low 进入 backlog。所有例外必须有 owner、expiry、monitoring 和 closure criteria。
2 分钟版本 :
我不会用"业务接受"四个字掩盖缺陷。缺陷记录必须连接 acceptance criteria、journey、segment、AI object version、customer/control impact 和 root cause。若决定 exception release, 需要 risk acceptance record: residual risk、impacted population、compensating control、exposure cap、expiry、approver、monitoring 和 closure。上线后监控必须能验证这个风险没有扩大, 到期前要修复或重新接受。
14. Quality Bar
一个 AI UAT / Regression Certification 体系达标的最低标准:
For any AI release,
the team can reconstruct:
what business capability was accepted,
which criteria and risks were in scope,
which journeys, segments and synthetic scenarios were tested,
which model/prompt/RAG/tool/workflow versions were certified,
which defects and exceptions remained,
who accepted the residual risk,
what monitoring was active,
and what conditions would trigger rollback or recertification.
如果这条链断了, UAT 就不是业务接受架构, 只是上线仪式。