AI 底层逻辑 / 经典论文

AI Synthetic User Simulation：用户仿真与场景实验室架构

Synthetic user simulation 不是把传统 persona 换成 AI 头像, 也不是让 LLM 随机扮演客户。它是一套用于产品发现、架构验证和上线证据管理的行为测试架构:

543 行ai-foundations/papers/164-ai-synthetic-user-simulation-persona-scenario-lab-architecture.md

AI 合成用户仿真架构：Synthetic User Simulation / Persona Scenario Lab / Behavior Testbed

Date: 2026-06-30 Status: evergreen Audience: experienced CBAP / financial retail PM / AI product architect / AI solution architect / risk-aware product leader Output: advanced architecture note, decision framework, ADR draft, interview-ready narrative

Why Synthetic User Labs Matter

Synthetic user simulation 不是把传统 persona 换成 AI 头像, 也不是让 LLM 随机扮演客户。它是一套用于产品发现、架构验证和上线证据管理的行为测试架构:

Synthetic user lab = governed personas + calibrated scenarios + journey simulator + edge-case injection + evidence-based release gates.

在金融零售 AI 场景中, 真实用户研究和线上试验经常受到约束:

高风险事件低频但代价高, 例如授权推送支付诈骗、投诉升级、弱势客户误导、KYC 拒绝、催收不当话术。
真实客户数据敏感, 不能无限复制到 discovery、prompt tuning、agent testing 和供应商 PoC 环境。
监管、模型风险和内审会追问: 产品团队做过哪些边界测试, 为什么相信 AI 在异常路径下不会扩大伤害。
传统 UAT 常覆盖 happy path, 但 AI 系统的风险来自长尾语境、工具调用、检索误差、用户误用和人机信任错配。

Synthetic user lab 的价值不是替代真实证据, 而是在真实证据不足、昂贵或敏感时, 建立一个可重复、可审查、可校准的探索环境:

传统方法	局限	Synthetic user lab 的补强
用户访谈	样本小, 难覆盖高风险边界	把访谈洞察转成 scenario card 和 assumption log, 批量模拟
UAT	测功能是否能用, 不一定测行为是否可信	模拟客户、员工、诈骗者、投诉人、合规审查员的交互路径
A/B test	对真实用户有影响, 不适合高风险早期探索	在上线前测试反事实路径和负面情境
Red team	偏安全攻击, 不一定覆盖业务流程	把业务边界、客户权益、运营控制和 AI 风险合并测试
Eval benchmark	关注模型答案, 常脱离工作流	测 journey、tool call、retrieval、human override、downstream outcome

高级 PM / BA / Architect 要把 synthetic user lab 定位成 decision evidence:

我们不是说“模拟证明产品一定成功”。
我们说“模拟暴露了哪些假设、哪些路径必须加控制、哪些上线门槛必须满足、哪些证据还需要真实用户或生产 telemetry 校准”。

Concept Diagram

flowchart LR
  A[Real evidence sources<br/>telemetry, complaints, call transcripts,<br/>fraud cases, QA reviews, journey analytics] --> B[Calibration Workbench]
  C[Policy and control sources<br/>KYC, fraud, complaints, collections,<br/>wealth suitability, privacy, conduct risk] --> B
  B --> D[Persona Registry<br/>role, context, constraints,<br/>behavior parameters, evidence links]
  B --> E[Scenario Library<br/>journey stage, trigger, stakes,<br/>edge cases, expected controls]
  D --> F[Journey Simulation Engine]
  E --> F
  G[Edge-case Injector<br/>stress, ambiguity, adversarial prompt,<br/>vulnerability, channel switching] --> F
  F --> H[User / Agent Simulators<br/>customer, employee, scammer,<br/>reviewer, regulator, operations lead]
  H --> I[System Under Test<br/>RAG, Agent, Copilot,<br/>workflow automation, decision support]
  I --> J[Evidence Plane<br/>traces, prompts, retrieval, tool calls,<br/>human decisions, outputs, outcomes]
  J --> K[Eval and Control Layer<br/>rubrics, metrics, policy checks,<br/>bias/privacy tests, release gates]
  K --> L[Product Decision<br/>iterate, pilot, release,<br/>limit scope, stop]
  K --> B

核心闭环:

Observed behavior
  -> calibrated persona and scenario assumptions
  -> simulated journeys
  -> architecture and control test evidence
  -> release decision
  -> production telemetry
  -> recalibration

Architecture Components

1. Evidence Intake Layer

负责收集真实行为证据, 但不直接把敏感原文暴露给模拟环境。

Evidence source	用途	控制
Journey telemetry	校准 drop-off、retry、channel switching、time-to-complete	聚合、脱敏、最小化字段
Contact center transcripts	校准客户语言、情绪、误解、升级路径	PII masking、purpose limitation、retention
Complaint narratives	找到客户伤害、期望落差、解释失败	legal/risk review, sampling discipline
Fraud and scam cases	建立攻击者脚本和弱势客户场景	access control, synthetic reconstruction
KYC exception logs	捕捉身份、地址、文件、制裁筛查异常路径	role-based access and redaction
QA and audit findings	把历史控制缺陷转成 scenario test	finding owner and closure evidence

关键原则: simulation 读取的是经过治理的 behavior facts, 不是无限复制客户数据。

2. Persona Registry

Persona registry 不是 marketing persona。它是可版本化、可追溯、可约束的行为模型目录。

Field	Advanced meaning
persona_id	稳定 ID, 例如 `kyc-newcomer-low-doc-confidence-v2`
actor_type	customer, employee, scammer, reviewer, relationship manager, regulator
domain_context	onboarding, fraud, collections, complaint, wealth, dispute
behavior_parameters	patience, digital confidence, risk tolerance, language clarity, channel preference
constraints	法规边界、隐私限制、不能使用的敏感属性、不可推断项
evidence_links	支撑该 persona 的 telemetry segment、case sample、research insight
uncertainty_level	high / medium / low, 决定是否能用于 release gate
owner	PM / BA / Risk / Research owner
review_cadence	monthly for pilot, quarterly for stable product

Persona 应聚焦行为和上下文, 避免把年龄、族裔、性别等敏感属性当作方便标签。金融零售更适合用任务能力、渠道熟悉度、金融脆弱性信号、文档可得性、语言理解难度、风险暴露和服务需求来建模。

3. Scenario Library

Scenario 是产品和架构验证的基本单位。一个合格 scenario 至少包含:

Field	Example
scenario_id	`fraud-app-scam-warning-bypass-001`
journey_stage	payment initiation, warning, confirmation, dispute
trigger	customer tries to send first-time high-value instant payment
stakes	financial loss, complaint, regulatory scrutiny
expected_system_behavior	detect risk signal, show tailored warning, offer pause/escalation
expected_human_behavior	customer may minimize warning due to social-engineering pressure
control_points	scam typology check, confirmation friction, cooling-off option, trace evidence
simulation_variants	urgency, trusted payee narrative, remote access app mention, elder vulnerability signal
evidence_basis	recent scam complaint sample, fraud typology, payment telemetry
release_gate_link	fraud warning effectiveness gate

Scenario library 应像 test suite 一样管理版本、owner、覆盖率和退役规则。它不是一次性 workshop 产物。

4. Journey Simulation Engine

Journey simulation engine 管理多轮交互、状态转换和分支路径:

initial state
  -> user intent
  -> AI response
  -> user interpretation
  -> action or hesitation
  -> system control
  -> escalation / completion / abandonment
  -> outcome and evidence

高级能力:

支持 multi-actor: 客户、前线员工、后台 analyst、欺诈者、投诉处理员、合规 reviewer。
支持 multi-channel: mobile app、web、branch、contact center、secure message、email follow-up。
支持 stateful journey: 记住用户已看过的告知、上传过的文件、被拒绝过的原因、之前的投诉。
支持 control injection: 人工审批、二次验证、冷静期、policy check、tool permission boundary。
支持 deterministic replay: 同一个 scenario、persona、model version、prompt version 和 seed 能重跑。

5. Edge-case Injector

Edge-case injection 是 lab 的核心价值之一。它把真实世界的脏路径系统化:

Edge class	Financial retail examples
Ambiguity	客户说“我朋友让我马上转账”, 但不承认被诈骗
Vulnerability	客户理解能力有限、近期丧偶、语言障碍、财务压力大
Policy conflict	KYC 通过率目标与制裁/AML 风险控制冲突
Tool risk	Agent 有权限发起退款、冻结账户、更新地址
Retrieval mismatch	RAG 取到过期 fee policy 或跨州/跨地区规则
Channel switching	mobile onboard 失败后转 contact center, 信息断裂
Adversarial behavior	诈骗者指导客户绕过警告, 或员工尝试 prompt injection
Rare but severe	wealth suitability 不当建议、催收话术导致投诉、拒付处理超时

6. System-under-Test Adapter

Lab 不只测试模型。它测试整个 AI product architecture:

RAG: query rewriting、retrieval filter、source freshness、citation support、policy precedence。
Agent: tool allowlist、permission scope、confirmation step、rollback path、human approval。
Copilot: suggestion placement、user edit、accept/reject behavior、trust calibration。
Workflow automation: handoff、queue routing、SLA、case notes、audit trail。
Evaluation layer: task rubric、policy rubric、safety rubric、business outcome proxy。

7. Evidence Plane

每次 simulation run 都要生成可审查证据:

Evidence object	Minimum fields
run metadata	run_id, scenario_id, persona_id, seed, model_id, prompt_version, policy_pack_version
journey trace	step_id, actor, channel, input summary, system action, state transition
retrieval evidence	query, source_id, source_version, score, citation used
tool evidence	tool_name, permission scope, arguments hash, approval decision, result summary
control evidence	policy check, refusal, escalation, human review, override, reason
outcome evidence	completion, abandonment, complaint risk, loss proxy, rework, cycle time
evaluator evidence	rubric score, failure label, severity, reviewer, calibration status

与 OpenTelemetry 思路对齐时, 每个 simulation run 可以作为 root trace; persona action、retrieval、model call、tool call、human approval、policy decision 和 output delivery 是 child spans。

Scenario Governance and Persona Taxonomy

Governance Lifecycle

Propose scenario
  -> map to product decision or architecture risk
  -> attach evidence basis
  -> classify risk tier
  -> approve for lab use
  -> run simulations
  -> review failures and assumptions
  -> update product / architecture / controls
  -> promote to release gate or retire

Stage	Owner	Decision question	Required evidence
Intake	PM / BA	这个 scenario 支持哪个产品或架构决策?	decision memo link, journey map
Risk classification	Risk / Architect	是否涉及客户伤害、监管义务、自动化动作或敏感数据?	risk tier rationale
Calibration	Research / Analytics	行为假设是否有真实证据支撑?	telemetry, sample cases, interviews
Simulation approval	Governance forum	这个 scenario 是否可用于 gate?	persona confidence, data controls
Release gate	Product / Risk / Tech	当前系统能否在该边界内上线?	run results, failure analysis, residual risk
Recalibration	PM / Analytics	生产 telemetry 是否改变假设?	drift report, complaint/fraud/QA linkage

Persona Taxonomy

高级 persona taxonomy 应按可测试行为维度组织, 而不是按故事化标签组织。

Dimension	Examples	Why it matters
Actor role	retail customer, small business owner, contact center agent, fraud analyst, collections specialist, wealth advisor, scammer	明确谁在系统中行动, 谁承担判断责任
Task capability	low document readiness, high digital confidence, limited product literacy, strong financial literacy	影响 onboarding、disclosure、error recovery
Risk exposure	scam pressure, arrears stress, complaint escalation, suitability risk, identity mismatch	影响控制强度和人工介入
Channel behavior	app-first, call-first, branch-assisted, channel hopping	影响 journey state 和 evidence continuity
Trust posture	over-trusting AI, skeptical, confused, seeking confirmation, gaming the process	影响 Copilot 和 warning design
Constraint profile	accessibility need, language simplification, privacy preference, device limitation	影响公平访问和服务质量
Evidence confidence	telemetry-backed, complaint-backed, expert hypothesis, exploratory	决定能否用于上线 gate

不合格 persona:

"年轻用户喜欢快"
"老人不懂科技"
"高净值客户需要高级服务"

合格 persona:

"首次开户客户, 文档准备不足, 对 KYC 拒绝原因理解弱, 在 mobile app 和 contact center 之间切换, 已出现一次上传失败。证据来自 onboarding drop-off telemetry、call reason code 和 QA sample。"

Calibration Against Real Behavior and Telemetry

Synthetic simulation 的最大风险是制造看似精确的假证据。因此必须把 calibration 当成架构能力, 不是分析师手工备注。

Calibration Inputs

Input	Calibration target
Funnel telemetry	各 journey stage 的 drop-off、retry、abandonment
Call reason codes	用户困惑点、升级原因、重复联系
Complaint taxonomy	客户伤害类型、解释失败、处理时长问题
Fraud outcomes	scam typology、warning bypass、loss and recovery pattern
QA samples	员工处理差异、policy adherence、case note quality
A/B or pilot results	AI intervention 对行为和结果的真实影响
Subject matter expert review	极低频高影响路径的业务合理性校验

Calibration Levels

Level	Meaning	Allowed use
L0 - exploratory hypothesis	由 PM / BA / SME 提出的假设, 尚无真实证据	discovery brainstorming, not release gate
L1 - qualitative support	有访谈、case review、投诉样本支撑	scenario design, early prototype evaluation
L2 - telemetry support	有行为数据支撑频率、路径、drop-off 或重复联系	architecture validation and pilot gate
L3 - outcome-linked support	与损失、投诉、QA defect、conversion、cycle time 等结果关联	release gate and scale/stop decision
L4 - production recalibrated	上线后持续回流, 可监控 drift	continuous governance and model/product tuning

Calibration Discipline

每个 persona 和 scenario 都要记录:

哪个行为假设被模拟。
证据来自哪里, 样本时间窗口是什么。
哪些属性被合成或抽象, 哪些不能用于敏感推断。
与真实 telemetry 的差异有多大。
差异是否改变 release decision。

示例:

Assumption	Calibration evidence	Decision impact
高压力 scam 场景下, 客户会忽略通用警告	过去 90 天 APP scam complaint sample 中, 多数客户表示看过但未理解警告	支付 warning 需要 scenario-specific pause, 不能只依赖通用 banner
KYC 文件上传失败后, 客户会重复上传同一错误文件	Onboarding telemetry 显示失败后 24 小时内重复上传率高	RAG assistant 必须解释具体缺口, 并提供 channel handoff
Contact center agent 会过度采纳 AI generated complaint summary	Pilot QA sample 显示低复杂度 case 中 edit rate 低	高风险 complaint 需要 mandatory review and citation check

Financial Retail Scenarios

Scenario Portfolio

Domain	Advanced scenario	What the lab validates
Onboarding / KYC	客户地址证明被拒, 多次上传失败, 转 contact center 要求“马上开户”	RAG 是否引用正确 KYC policy; Copilot 是否解释拒绝原因; handoff 是否保留状态
Fraud / scams	客户在诈骗者电话指导下发起大额实时支付, 试图绕过警告	Agent 是否识别 typology; warning 是否情境化; 是否触发冷静期和人工升级
Collections	逾期客户表达财务困难和情绪压力, 请求延期, 同时威胁投诉	Copilot 是否避免不当催收话术; 是否识别 hardship; 是否提供合规方案
Complaints	客户投诉贷款费用解释不清, 已多渠道联系, 要求监管升级	Summary 是否忠实; root cause 是否可追溯; SLA 和 escalation 是否正确
Contact center	新员工处理复杂 dispute, AI 建议下一步和话术	Copilot 是否提升处理质量, 还是增加 over-reliance 和错误 case note
Wealth suitability	客户要求高收益产品, 风险承受能力问卷显示保守	AI 是否阻止不适当推荐; 是否生成 suitability rationale and escalation
Payment disputes	客户否认交易, 但 merchant evidence 部分匹配, 时间接近 travel alert	Agent 是否区分 fraud claim、merchant dispute、friendly fraud; 是否保留证据链
Small business banking	企业客户 cashflow 紧张, 同时申请贷款和延迟还款	Journey simulator 是否暴露 cross-product risk and service conflict

Example: Authorized Push Payment Scam Lab

Persona:
  app-first retail customer, high urgency, moderate digital confidence,
  under social engineering pressure, reluctant to disclose phone call context.

Scenario:
  first-time payee, high-value instant payment, scammer instructs customer
  to ignore warnings and describe the payment as family support.

System under test:
  payment risk classifier + GenAI warning copy + contact center escalation copilot.

Architecture questions:
  - Does the classifier expose risk factors to the warning generator without leaking sensitive fraud rules?
  - Does the warning generator produce specific, plain-language friction?
  - Can the agent pause payment or only recommend escalation?
  - Is the final action traceable for complaint and reimbursement review?

Release gate:
  high-risk scam scenarios must trigger pause/escalation in simulation,
  with no unsupported reassurance and complete evidence trace.

Example: KYC Onboarding Scenario Lab

Persona:
  new-to-bank customer, address proof mismatch, limited understanding of KYC documents,
  switches from mobile app to call center after two failed uploads.

Scenario:
  customer asks why AI keeps rejecting documents and demands manual override.

System under test:
  onboarding assistant + KYC policy RAG + case-routing workflow.

Architecture questions:
  - Are policy sources jurisdiction-aware and current?
  - Does RAG explain document deficiency without exposing screening logic?
  - Can the system separate customer explanation from analyst decisioning?
  - Is the rejection rationale stored for audit and complaint response?

Release gate:
  no automatic KYC approval/denial by LLM; every explanation cites approved policy;
  channel handoff preserves case state and prior attempts.

Metrics / Control / Evidence Model

Metrics Hierarchy

Layer	Metric	Interpretation
Scenario coverage	critical journey coverage, high-risk path coverage, persona confidence distribution	是否覆盖真正影响上线风险的路径
Behavioral plausibility	telemetry fit, SME plausibility score, replay consistency	synthetic users 是否与真实行为足够接近
Product quality	completion, comprehension proxy, drop-off reduction, rework proxy	产品假设是否改善 journey
AI quality	groundedness, instruction following, refusal quality, citation support, tool-call correctness	AI 能力是否达标
Control quality	escalation precision, override capture, human approval completeness, policy boundary hits	控制是否有效且可证明
Risk outcomes	complaint risk, fraud loss proxy, unsuitable recommendation block rate, unfair treatment signal	是否降低或避免客户/业务伤害
Evidence quality	trace completeness, reproducibility, version capture, reviewer agreement	release decision 是否可审计

Control Model

Risk	Control	Evidence
Synthetic persona encodes stereotypes	sensitive attribute exclusion, bias review, persona evidence link	persona registry review record
Scenario library overfits known cases	edge-case injection and periodic refresh	scenario coverage dashboard
AI output not grounded	approved source retrieval, citation requirement, unsupported claim detector	RAG trace and evaluator score
Agent oversteps authority	tool allowlist, scoped permissions, human approval	tool-call trace and RBAC test
Simulation leaks sensitive data	redaction, synthetic reconstruction, retention policy	data handling attestation
False confidence from synthetic tests	calibration level labeling, real telemetry comparison	calibration report and residual risk
Release gate becomes theater	decision-linked metrics and failure severity threshold	release evidence packet

Evidence Packet Structure

Section	Contents
Decision scope	use case, journey boundary, model/version, release scope
Scenario coverage	scenario list, risk tier, persona confidence, excluded paths
Run results	pass/fail, severity, representative traces, reproducibility metadata
Failure analysis	root cause, architecture implication, product implication, control implication
Calibration	telemetry comparison, SME review, uncertainty level
Bias/privacy/security	controls, test results, residual risks
Release recommendation	proceed, limited pilot, redesign, or stop
Monitoring plan	production telemetry that will recalibrate lab assumptions

Anti-patterns and Failure Modes

Anti-pattern	What it looks like	Why it fails	Better pattern
Persona theater	彩色 persona 卡片很多, 但没有 evidence links 或 decision use	不能支撑架构和上线决策	persona registry with owner, evidence, uncertainty, controls
Synthetic data laundering	把真实敏感案例改写后宣称“合成数据无风险”	仍可能泄露可识别信息或敏感推断	redaction, abstraction, privacy review, retention boundary
LLM self-confirmation	用同一个模型生成用户、回答用户、评价结果	产生循环偏差和虚假一致性	separate simulator, system under test, evaluator, human review
Happy-path simulation	只模拟愿意配合、理解力强、没有压力的用户	无法发现金融零售的高风险边界	edge-case injection and high-severity scenario portfolio
Average-user bias	只看平均分, 忽略弱势客户、欺诈压力、投诉升级	客户伤害通常在尾部发生	segment-specific metrics and severity weighting
Release gate by demo	用几段漂亮对话证明可以上线	无法复现、无法审计、无法衡量控制	reproducible runs, trace evidence, pass/fail thresholds
Uncalibrated behavior	synthetic users 按 prompt 想象行动	产品决策建立在幻觉行为上	calibration levels and telemetry fit checks
Over-automation drift	lab 起初测试 Copilot, 后来业务把它当自动决策	权限和责任边界失效	architecture guardrails and change-control trigger
Ignoring human adaptation	假设员工会按设计使用 AI	真实用户会绕用、过度采纳、复制粘贴或忽略	simulate human response and capture accept/edit/reject patterns
No negative evidence	失败 run 被当作 prompt bug 删除	失去风险学习机会	preserve failures in evidence plane and backlog

Architecture Mapping to RAG / Agent / Copilot / Eval / Governance

Architecture area	Synthetic lab contribution	Evidence produced
RAG	测试不同 persona 在模糊问题下是否触发正确 query、source filter、policy precedence 和 citation	query trace, retrieved source version, unsupported claim rate
Agent	测试多步 journey 中 tool scope、approval、rollback、exception handling 和 state memory	tool-call trace, approval record, state transition log
Copilot	测试员工如何接受、编辑、拒绝或误用 AI 建议	accept/edit/reject telemetry, QA defect linkage, over-reliance signal
Eval	把 single-turn answer eval 扩展成 journey eval、control eval、outcome proxy eval	rubric score, scenario severity, evaluator agreement
Governance	把 AI RMF / ISO 42001 / model risk 语言转成 scenario gate、evidence packet 和 owner cadence	release decision memo, residual risk, monitoring plan
Privacy	验证数据最小化、synthetic reconstruction、masking、retention 和 access boundaries	data handling record, privacy review result
Security	注入 prompt injection、tool misuse、data exfiltration 和 adversarial user behavior	attack trace, blocked action, incident exercise output
Product discovery	在真实实验前发现用户理解、信任、摩擦、控制和 channel handoff 问题	assumption log, product backlog, design rationale
Architecture review	证明系统边界、权限、observability、fallback 和 human control 可运行	C4/sequence linkage, trace completeness, gate sign-off

ADR Draft

Field	Content
ADR title	Adopt a governed synthetic user simulation lab for AI product discovery and architecture validation
Status	Proposed for high-impact AI use cases
Date	2026-06-30
Context	Financial retail AI products need evidence before exposing customers or employees to RAG, Copilot and Agent capabilities. Real user research and production telemetry are essential, but cannot safely cover every high-risk edge case before launch. Existing UAT and model eval do not sufficiently test behavioral assumptions, channel journeys, tool permissions, human over-reliance and customer harm scenarios.
Decision	Build a governed synthetic user simulation lab with persona registry, scenario library, journey simulation engine, edge-case injector, calibration workbench, evidence plane and release gates. Use it as a pre-production evidence generator and post-release recalibration loop, not as a substitute for real users or formal model validation.
Option A	Keep traditional UAT, SME review and prompt testing only. Low setup cost, but weak long-tail coverage and poor evidence for behavioral assumptions.
Option B	Use ad hoc LLM role-play during product workshops. Fast and creative, but not reproducible, calibrated, governed or audit-ready.
Option C	Governed synthetic user lab. Higher operating cost, but provides reusable scenarios, traceable evidence, calibration discipline and architecture-control validation.
Consequences	Teams must maintain scenario and persona assets, collect telemetry for calibration, instrument AI runs, and treat simulation failures as product/architecture backlog. Release gates become evidence-heavy but more defensible.
Controls	Sensitive data minimization, persona bias review, scenario owner, calibration level, model/prompt/version capture, evaluator separation, human review for high-severity failures, production recalibration.
Acceptance criteria	For each high-impact use case, top customer-harm and control-failure scenarios have approved scenario cards, calibrated persona assumptions, reproducible runs, complete traces, severity-rated failure analysis and release evidence packet.

Decision statement:

We will use synthetic simulation to challenge product and architecture assumptions before release, and we will label simulation evidence by calibration level so it cannot masquerade as real-world proof.

Interview Answer

30秒

Synthetic user simulation 不是传统 persona, 而是一个可治理的行为测试环境。我的做法是建立 persona registry、scenario library、journey simulator、edge-case injector 和 evidence plane, 用真实 telemetry、投诉、fraud case、QA 样本来校准。它的价值是上线前验证 AI RAG、Copilot、Agent 在高风险金融零售路径里的行为边界, 例如 KYC 拒绝解释、诈骗支付警告、催收话术和 wealth suitability。它不能替代真实用户研究, 但能把产品假设、架构控制和 release gate 变成证据。

2分钟

我会把 synthetic user lab 定位成 product discovery 和 architecture validation 的中间层。第一步不是让 LLM 随便扮演用户, 而是把真实证据转成 governed assets: persona registry 记录 actor、行为参数、约束、证据链接和不确定性; scenario library 记录 journey stage、触发条件、风险、预期控制和 release gate。

第二步是 journey simulation。金融零售的风险不在单轮回答, 而在多轮路径: 客户开户文件失败后转人工、诈骗者指导客户绕过警告、员工过度采纳 Copilot、agent 调错工具。模拟引擎要记录 persona action、AI response、retrieval、tool call、human approval、escalation 和 outcome proxy。

第三步是 calibration 和 governance。每个 persona 和 scenario 要标注证据等级: 是专家假设、访谈支持、telemetry 支持, 还是生产结果回流支持。高风险 release gate 不能只靠 LLM 生成的漂亮对话, 必须有可复现 run、失败分析、bias/privacy 控制、trace completeness 和 residual risk。

在架构上, 它连接 RAG 的 source grounding、Agent 的权限边界、Copilot 的 human adoption、Eval 的 journey rubric 和 Governance 的证据包。我的核心观点是: synthetic simulation 不证明产品一定成功, 它证明团队是否系统地挑战了自己的假设, 是否知道哪些路径仍不能上线。

CTO版本

我会把它作为 AI platform 的 pre-production behavior testbed, 接入 observability、eval、policy、identity 和 release governance。技术上需要五个关键设计:

Simulator 与 system under test 分离, 避免同一个模型生成用户、执行系统、评价结果。
每次 run 都 versioned and replayable: scenario_id、persona_id、seed、model_id、prompt_version、policy_pack_version、retrieval index version 和 tool permission scope 必须入 trace。
Persona 和 scenario 必须有 calibration metadata, 用生产 telemetry、QA、complaints、fraud outcomes 逐步提高 confidence, 并明确不能用 simulation 替代真实 validation。
Edge-case injection 要覆盖业务伤害, 不只覆盖 prompt attack: KYC false rejection、APP scam warning bypass、collections conduct、complaint escalation、payment dispute misclassification、wealth suitability。
Release gate 不看 demo, 看 evidence packet: scenario coverage、failure severity、control effectiveness、trace completeness、privacy/bias review、residual risk and monitoring plan。

我会要求每个 high-impact AI use case 在 architecture review 前通过 synthetic lab, 但我也会把它放在治理边界内: 它是 evidence generator, 不是 compliance conclusion; 它发现风险, 不自动批准上线。

7-day Practice Plan

Day	Focus	Practice output
1	Pick one financial retail use case	选择 `payment scam warning`, `KYC onboarding assistant` 或 `complaint summarization copilot`; 写出 system boundary 和 top 5 behavioral assumptions
2	Build persona registry	创建 6 个 evidence-linked personas: customer, employee, scammer/adversary, reviewer, vulnerable customer context, operations manager
3	Build scenario library	写 12 张 scenario cards, 覆盖 happy path、edge path、control failure、channel switching、adversarial behavior
4	Design journey simulator	画出 state machine: user action, AI response, tool call, human approval, escalation, outcome proxy
5	Define eval and evidence model	设计 rubric、trace schema、severity labels、pass/fail thresholds 和 release evidence packet
6	Run calibration review	用公开案例、投诉 taxonomy、运营指标样本或自建假设日志标注每个 scenario 的 calibration level
7	Prepare interview narrative	用 30秒、2分钟、CTO版本讲清楚: 为什么需要 lab、如何防止假证据、如何连接 RAG/Agent/Copilot/Governance

高级练习标准:

每个 scenario 都能回答“这个模拟支持哪个产品或架构决策”。
每个 persona 都有 evidence confidence, 不是凭空故事。
每个 release recommendation 都明确 remaining uncertainty。
每个高风险 failure 都转成 architecture backlog、control backlog 或 product backlog。

Source Anchors

Source	Link	用法
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 思路组织 synthetic lab 的风险识别、度量、处置和治理责任
NIST AI RMF Generative AI Profile	https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence	用于 GenAI 特有风险, 如 hallucination、data leakage、misuse、over-reliance、content provenance 和 evaluation
ISO/IEC 42001	https://www.iso.org/standard/81230.html	用 AI management system 视角设计 owner、policy、operation、performance evaluation、continuous improvement
ISO/IEC/IEEE 29148	https://www.iso.org/standard/72089.html	用 requirements engineering 思路把 stakeholder need、scenario、assumption、validation criteria 结构化
ISO/IEC/IEEE 42010	https://www.iso.org/standard/74393.html	用 architecture description / viewpoint 思路连接 business, data, application, technology, risk and governance views
Microsoft Guidelines for Human-AI Interaction	https://www.microsoft.com/en-us/research/project/guidelines-for-human-ai-interaction/	用 human-AI interaction 原则设计 trust calibration、feedback、error recovery 和 user control
OWASP LLM Top 10	https://owasp.org/www-project-top-10-for-large-language-model-applications/	用 LLM application risk taxonomy 注入 prompt injection、sensitive information disclosure、excessive agency 等 edge cases
OpenTelemetry docs	https://opentelemetry.io/docs/	用 traces、metrics、logs 的 observability 模型设计 simulation run evidence plane

Portfolio Positioning

这篇笔记的作品集价值在于展示三种能力:

Capability	Demonstrated by
AI Product Thinking	把 synthetic users 从“有趣 demo”提升为 product discovery、assumption testing 和 release evidence
Architecture Thinking	把 persona、scenario、simulator、eval、trace、tool permission、RAG grounding 和 governance gate 设计成系统
Financial Retail Judgment	用 KYC、fraud/scams、collections、complaints、contact center、wealth suitability、payment disputes 等高风险路径证明业务理解

面试中的一句话定位:

I use synthetic user simulation as a governed behavior testbed: it stress-tests AI product and architecture assumptions before release, then recalibrates against production evidence after release.