AI Synthetic User Simulation:用户仿真与场景实验室架构
Synthetic user simulation 不是把传统 persona 换成 AI 头像, 也不是让 LLM 随机扮演客户。它是一套用于产品发现、架构验证和上线证据管理的行为测试架构:
AI 合成用户仿真架构:Synthetic User Simulation / Persona Scenario Lab / Behavior Testbed
Date: 2026-06-30 Status: evergreen Audience: experienced CBAP / financial retail PM / AI product architect / AI solution architect / risk-aware product leader Output: advanced architecture note, decision framework, ADR draft, interview-ready narrative
Why Synthetic User Labs Matter
Synthetic user simulation 不是把传统 persona 换成 AI 头像, 也不是让 LLM 随机扮演客户。它是一套用于产品发现、架构验证和上线证据管理的行为测试架构:
Synthetic user lab = governed personas + calibrated scenarios + journey simulator + edge-case injection + evidence-based release gates.
在金融零售 AI 场景中, 真实用户研究和线上试验经常受到约束:
- 高风险事件低频但代价高, 例如授权推送支付诈骗、投诉升级、弱势客户误导、KYC 拒绝、催收不当话术。
- 真实客户数据敏感, 不能无限复制到 discovery、prompt tuning、agent testing 和供应商 PoC 环境。
- 监管、模型风险和内审会追问: 产品团队做过哪些边界测试, 为什么相信 AI 在异常路径下不会扩大伤害。
- 传统 UAT 常覆盖 happy path, 但 AI 系统的风险来自长尾语境、工具调用、检索误差、用户误用和人机信任错配。
Synthetic user lab 的价值不是替代真实证据, 而是在真实证据不足、昂贵或敏感时, 建立一个可重复、可审查、可校准的探索环境:
| 传统方法 | 局限 | Synthetic user lab 的补强 |
|---|---|---|
| 用户访谈 | 样本小, 难覆盖高风险边界 | 把访谈洞察转成 scenario card 和 assumption log, 批量模拟 |
| UAT | 测功能是否能用, 不一定测行为是否可信 | 模拟客户、员工、诈骗者、投诉人、合规审查员的交互路径 |
| A/B test | 对真实用户有影响, 不适合高风险早期探索 | 在上线前测试反事实路径和负面情境 |
| Red team | 偏安全攻击, 不一定覆盖业务流程 | 把业务边界、客户权益、运营控制和 AI 风险合并测试 |
| Eval benchmark | 关注模型答案, 常脱离工作流 | 测 journey、tool call、retrieval、human override、downstream outcome |
高级 PM / BA / Architect 要把 synthetic user lab 定位成 decision evidence:
我们不是说“模拟证明产品一定成功”。
我们说“模拟暴露了哪些假设、哪些路径必须加控制、哪些上线门槛必须满足、哪些证据还需要真实用户或生产 telemetry 校准”。
Concept Diagram
flowchart LR
A[Real evidence sources<br/>telemetry, complaints, call transcripts,<br/>fraud cases, QA reviews, journey analytics] --> B[Calibration Workbench]
C[Policy and control sources<br/>KYC, fraud, complaints, collections,<br/>wealth suitability, privacy, conduct risk] --> B
B --> D[Persona Registry<br/>role, context, constraints,<br/>behavior parameters, evidence links]
B --> E[Scenario Library<br/>journey stage, trigger, stakes,<br/>edge cases, expected controls]
D --> F[Journey Simulation Engine]
E --> F
G[Edge-case Injector<br/>stress, ambiguity, adversarial prompt,<br/>vulnerability, channel switching] --> F
F --> H[User / Agent Simulators<br/>customer, employee, scammer,<br/>reviewer, regulator, operations lead]
H --> I[System Under Test<br/>RAG, Agent, Copilot,<br/>workflow automation, decision support]
I --> J[Evidence Plane<br/>traces, prompts, retrieval, tool calls,<br/>human decisions, outputs, outcomes]
J --> K[Eval and Control Layer<br/>rubrics, metrics, policy checks,<br/>bias/privacy tests, release gates]
K --> L[Product Decision<br/>iterate, pilot, release,<br/>limit scope, stop]
K --> B
核心闭环:
Observed behavior
-> calibrated persona and scenario assumptions
-> simulated journeys
-> architecture and control test evidence
-> release decision
-> production telemetry
-> recalibration
Architecture Components
1. Evidence Intake Layer
负责收集真实行为证据, 但不直接把敏感原文暴露给模拟环境。
| Evidence source | 用途 | 控制 |
|---|---|---|
| Journey telemetry | 校准 drop-off、retry、channel switching、time-to-complete | 聚合、脱敏、最小化字段 |
| Contact center transcripts | 校准客户语言、情绪、误解、升级路径 | PII masking、purpose limitation、retention |
| Complaint narratives | 找到客户伤害、期望落差、解释失败 | legal/risk review, sampling discipline |
| Fraud and scam cases | 建立攻击者脚本和弱势客户场景 | access control, synthetic reconstruction |
| KYC exception logs | 捕捉身份、地址、文件、制裁筛查异常路径 | role-based access and redaction |
| QA and audit findings | 把历史控制缺陷转成 scenario test | finding owner and closure evidence |
关键原则: simulation 读取的是经过治理的 behavior facts, 不是无限复制客户数据。
2. Persona Registry
Persona registry 不是 marketing persona。它是可版本化、可追溯、可约束的行为模型目录。
| Field | Advanced meaning |
|---|---|
| persona_id | 稳定 ID, 例如 kyc-newcomer-low-doc-confidence-v2 |
| actor_type | customer, employee, scammer, reviewer, relationship manager, regulator |
| domain_context | onboarding, fraud, collections, complaint, wealth, dispute |
| behavior_parameters | patience, digital confidence, risk tolerance, language clarity, channel preference |
| constraints | 法规边界、隐私限制、不能使用的敏感属性、不可推断项 |
| evidence_links | 支撑该 persona 的 telemetry segment、case sample、research insight |
| uncertainty_level | high / medium / low, 决定是否能用于 release gate |
| owner | PM / BA / Risk / Research owner |
| review_cadence | monthly for pilot, quarterly for stable product |
Persona 应聚焦行为和上下文, 避免把年龄、族裔、性别等敏感属性当作方便标签。金融零售更适合用任务能力、渠道熟悉度、金融脆弱性信号、文档可得性、语言理解难度、风险暴露和服务需求来建模。
3. Scenario Library
Scenario 是产品和架构验证的基本单位。一个合格 scenario 至少包含:
| Field | Example |
|---|---|
| scenario_id | fraud-app-scam-warning-bypass-001 |
| journey_stage | payment initiation, warning, confirmation, dispute |
| trigger | customer tries to send first-time high-value instant payment |
| stakes | financial loss, complaint, regulatory scrutiny |
| expected_system_behavior | detect risk signal, show tailored warning, offer pause/escalation |
| expected_human_behavior | customer may minimize warning due to social-engineering pressure |
| control_points | scam typology check, confirmation friction, cooling-off option, trace evidence |
| simulation_variants | urgency, trusted payee narrative, remote access app mention, elder vulnerability signal |
| evidence_basis | recent scam complaint sample, fraud typology, payment telemetry |
| release_gate_link | fraud warning effectiveness gate |
Scenario library 应像 test suite 一样管理版本、owner、覆盖率和退役规则。它不是一次性 workshop 产物。
4. Journey Simulation Engine
Journey simulation engine 管理多轮交互、状态转换和分支路径:
initial state
-> user intent
-> AI response
-> user interpretation
-> action or hesitation
-> system control
-> escalation / completion / abandonment
-> outcome and evidence
高级能力:
- 支持 multi-actor: 客户、前线员工、后台 analyst、欺诈者、投诉处理员、合规 reviewer。
- 支持 multi-channel: mobile app、web、branch、contact center、secure message、email follow-up。
- 支持 stateful journey: 记住用户已看过的告知、上传过的文件、被拒绝过的原因、之前的投诉。
- 支持 control injection: 人工审批、二次验证、冷静期、policy check、tool permission boundary。
- 支持 deterministic replay: 同一个 scenario、persona、model version、prompt version 和 seed 能重跑。
5. Edge-case Injector
Edge-case injection 是 lab 的核心价值之一。它把真实世界的脏路径系统化:
| Edge class | Financial retail examples |
|---|---|
| Ambiguity | 客户说“我朋友让我马上转账”, 但不承认被诈骗 |
| Vulnerability | 客户理解能力有限、近期丧偶、语言障碍、财务压力大 |
| Policy conflict | KYC 通过率目标与制裁/AML 风险控制冲突 |
| Tool risk | Agent 有权限发起退款、冻结账户、更新地址 |
| Retrieval mismatch | RAG 取到过期 fee policy 或跨州/跨地区规则 |
| Channel switching | mobile onboard 失败后转 contact center, 信息断裂 |
| Adversarial behavior | 诈骗者指导客户绕过警告, 或员工尝试 prompt injection |
| Rare but severe | wealth suitability 不当建议、催收话术导致投诉、拒付处理超时 |
6. System-under-Test Adapter
Lab 不只测试模型。它测试整个 AI product architecture:
- RAG: query rewriting、retrieval filter、source freshness、citation support、policy precedence。
- Agent: tool allowlist、permission scope、confirmation step、rollback path、human approval。
- Copilot: suggestion placement、user edit、accept/reject behavior、trust calibration。
- Workflow automation: handoff、queue routing、SLA、case notes、audit trail。
- Evaluation layer: task rubric、policy rubric、safety rubric、business outcome proxy。
7. Evidence Plane
每次 simulation run 都要生成可审查证据:
| Evidence object | Minimum fields |
|---|---|
| run metadata | run_id, scenario_id, persona_id, seed, model_id, prompt_version, policy_pack_version |
| journey trace | step_id, actor, channel, input summary, system action, state transition |
| retrieval evidence | query, source_id, source_version, score, citation used |
| tool evidence | tool_name, permission scope, arguments hash, approval decision, result summary |
| control evidence | policy check, refusal, escalation, human review, override, reason |
| outcome evidence | completion, abandonment, complaint risk, loss proxy, rework, cycle time |
| evaluator evidence | rubric score, failure label, severity, reviewer, calibration status |
与 OpenTelemetry 思路对齐时, 每个 simulation run 可以作为 root trace; persona action、retrieval、model call、tool call、human approval、policy decision 和 output delivery 是 child spans。
Scenario Governance and Persona Taxonomy
Governance Lifecycle
Propose scenario
-> map to product decision or architecture risk
-> attach evidence basis
-> classify risk tier
-> approve for lab use
-> run simulations
-> review failures and assumptions
-> update product / architecture / controls
-> promote to release gate or retire
| Stage | Owner | Decision question | Required evidence |
|---|---|---|---|
| Intake | PM / BA | 这个 scenario 支持哪个产品或架构决策? | decision memo link, journey map |
| Risk classification | Risk / Architect | 是否涉及客户伤害、监管义务、自动化动作或敏感数据? | risk tier rationale |
| Calibration | Research / Analytics | 行为假设是否有真实证据支撑? | telemetry, sample cases, interviews |
| Simulation approval | Governance forum | 这个 scenario 是否可用于 gate? | persona confidence, data controls |
| Release gate | Product / Risk / Tech | 当前系统能否在该边界内上线? | run results, failure analysis, residual risk |
| Recalibration | PM / Analytics | 生产 telemetry 是否改变假设? | drift report, complaint/fraud/QA linkage |
Persona Taxonomy
高级 persona taxonomy 应按可测试行为维度组织, 而不是按故事化标签组织。
| Dimension | Examples | Why it matters |
|---|---|---|
| Actor role | retail customer, small business owner, contact center agent, fraud analyst, collections specialist, wealth advisor, scammer | 明确谁在系统中行动, 谁承担判断责任 |
| Task capability | low document readiness, high digital confidence, limited product literacy, strong financial literacy | 影响 onboarding、disclosure、error recovery |
| Risk exposure | scam pressure, arrears stress, complaint escalation, suitability risk, identity mismatch | 影响控制强度和人工介入 |
| Channel behavior | app-first, call-first, branch-assisted, channel hopping | 影响 journey state 和 evidence continuity |
| Trust posture | over-trusting AI, skeptical, confused, seeking confirmation, gaming the process | 影响 Copilot 和 warning design |
| Constraint profile | accessibility need, language simplification, privacy preference, device limitation | 影响公平访问和服务质量 |
| Evidence confidence | telemetry-backed, complaint-backed, expert hypothesis, exploratory | 决定能否用于上线 gate |
不合格 persona:
"年轻用户喜欢快"
"老人不懂科技"
"高净值客户需要高级服务"
合格 persona:
"首次开户客户, 文档准备不足, 对 KYC 拒绝原因理解弱, 在 mobile app 和 contact center 之间切换, 已出现一次上传失败。证据来自 onboarding drop-off telemetry、call reason code 和 QA sample。"
Calibration Against Real Behavior and Telemetry
Synthetic simulation 的最大风险是制造看似精确的假证据。因此必须把 calibration 当成架构能力, 不是分析师手工备注。
Calibration Inputs
| Input | Calibration target |
|---|---|
| Funnel telemetry | 各 journey stage 的 drop-off、retry、abandonment |
| Call reason codes | 用户困惑点、升级原因、重复联系 |
| Complaint taxonomy | 客户伤害类型、解释失败、处理时长问题 |
| Fraud outcomes | scam typology、warning bypass、loss and recovery pattern |
| QA samples | 员工处理差异、policy adherence、case note quality |
| A/B or pilot results | AI intervention 对行为和结果的真实影响 |
| Subject matter expert review | 极低频高影响路径的业务合理性校验 |
Calibration Levels
| Level | Meaning | Allowed use |
|---|---|---|
| L0 - exploratory hypothesis | 由 PM / BA / SME 提出的假设, 尚无真实证据 | discovery brainstorming, not release gate |
| L1 - qualitative support | 有访谈、case review、投诉样本支撑 | scenario design, early prototype evaluation |
| L2 - telemetry support | 有行为数据支撑频率、路径、drop-off 或重复联系 | architecture validation and pilot gate |
| L3 - outcome-linked support | 与损失、投诉、QA defect、conversion、cycle time 等结果关联 | release gate and scale/stop decision |
| L4 - production recalibrated | 上线后持续回流, 可监控 drift | continuous governance and model/product tuning |
Calibration Discipline
每个 persona 和 scenario 都要记录:
- 哪个行为假设被模拟。
- 证据来自哪里, 样本时间窗口是什么。
- 哪些属性被合成或抽象, 哪些不能用于敏感推断。
- 与真实 telemetry 的差异有多大。
- 差异是否改变 release decision。
示例:
| Assumption | Calibration evidence | Decision impact |
|---|---|---|
| 高压力 scam 场景下, 客户会忽略通用警告 | 过去 90 天 APP scam complaint sample 中, 多数客户表示看过但未理解警告 | 支付 warning 需要 scenario-specific pause, 不能只依赖通用 banner |
| KYC 文件上传失败后, 客户会重复上传同一错误文件 | Onboarding telemetry 显示失败后 24 小时内重复上传率高 | RAG assistant 必须解释具体缺口, 并提供 channel handoff |
| Contact center agent 会过度采纳 AI generated complaint summary | Pilot QA sample 显示低复杂度 case 中 edit rate 低 | 高风险 complaint 需要 mandatory review and citation check |
Financial Retail Scenarios
Scenario Portfolio
| Domain | Advanced scenario | What the lab validates |
|---|---|---|
| Onboarding / KYC | 客户地址证明被拒, 多次上传失败, 转 contact center 要求“马上开户” | RAG 是否引用正确 KYC policy; Copilot 是否解释拒绝原因; handoff 是否保留状态 |
| Fraud / scams | 客户在诈骗者电话指导下发起大额实时支付, 试图绕过警告 | Agent 是否识别 typology; warning 是否情境化; 是否触发冷静期和人工升级 |
| Collections | 逾期客户表达财务困难和情绪压力, 请求延期, 同时威胁投诉 | Copilot 是否避免不当催收话术; 是否识别 hardship; 是否提供合规方案 |
| Complaints | 客户投诉贷款费用解释不清, 已多渠道联系, 要求监管升级 | Summary 是否忠实; root cause 是否可追溯; SLA 和 escalation 是否正确 |
| Contact center | 新员工处理复杂 dispute, AI 建议下一步和话术 | Copilot 是否提升处理质量, 还是增加 over-reliance 和错误 case note |
| Wealth suitability | 客户要求高收益产品, 风险承受能力问卷显示保守 | AI 是否阻止不适当推荐; 是否生成 suitability rationale and escalation |
| Payment disputes | 客户否认交易, 但 merchant evidence 部分匹配, 时间接近 travel alert | Agent 是否区分 fraud claim、merchant dispute、friendly fraud; 是否保留证据链 |
| Small business banking | 企业客户 cashflow 紧张, 同时申请贷款和延迟还款 | Journey simulator 是否暴露 cross-product risk and service conflict |
Example: Authorized Push Payment Scam Lab
Persona:
app-first retail customer, high urgency, moderate digital confidence,
under social engineering pressure, reluctant to disclose phone call context.
Scenario:
first-time payee, high-value instant payment, scammer instructs customer
to ignore warnings and describe the payment as family support.
System under test:
payment risk classifier + GenAI warning copy + contact center escalation copilot.
Architecture questions:
- Does the classifier expose risk factors to the warning generator without leaking sensitive fraud rules?
- Does the warning generator produce specific, plain-language friction?
- Can the agent pause payment or only recommend escalation?
- Is the final action traceable for complaint and reimbursement review?
Release gate:
high-risk scam scenarios must trigger pause/escalation in simulation,
with no unsupported reassurance and complete evidence trace.
Example: KYC Onboarding Scenario Lab
Persona:
new-to-bank customer, address proof mismatch, limited understanding of KYC documents,
switches from mobile app to call center after two failed uploads.
Scenario:
customer asks why AI keeps rejecting documents and demands manual override.
System under test:
onboarding assistant + KYC policy RAG + case-routing workflow.
Architecture questions:
- Are policy sources jurisdiction-aware and current?
- Does RAG explain document deficiency without exposing screening logic?
- Can the system separate customer explanation from analyst decisioning?
- Is the rejection rationale stored for audit and complaint response?
Release gate:
no automatic KYC approval/denial by LLM; every explanation cites approved policy;
channel handoff preserves case state and prior attempts.
Metrics / Control / Evidence Model
Metrics Hierarchy
| Layer | Metric | Interpretation |
|---|---|---|
| Scenario coverage | critical journey coverage, high-risk path coverage, persona confidence distribution | 是否覆盖真正影响上线风险的路径 |
| Behavioral plausibility | telemetry fit, SME plausibility score, replay consistency | synthetic users 是否与真实行为足够接近 |
| Product quality | completion, comprehension proxy, drop-off reduction, rework proxy | 产品假设是否改善 journey |
| AI quality | groundedness, instruction following, refusal quality, citation support, tool-call correctness | AI 能力是否达标 |
| Control quality | escalation precision, override capture, human approval completeness, policy boundary hits | 控制是否有效且可证明 |
| Risk outcomes | complaint risk, fraud loss proxy, unsuitable recommendation block rate, unfair treatment signal | 是否降低或避免客户/业务伤害 |
| Evidence quality | trace completeness, reproducibility, version capture, reviewer agreement | release decision 是否可审计 |
Control Model
| Risk | Control | Evidence |
|---|---|---|
| Synthetic persona encodes stereotypes | sensitive attribute exclusion, bias review, persona evidence link | persona registry review record |
| Scenario library overfits known cases | edge-case injection and periodic refresh | scenario coverage dashboard |
| AI output not grounded | approved source retrieval, citation requirement, unsupported claim detector | RAG trace and evaluator score |
| Agent oversteps authority | tool allowlist, scoped permissions, human approval | tool-call trace and RBAC test |
| Simulation leaks sensitive data | redaction, synthetic reconstruction, retention policy | data handling attestation |
| False confidence from synthetic tests | calibration level labeling, real telemetry comparison | calibration report and residual risk |
| Release gate becomes theater | decision-linked metrics and failure severity threshold | release evidence packet |
Evidence Packet Structure
| Section | Contents |
|---|---|
| Decision scope | use case, journey boundary, model/version, release scope |
| Scenario coverage | scenario list, risk tier, persona confidence, excluded paths |
| Run results | pass/fail, severity, representative traces, reproducibility metadata |
| Failure analysis | root cause, architecture implication, product implication, control implication |
| Calibration | telemetry comparison, SME review, uncertainty level |
| Bias/privacy/security | controls, test results, residual risks |
| Release recommendation | proceed, limited pilot, redesign, or stop |
| Monitoring plan | production telemetry that will recalibrate lab assumptions |
Anti-patterns and Failure Modes
| Anti-pattern | What it looks like | Why it fails | Better pattern |
|---|---|---|---|
| Persona theater | 彩色 persona 卡片很多, 但没有 evidence links 或 decision use | 不能支撑架构和上线决策 | persona registry with owner, evidence, uncertainty, controls |
| Synthetic data laundering | 把真实敏感案例改写后宣称“合成数据无风险” | 仍可能泄露可识别信息或敏感推断 | redaction, abstraction, privacy review, retention boundary |
| LLM self-confirmation | 用同一个模型生成用户、回答用户、评价结果 | 产生循环偏差和虚假一致性 | separate simulator, system under test, evaluator, human review |
| Happy-path simulation | 只模拟愿意配合、理解力强、没有压力的用户 | 无法发现金融零售的高风险边界 | edge-case injection and high-severity scenario portfolio |
| Average-user bias | 只看平均分, 忽略弱势客户、欺诈压力、投诉升级 | 客户伤害通常在尾部发生 | segment-specific metrics and severity weighting |
| Release gate by demo | 用几段漂亮对话证明可以上线 | 无法复现、无法审计、无法衡量控制 | reproducible runs, trace evidence, pass/fail thresholds |
| Uncalibrated behavior | synthetic users 按 prompt 想象行动 | 产品决策建立在幻觉行为上 | calibration levels and telemetry fit checks |
| Over-automation drift | lab 起初测试 Copilot, 后来业务把它当自动决策 | 权限和责任边界失效 | architecture guardrails and change-control trigger |
| Ignoring human adaptation | 假设员工会按设计使用 AI | 真实用户会绕用、过度采纳、复制粘贴或忽略 | simulate human response and capture accept/edit/reject patterns |
| No negative evidence | 失败 run 被当作 prompt bug 删除 | 失去风险学习机会 | preserve failures in evidence plane and backlog |
Architecture Mapping to RAG / Agent / Copilot / Eval / Governance
| Architecture area | Synthetic lab contribution | Evidence produced |
|---|---|---|
| RAG | 测试不同 persona 在模糊问题下是否触发正确 query、source filter、policy precedence 和 citation | query trace, retrieved source version, unsupported claim rate |
| Agent | 测试多步 journey 中 tool scope、approval、rollback、exception handling 和 state memory | tool-call trace, approval record, state transition log |
| Copilot | 测试员工如何接受、编辑、拒绝或误用 AI 建议 | accept/edit/reject telemetry, QA defect linkage, over-reliance signal |
| Eval | 把 single-turn answer eval 扩展成 journey eval、control eval、outcome proxy eval | rubric score, scenario severity, evaluator agreement |
| Governance | 把 AI RMF / ISO 42001 / model risk 语言转成 scenario gate、evidence packet 和 owner cadence | release decision memo, residual risk, monitoring plan |
| Privacy | 验证数据最小化、synthetic reconstruction、masking、retention 和 access boundaries | data handling record, privacy review result |
| Security | 注入 prompt injection、tool misuse、data exfiltration 和 adversarial user behavior | attack trace, blocked action, incident exercise output |
| Product discovery | 在真实实验前发现用户理解、信任、摩擦、控制和 channel handoff 问题 | assumption log, product backlog, design rationale |
| Architecture review | 证明系统边界、权限、observability、fallback 和 human control 可运行 | C4/sequence linkage, trace completeness, gate sign-off |
ADR Draft
| Field | Content |
|---|---|
| ADR title | Adopt a governed synthetic user simulation lab for AI product discovery and architecture validation |
| Status | Proposed for high-impact AI use cases |
| Date | 2026-06-30 |
| Context | Financial retail AI products need evidence before exposing customers or employees to RAG, Copilot and Agent capabilities. Real user research and production telemetry are essential, but cannot safely cover every high-risk edge case before launch. Existing UAT and model eval do not sufficiently test behavioral assumptions, channel journeys, tool permissions, human over-reliance and customer harm scenarios. |
| Decision | Build a governed synthetic user simulation lab with persona registry, scenario library, journey simulation engine, edge-case injector, calibration workbench, evidence plane and release gates. Use it as a pre-production evidence generator and post-release recalibration loop, not as a substitute for real users or formal model validation. |
| Option A | Keep traditional UAT, SME review and prompt testing only. Low setup cost, but weak long-tail coverage and poor evidence for behavioral assumptions. |
| Option B | Use ad hoc LLM role-play during product workshops. Fast and creative, but not reproducible, calibrated, governed or audit-ready. |
| Option C | Governed synthetic user lab. Higher operating cost, but provides reusable scenarios, traceable evidence, calibration discipline and architecture-control validation. |
| Consequences | Teams must maintain scenario and persona assets, collect telemetry for calibration, instrument AI runs, and treat simulation failures as product/architecture backlog. Release gates become evidence-heavy but more defensible. |
| Controls | Sensitive data minimization, persona bias review, scenario owner, calibration level, model/prompt/version capture, evaluator separation, human review for high-severity failures, production recalibration. |
| Acceptance criteria | For each high-impact use case, top customer-harm and control-failure scenarios have approved scenario cards, calibrated persona assumptions, reproducible runs, complete traces, severity-rated failure analysis and release evidence packet. |
Decision statement:
We will use synthetic simulation to challenge product and architecture assumptions before release, and we will label simulation evidence by calibration level so it cannot masquerade as real-world proof.
Interview Answer
30秒
Synthetic user simulation 不是传统 persona, 而是一个可治理的行为测试环境。我的做法是建立 persona registry、scenario library、journey simulator、edge-case injector 和 evidence plane, 用真实 telemetry、投诉、fraud case、QA 样本来校准。它的价值是上线前验证 AI RAG、Copilot、Agent 在高风险金融零售路径里的行为边界, 例如 KYC 拒绝解释、诈骗支付警告、催收话术和 wealth suitability。它不能替代真实用户研究, 但能把产品假设、架构控制和 release gate 变成证据。
2分钟
我会把 synthetic user lab 定位成 product discovery 和 architecture validation 的中间层。第一步不是让 LLM 随便扮演用户, 而是把真实证据转成 governed assets: persona registry 记录 actor、行为参数、约束、证据链接和不确定性; scenario library 记录 journey stage、触发条件、风险、预期控制和 release gate。
第二步是 journey simulation。金融零售的风险不在单轮回答, 而在多轮路径: 客户开户文件失败后转人工、诈骗者指导客户绕过警告、员工过度采纳 Copilot、agent 调错工具。模拟引擎要记录 persona action、AI response、retrieval、tool call、human approval、escalation 和 outcome proxy。
第三步是 calibration 和 governance。每个 persona 和 scenario 要标注证据等级: 是专家假设、访谈支持、telemetry 支持, 还是生产结果回流支持。高风险 release gate 不能只靠 LLM 生成的漂亮对话, 必须有可复现 run、失败分析、bias/privacy 控制、trace completeness 和 residual risk。
在架构上, 它连接 RAG 的 source grounding、Agent 的权限边界、Copilot 的 human adoption、Eval 的 journey rubric 和 Governance 的证据包。我的核心观点是: synthetic simulation 不证明产品一定成功, 它证明团队是否系统地挑战了自己的假设, 是否知道哪些路径仍不能上线。
CTO版本
我会把它作为 AI platform 的 pre-production behavior testbed, 接入 observability、eval、policy、identity 和 release governance。技术上需要五个关键设计:
- Simulator 与 system under test 分离, 避免同一个模型生成用户、执行系统、评价结果。
- 每次 run 都 versioned and replayable: scenario_id、persona_id、seed、model_id、prompt_version、policy_pack_version、retrieval index version 和 tool permission scope 必须入 trace。
- Persona 和 scenario 必须有 calibration metadata, 用生产 telemetry、QA、complaints、fraud outcomes 逐步提高 confidence, 并明确不能用 simulation 替代真实 validation。
- Edge-case injection 要覆盖业务伤害, 不只覆盖 prompt attack: KYC false rejection、APP scam warning bypass、collections conduct、complaint escalation、payment dispute misclassification、wealth suitability。
- Release gate 不看 demo, 看 evidence packet: scenario coverage、failure severity、control effectiveness、trace completeness、privacy/bias review、residual risk and monitoring plan。
我会要求每个 high-impact AI use case 在 architecture review 前通过 synthetic lab, 但我也会把它放在治理边界内: 它是 evidence generator, 不是 compliance conclusion; 它发现风险, 不自动批准上线。
7-day Practice Plan
| Day | Focus | Practice output |
|---|---|---|
| 1 | Pick one financial retail use case | 选择 payment scam warning, KYC onboarding assistant 或 complaint summarization copilot; 写出 system boundary 和 top 5 behavioral assumptions |
| 2 | Build persona registry | 创建 6 个 evidence-linked personas: customer, employee, scammer/adversary, reviewer, vulnerable customer context, operations manager |
| 3 | Build scenario library | 写 12 张 scenario cards, 覆盖 happy path、edge path、control failure、channel switching、adversarial behavior |
| 4 | Design journey simulator | 画出 state machine: user action, AI response, tool call, human approval, escalation, outcome proxy |
| 5 | Define eval and evidence model | 设计 rubric、trace schema、severity labels、pass/fail thresholds 和 release evidence packet |
| 6 | Run calibration review | 用公开案例、投诉 taxonomy、运营指标样本或自建假设日志标注每个 scenario 的 calibration level |
| 7 | Prepare interview narrative | 用 30秒、2分钟、CTO版本讲清楚: 为什么需要 lab、如何防止假证据、如何连接 RAG/Agent/Copilot/Governance |
高级练习标准:
- 每个 scenario 都能回答“这个模拟支持哪个产品或架构决策”。
- 每个 persona 都有 evidence confidence, 不是凭空故事。
- 每个 release recommendation 都明确 remaining uncertainty。
- 每个高风险 failure 都转成 architecture backlog、control backlog 或 product backlog。
Source Anchors
| Source | Link | 用法 |
|---|---|---|
| NIST AI Risk Management Framework | https://www.nist.gov/itl/ai-risk-management-framework | 用 Govern / Map / Measure / Manage 思路组织 synthetic lab 的风险识别、度量、处置和治理责任 |
| NIST AI RMF Generative AI Profile | https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence | 用于 GenAI 特有风险, 如 hallucination、data leakage、misuse、over-reliance、content provenance 和 evaluation |
| ISO/IEC 42001 | https://www.iso.org/standard/81230.html | 用 AI management system 视角设计 owner、policy、operation、performance evaluation、continuous improvement |
| ISO/IEC/IEEE 29148 | https://www.iso.org/standard/72089.html | 用 requirements engineering 思路把 stakeholder need、scenario、assumption、validation criteria 结构化 |
| ISO/IEC/IEEE 42010 | https://www.iso.org/standard/74393.html | 用 architecture description / viewpoint 思路连接 business, data, application, technology, risk and governance views |
| Microsoft Guidelines for Human-AI Interaction | https://www.microsoft.com/en-us/research/project/guidelines-for-human-ai-interaction/ | 用 human-AI interaction 原则设计 trust calibration、feedback、error recovery 和 user control |
| OWASP LLM Top 10 | https://owasp.org/www-project-top-10-for-large-language-model-applications/ | 用 LLM application risk taxonomy 注入 prompt injection、sensitive information disclosure、excessive agency 等 edge cases |
| OpenTelemetry docs | https://opentelemetry.io/docs/ | 用 traces、metrics、logs 的 observability 模型设计 simulation run evidence plane |
Portfolio Positioning
这篇笔记的作品集价值在于展示三种能力:
| Capability | Demonstrated by |
|---|---|
| AI Product Thinking | 把 synthetic users 从“有趣 demo”提升为 product discovery、assumption testing 和 release evidence |
| Architecture Thinking | 把 persona、scenario、simulator、eval、trace、tool permission、RAG grounding 和 governance gate 设计成系统 |
| Financial Retail Judgment | 用 KYC、fraud/scams、collections、complaints、contact center、wealth suitability、payment disputes 等高风险路径证明业务理解 |
面试中的一句话定位:
I use synthetic user simulation as a governed behavior testbed: it stress-tests AI product and architecture assumptions before release, then recalibrates against production evidence after release.