返回 Papers
AI 底层逻辑 / 经典论文

AI Synthetic User Simulation:用户仿真与场景实验室架构

Synthetic user simulation 不是把传统 persona 换成 AI 头像, 也不是让 LLM 随机扮演客户。它是一套用于产品发现、架构验证和上线证据管理的行为测试架构:

543ai-foundations/papers/164-ai-synthetic-user-simulation-persona-scenario-lab-architecture.md

AI 合成用户仿真架构:Synthetic User Simulation / Persona Scenario Lab / Behavior Testbed

Date: 2026-06-30 Status: evergreen Audience: experienced CBAP / financial retail PM / AI product architect / AI solution architect / risk-aware product leader Output: advanced architecture note, decision framework, ADR draft, interview-ready narrative


Why Synthetic User Labs Matter

Synthetic user simulation 不是把传统 persona 换成 AI 头像, 也不是让 LLM 随机扮演客户。它是一套用于产品发现、架构验证和上线证据管理的行为测试架构:

Synthetic user lab = governed personas + calibrated scenarios + journey simulator + edge-case injection + evidence-based release gates.

在金融零售 AI 场景中, 真实用户研究和线上试验经常受到约束:

  • 高风险事件低频但代价高, 例如授权推送支付诈骗、投诉升级、弱势客户误导、KYC 拒绝、催收不当话术。
  • 真实客户数据敏感, 不能无限复制到 discovery、prompt tuning、agent testing 和供应商 PoC 环境。
  • 监管、模型风险和内审会追问: 产品团队做过哪些边界测试, 为什么相信 AI 在异常路径下不会扩大伤害。
  • 传统 UAT 常覆盖 happy path, 但 AI 系统的风险来自长尾语境、工具调用、检索误差、用户误用和人机信任错配。

Synthetic user lab 的价值不是替代真实证据, 而是在真实证据不足、昂贵或敏感时, 建立一个可重复、可审查、可校准的探索环境:

传统方法局限Synthetic user lab 的补强
用户访谈样本小, 难覆盖高风险边界把访谈洞察转成 scenario card 和 assumption log, 批量模拟
UAT测功能是否能用, 不一定测行为是否可信模拟客户、员工、诈骗者、投诉人、合规审查员的交互路径
A/B test对真实用户有影响, 不适合高风险早期探索在上线前测试反事实路径和负面情境
Red team偏安全攻击, 不一定覆盖业务流程把业务边界、客户权益、运营控制和 AI 风险合并测试
Eval benchmark关注模型答案, 常脱离工作流测 journey、tool call、retrieval、human override、downstream outcome

高级 PM / BA / Architect 要把 synthetic user lab 定位成 decision evidence:

我们不是说“模拟证明产品一定成功”。
我们说“模拟暴露了哪些假设、哪些路径必须加控制、哪些上线门槛必须满足、哪些证据还需要真实用户或生产 telemetry 校准”。

Concept Diagram

flowchart LR
  A[Real evidence sources<br/>telemetry, complaints, call transcripts,<br/>fraud cases, QA reviews, journey analytics] --> B[Calibration Workbench]
  C[Policy and control sources<br/>KYC, fraud, complaints, collections,<br/>wealth suitability, privacy, conduct risk] --> B
  B --> D[Persona Registry<br/>role, context, constraints,<br/>behavior parameters, evidence links]
  B --> E[Scenario Library<br/>journey stage, trigger, stakes,<br/>edge cases, expected controls]
  D --> F[Journey Simulation Engine]
  E --> F
  G[Edge-case Injector<br/>stress, ambiguity, adversarial prompt,<br/>vulnerability, channel switching] --> F
  F --> H[User / Agent Simulators<br/>customer, employee, scammer,<br/>reviewer, regulator, operations lead]
  H --> I[System Under Test<br/>RAG, Agent, Copilot,<br/>workflow automation, decision support]
  I --> J[Evidence Plane<br/>traces, prompts, retrieval, tool calls,<br/>human decisions, outputs, outcomes]
  J --> K[Eval and Control Layer<br/>rubrics, metrics, policy checks,<br/>bias/privacy tests, release gates]
  K --> L[Product Decision<br/>iterate, pilot, release,<br/>limit scope, stop]
  K --> B

核心闭环:

Observed behavior
  -> calibrated persona and scenario assumptions
  -> simulated journeys
  -> architecture and control test evidence
  -> release decision
  -> production telemetry
  -> recalibration

Architecture Components

1. Evidence Intake Layer

负责收集真实行为证据, 但不直接把敏感原文暴露给模拟环境。

Evidence source用途控制
Journey telemetry校准 drop-off、retry、channel switching、time-to-complete聚合、脱敏、最小化字段
Contact center transcripts校准客户语言、情绪、误解、升级路径PII masking、purpose limitation、retention
Complaint narratives找到客户伤害、期望落差、解释失败legal/risk review, sampling discipline
Fraud and scam cases建立攻击者脚本和弱势客户场景access control, synthetic reconstruction
KYC exception logs捕捉身份、地址、文件、制裁筛查异常路径role-based access and redaction
QA and audit findings把历史控制缺陷转成 scenario testfinding owner and closure evidence

关键原则: simulation 读取的是经过治理的 behavior facts, 不是无限复制客户数据。

2. Persona Registry

Persona registry 不是 marketing persona。它是可版本化、可追溯、可约束的行为模型目录。

FieldAdvanced meaning
persona_id稳定 ID, 例如 kyc-newcomer-low-doc-confidence-v2
actor_typecustomer, employee, scammer, reviewer, relationship manager, regulator
domain_contextonboarding, fraud, collections, complaint, wealth, dispute
behavior_parameterspatience, digital confidence, risk tolerance, language clarity, channel preference
constraints法规边界、隐私限制、不能使用的敏感属性、不可推断项
evidence_links支撑该 persona 的 telemetry segment、case sample、research insight
uncertainty_levelhigh / medium / low, 决定是否能用于 release gate
ownerPM / BA / Risk / Research owner
review_cadencemonthly for pilot, quarterly for stable product

Persona 应聚焦行为和上下文, 避免把年龄、族裔、性别等敏感属性当作方便标签。金融零售更适合用任务能力、渠道熟悉度、金融脆弱性信号、文档可得性、语言理解难度、风险暴露和服务需求来建模。

3. Scenario Library

Scenario 是产品和架构验证的基本单位。一个合格 scenario 至少包含:

FieldExample
scenario_idfraud-app-scam-warning-bypass-001
journey_stagepayment initiation, warning, confirmation, dispute
triggercustomer tries to send first-time high-value instant payment
stakesfinancial loss, complaint, regulatory scrutiny
expected_system_behaviordetect risk signal, show tailored warning, offer pause/escalation
expected_human_behaviorcustomer may minimize warning due to social-engineering pressure
control_pointsscam typology check, confirmation friction, cooling-off option, trace evidence
simulation_variantsurgency, trusted payee narrative, remote access app mention, elder vulnerability signal
evidence_basisrecent scam complaint sample, fraud typology, payment telemetry
release_gate_linkfraud warning effectiveness gate

Scenario library 应像 test suite 一样管理版本、owner、覆盖率和退役规则。它不是一次性 workshop 产物。

4. Journey Simulation Engine

Journey simulation engine 管理多轮交互、状态转换和分支路径:

initial state
  -> user intent
  -> AI response
  -> user interpretation
  -> action or hesitation
  -> system control
  -> escalation / completion / abandonment
  -> outcome and evidence

高级能力:

  • 支持 multi-actor: 客户、前线员工、后台 analyst、欺诈者、投诉处理员、合规 reviewer。
  • 支持 multi-channel: mobile app、web、branch、contact center、secure message、email follow-up。
  • 支持 stateful journey: 记住用户已看过的告知、上传过的文件、被拒绝过的原因、之前的投诉。
  • 支持 control injection: 人工审批、二次验证、冷静期、policy check、tool permission boundary。
  • 支持 deterministic replay: 同一个 scenario、persona、model version、prompt version 和 seed 能重跑。

5. Edge-case Injector

Edge-case injection 是 lab 的核心价值之一。它把真实世界的脏路径系统化:

Edge classFinancial retail examples
Ambiguity客户说“我朋友让我马上转账”, 但不承认被诈骗
Vulnerability客户理解能力有限、近期丧偶、语言障碍、财务压力大
Policy conflictKYC 通过率目标与制裁/AML 风险控制冲突
Tool riskAgent 有权限发起退款、冻结账户、更新地址
Retrieval mismatchRAG 取到过期 fee policy 或跨州/跨地区规则
Channel switchingmobile onboard 失败后转 contact center, 信息断裂
Adversarial behavior诈骗者指导客户绕过警告, 或员工尝试 prompt injection
Rare but severewealth suitability 不当建议、催收话术导致投诉、拒付处理超时

6. System-under-Test Adapter

Lab 不只测试模型。它测试整个 AI product architecture:

  • RAG: query rewriting、retrieval filter、source freshness、citation support、policy precedence。
  • Agent: tool allowlist、permission scope、confirmation step、rollback path、human approval。
  • Copilot: suggestion placement、user edit、accept/reject behavior、trust calibration。
  • Workflow automation: handoff、queue routing、SLA、case notes、audit trail。
  • Evaluation layer: task rubric、policy rubric、safety rubric、business outcome proxy。

7. Evidence Plane

每次 simulation run 都要生成可审查证据:

Evidence objectMinimum fields
run metadatarun_id, scenario_id, persona_id, seed, model_id, prompt_version, policy_pack_version
journey tracestep_id, actor, channel, input summary, system action, state transition
retrieval evidencequery, source_id, source_version, score, citation used
tool evidencetool_name, permission scope, arguments hash, approval decision, result summary
control evidencepolicy check, refusal, escalation, human review, override, reason
outcome evidencecompletion, abandonment, complaint risk, loss proxy, rework, cycle time
evaluator evidencerubric score, failure label, severity, reviewer, calibration status

与 OpenTelemetry 思路对齐时, 每个 simulation run 可以作为 root trace; persona action、retrieval、model call、tool call、human approval、policy decision 和 output delivery 是 child spans。


Scenario Governance and Persona Taxonomy

Governance Lifecycle

Propose scenario
  -> map to product decision or architecture risk
  -> attach evidence basis
  -> classify risk tier
  -> approve for lab use
  -> run simulations
  -> review failures and assumptions
  -> update product / architecture / controls
  -> promote to release gate or retire
StageOwnerDecision questionRequired evidence
IntakePM / BA这个 scenario 支持哪个产品或架构决策?decision memo link, journey map
Risk classificationRisk / Architect是否涉及客户伤害、监管义务、自动化动作或敏感数据?risk tier rationale
CalibrationResearch / Analytics行为假设是否有真实证据支撑?telemetry, sample cases, interviews
Simulation approvalGovernance forum这个 scenario 是否可用于 gate?persona confidence, data controls
Release gateProduct / Risk / Tech当前系统能否在该边界内上线?run results, failure analysis, residual risk
RecalibrationPM / Analytics生产 telemetry 是否改变假设?drift report, complaint/fraud/QA linkage

Persona Taxonomy

高级 persona taxonomy 应按可测试行为维度组织, 而不是按故事化标签组织。

DimensionExamplesWhy it matters
Actor roleretail customer, small business owner, contact center agent, fraud analyst, collections specialist, wealth advisor, scammer明确谁在系统中行动, 谁承担判断责任
Task capabilitylow document readiness, high digital confidence, limited product literacy, strong financial literacy影响 onboarding、disclosure、error recovery
Risk exposurescam pressure, arrears stress, complaint escalation, suitability risk, identity mismatch影响控制强度和人工介入
Channel behaviorapp-first, call-first, branch-assisted, channel hopping影响 journey state 和 evidence continuity
Trust postureover-trusting AI, skeptical, confused, seeking confirmation, gaming the process影响 Copilot 和 warning design
Constraint profileaccessibility need, language simplification, privacy preference, device limitation影响公平访问和服务质量
Evidence confidencetelemetry-backed, complaint-backed, expert hypothesis, exploratory决定能否用于上线 gate

不合格 persona:

"年轻用户喜欢快"
"老人不懂科技"
"高净值客户需要高级服务"

合格 persona:

"首次开户客户, 文档准备不足, 对 KYC 拒绝原因理解弱, 在 mobile app 和 contact center 之间切换, 已出现一次上传失败。证据来自 onboarding drop-off telemetry、call reason code 和 QA sample。"

Calibration Against Real Behavior and Telemetry

Synthetic simulation 的最大风险是制造看似精确的假证据。因此必须把 calibration 当成架构能力, 不是分析师手工备注。

Calibration Inputs

InputCalibration target
Funnel telemetry各 journey stage 的 drop-off、retry、abandonment
Call reason codes用户困惑点、升级原因、重复联系
Complaint taxonomy客户伤害类型、解释失败、处理时长问题
Fraud outcomesscam typology、warning bypass、loss and recovery pattern
QA samples员工处理差异、policy adherence、case note quality
A/B or pilot resultsAI intervention 对行为和结果的真实影响
Subject matter expert review极低频高影响路径的业务合理性校验

Calibration Levels

LevelMeaningAllowed use
L0 - exploratory hypothesis由 PM / BA / SME 提出的假设, 尚无真实证据discovery brainstorming, not release gate
L1 - qualitative support有访谈、case review、投诉样本支撑scenario design, early prototype evaluation
L2 - telemetry support有行为数据支撑频率、路径、drop-off 或重复联系architecture validation and pilot gate
L3 - outcome-linked support与损失、投诉、QA defect、conversion、cycle time 等结果关联release gate and scale/stop decision
L4 - production recalibrated上线后持续回流, 可监控 driftcontinuous governance and model/product tuning

Calibration Discipline

每个 persona 和 scenario 都要记录:

  • 哪个行为假设被模拟。
  • 证据来自哪里, 样本时间窗口是什么。
  • 哪些属性被合成或抽象, 哪些不能用于敏感推断。
  • 与真实 telemetry 的差异有多大。
  • 差异是否改变 release decision。

示例:

AssumptionCalibration evidenceDecision impact
高压力 scam 场景下, 客户会忽略通用警告过去 90 天 APP scam complaint sample 中, 多数客户表示看过但未理解警告支付 warning 需要 scenario-specific pause, 不能只依赖通用 banner
KYC 文件上传失败后, 客户会重复上传同一错误文件Onboarding telemetry 显示失败后 24 小时内重复上传率高RAG assistant 必须解释具体缺口, 并提供 channel handoff
Contact center agent 会过度采纳 AI generated complaint summaryPilot QA sample 显示低复杂度 case 中 edit rate 低高风险 complaint 需要 mandatory review and citation check

Financial Retail Scenarios

Scenario Portfolio

DomainAdvanced scenarioWhat the lab validates
Onboarding / KYC客户地址证明被拒, 多次上传失败, 转 contact center 要求“马上开户”RAG 是否引用正确 KYC policy; Copilot 是否解释拒绝原因; handoff 是否保留状态
Fraud / scams客户在诈骗者电话指导下发起大额实时支付, 试图绕过警告Agent 是否识别 typology; warning 是否情境化; 是否触发冷静期和人工升级
Collections逾期客户表达财务困难和情绪压力, 请求延期, 同时威胁投诉Copilot 是否避免不当催收话术; 是否识别 hardship; 是否提供合规方案
Complaints客户投诉贷款费用解释不清, 已多渠道联系, 要求监管升级Summary 是否忠实; root cause 是否可追溯; SLA 和 escalation 是否正确
Contact center新员工处理复杂 dispute, AI 建议下一步和话术Copilot 是否提升处理质量, 还是增加 over-reliance 和错误 case note
Wealth suitability客户要求高收益产品, 风险承受能力问卷显示保守AI 是否阻止不适当推荐; 是否生成 suitability rationale and escalation
Payment disputes客户否认交易, 但 merchant evidence 部分匹配, 时间接近 travel alertAgent 是否区分 fraud claim、merchant dispute、friendly fraud; 是否保留证据链
Small business banking企业客户 cashflow 紧张, 同时申请贷款和延迟还款Journey simulator 是否暴露 cross-product risk and service conflict

Example: Authorized Push Payment Scam Lab

Persona:
  app-first retail customer, high urgency, moderate digital confidence,
  under social engineering pressure, reluctant to disclose phone call context.

Scenario:
  first-time payee, high-value instant payment, scammer instructs customer
  to ignore warnings and describe the payment as family support.

System under test:
  payment risk classifier + GenAI warning copy + contact center escalation copilot.

Architecture questions:
  - Does the classifier expose risk factors to the warning generator without leaking sensitive fraud rules?
  - Does the warning generator produce specific, plain-language friction?
  - Can the agent pause payment or only recommend escalation?
  - Is the final action traceable for complaint and reimbursement review?

Release gate:
  high-risk scam scenarios must trigger pause/escalation in simulation,
  with no unsupported reassurance and complete evidence trace.

Example: KYC Onboarding Scenario Lab

Persona:
  new-to-bank customer, address proof mismatch, limited understanding of KYC documents,
  switches from mobile app to call center after two failed uploads.

Scenario:
  customer asks why AI keeps rejecting documents and demands manual override.

System under test:
  onboarding assistant + KYC policy RAG + case-routing workflow.

Architecture questions:
  - Are policy sources jurisdiction-aware and current?
  - Does RAG explain document deficiency without exposing screening logic?
  - Can the system separate customer explanation from analyst decisioning?
  - Is the rejection rationale stored for audit and complaint response?

Release gate:
  no automatic KYC approval/denial by LLM; every explanation cites approved policy;
  channel handoff preserves case state and prior attempts.

Metrics / Control / Evidence Model

Metrics Hierarchy

LayerMetricInterpretation
Scenario coveragecritical journey coverage, high-risk path coverage, persona confidence distribution是否覆盖真正影响上线风险的路径
Behavioral plausibilitytelemetry fit, SME plausibility score, replay consistencysynthetic users 是否与真实行为足够接近
Product qualitycompletion, comprehension proxy, drop-off reduction, rework proxy产品假设是否改善 journey
AI qualitygroundedness, instruction following, refusal quality, citation support, tool-call correctnessAI 能力是否达标
Control qualityescalation precision, override capture, human approval completeness, policy boundary hits控制是否有效且可证明
Risk outcomescomplaint risk, fraud loss proxy, unsuitable recommendation block rate, unfair treatment signal是否降低或避免客户/业务伤害
Evidence qualitytrace completeness, reproducibility, version capture, reviewer agreementrelease decision 是否可审计

Control Model

RiskControlEvidence
Synthetic persona encodes stereotypessensitive attribute exclusion, bias review, persona evidence linkpersona registry review record
Scenario library overfits known casesedge-case injection and periodic refreshscenario coverage dashboard
AI output not groundedapproved source retrieval, citation requirement, unsupported claim detectorRAG trace and evaluator score
Agent oversteps authoritytool allowlist, scoped permissions, human approvaltool-call trace and RBAC test
Simulation leaks sensitive dataredaction, synthetic reconstruction, retention policydata handling attestation
False confidence from synthetic testscalibration level labeling, real telemetry comparisoncalibration report and residual risk
Release gate becomes theaterdecision-linked metrics and failure severity thresholdrelease evidence packet

Evidence Packet Structure

SectionContents
Decision scopeuse case, journey boundary, model/version, release scope
Scenario coveragescenario list, risk tier, persona confidence, excluded paths
Run resultspass/fail, severity, representative traces, reproducibility metadata
Failure analysisroot cause, architecture implication, product implication, control implication
Calibrationtelemetry comparison, SME review, uncertainty level
Bias/privacy/securitycontrols, test results, residual risks
Release recommendationproceed, limited pilot, redesign, or stop
Monitoring planproduction telemetry that will recalibrate lab assumptions

Anti-patterns and Failure Modes

Anti-patternWhat it looks likeWhy it failsBetter pattern
Persona theater彩色 persona 卡片很多, 但没有 evidence links 或 decision use不能支撑架构和上线决策persona registry with owner, evidence, uncertainty, controls
Synthetic data laundering把真实敏感案例改写后宣称“合成数据无风险”仍可能泄露可识别信息或敏感推断redaction, abstraction, privacy review, retention boundary
LLM self-confirmation用同一个模型生成用户、回答用户、评价结果产生循环偏差和虚假一致性separate simulator, system under test, evaluator, human review
Happy-path simulation只模拟愿意配合、理解力强、没有压力的用户无法发现金融零售的高风险边界edge-case injection and high-severity scenario portfolio
Average-user bias只看平均分, 忽略弱势客户、欺诈压力、投诉升级客户伤害通常在尾部发生segment-specific metrics and severity weighting
Release gate by demo用几段漂亮对话证明可以上线无法复现、无法审计、无法衡量控制reproducible runs, trace evidence, pass/fail thresholds
Uncalibrated behaviorsynthetic users 按 prompt 想象行动产品决策建立在幻觉行为上calibration levels and telemetry fit checks
Over-automation driftlab 起初测试 Copilot, 后来业务把它当自动决策权限和责任边界失效architecture guardrails and change-control trigger
Ignoring human adaptation假设员工会按设计使用 AI真实用户会绕用、过度采纳、复制粘贴或忽略simulate human response and capture accept/edit/reject patterns
No negative evidence失败 run 被当作 prompt bug 删除失去风险学习机会preserve failures in evidence plane and backlog

Architecture Mapping to RAG / Agent / Copilot / Eval / Governance

Architecture areaSynthetic lab contributionEvidence produced
RAG测试不同 persona 在模糊问题下是否触发正确 query、source filter、policy precedence 和 citationquery trace, retrieved source version, unsupported claim rate
Agent测试多步 journey 中 tool scope、approval、rollback、exception handling 和 state memorytool-call trace, approval record, state transition log
Copilot测试员工如何接受、编辑、拒绝或误用 AI 建议accept/edit/reject telemetry, QA defect linkage, over-reliance signal
Eval把 single-turn answer eval 扩展成 journey eval、control eval、outcome proxy evalrubric score, scenario severity, evaluator agreement
Governance把 AI RMF / ISO 42001 / model risk 语言转成 scenario gate、evidence packet 和 owner cadencerelease decision memo, residual risk, monitoring plan
Privacy验证数据最小化、synthetic reconstruction、masking、retention 和 access boundariesdata handling record, privacy review result
Security注入 prompt injection、tool misuse、data exfiltration 和 adversarial user behaviorattack trace, blocked action, incident exercise output
Product discovery在真实实验前发现用户理解、信任、摩擦、控制和 channel handoff 问题assumption log, product backlog, design rationale
Architecture review证明系统边界、权限、observability、fallback 和 human control 可运行C4/sequence linkage, trace completeness, gate sign-off

ADR Draft

FieldContent
ADR titleAdopt a governed synthetic user simulation lab for AI product discovery and architecture validation
StatusProposed for high-impact AI use cases
Date2026-06-30
ContextFinancial retail AI products need evidence before exposing customers or employees to RAG, Copilot and Agent capabilities. Real user research and production telemetry are essential, but cannot safely cover every high-risk edge case before launch. Existing UAT and model eval do not sufficiently test behavioral assumptions, channel journeys, tool permissions, human over-reliance and customer harm scenarios.
DecisionBuild a governed synthetic user simulation lab with persona registry, scenario library, journey simulation engine, edge-case injector, calibration workbench, evidence plane and release gates. Use it as a pre-production evidence generator and post-release recalibration loop, not as a substitute for real users or formal model validation.
Option AKeep traditional UAT, SME review and prompt testing only. Low setup cost, but weak long-tail coverage and poor evidence for behavioral assumptions.
Option BUse ad hoc LLM role-play during product workshops. Fast and creative, but not reproducible, calibrated, governed or audit-ready.
Option CGoverned synthetic user lab. Higher operating cost, but provides reusable scenarios, traceable evidence, calibration discipline and architecture-control validation.
ConsequencesTeams must maintain scenario and persona assets, collect telemetry for calibration, instrument AI runs, and treat simulation failures as product/architecture backlog. Release gates become evidence-heavy but more defensible.
ControlsSensitive data minimization, persona bias review, scenario owner, calibration level, model/prompt/version capture, evaluator separation, human review for high-severity failures, production recalibration.
Acceptance criteriaFor each high-impact use case, top customer-harm and control-failure scenarios have approved scenario cards, calibrated persona assumptions, reproducible runs, complete traces, severity-rated failure analysis and release evidence packet.

Decision statement:

We will use synthetic simulation to challenge product and architecture assumptions before release, and we will label simulation evidence by calibration level so it cannot masquerade as real-world proof.

Interview Answer

30秒

Synthetic user simulation 不是传统 persona, 而是一个可治理的行为测试环境。我的做法是建立 persona registry、scenario library、journey simulator、edge-case injector 和 evidence plane, 用真实 telemetry、投诉、fraud case、QA 样本来校准。它的价值是上线前验证 AI RAG、Copilot、Agent 在高风险金融零售路径里的行为边界, 例如 KYC 拒绝解释、诈骗支付警告、催收话术和 wealth suitability。它不能替代真实用户研究, 但能把产品假设、架构控制和 release gate 变成证据。

2分钟

我会把 synthetic user lab 定位成 product discovery 和 architecture validation 的中间层。第一步不是让 LLM 随便扮演用户, 而是把真实证据转成 governed assets: persona registry 记录 actor、行为参数、约束、证据链接和不确定性; scenario library 记录 journey stage、触发条件、风险、预期控制和 release gate。

第二步是 journey simulation。金融零售的风险不在单轮回答, 而在多轮路径: 客户开户文件失败后转人工、诈骗者指导客户绕过警告、员工过度采纳 Copilot、agent 调错工具。模拟引擎要记录 persona action、AI response、retrieval、tool call、human approval、escalation 和 outcome proxy。

第三步是 calibration 和 governance。每个 persona 和 scenario 要标注证据等级: 是专家假设、访谈支持、telemetry 支持, 还是生产结果回流支持。高风险 release gate 不能只靠 LLM 生成的漂亮对话, 必须有可复现 run、失败分析、bias/privacy 控制、trace completeness 和 residual risk。

在架构上, 它连接 RAG 的 source grounding、Agent 的权限边界、Copilot 的 human adoption、Eval 的 journey rubric 和 Governance 的证据包。我的核心观点是: synthetic simulation 不证明产品一定成功, 它证明团队是否系统地挑战了自己的假设, 是否知道哪些路径仍不能上线。

CTO版本

我会把它作为 AI platform 的 pre-production behavior testbed, 接入 observability、eval、policy、identity 和 release governance。技术上需要五个关键设计:

  1. Simulator 与 system under test 分离, 避免同一个模型生成用户、执行系统、评价结果。
  2. 每次 run 都 versioned and replayable: scenario_id、persona_id、seed、model_id、prompt_version、policy_pack_version、retrieval index version 和 tool permission scope 必须入 trace。
  3. Persona 和 scenario 必须有 calibration metadata, 用生产 telemetry、QA、complaints、fraud outcomes 逐步提高 confidence, 并明确不能用 simulation 替代真实 validation。
  4. Edge-case injection 要覆盖业务伤害, 不只覆盖 prompt attack: KYC false rejection、APP scam warning bypass、collections conduct、complaint escalation、payment dispute misclassification、wealth suitability。
  5. Release gate 不看 demo, 看 evidence packet: scenario coverage、failure severity、control effectiveness、trace completeness、privacy/bias review、residual risk and monitoring plan。

我会要求每个 high-impact AI use case 在 architecture review 前通过 synthetic lab, 但我也会把它放在治理边界内: 它是 evidence generator, 不是 compliance conclusion; 它发现风险, 不自动批准上线。


7-day Practice Plan

DayFocusPractice output
1Pick one financial retail use case选择 payment scam warning, KYC onboarding assistantcomplaint summarization copilot; 写出 system boundary 和 top 5 behavioral assumptions
2Build persona registry创建 6 个 evidence-linked personas: customer, employee, scammer/adversary, reviewer, vulnerable customer context, operations manager
3Build scenario library写 12 张 scenario cards, 覆盖 happy path、edge path、control failure、channel switching、adversarial behavior
4Design journey simulator画出 state machine: user action, AI response, tool call, human approval, escalation, outcome proxy
5Define eval and evidence model设计 rubric、trace schema、severity labels、pass/fail thresholds 和 release evidence packet
6Run calibration review用公开案例、投诉 taxonomy、运营指标样本或自建假设日志标注每个 scenario 的 calibration level
7Prepare interview narrative用 30秒、2分钟、CTO版本讲清楚: 为什么需要 lab、如何防止假证据、如何连接 RAG/Agent/Copilot/Governance

高级练习标准:

  • 每个 scenario 都能回答“这个模拟支持哪个产品或架构决策”。
  • 每个 persona 都有 evidence confidence, 不是凭空故事。
  • 每个 release recommendation 都明确 remaining uncertainty。
  • 每个高风险 failure 都转成 architecture backlog、control backlog 或 product backlog。

Source Anchors

SourceLink用法
NIST AI Risk Management Frameworkhttps://www.nist.gov/itl/ai-risk-management-framework用 Govern / Map / Measure / Manage 思路组织 synthetic lab 的风险识别、度量、处置和治理责任
NIST AI RMF Generative AI Profilehttps://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence用于 GenAI 特有风险, 如 hallucination、data leakage、misuse、over-reliance、content provenance 和 evaluation
ISO/IEC 42001https://www.iso.org/standard/81230.html用 AI management system 视角设计 owner、policy、operation、performance evaluation、continuous improvement
ISO/IEC/IEEE 29148https://www.iso.org/standard/72089.html用 requirements engineering 思路把 stakeholder need、scenario、assumption、validation criteria 结构化
ISO/IEC/IEEE 42010https://www.iso.org/standard/74393.html用 architecture description / viewpoint 思路连接 business, data, application, technology, risk and governance views
Microsoft Guidelines for Human-AI Interactionhttps://www.microsoft.com/en-us/research/project/guidelines-for-human-ai-interaction/用 human-AI interaction 原则设计 trust calibration、feedback、error recovery 和 user control
OWASP LLM Top 10https://owasp.org/www-project-top-10-for-large-language-model-applications/用 LLM application risk taxonomy 注入 prompt injection、sensitive information disclosure、excessive agency 等 edge cases
OpenTelemetry docshttps://opentelemetry.io/docs/用 traces、metrics、logs 的 observability 模型设计 simulation run evidence plane

Portfolio Positioning

这篇笔记的作品集价值在于展示三种能力:

CapabilityDemonstrated by
AI Product Thinking把 synthetic users 从“有趣 demo”提升为 product discovery、assumption testing 和 release evidence
Architecture Thinking把 persona、scenario、simulator、eval、trace、tool permission、RAG grounding 和 governance gate 设计成系统
Financial Retail Judgment用 KYC、fraud/scams、collections、complaints、contact center、wealth suitability、payment disputes 等高风险路径证明业务理解

面试中的一句话定位:

I use synthetic user simulation as a governed behavior testbed: it stress-tests AI product and architecture assumptions before release, then recalibrates against production evidence after release.