proceed / limited pilot / redesign / stop recommendation
Monthly after pilot
Recalibration review
telemetry drift, new scenario cards, retired assumptions
After incident or major complaint
Scenario backfill review
incident-derived scenarios and control regression tests
Lifecycle
1. Define decision scope
2. Map journey and control points
3. Create persona registry entries
4. Create scenario cards
5. Label assumptions and calibration level
6. Run simulations against system under test
7. Capture traces, outputs, tool calls and human decisions
8. Evaluate against product, architecture, risk and evidence rubrics
9. Convert failures into backlog and control changes
10. Build release evidence packet
11. Recalibrate with pilot and production telemetry
Stage Gates
Gate
Pass condition
Stop or limit condition
Discovery gate
Top behavioral assumptions documented and linked to scenarios
Major assumptions are undocumented or unsupported
Architecture gate
System boundary, permissions, trace schema and human control paths are testable
Tool actions, retrieval sources or approval paths are unclear
Pilot gate
High-severity scenarios have acceptable controls and complete evidence
Critical failures remain unresolved or untraceable
Release gate
Evidence packet shows coverage, calibration, residual risk and monitoring plan
Simulation is uncalibrated, biased, privacy-risky or demo-only
Scale gate
Production telemetry confirms assumptions or shows managed drift
Complaints, overrides, losses, QA defects or user misuse exceed threshold
Template: Persona Registry
Persona entries should be versioned assets. They represent behavior constraints and evidence, not decorative user stories.
Which user segments or conditions are excluded from release?
Release constraints and monitoring triggers
BA Questions
Question
Strong answer evidence
What is the work-as-done journey, not just the target process?
Journey map includes exceptions, channel switches, manual workarounds and control points
What business rules and policies must the AI respect?
Policy source inventory and scenario pass criteria
Which handoffs are stateful?
Trace includes prior attempts, case IDs, user disclosures and escalation status
What assumptions would change requirements if proven wrong?
Assumption log with decision impact
How are complaints, QA findings and incidents converted into regression scenarios?
Scenario backfill review and scenario lifecycle
Architecture Questions
Question
Strong answer evidence
Is the simulator separated from the system under test and evaluator?
Architecture diagram and model/service separation
Can every run be replayed?
run_id, seed, model, prompt, policy, retrieval index and tool version captured
What authority does the AI have?
tool scope matrix, approval policy and blocked action trace
How is RAG source freshness and jurisdiction controlled?
source registry, retrieval filters, index version and citation audit
How are privacy and sensitive data protected?
data minimization, masking, retention, access control and review records
How does simulation evidence flow into observability?
OpenTelemetry-style traces, metrics, logs and evidence store
What happens when production telemetry contradicts simulation?
recalibration workflow, release condition review and backlog trigger
Release Checklist
Discovery Readiness
Use case boundary states what AI drafts, retrieves, recommends, routes or executes.
Work-as-done journey includes exception paths and channel switching.
Top behavioral assumptions are logged with owner and evidence level.
Persona registry entries are behavior-based and avoid demographic stereotypes.
Scenario cards cover high-risk financial retail paths, not only happy path.
Architecture Readiness
System under test is separated from simulator and evaluator.
RAG source registry, index version and citation requirements are defined.
Agent tool scope, approval, rollback and blocked-action behavior are defined.
Copilot human actions capture accept, edit, reject, ignore, override and escalate.
Simulation run trace captures model, prompt, policy, retrieval and tool versions.
Evidence plane supports replay, reviewer drilldown and retention controls.
Risk and Control Readiness
Critical customer-harm scenarios have pass/fail thresholds.
Bias, privacy and sensitive-data controls are reviewed.
Prompt injection, data leakage and excessive agency scenarios are included where relevant.
Human control is meaningful, not merely a UI label.
Residual risks have owner, expiry and monitoring trigger.
Release Evidence Readiness
Release packet includes scenario coverage, failures, mitigations and remaining uncertainty.
High-severity failures are resolved, scoped out or accepted by accountable owner.
Simulation evidence is labeled by calibration level.
Production monitoring will measure assumptions, outcomes, overrides, complaints and incidents.
Governance sign-off covers business, architecture, risk, compliance, privacy and security.
Executive Narrative
One-page Narrative
We are using synthetic user simulation because the AI product will operate in financial retail journeys where customer behavior, employee behavior and adversarial behavior materially affect risk. Traditional UAT proves that screens and APIs function. Model eval proves that a model can answer selected prompts. Neither is enough to prove that a customer under scam pressure, a frustrated KYC applicant, a collections customer in hardship, a new contact center agent or a wealth advisor near a suitability boundary will interact with AI safely.
The lab gives us a governed behavior testbed. Personas are evidence-linked and versioned. Scenarios are tied to release decisions. Simulations are replayable and produce traces for prompts, retrieval, tool calls, policy decisions, human approvals and outcomes. Results are calibrated against real telemetry, complaints, QA reviews and fraud or dispute outcomes. Failures become product, architecture or control backlog, not discarded demo artifacts.
The executive decision is not “synthetic users say the product is safe.” The decision is:
Within this release scope, we have tested the most material behavior and control assumptions,
we know which evidence is strong or weak, we have mitigated critical failures,
and we have a production telemetry plan to recalibrate assumptions after launch.
CTO / CRO / COO Translation
Stakeholder
Message
CTO
The lab validates architecture boundaries before release: RAG grounding, tool permission, observability, replayability and rollback.
CRO
The lab exposes customer harm and control failures early, with residual risk ownership and monitoring triggers.
COO
The lab tests work-as-done: employee adoption, handoff, queue impact, QA defects and exception handling.
CPO / Product Head
The lab turns product assumptions into testable scenarios and gives a disciplined way to decide pilot scope.
Internal Audit
The lab produces reviewable evidence: versioned scenarios, run traces, evaluator decisions, sign-offs and recalibration records.
Interview Drills
Drill 1: Explain the Lab in 60 Seconds
Strong answer:
I treat synthetic user simulation as a governed behavior testbed, not as decorative personas.
For a financial retail AI use case, I create evidence-linked personas, scenario cards,
a journey simulator, edge-case injection, eval rubrics and trace evidence.
The goal is to test product and architecture assumptions before exposing real customers:
will RAG cite the right policy, will an agent stay within tool authority,
will a Copilot create over-reliance, will high-risk cases escalate?
Every simulation is labeled by calibration level and must be recalibrated with pilot telemetry.
Drill 2: Defend Against “Synthetic Users Are Fake”
Strong answer:
They are fake if used as proof of real-world behavior. They are useful if used as controlled assumption tests.
I would never claim synthetic users prove adoption or loss reduction. I use them to find failure modes,
stress architecture boundaries and create release evidence before real exposure.
The discipline is calibration: each persona and scenario must link to telemetry, complaints, QA, case reviews
or be labeled exploratory. Release gates distinguish simulation evidence from production evidence.
Drill 3: Apply to Payment Scam Warning
Strong answer structure:
Part
Answer
Persona
App-first customer under social-engineering pressure, reluctant to reveal phone-call context
Scenario
First-time high-value instant payment, scammer coaches customer to ignore warnings
Use abstraction, masking, synthetic reconstruction, access control, retention policy and privacy review before scenario promotion.
How do you know when a simulation is good enough for release?
It is never enough alone. It must meet scenario coverage, trace completeness, high-severity pass criteria, calibration level and production monitoring readiness.