AI 扩展计划 / Playbooks

AML ABPA 10 天启动

1. Which part of evidence gathering takes the most time?

326 行abpa/capstone-aml/AML_ABPA_10_DAY_STARTER.md

AML Copilot ABPA 10-Day Starter

This is the first executable ABPA capstone package. It converts existing AML Copilot work into BA / PM / Architect decision assets.

0. Evidence Already Available

Evidence	Repository source	What it proves
One-page PRD	`docs/AML_COPILOT_PRD.md`	Existing product framing, JTBD, MVP scope, eval-first success metrics
Domain type contract	`src/aml/types.ts`	Stable AML case, party, account, transaction, typology, SAR, HITL data model
Synthetic golden dataset v1	`src/aml/__tests__/aml.test.ts`	66 cases with labels: structuring 18, layering 15, mule_network 15, normal 18
Rule baseline	`src/aml/evalBaseline.ts`	Deterministic recall/FPR measurement against synthetic labels
Failure taxonomy	`src/aml/failureTaxonomy.ts`	Six stable failure classes for eval, triage, and monitoring
Code checks	`src/aml/__tests__/p1evals.test.ts`	SAR structure, citation existence, formatting, top typology consistency checks
External-ground-truth evaluator	`src/aml/groundTruthEval.ts`	Future path for non-circular eval with public labeled datasets

1. 10-Day Output Map

Day	ABPA asset	AML-specific output
1	Scenario choice	AML Investigation Copilot confirmed as capstone
2	AI Opportunity Canvas	Filled v0.1 below
3	Stakeholder list	Frontline analyst, compliance officer, AML ops lead, data owner, model risk, audit, IT, sponsor
4	Interview cards	Evidence-oriented questions for each group
5	AS-IS workflow	Current alert-to-SAR process
6	Pain metrics	Baseline metrics to collect before pilot
7	Requirement draft	12 candidate requirements
8	Requirements-to-Eval	Requirements mapped to code checks, human review, future LLM judge
9	AI Control Pack	Top controls for hallucination, privacy, over-reliance, stale knowledge, audit
10	Executive memo	First go/no-go recommendation

2. AI Opportunity Canvas v0.1

Scenario

Field	Answer
Business domain	Financial crime compliance / AML investigation
Process	Alert triage -> evidence gathering -> typology assessment -> SAR draft -> HITL review -> audit
Primary user	AML investigator
Secondary user	Compliance officer / AML quality reviewer
Sponsor	Head of Financial Crime Operations
Current systems	Transaction monitoring, core banking, KYC, case management, sanctions screening
Current decision owner	AML investigator for case recommendation; compliance officer for SAR submission approval

Business Problem

Question	Answer
What is painful today?	Investigators manually gather evidence across systems, compare activity against typologies, and write SAR narratives from scratch.
Who feels it?	Frontline AML investigators, AML operations leads, compliance quality reviewers, and ultimately regulatory reporting owners.
How often does it happen?	Every alert that survives first-level monitoring; exact production volume must be collected in discovery.
Current workaround	Manual cross-system search, spreadsheet notes, copied transaction evidence, rule-of-thumb typology judgment, manual narrative writing.
Consequence of doing nothing	Backlog, inconsistent investigation quality, slow SAR preparation, weak evidence traceability, staff fatigue, and regulatory exposure.

Baseline Metrics to Collect

Metric	Current repository evidence	Production evidence needed
Case volume	Not in repo	Alerts/day, investigations/day, SAR drafts/month
Average handling time	PRD assumes manual work can be hours per case; no internal log yet	Time from alert assignment to disposition
Evidence retrieval time	Not measured	Time spent collecting account, party, transaction, counterparty, prior SAR, sanctions evidence
Typology accuracy	Synthetic rule baseline exists	Reviewer-confirmed typology labels on historical cases
SAR quality	SAR template + code checks exist	QA rubric, returned SAR rate, reviewer edits
False positive burden	Synthetic normal FPR is measured	Production false positive rate and closure reasons
Cost per case	Not measured	Analyst cost, system cost, model cost, reviewer cost
Audit completeness	audit trail code exists	Missing evidence / missing reviewer action rate

AI Fit

Dimension	Assessment	Reason
Language-heavy work	High	SAR narratives, investigation notes, policy explanations
Knowledge retrieval	High	Evidence and typology matching depend on multiple sources
Judgment complexity	Medium-high	Classification and escalation need business judgment and controls
Structured data availability	Medium	Repo has transaction-like structure; production data readiness unknown
Labeled data availability	Medium-low	Synthetic labels exist; external/public labels or internal reviewed cases are needed
Auditability requirement	High	Every claim in SAR must cite real evidence
Error downside	High	Hallucinated evidence or missed suspicious activity is unacceptable
Human review feasibility	High	AML already has review / sign-off workflows

Candidate AI Roles

Role	Fit	Notes
Summarize case evidence	Yes	Strong fit if source-bound and citation-required
Classify typology	Yes, controlled	Must compare against deterministic baseline and human labels
Retrieve evidence	Yes	Needs freshness, identity, permissions, source-of-truth mapping
Draft SAR narrative	Yes, with HITL	Never auto-submit
Recommend next action	Yes, advisory only	Must show reason and evidence
Execute regulatory submission	No for MVP	Human approval and external filing controls required
Monitor model drift	Yes	Production quality monitoring required after pilot

No-AI Boundaries

Boundary	Reason	Safer alternative
Automatically filing SARs	Regulatory accountability and high downside of false evidence	HITL approval + submission outside MVP
Overriding investigator judgment	Human accountability remains required	AI recommendation with override reason
Using uncited model-generated evidence	Hallucination risk	Source-bound evidence retrieval + citation existence checks
Deciding customer account closure	High-impact decision with legal/customer risk	Decision support only; separate governance process

Initial Success Metrics

Category	Metric	Starter target	Guardrail
Efficiency	Evidence gathering time	-30% in pilot	No quality metric degradation
Quality	SAR citation validity	100% valid cited transaction IDs	Block if invalid citation appears
Typology	Suspicious vs normal recall	Must beat rule baseline on external labels before replacement	FPR within agreed threshold
Risk	Hallucination critical failures	0 in reviewed sample	Any critical hallucination stops rollout
Adoption	Repeat usage by analysts	>=60% weekly active among pilot users	Override reasons reviewed
Cost	Model/tool cost per case	Below analyst time saving threshold	Budget alerts

Recommendation

Field	Answer
Decision	Fund discovery + controlled prototype, not production rollout
First 10 days	Complete ABPA artifact pack and align stakeholders
First 30 days	Add production-like sample, reviewer rubric, and requirements-to-eval matrix
Stop rule	Stop if evidence cannot be cited, labels cannot be obtained, or compliance refuses HITL operating model
Open questions	Production data availability, regulatory reporting ownership, historical case label access, model/provider policy, audit retention

3. Stakeholder Evidence Map v0.1

Stakeholder	Success definition	Main concern	Evidence needed
AML investigator	Less manual evidence work, faster high-quality drafts	AI creates more review work or misses nuance	Prototype task test, time-on-task, override reasons
AML operations lead	Higher throughput without quality regression	Backlog moves from analysts to reviewers	Throughput, review load, escalation rate
Compliance officer	SAR narratives are accurate, complete, traceable	Unsupported statements and regulatory criticism	Citation validity, audit trail, reviewer sign-off
Model risk / AI governance	Model behavior is evaluated and controlled	Unvalidated model in high-risk process	Eval suite, model card, change control
Data owner	Data use is authorized and understood	PII leakage, stale data, source ambiguity	Data inventory, access controls, retention policy
IT / platform	Stable integration and supportable runtime	Unowned scripts, key leakage, unreliable tooling	Architecture ADR, runbook, observability
Security / privacy	Least privilege and no uncontrolled data export	Sensitive data in prompts or external vendor	Data classification, redaction, provider review
Internal audit	Evidence and decisions are reconstructable	Missing logs or undocumented overrides	Immutable audit trail, RACI, control register
Executive sponsor	Material efficiency or risk improvement	Demo without ROI	Business case, baseline, pilot decision memo

Interview Questions by Stakeholder

AML Investigator

Which part of evidence gathering takes the most time?
Which data sources do you trust most and least?
What makes a SAR draft unusable?
What kind of AI suggestion would you override immediately?
What evidence must be visible before you trust a typology recommendation?
What shortcuts do analysts use today when backlog is high?
What is the most common reason a case gets returned for rework?
What should never be hidden behind an AI summary?

Compliance Officer

What minimum evidence is required before a SAR narrative is acceptable?
Which statement types are most dangerous if hallucinated?
What audit fields must exist for AI-assisted work?
What should trigger mandatory senior review?
How should AI-generated content be labeled?
What would make you block the pilot?
What sample size would you need for quality review?
Which regulatory expectations are non-negotiable?

Data / IT / Security

Which systems are source of truth for party, account, transaction, KYC, sanctions, and case status?
Which fields are PII or restricted?
What data can leave the bank boundary?
What identity and authorization model is required?
What logs can be retained, and for how long?
What availability and incident response expectations apply?
How will model/provider outages be handled?
What integration path is realistic for a pilot?

4. AS-IS Workflow v0.1

flowchart TD
  A[Transaction monitoring alert created] --> B[Alert assigned to AML investigator]
  B --> C[Open case and review alert reason]
  C --> D[Gather customer/KYC/account evidence]
  D --> E[Gather transaction and counterparty evidence]
  E --> F[Compare activity with typologies]
  F --> G{Suspicious enough?}
  G -->|No| H[Close as false positive with rationale]
  G -->|Yes| I[Draft SAR narrative]
  I --> J[Compliance / senior review]
  J --> K{Approved?}
  K -->|No| L[Return for rework]
  L --> D
  K -->|Yes| M[Submit / archive evidence]

Pain Metrics to Collect

Step	Pain hypothesis	Metric
Evidence gathering	Cross-system switching drives most effort	minutes spent per source
Typology comparison	Analyst judgment varies by experience	reviewer disagreement rate
SAR drafting	Narrative quality is inconsistent	returned draft rate, edit distance
Review	Reviewers spend time checking evidence existence	missing / invalid citation rate
Closure	False positives still require evidence	closure time for normal cases

TO-BE Hypothesis

flowchart TD
  A[Alert assigned] --> B[Copilot gathers evidence]
  B --> C[Copilot scores typology with cited evidence]
  C --> D{Confidence and risk gate}
  D -->|Low confidence/high risk| E[Mandatory human review]
  D -->|Normal advisory| F[Analyst reviews closure rationale]
  D -->|Suspicious advisory| G[Copilot drafts source-bound SAR]
  F --> H[Analyst disposition]
  G --> I[Analyst edits and submits for compliance review]
  E --> I
  I --> J[Audit trail export]

5. Requirements-to-Eval Matrix v0.1

Req ID	Requirement	Eval data	Grader	Threshold	Production signal	Owner
R-001	SAR draft must cite only transaction IDs that exist in the case	Synthetic v1/v1.1 + future historical samples	code check	100%	invalid citation rate	Compliance Ops
R-002	SAR draft must include at least 5W1H narrative sections	Synthetic cases	code check	>=99%	incomplete draft rate	Product / Compliance
R-003	Typology recommendation must expose evidence and rule hits	Synthetic labeled cases	code check + human review	100% evidence visible	hidden reasoning complaints	Product
R-004	Suspicious-vs-normal classification must beat rule baseline before replacing it	External/public labeled set + internal reviewed cases	metric eval	statistically better than baseline	recall / precision / FPR	Model risk
R-005	Normal case false positives must remain within approved threshold	Historical normal cases	metric eval	threshold TBD by compliance	false positive rate	AML Ops
R-006	AI-generated content must be labeled as AI-assisted	UI / content inspection	code / UX review	100%	unlabeled AI content count	Compliance
R-007	Analyst must be able to override recommendation with reason	Prototype / workflow test	usability test	100% task success	override rate + reason distribution	Product
R-008	High-risk or low-confidence outputs must require human review	Scenario set	code check + human review	100% routed	auto-complete high-risk count	Risk
R-009	Every material action must be audit logged	Audit trail tests	code check	100%	missing audit event count	IT / Audit
R-010	Prompt injection in source text must not cause unsupported SAR claims	Red-team set	adversarial eval	0 critical failures	injection block / miss rate	Security
R-011	Model/provider outage must degrade to workflow baseline	Failure injection	integration test	no data loss	fallback activation rate	Platform
R-012	Cost per case must stay below approved pilot budget	Trace/cost logs	metric	budget TBD	$/case	Sponsor

6. AI Control Pack v0.1

Control ID	Risk	Preventive control	Detective control	Corrective control	Evidence	Owner
C-001	Hallucinated transaction evidence	Source-bound citation requirement; no free-form evidence IDs	citedTxIds existence check	Block draft and route to analyst	invalid citation report	Compliance Ops
C-002	Unsupported typology conclusion	Show rule hits / evidence; require confidence and rationale	reviewer disagreement tracking	Recalibrate threshold / labels	typology eval report	Model risk
C-003	Context pollution	Restrict evidence to case-linked accounts and relevant channels	benign noise citation check	Remove noisy evidence and review prompt/tool design	failure taxonomy label	Product
C-004	Prompt injection	Treat source text as untrusted; tool output isolation	red-team injection suite	patch prompts/tools and rerun suite	attack transcript	Security
C-005	Privacy leakage	Classify and minimize fields before model call	prompt/trace PII scan	redact and incident review	PII scan report	Data owner
C-006	Over-reliance	UI labels AI as advisory; require override and review controls	override / blind acceptance rate	retrain users and adjust UX	adoption dashboard	AML Ops
C-007	Stale knowledge	Version policies and typology library	freshness check	refresh knowledge base	knowledge version log	Compliance
C-008	Audit failure	Append-only audit event chain	missing event check	reconstruct case and file incident	audit export	Internal audit
C-009	Provider outage	fallback to rules/workflow baseline	timeout/error monitoring	switch provider or manual queue	runtime trace	IT
C-010	Cost runaway	budget cap / model routing policy	$/case monitoring	throttle or downgrade model	cost dashboard	Sponsor

7. Executive Decision Memo v0.1

Decision Requested

Fund a 30-day discovery and controlled prototype hardening phase for AML Investigation Copilot. Do not approve production rollout yet.

Why Now

The repository already has a functioning AML domain model, synthetic labeled dataset, rule baseline, failure taxonomy, SAR draft generator, and eval scaffolding.
The business workflow is high-value: AML investigations are evidence-heavy, language-heavy, regulated, and audit-sensitive.
The current prototype is honest about its limits: rule-template SAR, synthetic data, no production PII, no automatic filing.

Options

Option	Pros	Cons	Recommendation
Do nothing	No implementation risk	Backlog and manual evidence burden remain	No
Workflow-only improvements	Lower AI risk, easier audit	Does not address evidence synthesis / narrative drafting fully	Use as fallback baseline
AI-assisted copilot	Strong fit for evidence summarization, typology support, SAR draft	Needs controls, labels, and adoption design	Yes: discovery + controlled prototype
Fully agentic automation	Potential high efficiency	Too risky for SAR filing and high-impact decisions	No for MVP

First 30 Days

Validate production baseline metrics.
Confirm data source and access model.
Build stakeholder evidence map and workflow baseline.
Convert top requirements into evals.
Add human review and audit control pack.
Design pilot scope and stop rules.

Success Gate

Proceed to pilot only if:

cited evidence validity is 100% on test set,
compliance approves HITL operating model,
historical or external labels are available for non-circular eval,
analyst usability test shows task success without trust confusion,
control owners accept RACI.

Stop Rule

Stop or revert to workflow-only if:

the system cannot cite evidence reliably,
labeled data cannot be obtained,
reviewers cannot explain AI-assisted recommendations,
critical hallucination appears in controlled tests,
data/privacy review blocks model use.

8. Next Work Items

Priority	Work item	Output
P0	Complete ABPA stakeholder interviews	Interview evidence notes
P0	Convert AS-IS workflow to formal BPMN	BPMN + pain metrics
P0	Expand requirements-to-eval matrix	20+ requirements
P1	Add data readiness pack	source inventory + PII classification
P1	Add usability prototype test plan	5-person think-aloud script
P1	Add build-vs-buy matrix	self-built vs vendor vs hybrid
P2	Link public/external labeled AML dataset path	external-ground-truth eval plan