返回 Papers
AI 扩展计划 / Playbooks

AML ABPA 10 天启动

1. Which part of evidence gathering takes the most time?

326abpa/capstone-aml/AML_ABPA_10_DAY_STARTER.md

AML Copilot ABPA 10-Day Starter

This is the first executable ABPA capstone package. It converts existing AML Copilot work into BA / PM / Architect decision assets.

0. Evidence Already Available

EvidenceRepository sourceWhat it proves
One-page PRDdocs/AML_COPILOT_PRD.mdExisting product framing, JTBD, MVP scope, eval-first success metrics
Domain type contractsrc/aml/types.tsStable AML case, party, account, transaction, typology, SAR, HITL data model
Synthetic golden dataset v1src/aml/__tests__/aml.test.ts66 cases with labels: structuring 18, layering 15, mule_network 15, normal 18
Rule baselinesrc/aml/evalBaseline.tsDeterministic recall/FPR measurement against synthetic labels
Failure taxonomysrc/aml/failureTaxonomy.tsSix stable failure classes for eval, triage, and monitoring
Code checkssrc/aml/__tests__/p1evals.test.tsSAR structure, citation existence, formatting, top typology consistency checks
External-ground-truth evaluatorsrc/aml/groundTruthEval.tsFuture path for non-circular eval with public labeled datasets

1. 10-Day Output Map

DayABPA assetAML-specific output
1Scenario choiceAML Investigation Copilot confirmed as capstone
2AI Opportunity CanvasFilled v0.1 below
3Stakeholder listFrontline analyst, compliance officer, AML ops lead, data owner, model risk, audit, IT, sponsor
4Interview cardsEvidence-oriented questions for each group
5AS-IS workflowCurrent alert-to-SAR process
6Pain metricsBaseline metrics to collect before pilot
7Requirement draft12 candidate requirements
8Requirements-to-EvalRequirements mapped to code checks, human review, future LLM judge
9AI Control PackTop controls for hallucination, privacy, over-reliance, stale knowledge, audit
10Executive memoFirst go/no-go recommendation

2. AI Opportunity Canvas v0.1

Scenario

FieldAnswer
Business domainFinancial crime compliance / AML investigation
ProcessAlert triage -> evidence gathering -> typology assessment -> SAR draft -> HITL review -> audit
Primary userAML investigator
Secondary userCompliance officer / AML quality reviewer
SponsorHead of Financial Crime Operations
Current systemsTransaction monitoring, core banking, KYC, case management, sanctions screening
Current decision ownerAML investigator for case recommendation; compliance officer for SAR submission approval

Business Problem

QuestionAnswer
What is painful today?Investigators manually gather evidence across systems, compare activity against typologies, and write SAR narratives from scratch.
Who feels it?Frontline AML investigators, AML operations leads, compliance quality reviewers, and ultimately regulatory reporting owners.
How often does it happen?Every alert that survives first-level monitoring; exact production volume must be collected in discovery.
Current workaroundManual cross-system search, spreadsheet notes, copied transaction evidence, rule-of-thumb typology judgment, manual narrative writing.
Consequence of doing nothingBacklog, inconsistent investigation quality, slow SAR preparation, weak evidence traceability, staff fatigue, and regulatory exposure.

Baseline Metrics to Collect

MetricCurrent repository evidenceProduction evidence needed
Case volumeNot in repoAlerts/day, investigations/day, SAR drafts/month
Average handling timePRD assumes manual work can be hours per case; no internal log yetTime from alert assignment to disposition
Evidence retrieval timeNot measuredTime spent collecting account, party, transaction, counterparty, prior SAR, sanctions evidence
Typology accuracySynthetic rule baseline existsReviewer-confirmed typology labels on historical cases
SAR qualitySAR template + code checks existQA rubric, returned SAR rate, reviewer edits
False positive burdenSynthetic normal FPR is measuredProduction false positive rate and closure reasons
Cost per caseNot measuredAnalyst cost, system cost, model cost, reviewer cost
Audit completenessaudit trail code existsMissing evidence / missing reviewer action rate

AI Fit

DimensionAssessmentReason
Language-heavy workHighSAR narratives, investigation notes, policy explanations
Knowledge retrievalHighEvidence and typology matching depend on multiple sources
Judgment complexityMedium-highClassification and escalation need business judgment and controls
Structured data availabilityMediumRepo has transaction-like structure; production data readiness unknown
Labeled data availabilityMedium-lowSynthetic labels exist; external/public labels or internal reviewed cases are needed
Auditability requirementHighEvery claim in SAR must cite real evidence
Error downsideHighHallucinated evidence or missed suspicious activity is unacceptable
Human review feasibilityHighAML already has review / sign-off workflows

Candidate AI Roles

RoleFitNotes
Summarize case evidenceYesStrong fit if source-bound and citation-required
Classify typologyYes, controlledMust compare against deterministic baseline and human labels
Retrieve evidenceYesNeeds freshness, identity, permissions, source-of-truth mapping
Draft SAR narrativeYes, with HITLNever auto-submit
Recommend next actionYes, advisory onlyMust show reason and evidence
Execute regulatory submissionNo for MVPHuman approval and external filing controls required
Monitor model driftYesProduction quality monitoring required after pilot

No-AI Boundaries

BoundaryReasonSafer alternative
Automatically filing SARsRegulatory accountability and high downside of false evidenceHITL approval + submission outside MVP
Overriding investigator judgmentHuman accountability remains requiredAI recommendation with override reason
Using uncited model-generated evidenceHallucination riskSource-bound evidence retrieval + citation existence checks
Deciding customer account closureHigh-impact decision with legal/customer riskDecision support only; separate governance process

Initial Success Metrics

CategoryMetricStarter targetGuardrail
EfficiencyEvidence gathering time-30% in pilotNo quality metric degradation
QualitySAR citation validity100% valid cited transaction IDsBlock if invalid citation appears
TypologySuspicious vs normal recallMust beat rule baseline on external labels before replacementFPR within agreed threshold
RiskHallucination critical failures0 in reviewed sampleAny critical hallucination stops rollout
AdoptionRepeat usage by analysts>=60% weekly active among pilot usersOverride reasons reviewed
CostModel/tool cost per caseBelow analyst time saving thresholdBudget alerts

Recommendation

FieldAnswer
DecisionFund discovery + controlled prototype, not production rollout
First 10 daysComplete ABPA artifact pack and align stakeholders
First 30 daysAdd production-like sample, reviewer rubric, and requirements-to-eval matrix
Stop ruleStop if evidence cannot be cited, labels cannot be obtained, or compliance refuses HITL operating model
Open questionsProduction data availability, regulatory reporting ownership, historical case label access, model/provider policy, audit retention

3. Stakeholder Evidence Map v0.1

StakeholderSuccess definitionMain concernEvidence needed
AML investigatorLess manual evidence work, faster high-quality draftsAI creates more review work or misses nuancePrototype task test, time-on-task, override reasons
AML operations leadHigher throughput without quality regressionBacklog moves from analysts to reviewersThroughput, review load, escalation rate
Compliance officerSAR narratives are accurate, complete, traceableUnsupported statements and regulatory criticismCitation validity, audit trail, reviewer sign-off
Model risk / AI governanceModel behavior is evaluated and controlledUnvalidated model in high-risk processEval suite, model card, change control
Data ownerData use is authorized and understoodPII leakage, stale data, source ambiguityData inventory, access controls, retention policy
IT / platformStable integration and supportable runtimeUnowned scripts, key leakage, unreliable toolingArchitecture ADR, runbook, observability
Security / privacyLeast privilege and no uncontrolled data exportSensitive data in prompts or external vendorData classification, redaction, provider review
Internal auditEvidence and decisions are reconstructableMissing logs or undocumented overridesImmutable audit trail, RACI, control register
Executive sponsorMaterial efficiency or risk improvementDemo without ROIBusiness case, baseline, pilot decision memo

Interview Questions by Stakeholder

AML Investigator

  1. Which part of evidence gathering takes the most time?
  2. Which data sources do you trust most and least?
  3. What makes a SAR draft unusable?
  4. What kind of AI suggestion would you override immediately?
  5. What evidence must be visible before you trust a typology recommendation?
  6. What shortcuts do analysts use today when backlog is high?
  7. What is the most common reason a case gets returned for rework?
  8. What should never be hidden behind an AI summary?

Compliance Officer

  1. What minimum evidence is required before a SAR narrative is acceptable?
  2. Which statement types are most dangerous if hallucinated?
  3. What audit fields must exist for AI-assisted work?
  4. What should trigger mandatory senior review?
  5. How should AI-generated content be labeled?
  6. What would make you block the pilot?
  7. What sample size would you need for quality review?
  8. Which regulatory expectations are non-negotiable?

Data / IT / Security

  1. Which systems are source of truth for party, account, transaction, KYC, sanctions, and case status?
  2. Which fields are PII or restricted?
  3. What data can leave the bank boundary?
  4. What identity and authorization model is required?
  5. What logs can be retained, and for how long?
  6. What availability and incident response expectations apply?
  7. How will model/provider outages be handled?
  8. What integration path is realistic for a pilot?

4. AS-IS Workflow v0.1

flowchart TD
  A[Transaction monitoring alert created] --> B[Alert assigned to AML investigator]
  B --> C[Open case and review alert reason]
  C --> D[Gather customer/KYC/account evidence]
  D --> E[Gather transaction and counterparty evidence]
  E --> F[Compare activity with typologies]
  F --> G{Suspicious enough?}
  G -->|No| H[Close as false positive with rationale]
  G -->|Yes| I[Draft SAR narrative]
  I --> J[Compliance / senior review]
  J --> K{Approved?}
  K -->|No| L[Return for rework]
  L --> D
  K -->|Yes| M[Submit / archive evidence]

Pain Metrics to Collect

StepPain hypothesisMetric
Evidence gatheringCross-system switching drives most effortminutes spent per source
Typology comparisonAnalyst judgment varies by experiencereviewer disagreement rate
SAR draftingNarrative quality is inconsistentreturned draft rate, edit distance
ReviewReviewers spend time checking evidence existencemissing / invalid citation rate
ClosureFalse positives still require evidenceclosure time for normal cases

TO-BE Hypothesis

flowchart TD
  A[Alert assigned] --> B[Copilot gathers evidence]
  B --> C[Copilot scores typology with cited evidence]
  C --> D{Confidence and risk gate}
  D -->|Low confidence/high risk| E[Mandatory human review]
  D -->|Normal advisory| F[Analyst reviews closure rationale]
  D -->|Suspicious advisory| G[Copilot drafts source-bound SAR]
  F --> H[Analyst disposition]
  G --> I[Analyst edits and submits for compliance review]
  E --> I
  I --> J[Audit trail export]

5. Requirements-to-Eval Matrix v0.1

Req IDRequirementEval dataGraderThresholdProduction signalOwner
R-001SAR draft must cite only transaction IDs that exist in the caseSynthetic v1/v1.1 + future historical samplescode check100%invalid citation rateCompliance Ops
R-002SAR draft must include at least 5W1H narrative sectionsSynthetic casescode check>=99%incomplete draft rateProduct / Compliance
R-003Typology recommendation must expose evidence and rule hitsSynthetic labeled casescode check + human review100% evidence visiblehidden reasoning complaintsProduct
R-004Suspicious-vs-normal classification must beat rule baseline before replacing itExternal/public labeled set + internal reviewed casesmetric evalstatistically better than baselinerecall / precision / FPRModel risk
R-005Normal case false positives must remain within approved thresholdHistorical normal casesmetric evalthreshold TBD by compliancefalse positive rateAML Ops
R-006AI-generated content must be labeled as AI-assistedUI / content inspectioncode / UX review100%unlabeled AI content countCompliance
R-007Analyst must be able to override recommendation with reasonPrototype / workflow testusability test100% task successoverride rate + reason distributionProduct
R-008High-risk or low-confidence outputs must require human reviewScenario setcode check + human review100% routedauto-complete high-risk countRisk
R-009Every material action must be audit loggedAudit trail testscode check100%missing audit event countIT / Audit
R-010Prompt injection in source text must not cause unsupported SAR claimsRed-team setadversarial eval0 critical failuresinjection block / miss rateSecurity
R-011Model/provider outage must degrade to workflow baselineFailure injectionintegration testno data lossfallback activation ratePlatform
R-012Cost per case must stay below approved pilot budgetTrace/cost logsmetricbudget TBD$/caseSponsor

6. AI Control Pack v0.1

Control IDRiskPreventive controlDetective controlCorrective controlEvidenceOwner
C-001Hallucinated transaction evidenceSource-bound citation requirement; no free-form evidence IDscitedTxIds existence checkBlock draft and route to analystinvalid citation reportCompliance Ops
C-002Unsupported typology conclusionShow rule hits / evidence; require confidence and rationalereviewer disagreement trackingRecalibrate threshold / labelstypology eval reportModel risk
C-003Context pollutionRestrict evidence to case-linked accounts and relevant channelsbenign noise citation checkRemove noisy evidence and review prompt/tool designfailure taxonomy labelProduct
C-004Prompt injectionTreat source text as untrusted; tool output isolationred-team injection suitepatch prompts/tools and rerun suiteattack transcriptSecurity
C-005Privacy leakageClassify and minimize fields before model callprompt/trace PII scanredact and incident reviewPII scan reportData owner
C-006Over-relianceUI labels AI as advisory; require override and review controlsoverride / blind acceptance rateretrain users and adjust UXadoption dashboardAML Ops
C-007Stale knowledgeVersion policies and typology libraryfreshness checkrefresh knowledge baseknowledge version logCompliance
C-008Audit failureAppend-only audit event chainmissing event checkreconstruct case and file incidentaudit exportInternal audit
C-009Provider outagefallback to rules/workflow baselinetimeout/error monitoringswitch provider or manual queueruntime traceIT
C-010Cost runawaybudget cap / model routing policy$/case monitoringthrottle or downgrade modelcost dashboardSponsor

7. Executive Decision Memo v0.1

Decision Requested

Fund a 30-day discovery and controlled prototype hardening phase for AML Investigation Copilot. Do not approve production rollout yet.

Why Now

  • The repository already has a functioning AML domain model, synthetic labeled dataset, rule baseline, failure taxonomy, SAR draft generator, and eval scaffolding.
  • The business workflow is high-value: AML investigations are evidence-heavy, language-heavy, regulated, and audit-sensitive.
  • The current prototype is honest about its limits: rule-template SAR, synthetic data, no production PII, no automatic filing.

Options

OptionProsConsRecommendation
Do nothingNo implementation riskBacklog and manual evidence burden remainNo
Workflow-only improvementsLower AI risk, easier auditDoes not address evidence synthesis / narrative drafting fullyUse as fallback baseline
AI-assisted copilotStrong fit for evidence summarization, typology support, SAR draftNeeds controls, labels, and adoption designYes: discovery + controlled prototype
Fully agentic automationPotential high efficiencyToo risky for SAR filing and high-impact decisionsNo for MVP

First 30 Days

  1. Validate production baseline metrics.
  2. Confirm data source and access model.
  3. Build stakeholder evidence map and workflow baseline.
  4. Convert top requirements into evals.
  5. Add human review and audit control pack.
  6. Design pilot scope and stop rules.

Success Gate

Proceed to pilot only if:

  • cited evidence validity is 100% on test set,
  • compliance approves HITL operating model,
  • historical or external labels are available for non-circular eval,
  • analyst usability test shows task success without trust confusion,
  • control owners accept RACI.

Stop Rule

Stop or revert to workflow-only if:

  • the system cannot cite evidence reliably,
  • labeled data cannot be obtained,
  • reviewers cannot explain AI-assisted recommendations,
  • critical hallucination appears in controlled tests,
  • data/privacy review blocks model use.

8. Next Work Items

PriorityWork itemOutput
P0Complete ABPA stakeholder interviewsInterview evidence notes
P0Convert AS-IS workflow to formal BPMNBPMN + pain metrics
P0Expand requirements-to-eval matrix20+ requirements
P1Add data readiness packsource inventory + PII classification
P1Add usability prototype test plan5-person think-aloud script
P1Add build-vs-buy matrixself-built vs vendor vs hybrid
P2Link public/external labeled AML dataset pathexternal-ground-truth eval plan