AI 扩展计划 / Playbooks
AML ABPA 10 天启动
1. Which part of evidence gathering takes the most time?
326 行abpa/capstone-aml/AML_ABPA_10_DAY_STARTER.md
AML Copilot ABPA 10-Day Starter
This is the first executable ABPA capstone package. It converts existing AML Copilot work into BA / PM / Architect decision assets.
0. Evidence Already Available
| Evidence | Repository source | What it proves |
|---|---|---|
| One-page PRD | docs/AML_COPILOT_PRD.md | Existing product framing, JTBD, MVP scope, eval-first success metrics |
| Domain type contract | src/aml/types.ts | Stable AML case, party, account, transaction, typology, SAR, HITL data model |
| Synthetic golden dataset v1 | src/aml/__tests__/aml.test.ts | 66 cases with labels: structuring 18, layering 15, mule_network 15, normal 18 |
| Rule baseline | src/aml/evalBaseline.ts | Deterministic recall/FPR measurement against synthetic labels |
| Failure taxonomy | src/aml/failureTaxonomy.ts | Six stable failure classes for eval, triage, and monitoring |
| Code checks | src/aml/__tests__/p1evals.test.ts | SAR structure, citation existence, formatting, top typology consistency checks |
| External-ground-truth evaluator | src/aml/groundTruthEval.ts | Future path for non-circular eval with public labeled datasets |
1. 10-Day Output Map
| Day | ABPA asset | AML-specific output |
|---|---|---|
| 1 | Scenario choice | AML Investigation Copilot confirmed as capstone |
| 2 | AI Opportunity Canvas | Filled v0.1 below |
| 3 | Stakeholder list | Frontline analyst, compliance officer, AML ops lead, data owner, model risk, audit, IT, sponsor |
| 4 | Interview cards | Evidence-oriented questions for each group |
| 5 | AS-IS workflow | Current alert-to-SAR process |
| 6 | Pain metrics | Baseline metrics to collect before pilot |
| 7 | Requirement draft | 12 candidate requirements |
| 8 | Requirements-to-Eval | Requirements mapped to code checks, human review, future LLM judge |
| 9 | AI Control Pack | Top controls for hallucination, privacy, over-reliance, stale knowledge, audit |
| 10 | Executive memo | First go/no-go recommendation |
2. AI Opportunity Canvas v0.1
Scenario
| Field | Answer |
|---|---|
| Business domain | Financial crime compliance / AML investigation |
| Process | Alert triage -> evidence gathering -> typology assessment -> SAR draft -> HITL review -> audit |
| Primary user | AML investigator |
| Secondary user | Compliance officer / AML quality reviewer |
| Sponsor | Head of Financial Crime Operations |
| Current systems | Transaction monitoring, core banking, KYC, case management, sanctions screening |
| Current decision owner | AML investigator for case recommendation; compliance officer for SAR submission approval |
Business Problem
| Question | Answer |
|---|---|
| What is painful today? | Investigators manually gather evidence across systems, compare activity against typologies, and write SAR narratives from scratch. |
| Who feels it? | Frontline AML investigators, AML operations leads, compliance quality reviewers, and ultimately regulatory reporting owners. |
| How often does it happen? | Every alert that survives first-level monitoring; exact production volume must be collected in discovery. |
| Current workaround | Manual cross-system search, spreadsheet notes, copied transaction evidence, rule-of-thumb typology judgment, manual narrative writing. |
| Consequence of doing nothing | Backlog, inconsistent investigation quality, slow SAR preparation, weak evidence traceability, staff fatigue, and regulatory exposure. |
Baseline Metrics to Collect
| Metric | Current repository evidence | Production evidence needed |
|---|---|---|
| Case volume | Not in repo | Alerts/day, investigations/day, SAR drafts/month |
| Average handling time | PRD assumes manual work can be hours per case; no internal log yet | Time from alert assignment to disposition |
| Evidence retrieval time | Not measured | Time spent collecting account, party, transaction, counterparty, prior SAR, sanctions evidence |
| Typology accuracy | Synthetic rule baseline exists | Reviewer-confirmed typology labels on historical cases |
| SAR quality | SAR template + code checks exist | QA rubric, returned SAR rate, reviewer edits |
| False positive burden | Synthetic normal FPR is measured | Production false positive rate and closure reasons |
| Cost per case | Not measured | Analyst cost, system cost, model cost, reviewer cost |
| Audit completeness | audit trail code exists | Missing evidence / missing reviewer action rate |
AI Fit
| Dimension | Assessment | Reason |
|---|---|---|
| Language-heavy work | High | SAR narratives, investigation notes, policy explanations |
| Knowledge retrieval | High | Evidence and typology matching depend on multiple sources |
| Judgment complexity | Medium-high | Classification and escalation need business judgment and controls |
| Structured data availability | Medium | Repo has transaction-like structure; production data readiness unknown |
| Labeled data availability | Medium-low | Synthetic labels exist; external/public labels or internal reviewed cases are needed |
| Auditability requirement | High | Every claim in SAR must cite real evidence |
| Error downside | High | Hallucinated evidence or missed suspicious activity is unacceptable |
| Human review feasibility | High | AML already has review / sign-off workflows |
Candidate AI Roles
| Role | Fit | Notes |
|---|---|---|
| Summarize case evidence | Yes | Strong fit if source-bound and citation-required |
| Classify typology | Yes, controlled | Must compare against deterministic baseline and human labels |
| Retrieve evidence | Yes | Needs freshness, identity, permissions, source-of-truth mapping |
| Draft SAR narrative | Yes, with HITL | Never auto-submit |
| Recommend next action | Yes, advisory only | Must show reason and evidence |
| Execute regulatory submission | No for MVP | Human approval and external filing controls required |
| Monitor model drift | Yes | Production quality monitoring required after pilot |
No-AI Boundaries
| Boundary | Reason | Safer alternative |
|---|---|---|
| Automatically filing SARs | Regulatory accountability and high downside of false evidence | HITL approval + submission outside MVP |
| Overriding investigator judgment | Human accountability remains required | AI recommendation with override reason |
| Using uncited model-generated evidence | Hallucination risk | Source-bound evidence retrieval + citation existence checks |
| Deciding customer account closure | High-impact decision with legal/customer risk | Decision support only; separate governance process |
Initial Success Metrics
| Category | Metric | Starter target | Guardrail |
|---|---|---|---|
| Efficiency | Evidence gathering time | -30% in pilot | No quality metric degradation |
| Quality | SAR citation validity | 100% valid cited transaction IDs | Block if invalid citation appears |
| Typology | Suspicious vs normal recall | Must beat rule baseline on external labels before replacement | FPR within agreed threshold |
| Risk | Hallucination critical failures | 0 in reviewed sample | Any critical hallucination stops rollout |
| Adoption | Repeat usage by analysts | >=60% weekly active among pilot users | Override reasons reviewed |
| Cost | Model/tool cost per case | Below analyst time saving threshold | Budget alerts |
Recommendation
| Field | Answer |
|---|---|
| Decision | Fund discovery + controlled prototype, not production rollout |
| First 10 days | Complete ABPA artifact pack and align stakeholders |
| First 30 days | Add production-like sample, reviewer rubric, and requirements-to-eval matrix |
| Stop rule | Stop if evidence cannot be cited, labels cannot be obtained, or compliance refuses HITL operating model |
| Open questions | Production data availability, regulatory reporting ownership, historical case label access, model/provider policy, audit retention |
3. Stakeholder Evidence Map v0.1
| Stakeholder | Success definition | Main concern | Evidence needed |
|---|---|---|---|
| AML investigator | Less manual evidence work, faster high-quality drafts | AI creates more review work or misses nuance | Prototype task test, time-on-task, override reasons |
| AML operations lead | Higher throughput without quality regression | Backlog moves from analysts to reviewers | Throughput, review load, escalation rate |
| Compliance officer | SAR narratives are accurate, complete, traceable | Unsupported statements and regulatory criticism | Citation validity, audit trail, reviewer sign-off |
| Model risk / AI governance | Model behavior is evaluated and controlled | Unvalidated model in high-risk process | Eval suite, model card, change control |
| Data owner | Data use is authorized and understood | PII leakage, stale data, source ambiguity | Data inventory, access controls, retention policy |
| IT / platform | Stable integration and supportable runtime | Unowned scripts, key leakage, unreliable tooling | Architecture ADR, runbook, observability |
| Security / privacy | Least privilege and no uncontrolled data export | Sensitive data in prompts or external vendor | Data classification, redaction, provider review |
| Internal audit | Evidence and decisions are reconstructable | Missing logs or undocumented overrides | Immutable audit trail, RACI, control register |
| Executive sponsor | Material efficiency or risk improvement | Demo without ROI | Business case, baseline, pilot decision memo |
Interview Questions by Stakeholder
AML Investigator
- Which part of evidence gathering takes the most time?
- Which data sources do you trust most and least?
- What makes a SAR draft unusable?
- What kind of AI suggestion would you override immediately?
- What evidence must be visible before you trust a typology recommendation?
- What shortcuts do analysts use today when backlog is high?
- What is the most common reason a case gets returned for rework?
- What should never be hidden behind an AI summary?
Compliance Officer
- What minimum evidence is required before a SAR narrative is acceptable?
- Which statement types are most dangerous if hallucinated?
- What audit fields must exist for AI-assisted work?
- What should trigger mandatory senior review?
- How should AI-generated content be labeled?
- What would make you block the pilot?
- What sample size would you need for quality review?
- Which regulatory expectations are non-negotiable?
Data / IT / Security
- Which systems are source of truth for party, account, transaction, KYC, sanctions, and case status?
- Which fields are PII or restricted?
- What data can leave the bank boundary?
- What identity and authorization model is required?
- What logs can be retained, and for how long?
- What availability and incident response expectations apply?
- How will model/provider outages be handled?
- What integration path is realistic for a pilot?
4. AS-IS Workflow v0.1
flowchart TD
A[Transaction monitoring alert created] --> B[Alert assigned to AML investigator]
B --> C[Open case and review alert reason]
C --> D[Gather customer/KYC/account evidence]
D --> E[Gather transaction and counterparty evidence]
E --> F[Compare activity with typologies]
F --> G{Suspicious enough?}
G -->|No| H[Close as false positive with rationale]
G -->|Yes| I[Draft SAR narrative]
I --> J[Compliance / senior review]
J --> K{Approved?}
K -->|No| L[Return for rework]
L --> D
K -->|Yes| M[Submit / archive evidence]
Pain Metrics to Collect
| Step | Pain hypothesis | Metric |
|---|---|---|
| Evidence gathering | Cross-system switching drives most effort | minutes spent per source |
| Typology comparison | Analyst judgment varies by experience | reviewer disagreement rate |
| SAR drafting | Narrative quality is inconsistent | returned draft rate, edit distance |
| Review | Reviewers spend time checking evidence existence | missing / invalid citation rate |
| Closure | False positives still require evidence | closure time for normal cases |
TO-BE Hypothesis
flowchart TD
A[Alert assigned] --> B[Copilot gathers evidence]
B --> C[Copilot scores typology with cited evidence]
C --> D{Confidence and risk gate}
D -->|Low confidence/high risk| E[Mandatory human review]
D -->|Normal advisory| F[Analyst reviews closure rationale]
D -->|Suspicious advisory| G[Copilot drafts source-bound SAR]
F --> H[Analyst disposition]
G --> I[Analyst edits and submits for compliance review]
E --> I
I --> J[Audit trail export]
5. Requirements-to-Eval Matrix v0.1
| Req ID | Requirement | Eval data | Grader | Threshold | Production signal | Owner |
|---|---|---|---|---|---|---|
| R-001 | SAR draft must cite only transaction IDs that exist in the case | Synthetic v1/v1.1 + future historical samples | code check | 100% | invalid citation rate | Compliance Ops |
| R-002 | SAR draft must include at least 5W1H narrative sections | Synthetic cases | code check | >=99% | incomplete draft rate | Product / Compliance |
| R-003 | Typology recommendation must expose evidence and rule hits | Synthetic labeled cases | code check + human review | 100% evidence visible | hidden reasoning complaints | Product |
| R-004 | Suspicious-vs-normal classification must beat rule baseline before replacing it | External/public labeled set + internal reviewed cases | metric eval | statistically better than baseline | recall / precision / FPR | Model risk |
| R-005 | Normal case false positives must remain within approved threshold | Historical normal cases | metric eval | threshold TBD by compliance | false positive rate | AML Ops |
| R-006 | AI-generated content must be labeled as AI-assisted | UI / content inspection | code / UX review | 100% | unlabeled AI content count | Compliance |
| R-007 | Analyst must be able to override recommendation with reason | Prototype / workflow test | usability test | 100% task success | override rate + reason distribution | Product |
| R-008 | High-risk or low-confidence outputs must require human review | Scenario set | code check + human review | 100% routed | auto-complete high-risk count | Risk |
| R-009 | Every material action must be audit logged | Audit trail tests | code check | 100% | missing audit event count | IT / Audit |
| R-010 | Prompt injection in source text must not cause unsupported SAR claims | Red-team set | adversarial eval | 0 critical failures | injection block / miss rate | Security |
| R-011 | Model/provider outage must degrade to workflow baseline | Failure injection | integration test | no data loss | fallback activation rate | Platform |
| R-012 | Cost per case must stay below approved pilot budget | Trace/cost logs | metric | budget TBD | $/case | Sponsor |
6. AI Control Pack v0.1
| Control ID | Risk | Preventive control | Detective control | Corrective control | Evidence | Owner |
|---|---|---|---|---|---|---|
| C-001 | Hallucinated transaction evidence | Source-bound citation requirement; no free-form evidence IDs | citedTxIds existence check | Block draft and route to analyst | invalid citation report | Compliance Ops |
| C-002 | Unsupported typology conclusion | Show rule hits / evidence; require confidence and rationale | reviewer disagreement tracking | Recalibrate threshold / labels | typology eval report | Model risk |
| C-003 | Context pollution | Restrict evidence to case-linked accounts and relevant channels | benign noise citation check | Remove noisy evidence and review prompt/tool design | failure taxonomy label | Product |
| C-004 | Prompt injection | Treat source text as untrusted; tool output isolation | red-team injection suite | patch prompts/tools and rerun suite | attack transcript | Security |
| C-005 | Privacy leakage | Classify and minimize fields before model call | prompt/trace PII scan | redact and incident review | PII scan report | Data owner |
| C-006 | Over-reliance | UI labels AI as advisory; require override and review controls | override / blind acceptance rate | retrain users and adjust UX | adoption dashboard | AML Ops |
| C-007 | Stale knowledge | Version policies and typology library | freshness check | refresh knowledge base | knowledge version log | Compliance |
| C-008 | Audit failure | Append-only audit event chain | missing event check | reconstruct case and file incident | audit export | Internal audit |
| C-009 | Provider outage | fallback to rules/workflow baseline | timeout/error monitoring | switch provider or manual queue | runtime trace | IT |
| C-010 | Cost runaway | budget cap / model routing policy | $/case monitoring | throttle or downgrade model | cost dashboard | Sponsor |
7. Executive Decision Memo v0.1
Decision Requested
Fund a 30-day discovery and controlled prototype hardening phase for AML Investigation Copilot. Do not approve production rollout yet.
Why Now
- The repository already has a functioning AML domain model, synthetic labeled dataset, rule baseline, failure taxonomy, SAR draft generator, and eval scaffolding.
- The business workflow is high-value: AML investigations are evidence-heavy, language-heavy, regulated, and audit-sensitive.
- The current prototype is honest about its limits: rule-template SAR, synthetic data, no production PII, no automatic filing.
Options
| Option | Pros | Cons | Recommendation |
|---|---|---|---|
| Do nothing | No implementation risk | Backlog and manual evidence burden remain | No |
| Workflow-only improvements | Lower AI risk, easier audit | Does not address evidence synthesis / narrative drafting fully | Use as fallback baseline |
| AI-assisted copilot | Strong fit for evidence summarization, typology support, SAR draft | Needs controls, labels, and adoption design | Yes: discovery + controlled prototype |
| Fully agentic automation | Potential high efficiency | Too risky for SAR filing and high-impact decisions | No for MVP |
First 30 Days
- Validate production baseline metrics.
- Confirm data source and access model.
- Build stakeholder evidence map and workflow baseline.
- Convert top requirements into evals.
- Add human review and audit control pack.
- Design pilot scope and stop rules.
Success Gate
Proceed to pilot only if:
- cited evidence validity is 100% on test set,
- compliance approves HITL operating model,
- historical or external labels are available for non-circular eval,
- analyst usability test shows task success without trust confusion,
- control owners accept RACI.
Stop Rule
Stop or revert to workflow-only if:
- the system cannot cite evidence reliably,
- labeled data cannot be obtained,
- reviewers cannot explain AI-assisted recommendations,
- critical hallucination appears in controlled tests,
- data/privacy review blocks model use.
8. Next Work Items
| Priority | Work item | Output |
|---|---|---|
| P0 | Complete ABPA stakeholder interviews | Interview evidence notes |
| P0 | Convert AS-IS workflow to formal BPMN | BPMN + pain metrics |
| P0 | Expand requirements-to-eval matrix | 20+ requirements |
| P1 | Add data readiness pack | source inventory + PII classification |
| P1 | Add usability prototype test plan | 5-person think-aloud script |
| P1 | Add build-vs-buy matrix | self-built vs vendor vs hybrid |
| P2 | Link public/external labeled AML dataset path | external-ground-truth eval plan |