AI 扩展计划 / Playbooks

AI Procurement Intake / Vendor Evaluation Sandbox / Build-Buy Playbook

使用本 playbook 解决五类问题:

469 行AI_PROCUREMENT_INTAKE_VENDOR_EVALUATION_SANDBOX_BUILD_BUY_PLAYBOOK.md

AI Procurement Intake / Vendor Evaluation Sandbox / Build-Buy Decision Architecture Playbook

定位: 面向 experienced CBAP / AI BA / AI PM / Solution Architect / Enterprise Architect / financial retail transformation lead 的实战手册。
目标: 把 AI idea, vendor pitch 和 business pain 转成可评估、可门禁、可审计、可决策的 procurement intake, sandbox benchmark 和 build-buy-partner evidence pack。
边界: 本手册位于合同谈判、供应商退出策略和董事会投资叙事之前。它提供上游证据, 不替代 legal, compliance, procurement, privacy, security, model risk, finance 或 internal audit 的正式审查。

Purpose And When To Use

Purpose

使用本 playbook 解决五类问题:

业务方或高管提出 AI idea, 需要判断是否进入 AI portfolio。
供应商主动 pitch, 团队需要避免 demo-led procurement。
多个方案在 build, buy, partner, hybrid 之间摇摆。
PoC 已经出现, 但缺少证据、风险边界、成本模型和生产门禁。
需要把 AI vendor evaluation 做成可展示的高级 PM / BA / 架构作品集资产。

When To Use

Trigger	Use this playbook to produce
"我们想做一个 AI 助手"	AI procurement intake and use-case triage
"这个 vendor demo 很好"	vendor sandbox charter and benchmark plan
"到底买还是自研"	build-buy-partner decision record
"PoC 结果不错, 可以上线吗"	production gate evidence packet
"风险部门问证据在哪里"	metrics-control-evidence model and risk acceptance
"CTO 问是否会 lock-in"	architecture fit, concentration risk and exit constraint review

What Good Looks Like

AI request
  -> intake card
  -> risk-tiered triage
  -> sandbox charter
  -> benchmark plan
  -> vendor/internal/no-AI comparison
  -> build-buy-partner decision record
  -> risk acceptance if needed
  -> production gate evidence packet
  -> downstream contract, TPRM, model validation and release process

Operating Model

Stage Gates

Gate	Decision	Required evidence	Stop signal
Gate 0: Intake completeness	Is this a valid AI candidate?	intake card, owner, outcome, workflow, data, no-AI option	no measurable outcome or no accountable owner
Gate 1: Use-case triage	What risk tier and path?	impact assessment, AI role, customer/regulatory effect	unacceptable risk or unclear decision authority
Gate 2: Sandbox approval	Can we test safely?	sandbox charter, data approval, benchmark plan, cost cap	production data or write access without approval
Gate 3: Evidence review	Which option is strongest?	scorecard, benchmark report, failure taxonomy, architecture fit	demo-only evidence or critical failure unresolved
Gate 4: Build-buy decision	build, buy, partner, hybrid, stop?	decision record, TCO, control ownership, concentration review	preference-only decision
Gate 5: Production promotion	Can this enter pilot or production?	evidence packet, risk acceptance, release checklist	missing trace, eval, data, security or owner evidence

RACI

Activity	Business owner	AI PM	CBAP / BA	Architect	Security / Privacy	Risk / Compliance / MRM	Procurement / TPRM	Operations
Intake card	A	R	R	C	C	C	I	C
Triage	A	R	R	R	C	A/R	C	C
Sandbox charter	C	A/R	R	R	R	C	C	C
Benchmark design	C	R	R	R	C	A/R	I	C
Vendor scorecard	C	A/R	R	R	R	R	A/R	C
Build-buy decision	A	R	C	A/R	C	C	C	C
Risk acceptance	A	R	C	C	C	A/R	C	C
Production gate	A	R	R	R	R	R	C	A/R

A = accountable, R = responsible, C = consulted, I = informed.

Work Products

Work product	Minimum content	Reuse
AI procurement intake	outcome, workflow, AI role, data, risk, no-AI option	portfolio funnel and prioritization
Vendor sandbox scorecard	quality, risk, cost, latency, architecture fit, evidence	vendor comparison and procurement input
Benchmark plan	dataset, tasks, rubric, threshold, sample size, measurement	eval contract and release gate
Build-buy decision record	options, component boundary, decision, rationale, reversal trigger	ADR and executive memo input
Risk acceptance	residual risk, controls, evidence, owner, expiry, stop rule	risk governance and audit
Production gate packet	traceable evidence needed for pilot/production	release, model validation and operating readiness

Template: AI Procurement Intake

Field	Required answer	Example: AML investigation workbench
Request name	Specific workflow and AI role	AML investigation narrative copilot
Business owner	Named accountable owner	Head of AML Operations
Outcome	Baseline and target movement	Reduce evidence collection and narrative drafting time by 25 percent without increasing missed red flags
Workflow	Current process and insertion point	Analyst reviews alert, retrieves transactions, checks customer profile, drafts case narrative
AI role	read, retrieve, summarize, recommend, draft, decide, act	retrieve, summarize, draft; no final disposition
Users	Primary and secondary users	AML analyst, QA reviewer, team lead, model risk reviewer
Customer/regulatory impact	Direct, indirect or none	Indirect regulatory impact through case evidence and SAR support
Data sources	Systems, documents, logs, retention	transaction history, customer profile, KYC docs, prior cases, typology library
Sensitive data	PII, PCI, account, transaction, protected class, SAR-sensitive	PII, account, transaction, KYC and investigation-sensitive content
No-AI alternative	Process, rules, search, RPA, reporting	improve search, standard narrative templates, rules-based evidence checklist
Risk tier	Low, medium, high, critical	High
Sandbox data plan	synthetic, masked, historical, red-team	masked historical cases plus synthetic typology cases
Evaluation hypothesis	What must be proven	AI improves evidence completeness and drafting time while critical omissions remain zero
Architecture concern	Main integration/control issue	source entitlement, trace, case-system integration, human approval
Initial recommendation	reject, defer, sandbox, direct purchase, internal build	controlled sandbox only

Intake Triage Rules

If this is true	Route
No owner, no baseline, no decision boundary	reject or defer
Data cannot be used safely even in sandbox	defer until data plan exists
AI would make high-impact decisions without human authority	redesign scope
Value is mostly search or process discipline	compare AI against no-AI improvement
Workflow is common and low risk	buy candidate with sandbox
Workflow is regulated, high impact or evidence-heavy	hybrid candidate with stronger gate

Template: Vendor Sandbox Scorecard

Score each dimension from 1 to 5, then apply the weight. Adjust weights by use case risk. For AML, KYC, credit and payments, increase governance, audit, data and control weights.

Dimension	Weight	1	3	5	Evidence required
Business workflow fit	10	generic demo	partial workflow	fits real roles, queues and exceptions	workflow demo using sandbox tasks
Output quality	12	claims only	acceptable average score	strong score plus slice analysis	eval report and samples
Critical failure control	12	severe failures	mitigations proposed	no critical failures in approved set	failure taxonomy
Data privacy	10	vague processing	basic data terms	clear prompt, embedding, log, support and retention controls	data flow and settings
Security	10	weak IAM/logging	basic SSO/RBAC	strong IAM, tenant isolation, prompt/tool threat testing	security evidence
Eval maturity	10	vendor benchmark only	supports custom eval	supports customer gold set, red-team and export	eval methodology
Architecture fit	10	black box	API integration	gateway, trace, SIEM, IAM, RAG and tool boundary fit	architecture review
Audit evidence	8	transcript only	partial logs	full trace and admin audit export	trace sample
Cost predictability	6	unclear	rate card	cost per case, caps, variance and scale model	cost workbook
Latency/reliability	5	best effort	basic SLA	p95 measured, rate limits known, fallback path	latency report
Concentration/exit constraint	5	heavy lock-in	partial export	portable data, eval, prompt, logs and config	export review
Vendor operating maturity	2	sales-led	some process	support, incident, change and roadmap discipline	vendor evidence

Decision bands:

Weighted score	Interpretation
0-55	Do not pilot; use only for learning
56-70	Limited sandbox extension or low-risk pilot only
71-84	Candidate for controlled pilot if blockers closed
85-100	Strong candidate, still subject to production gate

Hard blockers override score:

Critical customer harm, regulatory, privacy or security failure unresolved。
Vendor cannot explain data route, retention or support access。
No evidence export for high-risk use case。
Tool actions cannot be constrained or audited。
Cost or latency cannot be modeled for production volume。

Template: Benchmark Plan

Section	Required content	Financial retail example
Use case	workflow and AI role	credit policy support copilot for underwriter memo drafting
Options compared	vendor A, vendor B, internal baseline, no-AI baseline	two vendor copilots, internal RAG prototype, current policy search
Dataset	source, size, masking, lineage, approval	300 historical policy questions, 80 edge cases, 40 red-team prompts
Task set	representative tasks	retrieve policy, summarize exception, draft memo section, refuse unsupported decision
Rubric	scoring dimensions and severity	groundedness, completeness, policy compliance, adverse-action boundary
Critical failures	zero-tolerance failures	protected-class inference, unsupported decline reason, stale policy citation
Metrics	quality, cost, latency, adoption proxy, risk	pass rate, critical failure rate, p95 latency, cost per memo, reviewer agreement
Slice analysis	product, customer type, geography, risk tier	small business loan, credit card, secured loan, state-specific policy
Controls tested	privacy, security, prompt injection, DLP, entitlement	restricted policy, prompt injection, PII masking, role-based source access
Run protocol	repeat count, temperature, model version, evaluator	fixed model route, three repeated runs for unstable tasks, SME review
Acceptance threshold	go, limited go, no-go	no critical failures, groundedness >= 4/5, p95 < 6s, cost <$0.40 per memo section
Evidence package	files and logs retained	prompts, outputs, traces, citations, evaluator notes, cost/latency export

Benchmark Result Table

Option	Quality	Critical failures	Cost/case	p95 latency	Architecture fit	Evidence maturity	Recommendation
Vendor A	86	0	$0.52	5.8s	strong API, weak log export	medium	pilot only if log export gap closed
Vendor B	79	2	$0.31	4.2s	black-box RAG	low	reject for regulated workflow
Internal RAG	81	0	$0.44	6.5s	strong control, moderate UX	high	hybrid candidate
No-AI baseline	62	0	staff time only	11 min	existing tools	high	keep as fallback and comparator

Template: Build-Buy Decision Record

Decision title:
Use case:
Date:
Decision owner:
Architecture owner:
Risk owner:

Decision:
Choose build / buy / partner / hybrid / stop.

Context:
Describe business outcome, workflow, AI role, data boundary, risk tier and production constraints.

Options considered:
1. Buy vendor capability.
2. Build internal capability.
3. Partner/co-build.
4. Hybrid component architecture.
5. No-AI/process option.

Evidence reviewed:
- Intake card:
- Sandbox benchmark:
- Vendor scorecard:
- Cost and latency:
- Security/privacy review:
- Architecture fit:
- Concentration and exit constraints:
- User feedback:

Rationale:
Explain why this option best balances value, control, speed, risk, cost, architecture fit and reversibility.

Component boundary:
List what is bought, built, partnered and retained internally.

Conditions:
List blockers that must close before pilot or production.

Residual risk:
List risk accepted, owner and expiry.

Reversal triggers:
List events that require redesign, re-benchmark, exit or stop.

Component Boundary Table

Component	Decision	Owner	Reason
Base model	buy	AI platform	commodity capability, managed scale
Model gateway	build	platform architecture	routing, logging, policy and cost control
Domain knowledge ingestion	hybrid	data and operations	vendor extraction plus internal source registry
Eval datasets and rubrics	build	EvalOps and domain SMEs	institution-specific behavior contract
Case workflow	existing platform	operations technology	avoid parallel case system
Tool action layer	build	architecture and security	permissions, approval, idempotency, audit
Observability	hybrid	platform SRE	vendor telemetry exported to internal dashboard

Template: Risk Acceptance

Field	Required content	Example
Use case	workflow and AI role	payments scam intervention script assistant
Decision	proceed, limited pilot, pause, reject	limited pilot
Residual risk statement	what remains after controls	AI may under-escalate novel scam language not represented in eval set
Impact	customer, financial, compliance, operational, reputation	customer loss, complaint, regulatory scrutiny
Preventive controls	controls before output/action	typology source registry, mandatory human approval, prohibited payment action
Detective controls	monitoring and review	daily sample review, under-escalation QA, complaint monitoring
Corrective controls	what happens when risk materializes	disable route, update red-team set, retrain staff, customer remediation workflow
Evidence	basis for acceptance	sandbox report, red-team result, p95 latency, reviewer calibration
Owner	named accountable person	Head of Payments Fraud Operations
Expiry	time or trigger	60 days or first high-severity incident
Stop rule	hard rollback condition	any unauthorized payment action or confirmed data leakage
Approval	business, risk, security, privacy, architecture	named approvals with date

Risk acceptance rules:

No permanent acceptance for unresolved critical AI risks。
No acceptance without named owner and expiry。
No acceptance for missing evidence; missing evidence is a blocker。
No use of "human in the loop" unless review behavior is logged and measured。

Template: Production Gate Evidence Packet

Evidence area	Required artifact	Gate question
Business and workflow	intake card, workflow map, decision boundary	Is the AI role clear and limited?
Evaluation	eval contract, benchmark report, failure taxonomy	Did it meet quality and risk thresholds?
Data and privacy	data flow, data classification, retention, DLP test	Is data use approved and minimized?
Security	threat model, IAM, RBAC, prompt injection test, tool control	Can it be attacked or misused beyond appetite?
Architecture	C4/context diagram, component decision, ADR, integration design	Does it fit target architecture and operations?
Vendor and third party	scorecard, due diligence gaps, concentration review	Are vendor dependencies understood?
Cost and latency	unit cost, p95 latency, budget cap, capacity model	Can it run economically and within SLO?
Human oversight	review policy, training, override log, escalation path	Does human control actually operate?
Observability	trace schema, log export, dashboard, alert rules	Can one case be reconstructed and monitored?
Incident and rollback	kill switch, fallback, manual queue, severity matrix	Can we stop, degrade or recover?
Risk acceptance	residual risk memo and approvals	Are remaining risks named, owned and time-bound?
Operating model	RACI, runbook, support model, cadence	Who owns performance after release?

Gate outcomes:

Outcome	Meaning
Go	Evidence complete, thresholds met, controls operational
Limited go	Narrow user group, time box, volume cap, explicit residual risk
No-go	Critical evidence missing or threshold failed
Rework	Design or control issue can be fixed and re-tested
Stop	Value too weak or risk too high

PM / BA / Architecture Questions

PM Questions

Question	Strong answer
What user behavior changes?	Names the exact task, user, queue, decision and expected adoption signal
What is the baseline?	Current time, quality, cost, risk or customer experience measured or measurable
Why AI rather than process or rules?	AI handles unstructured, variable, evidence-heavy work that simpler options cannot solve
What is the MVP boundary?	Bounded role, limited users, prohibited actions and pilot duration
What would make you stop?	Predefined quality, risk, cost, latency or adoption failure

BA Questions

Question	Strong answer
What requirement becomes an eval contract?	Requirement maps to dataset, rubric, threshold and evidence
What are exceptions and edge cases?	Includes complaint, vulnerable customer, stale policy, missing document, suspicious pattern
Who owns final authority?	Human/system authority is explicit, logged and not hidden in AI output
What data can be used?	Field-level data boundary, masking, retention and purpose are defined
How is traceability maintained?	Requirement -> eval -> prompt/model/RAG/tool version -> gate decision

Architecture Questions

Question	Strong answer
Where is the control point?	model gateway, data boundary, RAG source registry, tool gateway, eval gate or workflow approval
What is vendor-specific?	UI, model, prompt, index, workflow, logs, evals, connectors and admin config are separated
Can one output be reconstructed?	Trace includes user, prompt, model, source, tool, output, review and final action
What fails safe?	timeout, vendor outage, model regression, cost spike and unsafe output route to fallback
What is the exit constraint?	Data, prompts, evals, logs, indexes, configs and workflows have export or rebuild path

Release Checklist

Use this checklist before moving from sandbox to pilot or production.

Check	Pass condition
Intake complete	owner, outcome, workflow, AI role, data, risk tier and no-AI option recorded
Sandbox approved	data plan, task set, cost cap and access boundary approved
Benchmark executed	vendor, internal and baseline options compared fairly
Critical failures closed	zero unresolved critical failures or documented no-go
Architecture ADR approved	build-buy boundary, integration and control points recorded
Data/privacy review complete	prompt, document, embedding, log, retention and support access covered
Security review complete	IAM, RBAC, injection, tool misuse, DLP and logging tested
Eval contract approved	metrics, thresholds, slices and critical failures accepted by owners
Cost/latency acceptable	cost per case and p95 latency inside pilot envelope
Human oversight ready	reviewers trained, override/escalation logged
Observability ready	trace, dashboard, alert and evidence export tested
Incident response ready	kill switch, fallback, manual queue and severity routing tested
Risk acceptance signed	residual risks named, owned, expiring and linked to evidence
Production support ready	runbook, support owner, review cadence and vendor contact path active

Executive Narrative

One-Page Narrative

Decision requested:
Approve a controlled pilot / reject / continue sandbox / proceed to procurement diligence for [use case].

Why now:
[Business workflow] has [measured pain]. AI is plausible because the work is [unstructured, evidence-heavy, repetitive, time-sensitive], but production use requires controls over data, model behavior, human authority and audit evidence.

Options evaluated:
We compared [vendor A], [vendor B], [internal option] and [no-AI baseline] using the same sandbox dataset, tasks, rubric, cost and latency measurements.

Evidence:
The leading option achieved [quality result], [critical failure result], [cost], [latency] and [architecture fit]. Key gaps are [gap list].

Recommendation:
Proceed with [build/buy/partner/hybrid/stop] because it best balances value, risk, speed, control, cost and reversibility.

Conditions:
Before production, close [security/privacy/eval/logging/vendor] gaps, approve risk acceptance, and pass the production gate.

Stop rule:
Stop or roll back if [critical failure, data leakage, unauthorized action, cost breach, latency breach, adoption failure] occurs.

CFO / COO / CTO / CRO Lens

Executive	What they need to hear
CFO	Cost per case, cost-to-learn, scale economics, avoided waste from no-go decisions
COO	Workflow impact, adoption, queue capacity, fallback and operational ownership
CTO	Architecture fit, integration, abstraction, observability, resilience and concentration risk
CRO / Compliance	Risk tier, controls, evidence, human authority, monitoring and stop rules
CPO / Business lead	User value, customer impact, MVP scope, learning plan and rollout conditions

Interview Drills

Drill 1: Contact Center Vendor Pitch

Prompt:

A vendor shows a strong GenAI contact center demo. The business wants to buy quickly. How do you respond?

Strong answer:

I would not block learning, but I would move the vendor into a controlled intake and sandbox path. First I define the target workflows, such as credit card fee questions and dispute status. Then I set the AI role as agent assist, not direct customer commitment. The sandbox uses approved policy documents, synthetic customer cases, complaint and vulnerable customer edge cases, and prompt-injection tests. I compare the vendor against current knowledge search and any internal RAG option. The scorecard includes groundedness, policy compliance, escalation, p95 latency, cost per interaction, trace export and architecture fit. If it passes, I recommend a limited pilot with human review and production gate conditions.

Drill 2: AML Workbench Build vs Buy

Prompt:

Should the bank buy an AML investigation copilot or build one?

Strong answer:

I would avoid a single answer at product level. AML is high-risk and evidence-heavy, so I would likely choose hybrid. We can buy commodity document summarization or UX components, but keep internal control over source registry, entitlement, case-system integration, eval datasets, red-flag taxonomy, model gateway, audit trace and final disposition boundary. The sandbox must prove no critical red-flag omission, grounded narrative, reviewer agreement, trace reconstruction and acceptable cost/latency. If a vendor cannot export evidence or respect source permissions, it cannot be a production candidate even if the demo is impressive.

Drill 3: Credit Decision Support Risk Challenge

Prompt:

The pilot improves underwriter productivity by 30 percent but has two unsupported adverse-action draft examples. What do you do?

Strong answer:

I would treat unsupported adverse-action language as a critical failure, not an average-quality issue. The gate outcome is no-go or limited rework, depending on whether the feature can be constrained. I would remove adverse-action drafting from scope, strengthen policy retrieval and citation checks, add fair-lending and protected-class proxy cases to the eval set, require human approval logging, and rerun regression. Productivity improvement does not override customer harm and regulatory evidence requirements.

Drill 4: Payments Fraud Real-Time Constraint

Prompt:

A fraud intervention AI is accurate but p95 latency exceeds the intervention window. Is it still acceptable?

Strong answer:

Not for real-time intervention. The benchmark must be interpreted against operational SLO, not just model quality. I would either constrain the AI to post-event case enrichment, precompute certain risk narratives, use a smaller model route, or keep deterministic real-time controls while AI supports analyst review. Architecture fit includes latency, fallback and queue impact; a high-quality answer that arrives too late fails the workflow.

Drill 5: CTO Challenge On Lock-In

Prompt:

The CTO asks how you prevent AI vendor lock-in during procurement intake.

Strong answer:

I separate commodity capability from control points. During intake and sandbox, I require component mapping for model, RAG, prompt, eval, tool, workflow and logs. I prefer a model gateway, internal eval datasets, source registry, tool gateway and telemetry export so we can compare or replace vendors. I also score exportability of prompts, logs, eval results, indexes, configs and data. I do not claim perfect portability; I make exit constraints explicit before pilot and price them into the build-buy decision.

Source Anchors

Anchor	Link	用法
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	risk map, measure, manage and governance evidence
NIST AI RMF Generative AI Profile	https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence	GenAI-specific sandbox risks and red-team cases
ISO/IEC 42001 AI management systems	https://www.iso.org/standard/81230.html	AI lifecycle accountability and management system checks
ISO/IEC/IEEE 29148 Requirements engineering	https://www.iso.org/standard/72089.html	requirements quality and validation discipline
ISO/IEC/IEEE 42010 Architecture description	https://www.iso.org/standard/74393.html	stakeholder concerns, architecture viewpoints and ADR rationale
Interagency Third-Party Risk Guidance, FDIC FIL-29-2023	https://www.fdic.gov/news/financial-institution-letters/2023/fil23029.html	third-party planning, due diligence, selection and monitoring lifecycle
FFIEC AIO booklet summary, OCC Bulletin 2021-30	https://www.occ.gov/news-issuances/bulletins/2021/bulletin-2021-30.html	architecture, infrastructure, operations and resilience review
OWASP Top 10 for Large Language Model Applications	https://owasp.org/www-project-top-10-for-large-language-model-applications/	LLM security test cases for injection, data disclosure and excessive agency