返回 Papers
AI 扩展计划 / Playbooks

AI Procurement Intake / Vendor Evaluation Sandbox / Build-Buy Playbook

使用本 playbook 解决五类问题:

469AI_PROCUREMENT_INTAKE_VENDOR_EVALUATION_SANDBOX_BUILD_BUY_PLAYBOOK.md

AI Procurement Intake / Vendor Evaluation Sandbox / Build-Buy Decision Architecture Playbook

定位: 面向 experienced CBAP / AI BA / AI PM / Solution Architect / Enterprise Architect / financial retail transformation lead 的实战手册。
目标: 把 AI idea, vendor pitch 和 business pain 转成可评估、可门禁、可审计、可决策的 procurement intake, sandbox benchmark 和 build-buy-partner evidence pack。
边界: 本手册位于合同谈判、供应商退出策略和董事会投资叙事之前。它提供上游证据, 不替代 legal, compliance, procurement, privacy, security, model risk, finance 或 internal audit 的正式审查。


Purpose And When To Use

Purpose

使用本 playbook 解决五类问题:

  1. 业务方或高管提出 AI idea, 需要判断是否进入 AI portfolio。
  2. 供应商主动 pitch, 团队需要避免 demo-led procurement。
  3. 多个方案在 build, buy, partner, hybrid 之间摇摆。
  4. PoC 已经出现, 但缺少证据、风险边界、成本模型和生产门禁。
  5. 需要把 AI vendor evaluation 做成可展示的高级 PM / BA / 架构作品集资产。

When To Use

TriggerUse this playbook to produce
"我们想做一个 AI 助手"AI procurement intake and use-case triage
"这个 vendor demo 很好"vendor sandbox charter and benchmark plan
"到底买还是自研"build-buy-partner decision record
"PoC 结果不错, 可以上线吗"production gate evidence packet
"风险部门问证据在哪里"metrics-control-evidence model and risk acceptance
"CTO 问是否会 lock-in"architecture fit, concentration risk and exit constraint review

What Good Looks Like

AI request
  -> intake card
  -> risk-tiered triage
  -> sandbox charter
  -> benchmark plan
  -> vendor/internal/no-AI comparison
  -> build-buy-partner decision record
  -> risk acceptance if needed
  -> production gate evidence packet
  -> downstream contract, TPRM, model validation and release process

Operating Model

Stage Gates

GateDecisionRequired evidenceStop signal
Gate 0: Intake completenessIs this a valid AI candidate?intake card, owner, outcome, workflow, data, no-AI optionno measurable outcome or no accountable owner
Gate 1: Use-case triageWhat risk tier and path?impact assessment, AI role, customer/regulatory effectunacceptable risk or unclear decision authority
Gate 2: Sandbox approvalCan we test safely?sandbox charter, data approval, benchmark plan, cost capproduction data or write access without approval
Gate 3: Evidence reviewWhich option is strongest?scorecard, benchmark report, failure taxonomy, architecture fitdemo-only evidence or critical failure unresolved
Gate 4: Build-buy decisionbuild, buy, partner, hybrid, stop?decision record, TCO, control ownership, concentration reviewpreference-only decision
Gate 5: Production promotionCan this enter pilot or production?evidence packet, risk acceptance, release checklistmissing trace, eval, data, security or owner evidence

RACI

ActivityBusiness ownerAI PMCBAP / BAArchitectSecurity / PrivacyRisk / Compliance / MRMProcurement / TPRMOperations
Intake cardARRCCCIC
TriageARRRCA/RCC
Sandbox charterCA/RRRRCCC
Benchmark designCRRRCA/RIC
Vendor scorecardCA/RRRRRA/RC
Build-buy decisionARCA/RCCCC
Risk acceptanceARCCCA/RCC
Production gateARRRRRCA/R

A = accountable, R = responsible, C = consulted, I = informed.

Work Products

Work productMinimum contentReuse
AI procurement intakeoutcome, workflow, AI role, data, risk, no-AI optionportfolio funnel and prioritization
Vendor sandbox scorecardquality, risk, cost, latency, architecture fit, evidencevendor comparison and procurement input
Benchmark plandataset, tasks, rubric, threshold, sample size, measurementeval contract and release gate
Build-buy decision recordoptions, component boundary, decision, rationale, reversal triggerADR and executive memo input
Risk acceptanceresidual risk, controls, evidence, owner, expiry, stop rulerisk governance and audit
Production gate packettraceable evidence needed for pilot/productionrelease, model validation and operating readiness

Template: AI Procurement Intake

FieldRequired answerExample: AML investigation workbench
Request nameSpecific workflow and AI roleAML investigation narrative copilot
Business ownerNamed accountable ownerHead of AML Operations
OutcomeBaseline and target movementReduce evidence collection and narrative drafting time by 25 percent without increasing missed red flags
WorkflowCurrent process and insertion pointAnalyst reviews alert, retrieves transactions, checks customer profile, drafts case narrative
AI roleread, retrieve, summarize, recommend, draft, decide, actretrieve, summarize, draft; no final disposition
UsersPrimary and secondary usersAML analyst, QA reviewer, team lead, model risk reviewer
Customer/regulatory impactDirect, indirect or noneIndirect regulatory impact through case evidence and SAR support
Data sourcesSystems, documents, logs, retentiontransaction history, customer profile, KYC docs, prior cases, typology library
Sensitive dataPII, PCI, account, transaction, protected class, SAR-sensitivePII, account, transaction, KYC and investigation-sensitive content
No-AI alternativeProcess, rules, search, RPA, reportingimprove search, standard narrative templates, rules-based evidence checklist
Risk tierLow, medium, high, criticalHigh
Sandbox data plansynthetic, masked, historical, red-teammasked historical cases plus synthetic typology cases
Evaluation hypothesisWhat must be provenAI improves evidence completeness and drafting time while critical omissions remain zero
Architecture concernMain integration/control issuesource entitlement, trace, case-system integration, human approval
Initial recommendationreject, defer, sandbox, direct purchase, internal buildcontrolled sandbox only

Intake Triage Rules

If this is trueRoute
No owner, no baseline, no decision boundaryreject or defer
Data cannot be used safely even in sandboxdefer until data plan exists
AI would make high-impact decisions without human authorityredesign scope
Value is mostly search or process disciplinecompare AI against no-AI improvement
Workflow is common and low riskbuy candidate with sandbox
Workflow is regulated, high impact or evidence-heavyhybrid candidate with stronger gate

Template: Vendor Sandbox Scorecard

Score each dimension from 1 to 5, then apply the weight. Adjust weights by use case risk. For AML, KYC, credit and payments, increase governance, audit, data and control weights.

DimensionWeight135Evidence required
Business workflow fit10generic demopartial workflowfits real roles, queues and exceptionsworkflow demo using sandbox tasks
Output quality12claims onlyacceptable average scorestrong score plus slice analysiseval report and samples
Critical failure control12severe failuresmitigations proposedno critical failures in approved setfailure taxonomy
Data privacy10vague processingbasic data termsclear prompt, embedding, log, support and retention controlsdata flow and settings
Security10weak IAM/loggingbasic SSO/RBACstrong IAM, tenant isolation, prompt/tool threat testingsecurity evidence
Eval maturity10vendor benchmark onlysupports custom evalsupports customer gold set, red-team and exporteval methodology
Architecture fit10black boxAPI integrationgateway, trace, SIEM, IAM, RAG and tool boundary fitarchitecture review
Audit evidence8transcript onlypartial logsfull trace and admin audit exporttrace sample
Cost predictability6unclearrate cardcost per case, caps, variance and scale modelcost workbook
Latency/reliability5best effortbasic SLAp95 measured, rate limits known, fallback pathlatency report
Concentration/exit constraint5heavy lock-inpartial exportportable data, eval, prompt, logs and configexport review
Vendor operating maturity2sales-ledsome processsupport, incident, change and roadmap disciplinevendor evidence

Decision bands:

Weighted scoreInterpretation
0-55Do not pilot; use only for learning
56-70Limited sandbox extension or low-risk pilot only
71-84Candidate for controlled pilot if blockers closed
85-100Strong candidate, still subject to production gate

Hard blockers override score:

  • Critical customer harm, regulatory, privacy or security failure unresolved。
  • Vendor cannot explain data route, retention or support access。
  • No evidence export for high-risk use case。
  • Tool actions cannot be constrained or audited。
  • Cost or latency cannot be modeled for production volume。

Template: Benchmark Plan

SectionRequired contentFinancial retail example
Use caseworkflow and AI rolecredit policy support copilot for underwriter memo drafting
Options comparedvendor A, vendor B, internal baseline, no-AI baselinetwo vendor copilots, internal RAG prototype, current policy search
Datasetsource, size, masking, lineage, approval300 historical policy questions, 80 edge cases, 40 red-team prompts
Task setrepresentative tasksretrieve policy, summarize exception, draft memo section, refuse unsupported decision
Rubricscoring dimensions and severitygroundedness, completeness, policy compliance, adverse-action boundary
Critical failureszero-tolerance failuresprotected-class inference, unsupported decline reason, stale policy citation
Metricsquality, cost, latency, adoption proxy, riskpass rate, critical failure rate, p95 latency, cost per memo, reviewer agreement
Slice analysisproduct, customer type, geography, risk tiersmall business loan, credit card, secured loan, state-specific policy
Controls testedprivacy, security, prompt injection, DLP, entitlementrestricted policy, prompt injection, PII masking, role-based source access
Run protocolrepeat count, temperature, model version, evaluatorfixed model route, three repeated runs for unstable tasks, SME review
Acceptance thresholdgo, limited go, no-gono critical failures, groundedness >= 4/5, p95 < 6s, cost <$0.40 per memo section
Evidence packagefiles and logs retainedprompts, outputs, traces, citations, evaluator notes, cost/latency export

Benchmark Result Table

OptionQualityCritical failuresCost/casep95 latencyArchitecture fitEvidence maturityRecommendation
Vendor A860$0.525.8sstrong API, weak log exportmediumpilot only if log export gap closed
Vendor B792$0.314.2sblack-box RAGlowreject for regulated workflow
Internal RAG810$0.446.5sstrong control, moderate UXhighhybrid candidate
No-AI baseline620staff time only11 minexisting toolshighkeep as fallback and comparator

Template: Build-Buy Decision Record

Decision title:
Use case:
Date:
Decision owner:
Architecture owner:
Risk owner:

Decision:
Choose build / buy / partner / hybrid / stop.

Context:
Describe business outcome, workflow, AI role, data boundary, risk tier and production constraints.

Options considered:
1. Buy vendor capability.
2. Build internal capability.
3. Partner/co-build.
4. Hybrid component architecture.
5. No-AI/process option.

Evidence reviewed:
- Intake card:
- Sandbox benchmark:
- Vendor scorecard:
- Cost and latency:
- Security/privacy review:
- Architecture fit:
- Concentration and exit constraints:
- User feedback:

Rationale:
Explain why this option best balances value, control, speed, risk, cost, architecture fit and reversibility.

Component boundary:
List what is bought, built, partnered and retained internally.

Conditions:
List blockers that must close before pilot or production.

Residual risk:
List risk accepted, owner and expiry.

Reversal triggers:
List events that require redesign, re-benchmark, exit or stop.

Component Boundary Table

ComponentDecisionOwnerReason
Base modelbuyAI platformcommodity capability, managed scale
Model gatewaybuildplatform architecturerouting, logging, policy and cost control
Domain knowledge ingestionhybriddata and operationsvendor extraction plus internal source registry
Eval datasets and rubricsbuildEvalOps and domain SMEsinstitution-specific behavior contract
Case workflowexisting platformoperations technologyavoid parallel case system
Tool action layerbuildarchitecture and securitypermissions, approval, idempotency, audit
Observabilityhybridplatform SREvendor telemetry exported to internal dashboard

Template: Risk Acceptance

FieldRequired contentExample
Use caseworkflow and AI rolepayments scam intervention script assistant
Decisionproceed, limited pilot, pause, rejectlimited pilot
Residual risk statementwhat remains after controlsAI may under-escalate novel scam language not represented in eval set
Impactcustomer, financial, compliance, operational, reputationcustomer loss, complaint, regulatory scrutiny
Preventive controlscontrols before output/actiontypology source registry, mandatory human approval, prohibited payment action
Detective controlsmonitoring and reviewdaily sample review, under-escalation QA, complaint monitoring
Corrective controlswhat happens when risk materializesdisable route, update red-team set, retrain staff, customer remediation workflow
Evidencebasis for acceptancesandbox report, red-team result, p95 latency, reviewer calibration
Ownernamed accountable personHead of Payments Fraud Operations
Expirytime or trigger60 days or first high-severity incident
Stop rulehard rollback conditionany unauthorized payment action or confirmed data leakage
Approvalbusiness, risk, security, privacy, architecturenamed approvals with date

Risk acceptance rules:

  • No permanent acceptance for unresolved critical AI risks。
  • No acceptance without named owner and expiry。
  • No acceptance for missing evidence; missing evidence is a blocker。
  • No use of "human in the loop" unless review behavior is logged and measured。

Template: Production Gate Evidence Packet

Evidence areaRequired artifactGate question
Business and workflowintake card, workflow map, decision boundaryIs the AI role clear and limited?
Evaluationeval contract, benchmark report, failure taxonomyDid it meet quality and risk thresholds?
Data and privacydata flow, data classification, retention, DLP testIs data use approved and minimized?
Securitythreat model, IAM, RBAC, prompt injection test, tool controlCan it be attacked or misused beyond appetite?
ArchitectureC4/context diagram, component decision, ADR, integration designDoes it fit target architecture and operations?
Vendor and third partyscorecard, due diligence gaps, concentration reviewAre vendor dependencies understood?
Cost and latencyunit cost, p95 latency, budget cap, capacity modelCan it run economically and within SLO?
Human oversightreview policy, training, override log, escalation pathDoes human control actually operate?
Observabilitytrace schema, log export, dashboard, alert rulesCan one case be reconstructed and monitored?
Incident and rollbackkill switch, fallback, manual queue, severity matrixCan we stop, degrade or recover?
Risk acceptanceresidual risk memo and approvalsAre remaining risks named, owned and time-bound?
Operating modelRACI, runbook, support model, cadenceWho owns performance after release?

Gate outcomes:

OutcomeMeaning
GoEvidence complete, thresholds met, controls operational
Limited goNarrow user group, time box, volume cap, explicit residual risk
No-goCritical evidence missing or threshold failed
ReworkDesign or control issue can be fixed and re-tested
StopValue too weak or risk too high

PM / BA / Architecture Questions

PM Questions

QuestionStrong answer
What user behavior changes?Names the exact task, user, queue, decision and expected adoption signal
What is the baseline?Current time, quality, cost, risk or customer experience measured or measurable
Why AI rather than process or rules?AI handles unstructured, variable, evidence-heavy work that simpler options cannot solve
What is the MVP boundary?Bounded role, limited users, prohibited actions and pilot duration
What would make you stop?Predefined quality, risk, cost, latency or adoption failure

BA Questions

QuestionStrong answer
What requirement becomes an eval contract?Requirement maps to dataset, rubric, threshold and evidence
What are exceptions and edge cases?Includes complaint, vulnerable customer, stale policy, missing document, suspicious pattern
Who owns final authority?Human/system authority is explicit, logged and not hidden in AI output
What data can be used?Field-level data boundary, masking, retention and purpose are defined
How is traceability maintained?Requirement -> eval -> prompt/model/RAG/tool version -> gate decision

Architecture Questions

QuestionStrong answer
Where is the control point?model gateway, data boundary, RAG source registry, tool gateway, eval gate or workflow approval
What is vendor-specific?UI, model, prompt, index, workflow, logs, evals, connectors and admin config are separated
Can one output be reconstructed?Trace includes user, prompt, model, source, tool, output, review and final action
What fails safe?timeout, vendor outage, model regression, cost spike and unsafe output route to fallback
What is the exit constraint?Data, prompts, evals, logs, indexes, configs and workflows have export or rebuild path

Release Checklist

Use this checklist before moving from sandbox to pilot or production.

CheckPass condition
Intake completeowner, outcome, workflow, AI role, data, risk tier and no-AI option recorded
Sandbox approveddata plan, task set, cost cap and access boundary approved
Benchmark executedvendor, internal and baseline options compared fairly
Critical failures closedzero unresolved critical failures or documented no-go
Architecture ADR approvedbuild-buy boundary, integration and control points recorded
Data/privacy review completeprompt, document, embedding, log, retention and support access covered
Security review completeIAM, RBAC, injection, tool misuse, DLP and logging tested
Eval contract approvedmetrics, thresholds, slices and critical failures accepted by owners
Cost/latency acceptablecost per case and p95 latency inside pilot envelope
Human oversight readyreviewers trained, override/escalation logged
Observability readytrace, dashboard, alert and evidence export tested
Incident response readykill switch, fallback, manual queue and severity routing tested
Risk acceptance signedresidual risks named, owned, expiring and linked to evidence
Production support readyrunbook, support owner, review cadence and vendor contact path active

Executive Narrative

One-Page Narrative

Decision requested:
Approve a controlled pilot / reject / continue sandbox / proceed to procurement diligence for [use case].

Why now:
[Business workflow] has [measured pain]. AI is plausible because the work is [unstructured, evidence-heavy, repetitive, time-sensitive], but production use requires controls over data, model behavior, human authority and audit evidence.

Options evaluated:
We compared [vendor A], [vendor B], [internal option] and [no-AI baseline] using the same sandbox dataset, tasks, rubric, cost and latency measurements.

Evidence:
The leading option achieved [quality result], [critical failure result], [cost], [latency] and [architecture fit]. Key gaps are [gap list].

Recommendation:
Proceed with [build/buy/partner/hybrid/stop] because it best balances value, risk, speed, control, cost and reversibility.

Conditions:
Before production, close [security/privacy/eval/logging/vendor] gaps, approve risk acceptance, and pass the production gate.

Stop rule:
Stop or roll back if [critical failure, data leakage, unauthorized action, cost breach, latency breach, adoption failure] occurs.

CFO / COO / CTO / CRO Lens

ExecutiveWhat they need to hear
CFOCost per case, cost-to-learn, scale economics, avoided waste from no-go decisions
COOWorkflow impact, adoption, queue capacity, fallback and operational ownership
CTOArchitecture fit, integration, abstraction, observability, resilience and concentration risk
CRO / ComplianceRisk tier, controls, evidence, human authority, monitoring and stop rules
CPO / Business leadUser value, customer impact, MVP scope, learning plan and rollout conditions

Interview Drills

Drill 1: Contact Center Vendor Pitch

Prompt:

A vendor shows a strong GenAI contact center demo. The business wants to buy quickly. How do you respond?

Strong answer:

I would not block learning, but I would move the vendor into a controlled intake and sandbox path. First I define the target workflows, such as credit card fee questions and dispute status. Then I set the AI role as agent assist, not direct customer commitment. The sandbox uses approved policy documents, synthetic customer cases, complaint and vulnerable customer edge cases, and prompt-injection tests. I compare the vendor against current knowledge search and any internal RAG option. The scorecard includes groundedness, policy compliance, escalation, p95 latency, cost per interaction, trace export and architecture fit. If it passes, I recommend a limited pilot with human review and production gate conditions.

Drill 2: AML Workbench Build vs Buy

Prompt:

Should the bank buy an AML investigation copilot or build one?

Strong answer:

I would avoid a single answer at product level. AML is high-risk and evidence-heavy, so I would likely choose hybrid. We can buy commodity document summarization or UX components, but keep internal control over source registry, entitlement, case-system integration, eval datasets, red-flag taxonomy, model gateway, audit trace and final disposition boundary. The sandbox must prove no critical red-flag omission, grounded narrative, reviewer agreement, trace reconstruction and acceptable cost/latency. If a vendor cannot export evidence or respect source permissions, it cannot be a production candidate even if the demo is impressive.

Drill 3: Credit Decision Support Risk Challenge

Prompt:

The pilot improves underwriter productivity by 30 percent but has two unsupported adverse-action draft examples. What do you do?

Strong answer:

I would treat unsupported adverse-action language as a critical failure, not an average-quality issue. The gate outcome is no-go or limited rework, depending on whether the feature can be constrained. I would remove adverse-action drafting from scope, strengthen policy retrieval and citation checks, add fair-lending and protected-class proxy cases to the eval set, require human approval logging, and rerun regression. Productivity improvement does not override customer harm and regulatory evidence requirements.

Drill 4: Payments Fraud Real-Time Constraint

Prompt:

A fraud intervention AI is accurate but p95 latency exceeds the intervention window. Is it still acceptable?

Strong answer:

Not for real-time intervention. The benchmark must be interpreted against operational SLO, not just model quality. I would either constrain the AI to post-event case enrichment, precompute certain risk narratives, use a smaller model route, or keep deterministic real-time controls while AI supports analyst review. Architecture fit includes latency, fallback and queue impact; a high-quality answer that arrives too late fails the workflow.

Drill 5: CTO Challenge On Lock-In

Prompt:

The CTO asks how you prevent AI vendor lock-in during procurement intake.

Strong answer:

I separate commodity capability from control points. During intake and sandbox, I require component mapping for model, RAG, prompt, eval, tool, workflow and logs. I prefer a model gateway, internal eval datasets, source registry, tool gateway and telemetry export so we can compare or replace vendors. I also score exportability of prompts, logs, eval results, indexes, configs and data. I do not claim perfect portability; I make exit constraints explicit before pilot and price them into the build-buy decision.


Source Anchors

AnchorLink用法
NIST AI Risk Management Frameworkhttps://www.nist.gov/itl/ai-risk-management-frameworkrisk map, measure, manage and governance evidence
NIST AI RMF Generative AI Profilehttps://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligenceGenAI-specific sandbox risks and red-team cases
ISO/IEC 42001 AI management systemshttps://www.iso.org/standard/81230.htmlAI lifecycle accountability and management system checks
ISO/IEC/IEEE 29148 Requirements engineeringhttps://www.iso.org/standard/72089.htmlrequirements quality and validation discipline
ISO/IEC/IEEE 42010 Architecture descriptionhttps://www.iso.org/standard/74393.htmlstakeholder concerns, architecture viewpoints and ADR rationale
Interagency Third-Party Risk Guidance, FDIC FIL-29-2023https://www.fdic.gov/news/financial-institution-letters/2023/fil23029.htmlthird-party planning, due diligence, selection and monitoring lifecycle
FFIEC AIO booklet summary, OCC Bulletin 2021-30https://www.occ.gov/news-issuances/bulletins/2021/bulletin-2021-30.htmlarchitecture, infrastructure, operations and resilience review
OWASP Top 10 for Large Language Model Applicationshttps://owasp.org/www-project-top-10-for-large-language-model-applications/LLM security test cases for injection, data disclosure and excessive agency