AI Procurement Intake / Vendor Evaluation Sandbox / Build-Buy Playbook
使用本 playbook 解决五类问题:
AI Procurement Intake / Vendor Evaluation Sandbox / Build-Buy Decision Architecture Playbook
定位: 面向 experienced CBAP / AI BA / AI PM / Solution Architect / Enterprise Architect / financial retail transformation lead 的实战手册。
目标: 把 AI idea, vendor pitch 和 business pain 转成可评估、可门禁、可审计、可决策的 procurement intake, sandbox benchmark 和 build-buy-partner evidence pack。
边界: 本手册位于合同谈判、供应商退出策略和董事会投资叙事之前。它提供上游证据, 不替代 legal, compliance, procurement, privacy, security, model risk, finance 或 internal audit 的正式审查。
Purpose And When To Use
Purpose
使用本 playbook 解决五类问题:
- 业务方或高管提出 AI idea, 需要判断是否进入 AI portfolio。
- 供应商主动 pitch, 团队需要避免 demo-led procurement。
- 多个方案在 build, buy, partner, hybrid 之间摇摆。
- PoC 已经出现, 但缺少证据、风险边界、成本模型和生产门禁。
- 需要把 AI vendor evaluation 做成可展示的高级 PM / BA / 架构作品集资产。
When To Use
| Trigger | Use this playbook to produce |
|---|---|
| "我们想做一个 AI 助手" | AI procurement intake and use-case triage |
| "这个 vendor demo 很好" | vendor sandbox charter and benchmark plan |
| "到底买还是自研" | build-buy-partner decision record |
| "PoC 结果不错, 可以上线吗" | production gate evidence packet |
| "风险部门问证据在哪里" | metrics-control-evidence model and risk acceptance |
| "CTO 问是否会 lock-in" | architecture fit, concentration risk and exit constraint review |
What Good Looks Like
AI request
-> intake card
-> risk-tiered triage
-> sandbox charter
-> benchmark plan
-> vendor/internal/no-AI comparison
-> build-buy-partner decision record
-> risk acceptance if needed
-> production gate evidence packet
-> downstream contract, TPRM, model validation and release process
Operating Model
Stage Gates
| Gate | Decision | Required evidence | Stop signal |
|---|---|---|---|
| Gate 0: Intake completeness | Is this a valid AI candidate? | intake card, owner, outcome, workflow, data, no-AI option | no measurable outcome or no accountable owner |
| Gate 1: Use-case triage | What risk tier and path? | impact assessment, AI role, customer/regulatory effect | unacceptable risk or unclear decision authority |
| Gate 2: Sandbox approval | Can we test safely? | sandbox charter, data approval, benchmark plan, cost cap | production data or write access without approval |
| Gate 3: Evidence review | Which option is strongest? | scorecard, benchmark report, failure taxonomy, architecture fit | demo-only evidence or critical failure unresolved |
| Gate 4: Build-buy decision | build, buy, partner, hybrid, stop? | decision record, TCO, control ownership, concentration review | preference-only decision |
| Gate 5: Production promotion | Can this enter pilot or production? | evidence packet, risk acceptance, release checklist | missing trace, eval, data, security or owner evidence |
RACI
| Activity | Business owner | AI PM | CBAP / BA | Architect | Security / Privacy | Risk / Compliance / MRM | Procurement / TPRM | Operations |
|---|---|---|---|---|---|---|---|---|
| Intake card | A | R | R | C | C | C | I | C |
| Triage | A | R | R | R | C | A/R | C | C |
| Sandbox charter | C | A/R | R | R | R | C | C | C |
| Benchmark design | C | R | R | R | C | A/R | I | C |
| Vendor scorecard | C | A/R | R | R | R | R | A/R | C |
| Build-buy decision | A | R | C | A/R | C | C | C | C |
| Risk acceptance | A | R | C | C | C | A/R | C | C |
| Production gate | A | R | R | R | R | R | C | A/R |
A = accountable, R = responsible, C = consulted, I = informed.
Work Products
| Work product | Minimum content | Reuse |
|---|---|---|
| AI procurement intake | outcome, workflow, AI role, data, risk, no-AI option | portfolio funnel and prioritization |
| Vendor sandbox scorecard | quality, risk, cost, latency, architecture fit, evidence | vendor comparison and procurement input |
| Benchmark plan | dataset, tasks, rubric, threshold, sample size, measurement | eval contract and release gate |
| Build-buy decision record | options, component boundary, decision, rationale, reversal trigger | ADR and executive memo input |
| Risk acceptance | residual risk, controls, evidence, owner, expiry, stop rule | risk governance and audit |
| Production gate packet | traceable evidence needed for pilot/production | release, model validation and operating readiness |
Template: AI Procurement Intake
| Field | Required answer | Example: AML investigation workbench |
|---|---|---|
| Request name | Specific workflow and AI role | AML investigation narrative copilot |
| Business owner | Named accountable owner | Head of AML Operations |
| Outcome | Baseline and target movement | Reduce evidence collection and narrative drafting time by 25 percent without increasing missed red flags |
| Workflow | Current process and insertion point | Analyst reviews alert, retrieves transactions, checks customer profile, drafts case narrative |
| AI role | read, retrieve, summarize, recommend, draft, decide, act | retrieve, summarize, draft; no final disposition |
| Users | Primary and secondary users | AML analyst, QA reviewer, team lead, model risk reviewer |
| Customer/regulatory impact | Direct, indirect or none | Indirect regulatory impact through case evidence and SAR support |
| Data sources | Systems, documents, logs, retention | transaction history, customer profile, KYC docs, prior cases, typology library |
| Sensitive data | PII, PCI, account, transaction, protected class, SAR-sensitive | PII, account, transaction, KYC and investigation-sensitive content |
| No-AI alternative | Process, rules, search, RPA, reporting | improve search, standard narrative templates, rules-based evidence checklist |
| Risk tier | Low, medium, high, critical | High |
| Sandbox data plan | synthetic, masked, historical, red-team | masked historical cases plus synthetic typology cases |
| Evaluation hypothesis | What must be proven | AI improves evidence completeness and drafting time while critical omissions remain zero |
| Architecture concern | Main integration/control issue | source entitlement, trace, case-system integration, human approval |
| Initial recommendation | reject, defer, sandbox, direct purchase, internal build | controlled sandbox only |
Intake Triage Rules
| If this is true | Route |
|---|---|
| No owner, no baseline, no decision boundary | reject or defer |
| Data cannot be used safely even in sandbox | defer until data plan exists |
| AI would make high-impact decisions without human authority | redesign scope |
| Value is mostly search or process discipline | compare AI against no-AI improvement |
| Workflow is common and low risk | buy candidate with sandbox |
| Workflow is regulated, high impact or evidence-heavy | hybrid candidate with stronger gate |
Template: Vendor Sandbox Scorecard
Score each dimension from 1 to 5, then apply the weight. Adjust weights by use case risk. For AML, KYC, credit and payments, increase governance, audit, data and control weights.
| Dimension | Weight | 1 | 3 | 5 | Evidence required |
|---|---|---|---|---|---|
| Business workflow fit | 10 | generic demo | partial workflow | fits real roles, queues and exceptions | workflow demo using sandbox tasks |
| Output quality | 12 | claims only | acceptable average score | strong score plus slice analysis | eval report and samples |
| Critical failure control | 12 | severe failures | mitigations proposed | no critical failures in approved set | failure taxonomy |
| Data privacy | 10 | vague processing | basic data terms | clear prompt, embedding, log, support and retention controls | data flow and settings |
| Security | 10 | weak IAM/logging | basic SSO/RBAC | strong IAM, tenant isolation, prompt/tool threat testing | security evidence |
| Eval maturity | 10 | vendor benchmark only | supports custom eval | supports customer gold set, red-team and export | eval methodology |
| Architecture fit | 10 | black box | API integration | gateway, trace, SIEM, IAM, RAG and tool boundary fit | architecture review |
| Audit evidence | 8 | transcript only | partial logs | full trace and admin audit export | trace sample |
| Cost predictability | 6 | unclear | rate card | cost per case, caps, variance and scale model | cost workbook |
| Latency/reliability | 5 | best effort | basic SLA | p95 measured, rate limits known, fallback path | latency report |
| Concentration/exit constraint | 5 | heavy lock-in | partial export | portable data, eval, prompt, logs and config | export review |
| Vendor operating maturity | 2 | sales-led | some process | support, incident, change and roadmap discipline | vendor evidence |
Decision bands:
| Weighted score | Interpretation |
|---|---|
| 0-55 | Do not pilot; use only for learning |
| 56-70 | Limited sandbox extension or low-risk pilot only |
| 71-84 | Candidate for controlled pilot if blockers closed |
| 85-100 | Strong candidate, still subject to production gate |
Hard blockers override score:
- Critical customer harm, regulatory, privacy or security failure unresolved。
- Vendor cannot explain data route, retention or support access。
- No evidence export for high-risk use case。
- Tool actions cannot be constrained or audited。
- Cost or latency cannot be modeled for production volume。
Template: Benchmark Plan
| Section | Required content | Financial retail example |
|---|---|---|
| Use case | workflow and AI role | credit policy support copilot for underwriter memo drafting |
| Options compared | vendor A, vendor B, internal baseline, no-AI baseline | two vendor copilots, internal RAG prototype, current policy search |
| Dataset | source, size, masking, lineage, approval | 300 historical policy questions, 80 edge cases, 40 red-team prompts |
| Task set | representative tasks | retrieve policy, summarize exception, draft memo section, refuse unsupported decision |
| Rubric | scoring dimensions and severity | groundedness, completeness, policy compliance, adverse-action boundary |
| Critical failures | zero-tolerance failures | protected-class inference, unsupported decline reason, stale policy citation |
| Metrics | quality, cost, latency, adoption proxy, risk | pass rate, critical failure rate, p95 latency, cost per memo, reviewer agreement |
| Slice analysis | product, customer type, geography, risk tier | small business loan, credit card, secured loan, state-specific policy |
| Controls tested | privacy, security, prompt injection, DLP, entitlement | restricted policy, prompt injection, PII masking, role-based source access |
| Run protocol | repeat count, temperature, model version, evaluator | fixed model route, three repeated runs for unstable tasks, SME review |
| Acceptance threshold | go, limited go, no-go | no critical failures, groundedness >= 4/5, p95 < 6s, cost <$0.40 per memo section |
| Evidence package | files and logs retained | prompts, outputs, traces, citations, evaluator notes, cost/latency export |
Benchmark Result Table
| Option | Quality | Critical failures | Cost/case | p95 latency | Architecture fit | Evidence maturity | Recommendation |
|---|---|---|---|---|---|---|---|
| Vendor A | 86 | 0 | $0.52 | 5.8s | strong API, weak log export | medium | pilot only if log export gap closed |
| Vendor B | 79 | 2 | $0.31 | 4.2s | black-box RAG | low | reject for regulated workflow |
| Internal RAG | 81 | 0 | $0.44 | 6.5s | strong control, moderate UX | high | hybrid candidate |
| No-AI baseline | 62 | 0 | staff time only | 11 min | existing tools | high | keep as fallback and comparator |
Template: Build-Buy Decision Record
Decision title:
Use case:
Date:
Decision owner:
Architecture owner:
Risk owner:
Decision:
Choose build / buy / partner / hybrid / stop.
Context:
Describe business outcome, workflow, AI role, data boundary, risk tier and production constraints.
Options considered:
1. Buy vendor capability.
2. Build internal capability.
3. Partner/co-build.
4. Hybrid component architecture.
5. No-AI/process option.
Evidence reviewed:
- Intake card:
- Sandbox benchmark:
- Vendor scorecard:
- Cost and latency:
- Security/privacy review:
- Architecture fit:
- Concentration and exit constraints:
- User feedback:
Rationale:
Explain why this option best balances value, control, speed, risk, cost, architecture fit and reversibility.
Component boundary:
List what is bought, built, partnered and retained internally.
Conditions:
List blockers that must close before pilot or production.
Residual risk:
List risk accepted, owner and expiry.
Reversal triggers:
List events that require redesign, re-benchmark, exit or stop.
Component Boundary Table
| Component | Decision | Owner | Reason |
|---|---|---|---|
| Base model | buy | AI platform | commodity capability, managed scale |
| Model gateway | build | platform architecture | routing, logging, policy and cost control |
| Domain knowledge ingestion | hybrid | data and operations | vendor extraction plus internal source registry |
| Eval datasets and rubrics | build | EvalOps and domain SMEs | institution-specific behavior contract |
| Case workflow | existing platform | operations technology | avoid parallel case system |
| Tool action layer | build | architecture and security | permissions, approval, idempotency, audit |
| Observability | hybrid | platform SRE | vendor telemetry exported to internal dashboard |
Template: Risk Acceptance
| Field | Required content | Example |
|---|---|---|
| Use case | workflow and AI role | payments scam intervention script assistant |
| Decision | proceed, limited pilot, pause, reject | limited pilot |
| Residual risk statement | what remains after controls | AI may under-escalate novel scam language not represented in eval set |
| Impact | customer, financial, compliance, operational, reputation | customer loss, complaint, regulatory scrutiny |
| Preventive controls | controls before output/action | typology source registry, mandatory human approval, prohibited payment action |
| Detective controls | monitoring and review | daily sample review, under-escalation QA, complaint monitoring |
| Corrective controls | what happens when risk materializes | disable route, update red-team set, retrain staff, customer remediation workflow |
| Evidence | basis for acceptance | sandbox report, red-team result, p95 latency, reviewer calibration |
| Owner | named accountable person | Head of Payments Fraud Operations |
| Expiry | time or trigger | 60 days or first high-severity incident |
| Stop rule | hard rollback condition | any unauthorized payment action or confirmed data leakage |
| Approval | business, risk, security, privacy, architecture | named approvals with date |
Risk acceptance rules:
- No permanent acceptance for unresolved critical AI risks。
- No acceptance without named owner and expiry。
- No acceptance for missing evidence; missing evidence is a blocker。
- No use of "human in the loop" unless review behavior is logged and measured。
Template: Production Gate Evidence Packet
| Evidence area | Required artifact | Gate question |
|---|---|---|
| Business and workflow | intake card, workflow map, decision boundary | Is the AI role clear and limited? |
| Evaluation | eval contract, benchmark report, failure taxonomy | Did it meet quality and risk thresholds? |
| Data and privacy | data flow, data classification, retention, DLP test | Is data use approved and minimized? |
| Security | threat model, IAM, RBAC, prompt injection test, tool control | Can it be attacked or misused beyond appetite? |
| Architecture | C4/context diagram, component decision, ADR, integration design | Does it fit target architecture and operations? |
| Vendor and third party | scorecard, due diligence gaps, concentration review | Are vendor dependencies understood? |
| Cost and latency | unit cost, p95 latency, budget cap, capacity model | Can it run economically and within SLO? |
| Human oversight | review policy, training, override log, escalation path | Does human control actually operate? |
| Observability | trace schema, log export, dashboard, alert rules | Can one case be reconstructed and monitored? |
| Incident and rollback | kill switch, fallback, manual queue, severity matrix | Can we stop, degrade or recover? |
| Risk acceptance | residual risk memo and approvals | Are remaining risks named, owned and time-bound? |
| Operating model | RACI, runbook, support model, cadence | Who owns performance after release? |
Gate outcomes:
| Outcome | Meaning |
|---|---|
| Go | Evidence complete, thresholds met, controls operational |
| Limited go | Narrow user group, time box, volume cap, explicit residual risk |
| No-go | Critical evidence missing or threshold failed |
| Rework | Design or control issue can be fixed and re-tested |
| Stop | Value too weak or risk too high |
PM / BA / Architecture Questions
PM Questions
| Question | Strong answer |
|---|---|
| What user behavior changes? | Names the exact task, user, queue, decision and expected adoption signal |
| What is the baseline? | Current time, quality, cost, risk or customer experience measured or measurable |
| Why AI rather than process or rules? | AI handles unstructured, variable, evidence-heavy work that simpler options cannot solve |
| What is the MVP boundary? | Bounded role, limited users, prohibited actions and pilot duration |
| What would make you stop? | Predefined quality, risk, cost, latency or adoption failure |
BA Questions
| Question | Strong answer |
|---|---|
| What requirement becomes an eval contract? | Requirement maps to dataset, rubric, threshold and evidence |
| What are exceptions and edge cases? | Includes complaint, vulnerable customer, stale policy, missing document, suspicious pattern |
| Who owns final authority? | Human/system authority is explicit, logged and not hidden in AI output |
| What data can be used? | Field-level data boundary, masking, retention and purpose are defined |
| How is traceability maintained? | Requirement -> eval -> prompt/model/RAG/tool version -> gate decision |
Architecture Questions
| Question | Strong answer |
|---|---|
| Where is the control point? | model gateway, data boundary, RAG source registry, tool gateway, eval gate or workflow approval |
| What is vendor-specific? | UI, model, prompt, index, workflow, logs, evals, connectors and admin config are separated |
| Can one output be reconstructed? | Trace includes user, prompt, model, source, tool, output, review and final action |
| What fails safe? | timeout, vendor outage, model regression, cost spike and unsafe output route to fallback |
| What is the exit constraint? | Data, prompts, evals, logs, indexes, configs and workflows have export or rebuild path |
Release Checklist
Use this checklist before moving from sandbox to pilot or production.
| Check | Pass condition |
|---|---|
| Intake complete | owner, outcome, workflow, AI role, data, risk tier and no-AI option recorded |
| Sandbox approved | data plan, task set, cost cap and access boundary approved |
| Benchmark executed | vendor, internal and baseline options compared fairly |
| Critical failures closed | zero unresolved critical failures or documented no-go |
| Architecture ADR approved | build-buy boundary, integration and control points recorded |
| Data/privacy review complete | prompt, document, embedding, log, retention and support access covered |
| Security review complete | IAM, RBAC, injection, tool misuse, DLP and logging tested |
| Eval contract approved | metrics, thresholds, slices and critical failures accepted by owners |
| Cost/latency acceptable | cost per case and p95 latency inside pilot envelope |
| Human oversight ready | reviewers trained, override/escalation logged |
| Observability ready | trace, dashboard, alert and evidence export tested |
| Incident response ready | kill switch, fallback, manual queue and severity routing tested |
| Risk acceptance signed | residual risks named, owned, expiring and linked to evidence |
| Production support ready | runbook, support owner, review cadence and vendor contact path active |
Executive Narrative
One-Page Narrative
Decision requested:
Approve a controlled pilot / reject / continue sandbox / proceed to procurement diligence for [use case].
Why now:
[Business workflow] has [measured pain]. AI is plausible because the work is [unstructured, evidence-heavy, repetitive, time-sensitive], but production use requires controls over data, model behavior, human authority and audit evidence.
Options evaluated:
We compared [vendor A], [vendor B], [internal option] and [no-AI baseline] using the same sandbox dataset, tasks, rubric, cost and latency measurements.
Evidence:
The leading option achieved [quality result], [critical failure result], [cost], [latency] and [architecture fit]. Key gaps are [gap list].
Recommendation:
Proceed with [build/buy/partner/hybrid/stop] because it best balances value, risk, speed, control, cost and reversibility.
Conditions:
Before production, close [security/privacy/eval/logging/vendor] gaps, approve risk acceptance, and pass the production gate.
Stop rule:
Stop or roll back if [critical failure, data leakage, unauthorized action, cost breach, latency breach, adoption failure] occurs.
CFO / COO / CTO / CRO Lens
| Executive | What they need to hear |
|---|---|
| CFO | Cost per case, cost-to-learn, scale economics, avoided waste from no-go decisions |
| COO | Workflow impact, adoption, queue capacity, fallback and operational ownership |
| CTO | Architecture fit, integration, abstraction, observability, resilience and concentration risk |
| CRO / Compliance | Risk tier, controls, evidence, human authority, monitoring and stop rules |
| CPO / Business lead | User value, customer impact, MVP scope, learning plan and rollout conditions |
Interview Drills
Drill 1: Contact Center Vendor Pitch
Prompt:
A vendor shows a strong GenAI contact center demo. The business wants to buy quickly. How do you respond?
Strong answer:
I would not block learning, but I would move the vendor into a controlled intake and sandbox path. First I define the target workflows, such as credit card fee questions and dispute status. Then I set the AI role as agent assist, not direct customer commitment. The sandbox uses approved policy documents, synthetic customer cases, complaint and vulnerable customer edge cases, and prompt-injection tests. I compare the vendor against current knowledge search and any internal RAG option. The scorecard includes groundedness, policy compliance, escalation, p95 latency, cost per interaction, trace export and architecture fit. If it passes, I recommend a limited pilot with human review and production gate conditions.
Drill 2: AML Workbench Build vs Buy
Prompt:
Should the bank buy an AML investigation copilot or build one?
Strong answer:
I would avoid a single answer at product level. AML is high-risk and evidence-heavy, so I would likely choose hybrid. We can buy commodity document summarization or UX components, but keep internal control over source registry, entitlement, case-system integration, eval datasets, red-flag taxonomy, model gateway, audit trace and final disposition boundary. The sandbox must prove no critical red-flag omission, grounded narrative, reviewer agreement, trace reconstruction and acceptable cost/latency. If a vendor cannot export evidence or respect source permissions, it cannot be a production candidate even if the demo is impressive.
Drill 3: Credit Decision Support Risk Challenge
Prompt:
The pilot improves underwriter productivity by 30 percent but has two unsupported adverse-action draft examples. What do you do?
Strong answer:
I would treat unsupported adverse-action language as a critical failure, not an average-quality issue. The gate outcome is no-go or limited rework, depending on whether the feature can be constrained. I would remove adverse-action drafting from scope, strengthen policy retrieval and citation checks, add fair-lending and protected-class proxy cases to the eval set, require human approval logging, and rerun regression. Productivity improvement does not override customer harm and regulatory evidence requirements.
Drill 4: Payments Fraud Real-Time Constraint
Prompt:
A fraud intervention AI is accurate but p95 latency exceeds the intervention window. Is it still acceptable?
Strong answer:
Not for real-time intervention. The benchmark must be interpreted against operational SLO, not just model quality. I would either constrain the AI to post-event case enrichment, precompute certain risk narratives, use a smaller model route, or keep deterministic real-time controls while AI supports analyst review. Architecture fit includes latency, fallback and queue impact; a high-quality answer that arrives too late fails the workflow.
Drill 5: CTO Challenge On Lock-In
Prompt:
The CTO asks how you prevent AI vendor lock-in during procurement intake.
Strong answer:
I separate commodity capability from control points. During intake and sandbox, I require component mapping for model, RAG, prompt, eval, tool, workflow and logs. I prefer a model gateway, internal eval datasets, source registry, tool gateway and telemetry export so we can compare or replace vendors. I also score exportability of prompts, logs, eval results, indexes, configs and data. I do not claim perfect portability; I make exit constraints explicit before pilot and price them into the build-buy decision.
Source Anchors
| Anchor | Link | 用法 |
|---|---|---|
| NIST AI Risk Management Framework | https://www.nist.gov/itl/ai-risk-management-framework | risk map, measure, manage and governance evidence |
| NIST AI RMF Generative AI Profile | https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence | GenAI-specific sandbox risks and red-team cases |
| ISO/IEC 42001 AI management systems | https://www.iso.org/standard/81230.html | AI lifecycle accountability and management system checks |
| ISO/IEC/IEEE 29148 Requirements engineering | https://www.iso.org/standard/72089.html | requirements quality and validation discipline |
| ISO/IEC/IEEE 42010 Architecture description | https://www.iso.org/standard/74393.html | stakeholder concerns, architecture viewpoints and ADR rationale |
| Interagency Third-Party Risk Guidance, FDIC FIL-29-2023 | https://www.fdic.gov/news/financial-institution-letters/2023/fil23029.html | third-party planning, due diligence, selection and monitoring lifecycle |
| FFIEC AIO booklet summary, OCC Bulletin 2021-30 | https://www.occ.gov/news-issuances/bulletins/2021/bulletin-2021-30.html | architecture, infrastructure, operations and resilience review |
| OWASP Top 10 for Large Language Model Applications | https://owasp.org/www-project-top-10-for-large-language-model-applications/ | LLM security test cases for injection, data disclosure and excessive agency |