返回 Papers
AI 底层逻辑 / 经典论文

AI Procurement Intake:供应商评估沙盒与 Build-Buy 架构

AI 采购入口不是行政流程, 而是企业 AI 架构的第一道控制点。它决定一个 AI idea 会不会被错误地推入 vendor demo、PoC、采购谈判或生产集成。成熟机构不会先问"哪家供应商最好", 而是先问:

424ai-foundations/papers/166-ai-procurement-intake-vendor-evaluation-sandbox-build-buy-architecture.md

AI 采购入口 / 供应商评估沙盒 / Build-Buy 决策架构解读 (AI Procurement Intake / Vendor Evaluation Sandbox / Build-Buy Decision Architecture)

Date: 2026-06-30
Status: evergreen
Audience: experienced CBAP / financial retail PM / solution architect / enterprise architect moving into AI product and AI architecture
Output: 一份可放入作品集的 AI procurement intake, vendor sandbox, benchmark, build-buy-partner decision 和 production promotion gate 架构笔记


Why Procurement Intake Is An Architecture Control Point

AI 采购入口不是行政流程, 而是企业 AI 架构的第一道控制点。它决定一个 AI idea 会不会被错误地推入 vendor demo、PoC、采购谈判或生产集成。成熟机构不会先问"哪家供应商最好", 而是先问:

控制问题架构含义金融零售后果
这个用例是否值得进入 AI funnel先验证 outcome, workflow, data, risk tier, no-AI option避免把普通流程问题包装成 GenAI 项目
AI 在流程中扮演什么角色read, summarize, recommend, draft, decide, act 分层避免客服、信贷、AML、支付场景中越权决策
应该 build, buy, partner, hybrid 还是 stop将能力差异化、控制权、时间、成本、风险放进同一决策避免因 demo 好看而买入不适合的黑盒
sandbox 评估什么用真实但受控的数据、任务、rubric 和门禁测试供应商避免只看销售演示和通用 benchmark
证据如何进入生产放行评估结果必须连接 ADR, risk acceptance, release gate避免 PoC 成功后绕过安全、隐私、模型风险和运营控制

上游 procurement intake 的核心价值:

  1. 把 AI idea 变成可测量的业务和风险假设。
  2. 把 vendor comparison 变成 architecture comparison。
  3. 把 PoC 变成受控 sandbox, 不让试点自然漂移成 shadow production。
  4. 把 build-buy 选择从偏好讨论升级为 evidence-based decision。
  5. 把后续合同、退出、投资叙事和生产上线建立在证据之上。

边界说明: 本文聚焦 procurement lifecycle 上游的 intake, triage, sandbox, benchmark 和 build-buy decision。合同条款、退出迁移和董事会投资 narrative 属于下游材料, 这里只定义它们需要的输入证据。


Concept Diagram

flowchart LR
  A[AI idea / vendor pitch / business pain] --> B[Intake funnel]
  B --> C{Use-case triage}
  C -->|No AI fit| C1[Process / rules / data fix]
  C -->|Low value or high risk| C2[Stop or defer]
  C -->|Candidate| D[Decision architecture]
  D --> D1[Build]
  D --> D2[Buy]
  D --> D3[Partner]
  D --> D4[Hybrid]
  D --> E[Sandbox charter]
  E --> F[Vendor and internal option benchmark]
  F --> G[Evidence pack]
  G --> H{Architecture review board}
  H -->|No-go| H1[Reject / redesign]
  H -->|Limited pilot| H2[Controlled pilot with constraints]
  H -->|Production candidate| I[Production promotion gate]
  I --> J[Contract, security, privacy, risk, model validation, operating model]

一条实用原则:

Intake owns the question "should this enter the AI option space"; sandbox owns "which option works under controlled evidence"; architecture gate owns "can this be operated safely at production scale".


Intake-To-Sandbox-To-Decision Architecture

1. Intake Funnel

AI intake 要求每个 idea 先提交最小证据, 而不是直接预约 vendor demo。

Intake field高级要求不合格信号
Business outcome明确 baseline, target movement, impacted workflow, owner"提升效率", "智能化", "更懂客户"
AI roleread / retrieve / summarize / draft / recommend / decide / act直接说"AI 自动处理"
Customer or regulatory impact是否影响客户承诺、授信、KYC、AML、支付、投诉、收费、适当性只说内部工具所以低风险
Data boundary数据源、PII、PCI、账户、交易、文档、语音、日志、跨境、保留不知道会把什么发给供应商
Workflow insertion pointAS-IS / TO-BE 节点, human review, exception path, fallback没有流程图, 只有功能清单
No-AI alternative流程优化、规则引擎、搜索、RPA、知识治理、报表默认 AI 是唯一方案
Evidence plansandbox 数据、rubric、benchmark、control evidence只打算看供应商 demo

2. Triage Gate

把用例分成四类:

Tier典型场景推荐动作
T0: Reject / defer没有 owner、没有 baseline、数据不可用、风险不可接受stop, 先补流程或数据
T1: Learn-only sandbox价值假设早期, 数据可脱敏, 不连接生产controlled demo and architecture learning
T2: Controlled pilot candidate明确 workflow, 有评估集, 人类保留最终权责sandbox -> limited pilot
T3: High-impact candidate信贷、AML、KYC、支付、客户承诺、投诉或监管证据sandbox 加强, independent challenge, production gate

3. Sandbox Charter

Sandbox charter 必须在供应商测试前冻结:

Section内容
Scope具体流程、用户角色、允许任务、禁止任务
Datasynthetic, masked, historical, gold set, red-team set, access policy
Architecturevendor route, internal baseline, model gateway, logging, retrieval, tool boundary
Evaluationbenchmark task, rubric, thresholds, critical failures, slice analysis
Controlsprivacy, security, HITL, DLP, prompt injection, cost cap, kill switch
Evidencetrace, logs, output samples, evaluator notes, cost, latency, defect taxonomy
Decisionbuild / buy / partner / hybrid / stop 的判定规则

4. Decision Board

Decision board 不应只由 procurement 或 product 决定。最小构成:

Role负责挑战的问题
Business owner业务价值是否真实, 是否愿意承担 adoption 和 residual risk
AI PM / Product owner用户场景、MVP、体验、采用、收益假设是否清晰
CBAP / BA流程、规则、需求、例外、验收和证据是否完整
Solution architect集成、数据流、RAG、agent、日志、可观测、降级是否可行
Enterprise architect平台复用、能力地图、供应商集中度、目标架构适配
Security / privacy数据、身份、权限、日志、DLP、威胁模型是否达标
Risk / compliance / model risk风险等级、监管影响、模型验证、人工监督是否充分
Procurement / TPRM供应商风险、商业条款、后续合同尽调是否可进入下一阶段

Build / Buy / Partner Decision Model

Build-buy-partner 不是三选一口号, 而是一组架构边界决策。

Decision Axes

AxisBuild 倾向Buy 倾向Partner 倾向Hybrid 倾向
Differentiation流程或数据是竞争优势能力通用, 市场成熟需要行业经验转移控制层差异化, 能力层通用
Control need数据、模型、策略、审计、工具权限必须内部控制供应商可提供充分控制证据机构缺少短期能力内部保留 policy, eval, gateway, audit
Time to value可以等待内部能力建设需要快速验证和上线需要加速交付但保留学习先买后抽象, 或买组件建控制面
Scale economics用量大, 单位成本可被内部平台摊薄用量不确定或中小规模早期探索高风险部分内部化, 普通能力外部化
Talent readiness有 AI platform, data, eval, security, SRE 能力内部团队不足需要 co-build and knowledge transfer内部团队能运营控制层
Regulatory evidence内部可生成更强证据供应商证据成熟且可导出需要顾问补齐控制设计机构证据层统一, vendor 提供组件证据
Exit optionality内部架构可替换vendor lock-in 可接受依赖转移需要计划抽象接口降低退出成本

Component-Level Decision

不要为整个用例做一个笼统决定。把 AI system 拆到组件层:

Component常见选择判断逻辑
Base modelbuy or use managed model基础模型通常不是金融零售机构差异化来源
Model gatewaybuild or platform buy需要统一 routing, logging, policy, cost, versioning
RAG ingestionhybrid文档处理可买, source registry 和 entitlement 应内部控制
Vector store / searchbuy managed or internal platform取决于数据分类、延迟、成本和地域要求
Prompt / policy registrybuild lightweight政策、prompt、release evidence 应机构可审计
Eval harnesshybrid工具可买, golden set、rubric、门禁阈值必须内部拥有
Agent tool gatewaybuild高风险动作、权限、审批、幂等和审计不宜交给黑盒
Human review workbenchbuy, build, or existing workflow取决于是否嵌入 AML/KYC/信贷/客服 case system
Observabilityhybridvendor trace 要进入内部 SIEM, audit, cost 和 quality dashboard

Practical Decision Rule

ConditionRecommendation
供应商产品强, 但不能导出 trace/eval/logsandbox 可以学, 不进入生产候选
供应商质量好, 但工具动作权限不可控buy UI/model layer only, build tool gateway
内部模型质量一般, 但证据和控制强可做 high-risk pilot, 因为金融场景安全证据比平均分重要
多供应商效果接近选择架构适配、证据导出、成本可预测和退出约束更好的方案
用例不是差异化, 但需要快速 adoptionbuy with strong sandbox and production gate
用例是核心风控或客户承诺hybrid by default, 内部控制 decision boundary and evidence

Vendor Sandbox And Benchmark Design

Sandbox Design Principles

  1. 同一任务, 同一数据, 同一 rubric, 同一 cost and latency measurement。
  2. 至少比较 vendor option, internal baseline, no-AI baseline。
  3. 测试 positive cases, negative cases, edge cases, abuse cases, stale-source cases, restricted-data cases。
  4. 输出必须可追踪到 prompt, model, source, tool call, reviewer, decision。
  5. sandbox 只能使用批准数据, 不能连接生产写动作。
  6. benchmark 结论必须包含 architecture fit, not just accuracy。

Benchmark Plan

DimensionMeasurementRelease implication
Task qualitygroundedness, completeness, policy compliance, extraction accuracy, narrative quality决定是否满足 workflow outcome
Critical failurehallucinated commitment, missed red flag, unauthorized advice, PII leakage, wrong adverse actionhigh-risk 用例通常要求为 0
Retrieval qualitysource recall, citation correctness, freshness, entitlement respect决定 RAG 是否可用于受控生产
Tool safetyallowed tool choice, argument correctness, approval compliance, idempotency决定 agent 是否可启用工具
Human oversightreviewer agreement, override rate, review time, escalation quality决定 HITL 是否真实有效
Costcost per case, token variance, document cost, eval cost, monitoring cost决定 TCO 和 scale feasibility
Latencyp50, p95, timeout, retry, end-to-end workflow time决定客户体验和运营队列可用性
Security / privacyprompt injection result, DLP pass, data retention proof, access control决定是否进入 pilot
Architecture fitAPI, IAM, audit export, SIEM, model gateway, data residency, change control决定 build-buy-partner 边界
Evidence maturityeval export, trace completeness, admin audit, versioning, incident evidence决定是否满足金融审计和模型风险

Sandbox Evidence Pack

每个 vendor 或 internal option 都要产出:

Evidence object内容
Option cardvendor/internal option, deployment model, model/provider, data route, components
Data map输入字段、文档、日志、embedding、retention、masking、region
Benchmark reportdataset, tasks, rubric, score, confidence, slice failures
Failure taxonomycritical, high, medium, low failure examples and root cause
Trace sampleprompt version, retrieval results, model version, output, tool calls, reviewer action
Cost and latency sheetunit economics, p50/p95 latency, rate limit, stress result
Architecture fit reviewintegration, IAM, observability, RAG, agent boundary, platform fit
Risk reviewprivacy, security, third-party, model risk, compliance, operational resilience
Recommendationbuild / buy / partner / hybrid / stop, with conditions and reversal triggers

Financial Retail Scenarios

1. GenAI Contact Center Copilot

Intake decisionSandbox focusArchitecture decision
AI drafts agent guidance, not customer commitmentspolicy answer, citation, escalation, vulnerable customer handlingbuy copilot UI if strong; build knowledge governance, eval, telemetry export

Hard failures:

  • AI promises fee reversal outside policy。
  • AI misses complaint language or vulnerable customer marker。
  • AI cites stale product terms。
  • AI outputs account data to unauthorized role。

2. KYC Document Intelligence

Intake decisionSandbox focusArchitecture decision
AI extracts and reconciles document facts; final KYC disposition stays human/system controlledfield extraction, document fraud signals, missing document checklist, data lineagebuy OCR/extraction, partner for policy tuning, build case workflow and evidence layer

Hard failures:

  • Wrong identity attribute without confidence flag。
  • Document retention exceeds approved period。
  • Evidence cannot be exported for audit or regulator inquiry。
  • Model cannot handle jurisdiction-specific document rules。

3. AML Investigation Workbench

Intake decisionSandbox focusArchitecture decision
AI summarizes evidence and drafts narrative; no final SAR/no-SAR decisionred flag coverage, source-grounded narrative, analyst override, missed-risk ratehybrid; internal control over data, RAG, audit, case action boundary

Hard failures:

  • AI omits material suspicious activity。
  • AI invents transaction rationale。
  • AI suggests final SAR decision as authoritative。
  • Case trace cannot reconstruct evidence used。

4. Credit Decision Support

Intake decisionSandbox focusArchitecture decision
AI supports memo drafting and policy retrieval; credit decision remains governed by approved decisioning processpolicy retrieval, adverse-action boundary, fair lending slice, explanation evidencehybrid; build decision boundary and model risk evidence, buy retrieval or document summarization components only

Hard failures:

  • AI uses protected-class proxy or unsupported inference。
  • AI drafts adverse action reason not supported by system of record。
  • Human reviewers over-rely without challenge。
  • Vendor cannot provide versioned evidence for model/prompt changes。

5. Payments Fraud Intervention

Intake decisionSandbox focusArchitecture decision
AI recommends intervention scripts and case prioritization; payment block/release requires deterministic policy and approvalfalse-positive customer harm, scam typology coverage, latency, tool permissionshybrid; build tool gateway, approval and audit; consider buy for scam narrative intelligence

Hard failures:

  • AI triggers payment action without authorization。
  • AI misses urgent scam indicators。
  • Latency breaks real-time intervention window。
  • Tool call cannot be replayed or reversed。

Metrics / Control / Evidence Model

Use a three-layer evidence model: metric proves behavior, control constrains risk, evidence proves the control operated.

LayerExamplesOwner
Outcome metricshandle time, document review cycle time, AML case completeness, fraud intervention conversion, complaint escalation accuracybusiness owner and PM
Quality metricsgroundedness, extraction accuracy, source coverage, critical failure rate, reviewer agreement, override reasonEvalOps and domain SMEs
Risk metricsPII leakage, unauthorized tool call, policy violation, under-escalation, biased slice regression, stale-source answerrisk, compliance, model risk
Operational metricsp50/p95 latency, timeout, fallback, cost per case, rate-limit hit, support ticket volumeplatform and operations
Adoption metricsactive users, task completion, accepted suggestions, edit distance, trust survey, manual fallback ratePM and operations

Control mapping:

RiskPreventive controlDetective controlCorrective controlEvidence
Vendor selected before problem clarityintake completeness gatefunnel review logreject/defer decisionintake card, decision minutes
Demo biascommon benchmark planscore normalizationre-run benchmarkbenchmark report
Sensitive data leakagemasking, DLP, approved sandbox datapayload sampling, DLP alertdelete, notify, retrain reviewersdata map, DLP test
Prompt injectionred-team cases, tool isolationinjection failure monitordisable route, update guardrailred-team report
Cost runawaybudget cap, token limitscost dashboardthrottle, switch model, revise scopecost sheet
Over-relianceUI uncertainty, mandatory review for high riskoverride and review auditretraining, stricter HITLreviewer logs
Architecture lock-ingateway, export requirement, component boundariesdependency reviewredesign or reject vendorADR, architecture map
Production promotion without evidencerelease gate checklistevidence binder completeness checklimited pilot or no-gogate memo

Anti-Patterns And Failure Modes

Anti-patternWhy it failsBetter pattern
Vendor-first discoveryDemo defines problem and success criteriaIntake starts from outcome, workflow, risk and baseline
Accuracy-only scorecardIgnores audit, latency, cost, data, tool safety and architecture fitMulti-dimensional sandbox scorecard
PoC using production data without controlCreates privacy and shadow-production riskApproved sandbox data boundary and DLP
One build-buy decision for the whole systemHides component-level control needsComponent decision matrix
Vendor black-box RAGCannot prove source, freshness, entitlement or citationSource registry and retrieval trace
Contract promises without technical enforcementRights cannot be exercised in operationsContract-control-evidence mapping
Pilot becomes production by adoption pressureControls arrive after risk exposureProduction promotion gate with hard stop criteria
No no-AI baselineAI value cannot be defendedCompare process/rules/search baseline
Weak cost measurementToken and eval cost surprise at scaleCost per case and capacity model
Missing exit constraints at intakeLock-in discovered after integrationExit constraints and concentration risk before pilot

Architecture Mapping To RAG / Agent / Copilot / Eval / Governance

Architecture patternIntake questionSandbox testProduction gate evidence
RAGWhich sources are authoritative, current and permissioned?citation correctness, stale-source failure, ACL filtering, retrieval recallsource registry, index version, retrieval eval, access review
AgentWhat tools can be called, with what authority and side effects?tool choice accuracy, argument validation, approval path, idempotencytool policy, audit trace, kill switch, rollback test
CopilotWhat does the human see, edit, approve, reject and own?reviewer agreement, override rate, UX trust calibration, escalationHITL log, training, adoption and quality dashboard
EvalWhat behavior contract must be proven before release?golden set, red-team set, slice metrics, critical failureseval report, threshold decision, exception memo
GovernanceWho owns AI risk, change, evidence, incident and lifecycle?RACI simulation, gate dry run, evidence completenessAI inventory, ADR, risk acceptance, operating cadence
Model gatewayWhich providers and versions are allowed?route comparison, fallback, cost/latency benchmarkrouting policy, model registry, telemetry export
ObservabilityCan one case be reconstructed end to end?trace completeness, log redaction, SIEM exportevidence binder, retention setting, audit export

ADR Draft

ADR: AI Procurement Intake And Vendor Sandbox Decision Architecture

Status: Proposed
Date: 2026-06-30

Context:
Financial retail AI initiatives are entering the portfolio through business ideas, vendor pitches,
executive pressure and local productivity experiments. Without a common intake and sandbox
architecture, teams may select vendors before defining outcomes, data boundaries, eval contracts,
architecture fit, risk tier and production evidence gates.

Decision:
Create an upstream AI procurement intake and vendor evaluation sandbox architecture. Every AI
candidate must pass intake completeness, use-case triage, sandbox charter, controlled benchmark,
build-buy-partner decision record and production promotion gate before procurement contracting
or production integration.

Options Considered:
1. Let procurement run standard vendor questionnaires first.
2. Let product teams run ad hoc PoCs and bring winners to architecture review.
3. Establish a common intake-to-sandbox-to-decision architecture across product, risk, procurement and architecture.

Decision Rationale:
Option 3 keeps AI vendor selection evidence-based and architecture-aware. It prevents demo bias,
shadow production, uncontrolled data exposure and premature lock-in. It also gives downstream
contracting, third-party risk, model validation and executive funding a stronger evidence base.

Consequences:
- Teams must define outcome, workflow, AI role, data boundary, no-AI baseline and eval contract before vendor testing.
- Vendors must be compared against the same sandbox tasks, rubric, cost, latency, risk and architecture-fit criteria.
- Production promotion requires traceable evidence, not product enthusiasm.
- The organization must maintain reusable templates, datasets, scorecards and gate records.

Reversal Triggers:
- Intake gate blocks too many low-risk experiments without learning value.
- Sandbox cycle time becomes disproportionate to risk tier.
- Platform-level approved patterns make some repeated sandbox steps redundant.
- Regulatory, legal or internal policy changes require a stronger or different gate.

Interview Answer

30 秒版本

我会把 AI procurement intake 当成架构控制点, 而不是采购表单。先用 outcome、workflow、AI role、data boundary、risk tier 和 no-AI baseline 做 triage, 再用受控 sandbox 对 build、buy、partner、hybrid 方案做同一数据、同一 rubric、同一成本和延迟基准测试。最后用 evidence pack 支撑 ADR、risk acceptance 和 production gate, 防止 vendor demo 直接变成生产系统。

2 分钟版本

在金融零售里, 例如 AML investigation workbench 或 KYC document intelligence, 我不会先问哪家供应商 demo 最好。我会先建立 intake funnel: 业务 outcome 是什么, AI 在流程中是检索、摘要、建议还是执行, 哪些数据会进入 prompt、embedding、日志和供应商, 哪些行为必须人工审批, no-AI baseline 是什么。通过 triage 后, 我会设计 sandbox charter, 用同一批脱敏或合成数据、golden set、red-team cases 和业务 rubric 比较供应商方案、内部方案和流程方案。评估不只看 accuracy, 还看 critical failure、grounding、权限过滤、成本、p95 延迟、trace 完整性、审计导出、模型变更、工具权限和架构适配。决策不是简单买或不买, 而是组件级 build-buy-partner: 例如可以买 OCR 或 copilot UI, 但内部保留 model gateway、RAG source registry、eval contract、tool gateway 和 audit evidence。只有 evidence pack 能支撑 risk、privacy、security、model validation 和 production promotion gate 时, 才进入受控 pilot 或采购合同阶段。

CTO 版本

I would institutionalize AI procurement intake as a control-plane pattern. The control plane has five artifacts: intake card, sandbox charter, benchmark evidence, build-buy-partner ADR and production promotion packet. It prevents premature vendor coupling by forcing every option through the same workflow, data, risk, evaluation, cost, latency and architecture-fit tests. At component level, I would usually buy commodity capability, build policy and evidence control points, and partner only where domain transfer is required. For regulated financial use cases, the winning option is not the one with the best demo; it is the option that can be integrated through our identity, data, RAG, tool, eval, observability, incident and audit architecture while keeping concentration risk and exit constraints explicit.


7-Day Practice Plan

DayPracticeOutput
1Choose one use case: GenAI contact center, KYC document intelligence, AML workbench, credit support, or payments fraudOne-page intake card with outcome, workflow, AI role, data and no-AI option
2Draw AS-IS / TO-BE workflow and decision authority boundaryBPMN-lite flow plus allowed, restricted and prohibited AI actions
3Build sandbox charterData plan, task list, rubric, critical failures, cost and latency measures
4Create vendor scorecard and internal baseline comparisonWeighted scorecard with architecture-fit criteria
5Write component-level build-buy-partner decisionComponent matrix for model, RAG, eval, gateway, workflow, observability
6Assemble evidence modelMetrics, controls, evidence objects, risk acceptance and gate thresholds
7Practice interview narrative30秒, 2分钟, CTO answer and one financial scenario deep dive

Source Anchors

AnchorLink本文使用方式
NIST AI Risk Management Frameworkhttps://www.nist.gov/itl/ai-risk-management-framework用 Govern / Map / Measure / Manage 组织 AI risk, evidence, monitoring and management action
NIST AI RMF Generative AI Profilehttps://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence用 GenAI risk lens 设计 sandbox red-team, content provenance, data leakage and misuse cases
ISO/IEC 42001 AI management systemshttps://www.iso.org/standard/81230.html用 AI management system 思路定义 accountability, lifecycle, operation, performance evaluation and improvement
ISO/IEC/IEEE 29148 Requirements engineeringhttps://www.iso.org/standard/72089.html用 requirements quality, stakeholder concern and validation thinking 支撑 intake and eval contract
ISO/IEC/IEEE 42010 Architecture descriptionhttps://www.iso.org/standard/74393.html用 stakeholder concern, viewpoint and architecture rationale 组织 ADR and architecture fit review
Interagency Third-Party Risk Guidance, FDIC FIL-29-2023https://www.fdic.gov/news/financial-institution-letters/2023/fil23029.html用 third-party lifecycle 思维连接 planning, due diligence, selection, monitoring and termination inputs
FFIEC AIO booklet summary, OCC Bulletin 2021-30https://www.occ.gov/news-issuances/bulletins/2021/bulletin-2021-30.html用 architecture, infrastructure and operations lens 检查 resilience, integration, operations and evidence
OWASP Top 10 for Large Language Model Applicationshttps://owasp.org/www-project-top-10-for-large-language-model-applications/用 prompt injection, sensitive information disclosure, supply chain and excessive agency 设计安全测试