AI 底层逻辑 / 经典论文

AI Procurement Intake：供应商评估沙盒与 Build-Buy 架构

AI 采购入口不是行政流程, 而是企业 AI 架构的第一道控制点。它决定一个 AI idea 会不会被错误地推入 vendor demo、PoC、采购谈判或生产集成。成熟机构不会先问"哪家供应商最好", 而是先问:

424 行ai-foundations/papers/166-ai-procurement-intake-vendor-evaluation-sandbox-build-buy-architecture.md

AI 采购入口 / 供应商评估沙盒 / Build-Buy 决策架构解读 (AI Procurement Intake / Vendor Evaluation Sandbox / Build-Buy Decision Architecture)

Date: 2026-06-30
Status: evergreen
Audience: experienced CBAP / financial retail PM / solution architect / enterprise architect moving into AI product and AI architecture
Output: 一份可放入作品集的 AI procurement intake, vendor sandbox, benchmark, build-buy-partner decision 和 production promotion gate 架构笔记

Why Procurement Intake Is An Architecture Control Point

控制问题	架构含义	金融零售后果
这个用例是否值得进入 AI funnel	先验证 outcome, workflow, data, risk tier, no-AI option	避免把普通流程问题包装成 GenAI 项目
AI 在流程中扮演什么角色	read, summarize, recommend, draft, decide, act 分层	避免客服、信贷、AML、支付场景中越权决策
应该 build, buy, partner, hybrid 还是 stop	将能力差异化、控制权、时间、成本、风险放进同一决策	避免因 demo 好看而买入不适合的黑盒
sandbox 评估什么	用真实但受控的数据、任务、rubric 和门禁测试供应商	避免只看销售演示和通用 benchmark
证据如何进入生产放行	评估结果必须连接 ADR, risk acceptance, release gate	避免 PoC 成功后绕过安全、隐私、模型风险和运营控制

上游 procurement intake 的核心价值:

把 AI idea 变成可测量的业务和风险假设。
把 vendor comparison 变成 architecture comparison。
把 PoC 变成受控 sandbox, 不让试点自然漂移成 shadow production。
把 build-buy 选择从偏好讨论升级为 evidence-based decision。
把后续合同、退出、投资叙事和生产上线建立在证据之上。

边界说明: 本文聚焦 procurement lifecycle 上游的 intake, triage, sandbox, benchmark 和 build-buy decision。合同条款、退出迁移和董事会投资 narrative 属于下游材料, 这里只定义它们需要的输入证据。

Concept Diagram

flowchart LR
  A[AI idea / vendor pitch / business pain] --> B[Intake funnel]
  B --> C{Use-case triage}
  C -->|No AI fit| C1[Process / rules / data fix]
  C -->|Low value or high risk| C2[Stop or defer]
  C -->|Candidate| D[Decision architecture]
  D --> D1[Build]
  D --> D2[Buy]
  D --> D3[Partner]
  D --> D4[Hybrid]
  D --> E[Sandbox charter]
  E --> F[Vendor and internal option benchmark]
  F --> G[Evidence pack]
  G --> H{Architecture review board}
  H -->|No-go| H1[Reject / redesign]
  H -->|Limited pilot| H2[Controlled pilot with constraints]
  H -->|Production candidate| I[Production promotion gate]
  I --> J[Contract, security, privacy, risk, model validation, operating model]

一条实用原则:

Intake owns the question "should this enter the AI option space"; sandbox owns "which option works under controlled evidence"; architecture gate owns "can this be operated safely at production scale".

Intake-To-Sandbox-To-Decision Architecture

1. Intake Funnel

AI intake 要求每个 idea 先提交最小证据, 而不是直接预约 vendor demo。

Intake field	高级要求	不合格信号
Business outcome	明确 baseline, target movement, impacted workflow, owner	"提升效率", "智能化", "更懂客户"
AI role	read / retrieve / summarize / draft / recommend / decide / act	直接说"AI 自动处理"
Customer or regulatory impact	是否影响客户承诺、授信、KYC、AML、支付、投诉、收费、适当性	只说内部工具所以低风险
Data boundary	数据源、PII、PCI、账户、交易、文档、语音、日志、跨境、保留	不知道会把什么发给供应商
Workflow insertion point	AS-IS / TO-BE 节点, human review, exception path, fallback	没有流程图, 只有功能清单
No-AI alternative	流程优化、规则引擎、搜索、RPA、知识治理、报表	默认 AI 是唯一方案
Evidence plan	sandbox 数据、rubric、benchmark、control evidence	只打算看供应商 demo

2. Triage Gate

把用例分成四类:

Tier	典型场景	推荐动作
T0: Reject / defer	没有 owner、没有 baseline、数据不可用、风险不可接受	stop, 先补流程或数据
T1: Learn-only sandbox	价值假设早期, 数据可脱敏, 不连接生产	controlled demo and architecture learning
T2: Controlled pilot candidate	明确 workflow, 有评估集, 人类保留最终权责	sandbox -> limited pilot
T3: High-impact candidate	信贷、AML、KYC、支付、客户承诺、投诉或监管证据	sandbox 加强, independent challenge, production gate

3. Sandbox Charter

Sandbox charter 必须在供应商测试前冻结:

Section	内容
Scope	具体流程、用户角色、允许任务、禁止任务
Data	synthetic, masked, historical, gold set, red-team set, access policy
Architecture	vendor route, internal baseline, model gateway, logging, retrieval, tool boundary
Evaluation	benchmark task, rubric, thresholds, critical failures, slice analysis
Controls	privacy, security, HITL, DLP, prompt injection, cost cap, kill switch
Evidence	trace, logs, output samples, evaluator notes, cost, latency, defect taxonomy
Decision	build / buy / partner / hybrid / stop 的判定规则

4. Decision Board

Decision board 不应只由 procurement 或 product 决定。最小构成:

Role	负责挑战的问题
Business owner	业务价值是否真实, 是否愿意承担 adoption 和 residual risk
AI PM / Product owner	用户场景、MVP、体验、采用、收益假设是否清晰
CBAP / BA	流程、规则、需求、例外、验收和证据是否完整
Solution architect	集成、数据流、RAG、agent、日志、可观测、降级是否可行
Enterprise architect	平台复用、能力地图、供应商集中度、目标架构适配
Security / privacy	数据、身份、权限、日志、DLP、威胁模型是否达标
Risk / compliance / model risk	风险等级、监管影响、模型验证、人工监督是否充分
Procurement / TPRM	供应商风险、商业条款、后续合同尽调是否可进入下一阶段

Build / Buy / Partner Decision Model

Build-buy-partner 不是三选一口号, 而是一组架构边界决策。

Decision Axes

Axis	Build 倾向	Buy 倾向	Partner 倾向	Hybrid 倾向
Differentiation	流程或数据是竞争优势	能力通用, 市场成熟	需要行业经验转移	控制层差异化, 能力层通用
Control need	数据、模型、策略、审计、工具权限必须内部控制	供应商可提供充分控制证据	机构缺少短期能力	内部保留 policy, eval, gateway, audit
Time to value	可以等待内部能力建设	需要快速验证和上线	需要加速交付但保留学习	先买后抽象, 或买组件建控制面
Scale economics	用量大, 单位成本可被内部平台摊薄	用量不确定或中小规模	早期探索	高风险部分内部化, 普通能力外部化
Talent readiness	有 AI platform, data, eval, security, SRE 能力	内部团队不足	需要 co-build and knowledge transfer	内部团队能运营控制层
Regulatory evidence	内部可生成更强证据	供应商证据成熟且可导出	需要顾问补齐控制设计	机构证据层统一, vendor 提供组件证据
Exit optionality	内部架构可替换	vendor lock-in 可接受	依赖转移需要计划	抽象接口降低退出成本

Component-Level Decision

不要为整个用例做一个笼统决定。把 AI system 拆到组件层:

Component	常见选择	判断逻辑
Base model	buy or use managed model	基础模型通常不是金融零售机构差异化来源
Model gateway	build or platform buy	需要统一 routing, logging, policy, cost, versioning
RAG ingestion	hybrid	文档处理可买, source registry 和 entitlement 应内部控制
Vector store / search	buy managed or internal platform	取决于数据分类、延迟、成本和地域要求
Prompt / policy registry	build lightweight	政策、prompt、release evidence 应机构可审计
Eval harness	hybrid	工具可买, golden set、rubric、门禁阈值必须内部拥有
Agent tool gateway	build	高风险动作、权限、审批、幂等和审计不宜交给黑盒
Human review workbench	buy, build, or existing workflow	取决于是否嵌入 AML/KYC/信贷/客服 case system
Observability	hybrid	vendor trace 要进入内部 SIEM, audit, cost 和 quality dashboard

Practical Decision Rule

Condition	Recommendation
供应商产品强, 但不能导出 trace/eval/log	sandbox 可以学, 不进入生产候选
供应商质量好, 但工具动作权限不可控	buy UI/model layer only, build tool gateway
内部模型质量一般, 但证据和控制强	可做 high-risk pilot, 因为金融场景安全证据比平均分重要
多供应商效果接近	选择架构适配、证据导出、成本可预测和退出约束更好的方案
用例不是差异化, 但需要快速 adoption	buy with strong sandbox and production gate
用例是核心风控或客户承诺	hybrid by default, 内部控制 decision boundary and evidence

Vendor Sandbox And Benchmark Design

Sandbox Design Principles

同一任务, 同一数据, 同一 rubric, 同一 cost and latency measurement。
至少比较 vendor option, internal baseline, no-AI baseline。
测试 positive cases, negative cases, edge cases, abuse cases, stale-source cases, restricted-data cases。
输出必须可追踪到 prompt, model, source, tool call, reviewer, decision。
sandbox 只能使用批准数据, 不能连接生产写动作。
benchmark 结论必须包含 architecture fit, not just accuracy。

Benchmark Plan

Dimension	Measurement	Release implication
Task quality	groundedness, completeness, policy compliance, extraction accuracy, narrative quality	决定是否满足 workflow outcome
Critical failure	hallucinated commitment, missed red flag, unauthorized advice, PII leakage, wrong adverse action	high-risk 用例通常要求为 0
Retrieval quality	source recall, citation correctness, freshness, entitlement respect	决定 RAG 是否可用于受控生产
Tool safety	allowed tool choice, argument correctness, approval compliance, idempotency	决定 agent 是否可启用工具
Human oversight	reviewer agreement, override rate, review time, escalation quality	决定 HITL 是否真实有效
Cost	cost per case, token variance, document cost, eval cost, monitoring cost	决定 TCO 和 scale feasibility
Latency	p50, p95, timeout, retry, end-to-end workflow time	决定客户体验和运营队列可用性
Security / privacy	prompt injection result, DLP pass, data retention proof, access control	决定是否进入 pilot
Architecture fit	API, IAM, audit export, SIEM, model gateway, data residency, change control	决定 build-buy-partner 边界
Evidence maturity	eval export, trace completeness, admin audit, versioning, incident evidence	决定是否满足金融审计和模型风险

Sandbox Evidence Pack

每个 vendor 或 internal option 都要产出:

Evidence object	内容
Option card	vendor/internal option, deployment model, model/provider, data route, components
Data map	输入字段、文档、日志、embedding、retention、masking、region
Benchmark report	dataset, tasks, rubric, score, confidence, slice failures
Failure taxonomy	critical, high, medium, low failure examples and root cause
Trace sample	prompt version, retrieval results, model version, output, tool calls, reviewer action
Cost and latency sheet	unit economics, p50/p95 latency, rate limit, stress result
Architecture fit review	integration, IAM, observability, RAG, agent boundary, platform fit
Risk review	privacy, security, third-party, model risk, compliance, operational resilience
Recommendation	build / buy / partner / hybrid / stop, with conditions and reversal triggers

Financial Retail Scenarios

1. GenAI Contact Center Copilot

Intake decision	Sandbox focus	Architecture decision
AI drafts agent guidance, not customer commitments	policy answer, citation, escalation, vulnerable customer handling	buy copilot UI if strong; build knowledge governance, eval, telemetry export

Hard failures:

AI promises fee reversal outside policy。
AI misses complaint language or vulnerable customer marker。
AI cites stale product terms。
AI outputs account data to unauthorized role。

2. KYC Document Intelligence

Intake decision	Sandbox focus	Architecture decision
AI extracts and reconciles document facts; final KYC disposition stays human/system controlled	field extraction, document fraud signals, missing document checklist, data lineage	buy OCR/extraction, partner for policy tuning, build case workflow and evidence layer

Hard failures:

Wrong identity attribute without confidence flag。
Document retention exceeds approved period。
Evidence cannot be exported for audit or regulator inquiry。
Model cannot handle jurisdiction-specific document rules。

3. AML Investigation Workbench

Intake decision	Sandbox focus	Architecture decision
AI summarizes evidence and drafts narrative; no final SAR/no-SAR decision	red flag coverage, source-grounded narrative, analyst override, missed-risk rate	hybrid; internal control over data, RAG, audit, case action boundary

Hard failures:

AI omits material suspicious activity。
AI invents transaction rationale。
AI suggests final SAR decision as authoritative。
Case trace cannot reconstruct evidence used。

4. Credit Decision Support

Intake decision	Sandbox focus	Architecture decision
AI supports memo drafting and policy retrieval; credit decision remains governed by approved decisioning process	policy retrieval, adverse-action boundary, fair lending slice, explanation evidence	hybrid; build decision boundary and model risk evidence, buy retrieval or document summarization components only

Hard failures:

AI uses protected-class proxy or unsupported inference。
AI drafts adverse action reason not supported by system of record。
Human reviewers over-rely without challenge。
Vendor cannot provide versioned evidence for model/prompt changes。

5. Payments Fraud Intervention

Intake decision	Sandbox focus	Architecture decision
AI recommends intervention scripts and case prioritization; payment block/release requires deterministic policy and approval	false-positive customer harm, scam typology coverage, latency, tool permissions	hybrid; build tool gateway, approval and audit; consider buy for scam narrative intelligence

Hard failures:

AI triggers payment action without authorization。
AI misses urgent scam indicators。
Latency breaks real-time intervention window。
Tool call cannot be replayed or reversed。

Metrics / Control / Evidence Model

Use a three-layer evidence model: metric proves behavior, control constrains risk, evidence proves the control operated.

Layer	Examples	Owner
Outcome metrics	handle time, document review cycle time, AML case completeness, fraud intervention conversion, complaint escalation accuracy	business owner and PM
Quality metrics	groundedness, extraction accuracy, source coverage, critical failure rate, reviewer agreement, override reason	EvalOps and domain SMEs
Risk metrics	PII leakage, unauthorized tool call, policy violation, under-escalation, biased slice regression, stale-source answer	risk, compliance, model risk
Operational metrics	p50/p95 latency, timeout, fallback, cost per case, rate-limit hit, support ticket volume	platform and operations
Adoption metrics	active users, task completion, accepted suggestions, edit distance, trust survey, manual fallback rate	PM and operations

Control mapping:

Risk	Preventive control	Detective control	Corrective control	Evidence
Vendor selected before problem clarity	intake completeness gate	funnel review log	reject/defer decision	intake card, decision minutes
Demo bias	common benchmark plan	score normalization	re-run benchmark	benchmark report
Sensitive data leakage	masking, DLP, approved sandbox data	payload sampling, DLP alert	delete, notify, retrain reviewers	data map, DLP test
Prompt injection	red-team cases, tool isolation	injection failure monitor	disable route, update guardrail	red-team report
Cost runaway	budget cap, token limits	cost dashboard	throttle, switch model, revise scope	cost sheet
Over-reliance	UI uncertainty, mandatory review for high risk	override and review audit	retraining, stricter HITL	reviewer logs
Architecture lock-in	gateway, export requirement, component boundaries	dependency review	redesign or reject vendor	ADR, architecture map
Production promotion without evidence	release gate checklist	evidence binder completeness check	limited pilot or no-go	gate memo

Anti-Patterns And Failure Modes

Anti-pattern	Why it fails	Better pattern
Vendor-first discovery	Demo defines problem and success criteria	Intake starts from outcome, workflow, risk and baseline
Accuracy-only scorecard	Ignores audit, latency, cost, data, tool safety and architecture fit	Multi-dimensional sandbox scorecard
PoC using production data without control	Creates privacy and shadow-production risk	Approved sandbox data boundary and DLP
One build-buy decision for the whole system	Hides component-level control needs	Component decision matrix
Vendor black-box RAG	Cannot prove source, freshness, entitlement or citation	Source registry and retrieval trace
Contract promises without technical enforcement	Rights cannot be exercised in operations	Contract-control-evidence mapping
Pilot becomes production by adoption pressure	Controls arrive after risk exposure	Production promotion gate with hard stop criteria
No no-AI baseline	AI value cannot be defended	Compare process/rules/search baseline
Weak cost measurement	Token and eval cost surprise at scale	Cost per case and capacity model
Missing exit constraints at intake	Lock-in discovered after integration	Exit constraints and concentration risk before pilot

Architecture Mapping To RAG / Agent / Copilot / Eval / Governance

Architecture pattern	Intake question	Sandbox test	Production gate evidence
RAG	Which sources are authoritative, current and permissioned?	citation correctness, stale-source failure, ACL filtering, retrieval recall	source registry, index version, retrieval eval, access review
Agent	What tools can be called, with what authority and side effects?	tool choice accuracy, argument validation, approval path, idempotency	tool policy, audit trace, kill switch, rollback test
Copilot	What does the human see, edit, approve, reject and own?	reviewer agreement, override rate, UX trust calibration, escalation	HITL log, training, adoption and quality dashboard
Eval	What behavior contract must be proven before release?	golden set, red-team set, slice metrics, critical failures	eval report, threshold decision, exception memo
Governance	Who owns AI risk, change, evidence, incident and lifecycle?	RACI simulation, gate dry run, evidence completeness	AI inventory, ADR, risk acceptance, operating cadence
Model gateway	Which providers and versions are allowed?	route comparison, fallback, cost/latency benchmark	routing policy, model registry, telemetry export
Observability	Can one case be reconstructed end to end?	trace completeness, log redaction, SIEM export	evidence binder, retention setting, audit export

ADR Draft

ADR: AI Procurement Intake And Vendor Sandbox Decision Architecture

Status: Proposed
Date: 2026-06-30

Context:
Financial retail AI initiatives are entering the portfolio through business ideas, vendor pitches,
executive pressure and local productivity experiments. Without a common intake and sandbox
architecture, teams may select vendors before defining outcomes, data boundaries, eval contracts,
architecture fit, risk tier and production evidence gates.

Decision:
Create an upstream AI procurement intake and vendor evaluation sandbox architecture. Every AI
candidate must pass intake completeness, use-case triage, sandbox charter, controlled benchmark,
build-buy-partner decision record and production promotion gate before procurement contracting
or production integration.

Options Considered:
1. Let procurement run standard vendor questionnaires first.
2. Let product teams run ad hoc PoCs and bring winners to architecture review.
3. Establish a common intake-to-sandbox-to-decision architecture across product, risk, procurement and architecture.

Decision Rationale:
Option 3 keeps AI vendor selection evidence-based and architecture-aware. It prevents demo bias,
shadow production, uncontrolled data exposure and premature lock-in. It also gives downstream
contracting, third-party risk, model validation and executive funding a stronger evidence base.

Consequences:
- Teams must define outcome, workflow, AI role, data boundary, no-AI baseline and eval contract before vendor testing.
- Vendors must be compared against the same sandbox tasks, rubric, cost, latency, risk and architecture-fit criteria.
- Production promotion requires traceable evidence, not product enthusiasm.
- The organization must maintain reusable templates, datasets, scorecards and gate records.

Reversal Triggers:
- Intake gate blocks too many low-risk experiments without learning value.
- Sandbox cycle time becomes disproportionate to risk tier.
- Platform-level approved patterns make some repeated sandbox steps redundant.
- Regulatory, legal or internal policy changes require a stronger or different gate.

Interview Answer

30 秒版本

我会把 AI procurement intake 当成架构控制点, 而不是采购表单。先用 outcome、workflow、AI role、data boundary、risk tier 和 no-AI baseline 做 triage, 再用受控 sandbox 对 build、buy、partner、hybrid 方案做同一数据、同一 rubric、同一成本和延迟基准测试。最后用 evidence pack 支撑 ADR、risk acceptance 和 production gate, 防止 vendor demo 直接变成生产系统。

2 分钟版本

在金融零售里, 例如 AML investigation workbench 或 KYC document intelligence, 我不会先问哪家供应商 demo 最好。我会先建立 intake funnel: 业务 outcome 是什么, AI 在流程中是检索、摘要、建议还是执行, 哪些数据会进入 prompt、embedding、日志和供应商, 哪些行为必须人工审批, no-AI baseline 是什么。通过 triage 后, 我会设计 sandbox charter, 用同一批脱敏或合成数据、golden set、red-team cases 和业务 rubric 比较供应商方案、内部方案和流程方案。评估不只看 accuracy, 还看 critical failure、grounding、权限过滤、成本、p95 延迟、trace 完整性、审计导出、模型变更、工具权限和架构适配。决策不是简单买或不买, 而是组件级 build-buy-partner: 例如可以买 OCR 或 copilot UI, 但内部保留 model gateway、RAG source registry、eval contract、tool gateway 和 audit evidence。只有 evidence pack 能支撑 risk、privacy、security、model validation 和 production promotion gate 时, 才进入受控 pilot 或采购合同阶段。

CTO 版本

I would institutionalize AI procurement intake as a control-plane pattern. The control plane has five artifacts: intake card, sandbox charter, benchmark evidence, build-buy-partner ADR and production promotion packet. It prevents premature vendor coupling by forcing every option through the same workflow, data, risk, evaluation, cost, latency and architecture-fit tests. At component level, I would usually buy commodity capability, build policy and evidence control points, and partner only where domain transfer is required. For regulated financial use cases, the winning option is not the one with the best demo; it is the option that can be integrated through our identity, data, RAG, tool, eval, observability, incident and audit architecture while keeping concentration risk and exit constraints explicit.

7-Day Practice Plan

Day	Practice	Output
1	Choose one use case: GenAI contact center, KYC document intelligence, AML workbench, credit support, or payments fraud	One-page intake card with outcome, workflow, AI role, data and no-AI option
2	Draw AS-IS / TO-BE workflow and decision authority boundary	BPMN-lite flow plus allowed, restricted and prohibited AI actions
3	Build sandbox charter	Data plan, task list, rubric, critical failures, cost and latency measures
4	Create vendor scorecard and internal baseline comparison	Weighted scorecard with architecture-fit criteria
5	Write component-level build-buy-partner decision	Component matrix for model, RAG, eval, gateway, workflow, observability
6	Assemble evidence model	Metrics, controls, evidence objects, risk acceptance and gate thresholds
7	Practice interview narrative	30秒, 2分钟, CTO answer and one financial scenario deep dive

Source Anchors

Anchor	Link	本文使用方式
NIST AI Risk Management Framework	https://www.nist.gov/itl/ai-risk-management-framework	用 Govern / Map / Measure / Manage 组织 AI risk, evidence, monitoring and management action
NIST AI RMF Generative AI Profile	https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence	用 GenAI risk lens 设计 sandbox red-team, content provenance, data leakage and misuse cases
ISO/IEC 42001 AI management systems	https://www.iso.org/standard/81230.html	用 AI management system 思路定义 accountability, lifecycle, operation, performance evaluation and improvement
ISO/IEC/IEEE 29148 Requirements engineering	https://www.iso.org/standard/72089.html	用 requirements quality, stakeholder concern and validation thinking 支撑 intake and eval contract
ISO/IEC/IEEE 42010 Architecture description	https://www.iso.org/standard/74393.html	用 stakeholder concern, viewpoint and architecture rationale 组织 ADR and architecture fit review
Interagency Third-Party Risk Guidance, FDIC FIL-29-2023	https://www.fdic.gov/news/financial-institution-letters/2023/fil23029.html	用 third-party lifecycle 思维连接 planning, due diligence, selection, monitoring and termination inputs
FFIEC AIO booklet summary, OCC Bulletin 2021-30	https://www.occ.gov/news-issuances/bulletins/2021/bulletin-2021-30.html	用 architecture, infrastructure and operations lens 检查 resilience, integration, operations and evidence
OWASP Top 10 for Large Language Model Applications	https://owasp.org/www-project-top-10-for-large-language-model-applications/	用 prompt injection, sensitive information disclosure, supply chain and excessive agency 设计安全测试