AI 扩展计划 / Playbooks

AI Model Portfolio Benchmarking / Capability Scorecard / Selection Governance Playbook

使用本 playbook 建立一套可执行的模型组合治理机制：

420 行AI_MODEL_PORTFOLIO_BENCHMARKING_CAPABILITY_SCORECARD_SELECTION_GOVERNANCE_PLAYBOOK.md

AI Model Portfolio Benchmarking / Capability Scorecard / Selection Governance Playbook

定位：面向 experienced CBAP / financial retail PM / product architect / solution architect / AI governance lead，把模型选型从一次性比较升级为持续性的 model portfolio governance operating model。
边界：本 playbook 不替代模型路由策略、采购沙盒、董事会投资叙事或 AI FinOps。它定义模型组合如何被登记、评测、评分、批准、挑战、监控和退役。

Purpose and when to use

Purpose

使用本 playbook 建立一套可执行的模型组合治理机制：

business capability
  -> AI task taxonomy
  -> model portfolio inventory
  -> model card and approved-use boundary
  -> benchmark pack
  -> capability scorecard
  -> model selection record
  -> champion/challenger lifecycle
  -> release evidence packet
  -> retirement trigger

目标不是选出“全公司最强模型”，而是让团队能回答：

哪些模型家族被批准用于哪些任务？
哪些模型只能做 draft、summary、classification、retrieval answer 或 extraction，不能做 decision？
哪些 benchmark pack 支撑本次选型？
red-team、domain、latency、cost、security 和 governance 分数如何影响批准范围？
新模型进入时如何挑战 champion？
旧模型在什么情况下必须 watchlist、限制、替换或退役？
审计、模型风险和产品决策如何复核当时的证据？

When to use

Trigger	Use this playbook to produce
新 AI use case 进入 portfolio	task taxonomy, candidate model list, first benchmark pack
产品团队想换模型	challenger benchmark and model selection record
供应商发布新模型版本	change impact review and re-benchmark decision
open-weight 模型进入企业平台	open/proprietary comparison, hosting and safety evidence
高风险 release 前	scorecard, hard blocker review, evidence packet
生产出现投诉、override 或安全事件	regression pack update, watchlist and retirement review
季度 AI governance review	model portfolio health, concentration, challenger freshness, retirement backlog

Operating model

Governance bodies

Body	Decision rights	Core outputs
AI product council	业务优先级、use case scope、用户和 workflow 边界	approved use case and product scope
Model selection council	champion/challenger、approved use、限制条件、退役	model selection record
AI architecture review	deployment mode、integration、trace、fallback、data boundary	architecture fit review
Model risk / independent challenge	高风险 benchmark 设计、threshold、evidence sufficiency challenge	challenge memo
Security / privacy review	data route、DLP、IAM、prompt injection、logging、retention	security/privacy sign-off
Operations review	latency、availability、human review、queue impact、fallback readiness	operational readiness record

RACI

Activity	PM	CBAP / BA	AI architect	EvalOps	Security / privacy	Model risk	Business owner	Operations
Task taxonomy	A/R	R	C	C	I	C	C	C
Model inventory	C	C	A/R	R	C	C	I	I
Benchmark pack	A	R	R	R	C	A/R	C	C
Capability scorecard	A/R	R	R	R	C	C	C	C
Selection record	A/R	C	R	C	C	C	A	C
Release evidence	R	R	R	A/R	R	C	A	R
Retirement trigger review	A/R	C	R	R	C	C	A	R

A = accountable, R = responsible, C = consulted, I = informed.

Cadence

Cadence	Meeting / activity	Decision
Per use case intake	task taxonomy and candidate model review	which model families enter benchmark
Per release	scorecard and hard blocker review	go, no-go, limited release or rerun
Monthly	challenger and incident review	promote, watchlist, constrain or remediate
Quarterly	portfolio review	concentration, stale models, retirement backlog, benchmark refresh
Event-driven	supplier/model/data/policy/security change	emergency re-benchmark, freeze expansion or retire

Template: model portfolio inventory

Field	Required content	Example
model_id	Enterprise unique identifier	`KYC-OCR-SPECIALIST-2026-06`
model_name	Provider or internal name	Specialist document extractor
model_family	general LLM / reasoning / small LLM / embedding / reranker / classifier / OCR / speech / judge / safety	OCR / extraction
provider_mode	proprietary API / managed cloud / open-weight self-hosted / internal fine-tune / vendor component	managed cloud
deployment_region	runtime region and data residency	US region only
version_boundary	model snapshot, API date, fine-tune id, guardrail version	model v3.2, OCR layout pack v7
approved_use	permitted tasks and workflows	KYC ID and address proof extraction with human review for exceptions
prohibited_use	decisions or data use not permitted	no final KYC approval, no customer communication
risk_tier	low / medium / high / critical	high
data_boundary	PII, PCI, transaction, document, voice, training use, retention	PII allowed; no training; logs retained 30 days
benchmark_pack	pack id and latest run	`KYC-DOC-EXTRACT-v2026.06`, pass
scorecard_summary	main score and blockers	strong extraction, weak on handwritten docs, no critical blocker
lifecycle_status	candidate / challenger / champion / constrained / watchlisted / retired	constrained champion
owner	product, architecture, operational and risk owners	KYC PO + AI platform architect + onboarding ops
review_expiry	time or event trigger	quarterly or document policy change
fallback_model	fallback or no-AI baseline	manual review queue
evidence_location	GRC/evidence binder id	`AIGOV-MODEL-PORT-2026-168-KYC`

Template: capability scorecard

Use 1-5 scoring with written rationale. For high-risk workflows, hard blockers override weighted totals.

Dimension	Weight	Evidence required	Notes
Task quality	14	benchmark pass rate, SME sample review	Does it perform the intended task, not a generic task?
Domain score	14	domain edge cases, policy/version tests	Does it understand financial retail rules and exceptions?
Red-team score	14	adversarial prompt, misuse, privacy and unsafe advice tests	Any critical failure blocks high-risk approval.
Grounding/citation	10	citation support, source recall, entitlement tests	Required for RAG and policy support.
Robustness	8	mutation tests, language/channel/noise variation	Weak robustness constrains scope.
Security/data boundary	10	IAM, DLP, retention, logging, training use evidence	Must match data classification.
Latency/reliability	8	p50/p95, timeout, availability, fallback	Must fit workflow queue and customer experience.
Cost fitness	6	cost per case, variance, scale assumptions	Cost informs scope, but does not override safety.
Human oversight fit	6	reviewer workflow, override, escalation logs	Prevents hidden automation.
Governance readiness	10	model card, version lock, trace export, retirement trigger	Required for audit and change control.

Hard blocker review

Blocker	Pass evidence	Decision if failed
Critical customer harm	no critical harm in approved red-team/domain set	reject or constrain to lab
Unauthorized decision	AI role cannot make final credit/KYC/AML/fraud decision unless explicitly approved	redesign workflow
PII or restricted data leakage	DLP and entitlement tests pass	block release
Unsupported regulated advice	no unsupported adverse action, legal/financial advice or customer commitment	block high-risk use
Missing trace	prompt, model, retrieval, output and human action are reproducible	no regulated release
Silent version drift	version boundary and supplier change notice captured	freeze approval

Template: benchmark pack

Section	Required content	Example: credit policy RAG
pack_id	unique benchmark identifier	`CREDIT-POLICY-RAG-v2026.06`
use_case	workflow and AI role	underwriter policy support, answer with citations
approved scope	products, regions, users, channels	consumer credit card and personal loan policy
out of scope	explicitly excluded work	no final underwriting decision, no adverse action notice generation
model options	champion, challenger, baseline	frontier proprietary, open-weight hosted, existing search
dataset	source, version, sample size, coverage	500 policy Q&A, 80 edge cases, 60 stale/conflict cases
domain slices	product, jurisdiction, effective date, customer segment	state-specific rules, exception process, protected-class boundary
red-team slices	prompt injection, unsafe advice, restricted data, overconfidence	user asks for decline reason not in policy
rubric	scoring method and severity	groundedness, citation, completeness, refusal, escalation
critical failures	zero tolerance failures	unsupported adverse action reason, fabricated policy, restricted source leak
run protocol	model version, prompt, retrieval config, judge model, repeats	prompt v12, retriever v5, 3 repeated runs for unstable cases
operational test	latency, timeout, stress, fallback	p95 under workflow target, fallback to policy search
evidence output	files and records retained	traces, scorecard, SME notes, selection record

Benchmark result summary

Option	Task score	Domain score	Red-team score	p95 latency	Cost/case	Governance readiness	Recommendation
Current champion	4.2	4.0	4.5	4.8s	$0.38	high	keep, monitor stale-policy slice
Frontier challenger	4.6	4.4	4.2	6.2s	$0.71	medium	constrained pilot for complex cases
Open-weight challenger	3.9	3.6	3.8	3.9s	$0.29	high if hosted internally	continue challenge after red-team tuning
No-AI baseline	3.1	3.3	5.0	human search	staff time	high	keep as fallback

Template: model selection record

Decision title:
Use case:
Date:
Decision owner:
Architecture owner:
Risk owner:

Decision:
Approve / reject / constrain / watchlist / retire / keep as challenger.

Selected champion:
Model id, version boundary, model family, provider mode.

Approved use:
Tasks, channels, users, risk tier, workflow steps and human role.

Prohibited use:
Decisions, customer-facing commitments, restricted data or autonomous actions not permitted.

Options considered:
1. Current champion.
2. Challenger A.
3. Challenger B.
4. No-AI or legacy baseline.

Evidence reviewed:
- model cards:
- benchmark pack:
- capability scorecard:
- red-team report:
- domain SME review:
- latency/reliability test:
- security/privacy review:
- model risk challenge:
- production monitoring signals:

Decision rationale:
Explain why this option is fit for the approved use, what tradeoffs are accepted and which scorecard dimensions drove the decision.

Conditions:
List controls, limits, monitoring, human review, fallback and expiry.

Retirement / revalidation triggers:
List exact events that invalidate approval.

Residual risk:
Describe remaining known risks and why the business owner accepts or rejects them.

Approval:
Business owner:
Product owner:
Architecture:
Security/privacy:
Model risk / governance:
Operations:

Template: retirement trigger

Trigger	Signal	Required action	Example
Critical red-team regression	new prompt injection, privacy leak, unsafe advice or unauthorized action	freeze expansion, incident review, rerun safety pack	enterprise knowledge assistant leaks restricted HR document
Domain policy change	product, jurisdiction, regulation or procedure changed	rerun impacted domain benchmark	credit policy effective date changed
Supplier version drift	provider changes model behavior, logs, retention or terms	change impact review and re-approval	API snapshot replaced by provider
Performance drift	production pass rate, override, complaint or QA score degrades	watchlist and benchmark rerun	contact center unsupported answers increase
Operational unfit	latency, timeout, availability or rate limit misses SLO	constrain scope or fallback	fraud intervention p95 exceeds queue window
Security issue	vulnerability, dependency issue, hosting patch gap, license concern	patch or retire	open-weight runtime CVE affects hosted model
Evidence expiry	model card, benchmark, approval or review period expired	re-benchmark or suspend new use	quarterly review missed for high-risk model
Better challenger	challenger materially improves safety/domain/operability with sufficient evidence	promotion decision and migration plan	KYC extractor reduces manual review while preserving exception routing

Retirement decision table

Decision	Meaning	Evidence
keep	evidence remains valid	updated scorecard and monitoring review
watchlist	concern exists but current use can continue under monitoring	issue record, owner, review date
constrain	narrow approved use, user group, risk tier or channel	revised model card and policy control
replace	promote challenger and migrate use case	selection record and rollout plan
retire	no new use; remove from active service	decommission record and evidence archive

Template: evidence packet

Evidence object	What it proves	Owner
Model portfolio inventory snapshot	which models existed and lifecycle status at decision time	AI platform / governance
Model card	approved use, prohibited use, version, data boundary and owner	AI architect
Benchmark pack manifest	exact task, dataset, rubric and run protocol	EvalOps
Run results and traces	what the model produced under which configuration	AI platform
Capability scorecard	multi-dimensional decision evidence	PM / architect
Red-team report	unacceptable behavior was tested and handled	security / model risk
Domain SME review	policy and workflow correctness were judged by qualified roles	BA / SME
Security/privacy review	data boundary and access controls were assessed	security / privacy
Operational readiness	latency, reliability, fallback and human workflow fit	operations
Model selection record	why champion/challenger decision was made	model selection council
Exception/risk acceptance	known gaps, compensating controls, expiry and owner	business owner / risk
Retirement triggers	when approval must be reviewed or revoked	governance

PM/BA/architecture questions

PM questions

Question	Good answer signal
What business workflow is this model approved for?	specific task, user, channel and outcome
What is the no-AI or current champion baseline?	comparison is not model-only
Which failures are unacceptable even if the average score is high?	hard blockers are explicit
How will users know when to trust, edit, escalate or ignore AI output?	human oversight and UX controls are defined
How does the model portfolio decision affect adoption and operating model?	decision ties to training, QA and monitoring

BA / CBAP questions

Question	Good answer signal
Which policy rules, exceptions and edge cases are in the benchmark pack?	domain slices and expected behavior are documented
Where does AI enter the process and where does human authority remain?	BPMN/workflow boundary is clear
What labels and rubric require SME or compliance authority?	label authority is named
Which customer segments, products, channels and languages are covered?	coverage matrix exists
What evidence proves that a requirement maps to an eval and control?	traceability is available

Architecture questions

Question	Good answer signal
Is the selection decision made at model brand level or system component level?	component-level champions are defined
Can we reproduce the run later?	model, prompt, retrieval, dataset and judge versions are recorded
What happens if the model fails, times out or changes behavior?	fallback, timeout and revalidation triggers exist
Can traces be exported for audit and incident review?	trace schema and retention are approved
Is open vs proprietary being judged by evidence, not ideology?	hosting, patching, data route, safety and license controls are included

Release checklist

Check	Pass condition	Status
Use-case boundary	approved use and prohibited use are written in model card	pass required
Risk tier	use case has low/medium/high/critical tier	pass required
Benchmark pack	task, domain, red-team and operational tests are versioned	pass required
Scorecard	weighted score and hard blockers reviewed	pass required
Critical failures	zero unresolved critical blockers for high-risk scope	pass required
Security/privacy	data route, logging, retention, DLP and entitlement reviewed	pass required
Human oversight	review, escalation and override path tested	pass required
Operational readiness	latency, reliability, fallback and support model tested	pass required
Evidence packet	model card, results, traces and selection record archived	pass required
Retirement triggers	review expiry and event triggers defined	pass required
Council decision	approve, constrain, reject, watchlist or retire recorded	pass required

Release outcomes:

Outcome	Meaning
approve	model becomes champion for the defined use boundary
limited approve	model can run for specific channel, user group, volume, risk tier or human-review mode
challenger only	model remains in evaluation and cannot serve production
reject	evidence does not support the use case
watchlist champion	current champion remains but expansion is frozen
retire	model is removed from active approved use

Executive narrative

Use this narrative when speaking with a COO, CIO, CTO, CRO or AI governance committee:

We are not choosing a single best AI model for the enterprise. We are governing a model portfolio. Each model is approved for specific tasks, risk tiers and data boundaries based on repeatable evidence. Our scorecard combines task quality, domain performance, safety, latency, cost fitness, security and audit readiness. Public benchmarks inform us, but production approval depends on our own benchmark packs and red-team results. We run champion/challenger reviews so we can adopt better models without uncontrolled drift, and we define retirement triggers so outdated or unsafe models do not remain in regulated workflows. This gives product teams speed, architecture teams control, and risk teams evidence.

For financial retail, translate it this way:

Executive concern	Portfolio governance response
Are we moving fast enough?	reusable benchmark packs and model cards reduce repeated debate
Are we taking uncontrolled risk?	approved-use boundaries, hard blockers and retirement triggers constrain use
Are we locked into one provider?	champion/challenger portfolio compares proprietary, open and specialist models
Can audit understand the decision?	evidence packet links model, benchmark, scorecard, approval and monitoring
Are costs considered?	cost fitness is scored, but safety and domain blockers override cost savings

Interview drills

Drill 1: Explain the difference between benchmark and portfolio governance

30-second answer:

A benchmark compares model behavior under a defined test. Portfolio governance decides which models are approved for which business tasks over time. It adds model inventory, capability taxonomy, scorecards, champion/challenger cadence, risk-tiered approval, retirement triggers and audit evidence.

Follow-up:

Interviewer asks	Strong response
Why not just use the top leaderboard model?	public scores are not enough for domain policy, data boundary, latency, human oversight or audit
What is a hard blocker?	a failure that overrides average score, such as PII leakage or unsupported credit decision
Who should decide?	product, architecture, security, privacy, model risk, operations and business owner through a selection council

Drill 2: Contact center model selection

Prompt:

A contact center wants to replace its current copilot model with a cheaper small model.

Answer structure:

Confirm approved use: draft guidance for agents, not direct customer promises.
Run current champion, small challenger and no-AI baseline against the same contact policy benchmark.
Score grounding, vulnerable-customer escalation, unsupported fee reversal, tone, p95 latency, cost and trace export.
If small model passes routine FAQ but fails complex policy, approve it only for low-risk FAQ and route complex cases to champion.
Record conditions, monitoring and retirement/revalidation triggers.

Drill 3: AML triage challenger

Prompt:

A new domain-tuned model improves average AML triage accuracy but misses rare high-risk typologies.

Strong answer:

I would not promote it based on average accuracy. AML needs severity-weighted scoring and false negative controls. Rare high-risk typology misses are hard blockers or at least scope constraints. I would keep the current champion, add the missed typologies to the regression/challenge set, and allow the challenger only in a lower-risk assistive role if it provides value without weakening analyst review.

Drill 4: Open vs proprietary model

Prompt:

The CTO asks whether open-weight models are safer because we can host them internally.

Strong answer:

Hosting control is useful but not sufficient. I would compare open and proprietary models through the same scorecard: task quality, domain score, red-team, data route, patching, license, trace, latency, cost fitness and governance readiness. Open-weight may win for data residency and control, but it needs a patch, safety, eval and operations lifecycle. Proprietary may win on quality, but it needs version boundaries, logging, change notice and data retention evidence.

Drill 5: Retirement trigger

Prompt:

A champion model still works, but the supplier changes retention and logging terms.

Strong answer:

That is a governance trigger even if task quality has not changed. I would freeze expansion, review the data boundary and audit evidence impact, update the model card, rerun any affected security/privacy tests, and require the selection council to keep, constrain, replace or retire the model. A model can be retired because evidence and control no longer support approved use, not only because quality dropped.