返回 Papers
AI 扩展计划 / Playbooks

AI Model Portfolio Benchmarking / Capability Scorecard / Selection Governance Playbook

使用本 playbook 建立一套可执行的模型组合治理机制:

420AI_MODEL_PORTFOLIO_BENCHMARKING_CAPABILITY_SCORECARD_SELECTION_GOVERNANCE_PLAYBOOK.md

AI Model Portfolio Benchmarking / Capability Scorecard / Selection Governance Playbook

定位:面向 experienced CBAP / financial retail PM / product architect / solution architect / AI governance lead,把模型选型从一次性比较升级为持续性的 model portfolio governance operating model。
边界:本 playbook 不替代模型路由策略、采购沙盒、董事会投资叙事或 AI FinOps。它定义模型组合如何被登记、评测、评分、批准、挑战、监控和退役。


Purpose and when to use

Purpose

使用本 playbook 建立一套可执行的模型组合治理机制:

business capability
  -> AI task taxonomy
  -> model portfolio inventory
  -> model card and approved-use boundary
  -> benchmark pack
  -> capability scorecard
  -> model selection record
  -> champion/challenger lifecycle
  -> release evidence packet
  -> retirement trigger

目标不是选出“全公司最强模型”,而是让团队能回答:

  • 哪些模型家族被批准用于哪些任务?
  • 哪些模型只能做 draft、summary、classification、retrieval answer 或 extraction,不能做 decision?
  • 哪些 benchmark pack 支撑本次选型?
  • red-team、domain、latency、cost、security 和 governance 分数如何影响批准范围?
  • 新模型进入时如何挑战 champion?
  • 旧模型在什么情况下必须 watchlist、限制、替换或退役?
  • 审计、模型风险和产品决策如何复核当时的证据?

When to use

TriggerUse this playbook to produce
新 AI use case 进入 portfoliotask taxonomy, candidate model list, first benchmark pack
产品团队想换模型challenger benchmark and model selection record
供应商发布新模型版本change impact review and re-benchmark decision
open-weight 模型进入企业平台open/proprietary comparison, hosting and safety evidence
高风险 release 前scorecard, hard blocker review, evidence packet
生产出现投诉、override 或安全事件regression pack update, watchlist and retirement review
季度 AI governance reviewmodel portfolio health, concentration, challenger freshness, retirement backlog

Operating model

Governance bodies

BodyDecision rightsCore outputs
AI product council业务优先级、use case scope、用户和 workflow 边界approved use case and product scope
Model selection councilchampion/challenger、approved use、限制条件、退役model selection record
AI architecture reviewdeployment mode、integration、trace、fallback、data boundaryarchitecture fit review
Model risk / independent challenge高风险 benchmark 设计、threshold、evidence sufficiency challengechallenge memo
Security / privacy reviewdata route、DLP、IAM、prompt injection、logging、retentionsecurity/privacy sign-off
Operations reviewlatency、availability、human review、queue impact、fallback readinessoperational readiness record

RACI

ActivityPMCBAP / BAAI architectEvalOpsSecurity / privacyModel riskBusiness ownerOperations
Task taxonomyA/RRCCICCC
Model inventoryCCA/RRCCII
Benchmark packARRRCA/RCC
Capability scorecardA/RRRRCCCC
Selection recordA/RCRCCCAC
Release evidenceRRRA/RRCAR
Retirement trigger reviewA/RCRRCCAR

A = accountable, R = responsible, C = consulted, I = informed.

Cadence

CadenceMeeting / activityDecision
Per use case intaketask taxonomy and candidate model reviewwhich model families enter benchmark
Per releasescorecard and hard blocker reviewgo, no-go, limited release or rerun
Monthlychallenger and incident reviewpromote, watchlist, constrain or remediate
Quarterlyportfolio reviewconcentration, stale models, retirement backlog, benchmark refresh
Event-drivensupplier/model/data/policy/security changeemergency re-benchmark, freeze expansion or retire

Template: model portfolio inventory

FieldRequired contentExample
model_idEnterprise unique identifierKYC-OCR-SPECIALIST-2026-06
model_nameProvider or internal nameSpecialist document extractor
model_familygeneral LLM / reasoning / small LLM / embedding / reranker / classifier / OCR / speech / judge / safetyOCR / extraction
provider_modeproprietary API / managed cloud / open-weight self-hosted / internal fine-tune / vendor componentmanaged cloud
deployment_regionruntime region and data residencyUS region only
version_boundarymodel snapshot, API date, fine-tune id, guardrail versionmodel v3.2, OCR layout pack v7
approved_usepermitted tasks and workflowsKYC ID and address proof extraction with human review for exceptions
prohibited_usedecisions or data use not permittedno final KYC approval, no customer communication
risk_tierlow / medium / high / criticalhigh
data_boundaryPII, PCI, transaction, document, voice, training use, retentionPII allowed; no training; logs retained 30 days
benchmark_packpack id and latest runKYC-DOC-EXTRACT-v2026.06, pass
scorecard_summarymain score and blockersstrong extraction, weak on handwritten docs, no critical blocker
lifecycle_statuscandidate / challenger / champion / constrained / watchlisted / retiredconstrained champion
ownerproduct, architecture, operational and risk ownersKYC PO + AI platform architect + onboarding ops
review_expirytime or event triggerquarterly or document policy change
fallback_modelfallback or no-AI baselinemanual review queue
evidence_locationGRC/evidence binder idAIGOV-MODEL-PORT-2026-168-KYC

Template: capability scorecard

Use 1-5 scoring with written rationale. For high-risk workflows, hard blockers override weighted totals.

DimensionWeightScoreEvidence requiredNotes
Task quality14benchmark pass rate, SME sample reviewDoes it perform the intended task, not a generic task?
Domain score14domain edge cases, policy/version testsDoes it understand financial retail rules and exceptions?
Red-team score14adversarial prompt, misuse, privacy and unsafe advice testsAny critical failure blocks high-risk approval.
Grounding/citation10citation support, source recall, entitlement testsRequired for RAG and policy support.
Robustness8mutation tests, language/channel/noise variationWeak robustness constrains scope.
Security/data boundary10IAM, DLP, retention, logging, training use evidenceMust match data classification.
Latency/reliability8p50/p95, timeout, availability, fallbackMust fit workflow queue and customer experience.
Cost fitness6cost per case, variance, scale assumptionsCost informs scope, but does not override safety.
Human oversight fit6reviewer workflow, override, escalation logsPrevents hidden automation.
Governance readiness10model card, version lock, trace export, retirement triggerRequired for audit and change control.

Hard blocker review

BlockerPass evidenceDecision if failed
Critical customer harmno critical harm in approved red-team/domain setreject or constrain to lab
Unauthorized decisionAI role cannot make final credit/KYC/AML/fraud decision unless explicitly approvedredesign workflow
PII or restricted data leakageDLP and entitlement tests passblock release
Unsupported regulated adviceno unsupported adverse action, legal/financial advice or customer commitmentblock high-risk use
Missing traceprompt, model, retrieval, output and human action are reproducibleno regulated release
Silent version driftversion boundary and supplier change notice capturedfreeze approval

Template: benchmark pack

SectionRequired contentExample: credit policy RAG
pack_idunique benchmark identifierCREDIT-POLICY-RAG-v2026.06
use_caseworkflow and AI roleunderwriter policy support, answer with citations
approved scopeproducts, regions, users, channelsconsumer credit card and personal loan policy
out of scopeexplicitly excluded workno final underwriting decision, no adverse action notice generation
model optionschampion, challenger, baselinefrontier proprietary, open-weight hosted, existing search
datasetsource, version, sample size, coverage500 policy Q&A, 80 edge cases, 60 stale/conflict cases
domain slicesproduct, jurisdiction, effective date, customer segmentstate-specific rules, exception process, protected-class boundary
red-team slicesprompt injection, unsafe advice, restricted data, overconfidenceuser asks for decline reason not in policy
rubricscoring method and severitygroundedness, citation, completeness, refusal, escalation
critical failureszero tolerance failuresunsupported adverse action reason, fabricated policy, restricted source leak
run protocolmodel version, prompt, retrieval config, judge model, repeatsprompt v12, retriever v5, 3 repeated runs for unstable cases
operational testlatency, timeout, stress, fallbackp95 under workflow target, fallback to policy search
evidence outputfiles and records retainedtraces, scorecard, SME notes, selection record

Benchmark result summary

OptionTask scoreDomain scoreRed-team scorep95 latencyCost/caseGovernance readinessRecommendation
Current champion4.24.04.54.8s$0.38highkeep, monitor stale-policy slice
Frontier challenger4.64.44.26.2s$0.71mediumconstrained pilot for complex cases
Open-weight challenger3.93.63.83.9s$0.29high if hosted internallycontinue challenge after red-team tuning
No-AI baseline3.13.35.0human searchstaff timehighkeep as fallback

Template: model selection record

Decision title:
Use case:
Date:
Decision owner:
Architecture owner:
Risk owner:

Decision:
Approve / reject / constrain / watchlist / retire / keep as challenger.

Selected champion:
Model id, version boundary, model family, provider mode.

Approved use:
Tasks, channels, users, risk tier, workflow steps and human role.

Prohibited use:
Decisions, customer-facing commitments, restricted data or autonomous actions not permitted.

Options considered:
1. Current champion.
2. Challenger A.
3. Challenger B.
4. No-AI or legacy baseline.

Evidence reviewed:
- model cards:
- benchmark pack:
- capability scorecard:
- red-team report:
- domain SME review:
- latency/reliability test:
- security/privacy review:
- model risk challenge:
- production monitoring signals:

Decision rationale:
Explain why this option is fit for the approved use, what tradeoffs are accepted and which scorecard dimensions drove the decision.

Conditions:
List controls, limits, monitoring, human review, fallback and expiry.

Retirement / revalidation triggers:
List exact events that invalidate approval.

Residual risk:
Describe remaining known risks and why the business owner accepts or rejects them.

Approval:
Business owner:
Product owner:
Architecture:
Security/privacy:
Model risk / governance:
Operations:

Template: retirement trigger

TriggerSignalRequired actionExample
Critical red-team regressionnew prompt injection, privacy leak, unsafe advice or unauthorized actionfreeze expansion, incident review, rerun safety packenterprise knowledge assistant leaks restricted HR document
Domain policy changeproduct, jurisdiction, regulation or procedure changedrerun impacted domain benchmarkcredit policy effective date changed
Supplier version driftprovider changes model behavior, logs, retention or termschange impact review and re-approvalAPI snapshot replaced by provider
Performance driftproduction pass rate, override, complaint or QA score degradeswatchlist and benchmark reruncontact center unsupported answers increase
Operational unfitlatency, timeout, availability or rate limit misses SLOconstrain scope or fallbackfraud intervention p95 exceeds queue window
Security issuevulnerability, dependency issue, hosting patch gap, license concernpatch or retireopen-weight runtime CVE affects hosted model
Evidence expirymodel card, benchmark, approval or review period expiredre-benchmark or suspend new usequarterly review missed for high-risk model
Better challengerchallenger materially improves safety/domain/operability with sufficient evidencepromotion decision and migration planKYC extractor reduces manual review while preserving exception routing

Retirement decision table

DecisionMeaningEvidence
keepevidence remains validupdated scorecard and monitoring review
watchlistconcern exists but current use can continue under monitoringissue record, owner, review date
constrainnarrow approved use, user group, risk tier or channelrevised model card and policy control
replacepromote challenger and migrate use caseselection record and rollout plan
retireno new use; remove from active servicedecommission record and evidence archive

Template: evidence packet

Evidence objectWhat it provesOwner
Model portfolio inventory snapshotwhich models existed and lifecycle status at decision timeAI platform / governance
Model cardapproved use, prohibited use, version, data boundary and ownerAI architect
Benchmark pack manifestexact task, dataset, rubric and run protocolEvalOps
Run results and traceswhat the model produced under which configurationAI platform
Capability scorecardmulti-dimensional decision evidencePM / architect
Red-team reportunacceptable behavior was tested and handledsecurity / model risk
Domain SME reviewpolicy and workflow correctness were judged by qualified rolesBA / SME
Security/privacy reviewdata boundary and access controls were assessedsecurity / privacy
Operational readinesslatency, reliability, fallback and human workflow fitoperations
Model selection recordwhy champion/challenger decision was mademodel selection council
Exception/risk acceptanceknown gaps, compensating controls, expiry and ownerbusiness owner / risk
Retirement triggerswhen approval must be reviewed or revokedgovernance

PM/BA/architecture questions

PM questions

QuestionGood answer signal
What business workflow is this model approved for?specific task, user, channel and outcome
What is the no-AI or current champion baseline?comparison is not model-only
Which failures are unacceptable even if the average score is high?hard blockers are explicit
How will users know when to trust, edit, escalate or ignore AI output?human oversight and UX controls are defined
How does the model portfolio decision affect adoption and operating model?decision ties to training, QA and monitoring

BA / CBAP questions

QuestionGood answer signal
Which policy rules, exceptions and edge cases are in the benchmark pack?domain slices and expected behavior are documented
Where does AI enter the process and where does human authority remain?BPMN/workflow boundary is clear
What labels and rubric require SME or compliance authority?label authority is named
Which customer segments, products, channels and languages are covered?coverage matrix exists
What evidence proves that a requirement maps to an eval and control?traceability is available

Architecture questions

QuestionGood answer signal
Is the selection decision made at model brand level or system component level?component-level champions are defined
Can we reproduce the run later?model, prompt, retrieval, dataset and judge versions are recorded
What happens if the model fails, times out or changes behavior?fallback, timeout and revalidation triggers exist
Can traces be exported for audit and incident review?trace schema and retention are approved
Is open vs proprietary being judged by evidence, not ideology?hosting, patching, data route, safety and license controls are included

Release checklist

CheckPass conditionStatus
Use-case boundaryapproved use and prohibited use are written in model cardpass required
Risk tieruse case has low/medium/high/critical tierpass required
Benchmark packtask, domain, red-team and operational tests are versionedpass required
Scorecardweighted score and hard blockers reviewedpass required
Critical failureszero unresolved critical blockers for high-risk scopepass required
Security/privacydata route, logging, retention, DLP and entitlement reviewedpass required
Human oversightreview, escalation and override path testedpass required
Operational readinesslatency, reliability, fallback and support model testedpass required
Evidence packetmodel card, results, traces and selection record archivedpass required
Retirement triggersreview expiry and event triggers definedpass required
Council decisionapprove, constrain, reject, watchlist or retire recordedpass required

Release outcomes:

OutcomeMeaning
approvemodel becomes champion for the defined use boundary
limited approvemodel can run for specific channel, user group, volume, risk tier or human-review mode
challenger onlymodel remains in evaluation and cannot serve production
rejectevidence does not support the use case
watchlist championcurrent champion remains but expansion is frozen
retiremodel is removed from active approved use

Executive narrative

Use this narrative when speaking with a COO, CIO, CTO, CRO or AI governance committee:

We are not choosing a single best AI model for the enterprise. We are governing a model portfolio. Each model is approved for specific tasks, risk tiers and data boundaries based on repeatable evidence. Our scorecard combines task quality, domain performance, safety, latency, cost fitness, security and audit readiness. Public benchmarks inform us, but production approval depends on our own benchmark packs and red-team results. We run champion/challenger reviews so we can adopt better models without uncontrolled drift, and we define retirement triggers so outdated or unsafe models do not remain in regulated workflows. This gives product teams speed, architecture teams control, and risk teams evidence.

For financial retail, translate it this way:

Executive concernPortfolio governance response
Are we moving fast enough?reusable benchmark packs and model cards reduce repeated debate
Are we taking uncontrolled risk?approved-use boundaries, hard blockers and retirement triggers constrain use
Are we locked into one provider?champion/challenger portfolio compares proprietary, open and specialist models
Can audit understand the decision?evidence packet links model, benchmark, scorecard, approval and monitoring
Are costs considered?cost fitness is scored, but safety and domain blockers override cost savings

Interview drills

Drill 1: Explain the difference between benchmark and portfolio governance

30-second answer:

A benchmark compares model behavior under a defined test. Portfolio governance decides which models are approved for which business tasks over time. It adds model inventory, capability taxonomy, scorecards, champion/challenger cadence, risk-tiered approval, retirement triggers and audit evidence.

Follow-up:

Interviewer asksStrong response
Why not just use the top leaderboard model?public scores are not enough for domain policy, data boundary, latency, human oversight or audit
What is a hard blocker?a failure that overrides average score, such as PII leakage or unsupported credit decision
Who should decide?product, architecture, security, privacy, model risk, operations and business owner through a selection council

Drill 2: Contact center model selection

Prompt:

A contact center wants to replace its current copilot model with a cheaper small model.

Answer structure:

  1. Confirm approved use: draft guidance for agents, not direct customer promises.
  2. Run current champion, small challenger and no-AI baseline against the same contact policy benchmark.
  3. Score grounding, vulnerable-customer escalation, unsupported fee reversal, tone, p95 latency, cost and trace export.
  4. If small model passes routine FAQ but fails complex policy, approve it only for low-risk FAQ and route complex cases to champion.
  5. Record conditions, monitoring and retirement/revalidation triggers.

Drill 3: AML triage challenger

Prompt:

A new domain-tuned model improves average AML triage accuracy but misses rare high-risk typologies.

Strong answer:

I would not promote it based on average accuracy. AML needs severity-weighted scoring and false negative controls. Rare high-risk typology misses are hard blockers or at least scope constraints. I would keep the current champion, add the missed typologies to the regression/challenge set, and allow the challenger only in a lower-risk assistive role if it provides value without weakening analyst review.

Drill 4: Open vs proprietary model

Prompt:

The CTO asks whether open-weight models are safer because we can host them internally.

Strong answer:

Hosting control is useful but not sufficient. I would compare open and proprietary models through the same scorecard: task quality, domain score, red-team, data route, patching, license, trace, latency, cost fitness and governance readiness. Open-weight may win for data residency and control, but it needs a patch, safety, eval and operations lifecycle. Proprietary may win on quality, but it needs version boundaries, logging, change notice and data retention evidence.

Drill 5: Retirement trigger

Prompt:

A champion model still works, but the supplier changes retention and logging terms.

Strong answer:

That is a governance trigger even if task quality has not changed. I would freeze expansion, review the data boundary and audit evidence impact, update the model card, rerun any affected security/privacy tests, and require the selection council to keep, constrain, replace or retire the model. A model can be retired because evidence and control no longer support approved use, not only because quality dropped.