AI Model Portfolio Benchmarking / Capability Scorecard / Selection Governance Playbook
使用本 playbook 建立一套可执行的模型组合治理机制:
AI Model Portfolio Benchmarking / Capability Scorecard / Selection Governance Playbook
定位:面向 experienced CBAP / financial retail PM / product architect / solution architect / AI governance lead,把模型选型从一次性比较升级为持续性的 model portfolio governance operating model。
边界:本 playbook 不替代模型路由策略、采购沙盒、董事会投资叙事或 AI FinOps。它定义模型组合如何被登记、评测、评分、批准、挑战、监控和退役。
Purpose and when to use
Purpose
使用本 playbook 建立一套可执行的模型组合治理机制:
business capability
-> AI task taxonomy
-> model portfolio inventory
-> model card and approved-use boundary
-> benchmark pack
-> capability scorecard
-> model selection record
-> champion/challenger lifecycle
-> release evidence packet
-> retirement trigger
目标不是选出“全公司最强模型”,而是让团队能回答:
- 哪些模型家族被批准用于哪些任务?
- 哪些模型只能做 draft、summary、classification、retrieval answer 或 extraction,不能做 decision?
- 哪些 benchmark pack 支撑本次选型?
- red-team、domain、latency、cost、security 和 governance 分数如何影响批准范围?
- 新模型进入时如何挑战 champion?
- 旧模型在什么情况下必须 watchlist、限制、替换或退役?
- 审计、模型风险和产品决策如何复核当时的证据?
When to use
| Trigger | Use this playbook to produce |
|---|---|
| 新 AI use case 进入 portfolio | task taxonomy, candidate model list, first benchmark pack |
| 产品团队想换模型 | challenger benchmark and model selection record |
| 供应商发布新模型版本 | change impact review and re-benchmark decision |
| open-weight 模型进入企业平台 | open/proprietary comparison, hosting and safety evidence |
| 高风险 release 前 | scorecard, hard blocker review, evidence packet |
| 生产出现投诉、override 或安全事件 | regression pack update, watchlist and retirement review |
| 季度 AI governance review | model portfolio health, concentration, challenger freshness, retirement backlog |
Operating model
Governance bodies
| Body | Decision rights | Core outputs |
|---|---|---|
| AI product council | 业务优先级、use case scope、用户和 workflow 边界 | approved use case and product scope |
| Model selection council | champion/challenger、approved use、限制条件、退役 | model selection record |
| AI architecture review | deployment mode、integration、trace、fallback、data boundary | architecture fit review |
| Model risk / independent challenge | 高风险 benchmark 设计、threshold、evidence sufficiency challenge | challenge memo |
| Security / privacy review | data route、DLP、IAM、prompt injection、logging、retention | security/privacy sign-off |
| Operations review | latency、availability、human review、queue impact、fallback readiness | operational readiness record |
RACI
| Activity | PM | CBAP / BA | AI architect | EvalOps | Security / privacy | Model risk | Business owner | Operations |
|---|---|---|---|---|---|---|---|---|
| Task taxonomy | A/R | R | C | C | I | C | C | C |
| Model inventory | C | C | A/R | R | C | C | I | I |
| Benchmark pack | A | R | R | R | C | A/R | C | C |
| Capability scorecard | A/R | R | R | R | C | C | C | C |
| Selection record | A/R | C | R | C | C | C | A | C |
| Release evidence | R | R | R | A/R | R | C | A | R |
| Retirement trigger review | A/R | C | R | R | C | C | A | R |
A = accountable, R = responsible, C = consulted, I = informed.
Cadence
| Cadence | Meeting / activity | Decision |
|---|---|---|
| Per use case intake | task taxonomy and candidate model review | which model families enter benchmark |
| Per release | scorecard and hard blocker review | go, no-go, limited release or rerun |
| Monthly | challenger and incident review | promote, watchlist, constrain or remediate |
| Quarterly | portfolio review | concentration, stale models, retirement backlog, benchmark refresh |
| Event-driven | supplier/model/data/policy/security change | emergency re-benchmark, freeze expansion or retire |
Template: model portfolio inventory
| Field | Required content | Example |
|---|---|---|
| model_id | Enterprise unique identifier | KYC-OCR-SPECIALIST-2026-06 |
| model_name | Provider or internal name | Specialist document extractor |
| model_family | general LLM / reasoning / small LLM / embedding / reranker / classifier / OCR / speech / judge / safety | OCR / extraction |
| provider_mode | proprietary API / managed cloud / open-weight self-hosted / internal fine-tune / vendor component | managed cloud |
| deployment_region | runtime region and data residency | US region only |
| version_boundary | model snapshot, API date, fine-tune id, guardrail version | model v3.2, OCR layout pack v7 |
| approved_use | permitted tasks and workflows | KYC ID and address proof extraction with human review for exceptions |
| prohibited_use | decisions or data use not permitted | no final KYC approval, no customer communication |
| risk_tier | low / medium / high / critical | high |
| data_boundary | PII, PCI, transaction, document, voice, training use, retention | PII allowed; no training; logs retained 30 days |
| benchmark_pack | pack id and latest run | KYC-DOC-EXTRACT-v2026.06, pass |
| scorecard_summary | main score and blockers | strong extraction, weak on handwritten docs, no critical blocker |
| lifecycle_status | candidate / challenger / champion / constrained / watchlisted / retired | constrained champion |
| owner | product, architecture, operational and risk owners | KYC PO + AI platform architect + onboarding ops |
| review_expiry | time or event trigger | quarterly or document policy change |
| fallback_model | fallback or no-AI baseline | manual review queue |
| evidence_location | GRC/evidence binder id | AIGOV-MODEL-PORT-2026-168-KYC |
Template: capability scorecard
Use 1-5 scoring with written rationale. For high-risk workflows, hard blockers override weighted totals.
| Dimension | Weight | Score | Evidence required | Notes |
|---|---|---|---|---|
| Task quality | 14 | benchmark pass rate, SME sample review | Does it perform the intended task, not a generic task? | |
| Domain score | 14 | domain edge cases, policy/version tests | Does it understand financial retail rules and exceptions? | |
| Red-team score | 14 | adversarial prompt, misuse, privacy and unsafe advice tests | Any critical failure blocks high-risk approval. | |
| Grounding/citation | 10 | citation support, source recall, entitlement tests | Required for RAG and policy support. | |
| Robustness | 8 | mutation tests, language/channel/noise variation | Weak robustness constrains scope. | |
| Security/data boundary | 10 | IAM, DLP, retention, logging, training use evidence | Must match data classification. | |
| Latency/reliability | 8 | p50/p95, timeout, availability, fallback | Must fit workflow queue and customer experience. | |
| Cost fitness | 6 | cost per case, variance, scale assumptions | Cost informs scope, but does not override safety. | |
| Human oversight fit | 6 | reviewer workflow, override, escalation logs | Prevents hidden automation. | |
| Governance readiness | 10 | model card, version lock, trace export, retirement trigger | Required for audit and change control. |
Hard blocker review
| Blocker | Pass evidence | Decision if failed |
|---|---|---|
| Critical customer harm | no critical harm in approved red-team/domain set | reject or constrain to lab |
| Unauthorized decision | AI role cannot make final credit/KYC/AML/fraud decision unless explicitly approved | redesign workflow |
| PII or restricted data leakage | DLP and entitlement tests pass | block release |
| Unsupported regulated advice | no unsupported adverse action, legal/financial advice or customer commitment | block high-risk use |
| Missing trace | prompt, model, retrieval, output and human action are reproducible | no regulated release |
| Silent version drift | version boundary and supplier change notice captured | freeze approval |
Template: benchmark pack
| Section | Required content | Example: credit policy RAG |
|---|---|---|
| pack_id | unique benchmark identifier | CREDIT-POLICY-RAG-v2026.06 |
| use_case | workflow and AI role | underwriter policy support, answer with citations |
| approved scope | products, regions, users, channels | consumer credit card and personal loan policy |
| out of scope | explicitly excluded work | no final underwriting decision, no adverse action notice generation |
| model options | champion, challenger, baseline | frontier proprietary, open-weight hosted, existing search |
| dataset | source, version, sample size, coverage | 500 policy Q&A, 80 edge cases, 60 stale/conflict cases |
| domain slices | product, jurisdiction, effective date, customer segment | state-specific rules, exception process, protected-class boundary |
| red-team slices | prompt injection, unsafe advice, restricted data, overconfidence | user asks for decline reason not in policy |
| rubric | scoring method and severity | groundedness, citation, completeness, refusal, escalation |
| critical failures | zero tolerance failures | unsupported adverse action reason, fabricated policy, restricted source leak |
| run protocol | model version, prompt, retrieval config, judge model, repeats | prompt v12, retriever v5, 3 repeated runs for unstable cases |
| operational test | latency, timeout, stress, fallback | p95 under workflow target, fallback to policy search |
| evidence output | files and records retained | traces, scorecard, SME notes, selection record |
Benchmark result summary
| Option | Task score | Domain score | Red-team score | p95 latency | Cost/case | Governance readiness | Recommendation |
|---|---|---|---|---|---|---|---|
| Current champion | 4.2 | 4.0 | 4.5 | 4.8s | $0.38 | high | keep, monitor stale-policy slice |
| Frontier challenger | 4.6 | 4.4 | 4.2 | 6.2s | $0.71 | medium | constrained pilot for complex cases |
| Open-weight challenger | 3.9 | 3.6 | 3.8 | 3.9s | $0.29 | high if hosted internally | continue challenge after red-team tuning |
| No-AI baseline | 3.1 | 3.3 | 5.0 | human search | staff time | high | keep as fallback |
Template: model selection record
Decision title:
Use case:
Date:
Decision owner:
Architecture owner:
Risk owner:
Decision:
Approve / reject / constrain / watchlist / retire / keep as challenger.
Selected champion:
Model id, version boundary, model family, provider mode.
Approved use:
Tasks, channels, users, risk tier, workflow steps and human role.
Prohibited use:
Decisions, customer-facing commitments, restricted data or autonomous actions not permitted.
Options considered:
1. Current champion.
2. Challenger A.
3. Challenger B.
4. No-AI or legacy baseline.
Evidence reviewed:
- model cards:
- benchmark pack:
- capability scorecard:
- red-team report:
- domain SME review:
- latency/reliability test:
- security/privacy review:
- model risk challenge:
- production monitoring signals:
Decision rationale:
Explain why this option is fit for the approved use, what tradeoffs are accepted and which scorecard dimensions drove the decision.
Conditions:
List controls, limits, monitoring, human review, fallback and expiry.
Retirement / revalidation triggers:
List exact events that invalidate approval.
Residual risk:
Describe remaining known risks and why the business owner accepts or rejects them.
Approval:
Business owner:
Product owner:
Architecture:
Security/privacy:
Model risk / governance:
Operations:
Template: retirement trigger
| Trigger | Signal | Required action | Example |
|---|---|---|---|
| Critical red-team regression | new prompt injection, privacy leak, unsafe advice or unauthorized action | freeze expansion, incident review, rerun safety pack | enterprise knowledge assistant leaks restricted HR document |
| Domain policy change | product, jurisdiction, regulation or procedure changed | rerun impacted domain benchmark | credit policy effective date changed |
| Supplier version drift | provider changes model behavior, logs, retention or terms | change impact review and re-approval | API snapshot replaced by provider |
| Performance drift | production pass rate, override, complaint or QA score degrades | watchlist and benchmark rerun | contact center unsupported answers increase |
| Operational unfit | latency, timeout, availability or rate limit misses SLO | constrain scope or fallback | fraud intervention p95 exceeds queue window |
| Security issue | vulnerability, dependency issue, hosting patch gap, license concern | patch or retire | open-weight runtime CVE affects hosted model |
| Evidence expiry | model card, benchmark, approval or review period expired | re-benchmark or suspend new use | quarterly review missed for high-risk model |
| Better challenger | challenger materially improves safety/domain/operability with sufficient evidence | promotion decision and migration plan | KYC extractor reduces manual review while preserving exception routing |
Retirement decision table
| Decision | Meaning | Evidence |
|---|---|---|
| keep | evidence remains valid | updated scorecard and monitoring review |
| watchlist | concern exists but current use can continue under monitoring | issue record, owner, review date |
| constrain | narrow approved use, user group, risk tier or channel | revised model card and policy control |
| replace | promote challenger and migrate use case | selection record and rollout plan |
| retire | no new use; remove from active service | decommission record and evidence archive |
Template: evidence packet
| Evidence object | What it proves | Owner |
|---|---|---|
| Model portfolio inventory snapshot | which models existed and lifecycle status at decision time | AI platform / governance |
| Model card | approved use, prohibited use, version, data boundary and owner | AI architect |
| Benchmark pack manifest | exact task, dataset, rubric and run protocol | EvalOps |
| Run results and traces | what the model produced under which configuration | AI platform |
| Capability scorecard | multi-dimensional decision evidence | PM / architect |
| Red-team report | unacceptable behavior was tested and handled | security / model risk |
| Domain SME review | policy and workflow correctness were judged by qualified roles | BA / SME |
| Security/privacy review | data boundary and access controls were assessed | security / privacy |
| Operational readiness | latency, reliability, fallback and human workflow fit | operations |
| Model selection record | why champion/challenger decision was made | model selection council |
| Exception/risk acceptance | known gaps, compensating controls, expiry and owner | business owner / risk |
| Retirement triggers | when approval must be reviewed or revoked | governance |
PM/BA/architecture questions
PM questions
| Question | Good answer signal |
|---|---|
| What business workflow is this model approved for? | specific task, user, channel and outcome |
| What is the no-AI or current champion baseline? | comparison is not model-only |
| Which failures are unacceptable even if the average score is high? | hard blockers are explicit |
| How will users know when to trust, edit, escalate or ignore AI output? | human oversight and UX controls are defined |
| How does the model portfolio decision affect adoption and operating model? | decision ties to training, QA and monitoring |
BA / CBAP questions
| Question | Good answer signal |
|---|---|
| Which policy rules, exceptions and edge cases are in the benchmark pack? | domain slices and expected behavior are documented |
| Where does AI enter the process and where does human authority remain? | BPMN/workflow boundary is clear |
| What labels and rubric require SME or compliance authority? | label authority is named |
| Which customer segments, products, channels and languages are covered? | coverage matrix exists |
| What evidence proves that a requirement maps to an eval and control? | traceability is available |
Architecture questions
| Question | Good answer signal |
|---|---|
| Is the selection decision made at model brand level or system component level? | component-level champions are defined |
| Can we reproduce the run later? | model, prompt, retrieval, dataset and judge versions are recorded |
| What happens if the model fails, times out or changes behavior? | fallback, timeout and revalidation triggers exist |
| Can traces be exported for audit and incident review? | trace schema and retention are approved |
| Is open vs proprietary being judged by evidence, not ideology? | hosting, patching, data route, safety and license controls are included |
Release checklist
| Check | Pass condition | Status |
|---|---|---|
| Use-case boundary | approved use and prohibited use are written in model card | pass required |
| Risk tier | use case has low/medium/high/critical tier | pass required |
| Benchmark pack | task, domain, red-team and operational tests are versioned | pass required |
| Scorecard | weighted score and hard blockers reviewed | pass required |
| Critical failures | zero unresolved critical blockers for high-risk scope | pass required |
| Security/privacy | data route, logging, retention, DLP and entitlement reviewed | pass required |
| Human oversight | review, escalation and override path tested | pass required |
| Operational readiness | latency, reliability, fallback and support model tested | pass required |
| Evidence packet | model card, results, traces and selection record archived | pass required |
| Retirement triggers | review expiry and event triggers defined | pass required |
| Council decision | approve, constrain, reject, watchlist or retire recorded | pass required |
Release outcomes:
| Outcome | Meaning |
|---|---|
| approve | model becomes champion for the defined use boundary |
| limited approve | model can run for specific channel, user group, volume, risk tier or human-review mode |
| challenger only | model remains in evaluation and cannot serve production |
| reject | evidence does not support the use case |
| watchlist champion | current champion remains but expansion is frozen |
| retire | model is removed from active approved use |
Executive narrative
Use this narrative when speaking with a COO, CIO, CTO, CRO or AI governance committee:
We are not choosing a single best AI model for the enterprise. We are governing a model portfolio. Each model is approved for specific tasks, risk tiers and data boundaries based on repeatable evidence. Our scorecard combines task quality, domain performance, safety, latency, cost fitness, security and audit readiness. Public benchmarks inform us, but production approval depends on our own benchmark packs and red-team results. We run champion/challenger reviews so we can adopt better models without uncontrolled drift, and we define retirement triggers so outdated or unsafe models do not remain in regulated workflows. This gives product teams speed, architecture teams control, and risk teams evidence.
For financial retail, translate it this way:
| Executive concern | Portfolio governance response |
|---|---|
| Are we moving fast enough? | reusable benchmark packs and model cards reduce repeated debate |
| Are we taking uncontrolled risk? | approved-use boundaries, hard blockers and retirement triggers constrain use |
| Are we locked into one provider? | champion/challenger portfolio compares proprietary, open and specialist models |
| Can audit understand the decision? | evidence packet links model, benchmark, scorecard, approval and monitoring |
| Are costs considered? | cost fitness is scored, but safety and domain blockers override cost savings |
Interview drills
Drill 1: Explain the difference between benchmark and portfolio governance
30-second answer:
A benchmark compares model behavior under a defined test. Portfolio governance decides which models are approved for which business tasks over time. It adds model inventory, capability taxonomy, scorecards, champion/challenger cadence, risk-tiered approval, retirement triggers and audit evidence.
Follow-up:
| Interviewer asks | Strong response |
|---|---|
| Why not just use the top leaderboard model? | public scores are not enough for domain policy, data boundary, latency, human oversight or audit |
| What is a hard blocker? | a failure that overrides average score, such as PII leakage or unsupported credit decision |
| Who should decide? | product, architecture, security, privacy, model risk, operations and business owner through a selection council |
Drill 2: Contact center model selection
Prompt:
A contact center wants to replace its current copilot model with a cheaper small model.
Answer structure:
- Confirm approved use: draft guidance for agents, not direct customer promises.
- Run current champion, small challenger and no-AI baseline against the same contact policy benchmark.
- Score grounding, vulnerable-customer escalation, unsupported fee reversal, tone, p95 latency, cost and trace export.
- If small model passes routine FAQ but fails complex policy, approve it only for low-risk FAQ and route complex cases to champion.
- Record conditions, monitoring and retirement/revalidation triggers.
Drill 3: AML triage challenger
Prompt:
A new domain-tuned model improves average AML triage accuracy but misses rare high-risk typologies.
Strong answer:
I would not promote it based on average accuracy. AML needs severity-weighted scoring and false negative controls. Rare high-risk typology misses are hard blockers or at least scope constraints. I would keep the current champion, add the missed typologies to the regression/challenge set, and allow the challenger only in a lower-risk assistive role if it provides value without weakening analyst review.
Drill 4: Open vs proprietary model
Prompt:
The CTO asks whether open-weight models are safer because we can host them internally.
Strong answer:
Hosting control is useful but not sufficient. I would compare open and proprietary models through the same scorecard: task quality, domain score, red-team, data route, patching, license, trace, latency, cost fitness and governance readiness. Open-weight may win for data residency and control, but it needs a patch, safety, eval and operations lifecycle. Proprietary may win on quality, but it needs version boundaries, logging, change notice and data retention evidence.
Drill 5: Retirement trigger
Prompt:
A champion model still works, but the supplier changes retention and logging terms.
Strong answer:
That is a governance trigger even if task quality has not changed. I would freeze expansion, review the data boundary and audit evidence impact, update the model card, rerun any affected security/privacy tests, and require the selection council to keep, constrain, replace or retire the model. A model can be retired because evidence and control no longer support approved use, not only because quality dropped.