AI Model Portfolio Benchmarking:模型组合评测与选型治理架构
Date: 2026-06-30
AI 模型组合基准评测 / 能力评分卡 / 选型治理架构:Model Portfolio Benchmarking / Capability Scorecard / Selection Governance Architecture
Date: 2026-06-30
Status: evergreen
Audience: experienced CBAP / financial retail PM / product architect / solution architect / AI governance lead
Output: advanced architecture note for website-visible portfolio, model selection council design, audit evidence and interview discussion
Why model portfolio governance matters
金融零售企业不会长期只用一个 AI 模型。实际环境通常同时存在 frontier model、small model、embedding model、reranker、domain-tuned classifier、document extraction model、speech model、judge model、open-weight model、managed proprietary model 和 legacy ML model。问题不再是“哪个模型最好”,而是:
对每个业务任务、风险等级、数据边界、延迟约束、成本约束、安全要求和审计要求,哪个模型家族在当前证据下被批准使用,什么时候需要 challenger,什么时候必须退役?
这是一套 portfolio governance 问题,而不是单次 benchmark 或 vendor demo。
| 治理问题 | 架构含义 | 金融零售后果 |
|---|---|---|
| 模型能力如何分类 | 需要统一 capability taxonomy,避免每个团队用自己的“好用”定义 | 客服、AML、KYC、信贷和支付团队无法比较证据 |
| 分数如何进入决策 | scorecard 必须绑定任务、阈值、风险层级和证据,而不是平均分排名 | 高风险场景不能被通用 leaderboard 或低风险 FAQ 分数掩盖 |
| 模型家族如何管理 | proprietary、open-weight、fine-tuned、small、specialist、judge 各有不同控制面 | 避免供应集中、黑盒不可审计、开源权重无补丁流程 |
| challenger 如何运行 | 新模型必须在同一 benchmark pack、同一 rubric、同一成本/延迟测量下比较 | 避免“新模型更强”的口号绕过 release gate |
| 何时退役 | 需要 drift、事故、政策变化、成本失控、安全退步、供应风险等触发器 | 避免旧模型继续服务 regulated workflow |
| 审计如何复核 | 每次选型要留下 model card、scorecard、benchmark run、decision record 和 exception | 支撑模型风险、内审、监管、产品复盘和管理层问责 |
边界说明:本文不做模型路由策略设计,不做采购沙盒,不做董事会投资叙事,也不做 AI FinOps 成本管理。这里聚焦持续性的模型组合治理:模型能力分类、基准评测、能力评分卡、选型委员会、champion/challenger、风险分层、退役触发和证据链。
Concept diagram
flowchart TB
A[Business capability map] --> B[AI task taxonomy]
B --> C[Model portfolio inventory]
C --> D[Model cards and approved-use boundaries]
B --> E[Benchmark packs]
E --> E1[Quality and task metrics]
E --> E2[Safety and red-team tests]
E --> E3[Domain and policy tests]
E --> E4[Cost, latency and resilience tests]
D --> F[Capability scorecard]
E1 --> F
E2 --> F
E3 --> F
E4 --> F
F --> G{Model selection council}
G -->|Approve| H[Champion model for approved use]
G -->|Constrain| I[Conditional use with controls]
G -->|Reject| J[Do not use or return to lab]
G -->|Challenge| K[Challenger backlog]
H --> L[Release and production monitoring]
I --> L
L --> M[Evidence packet]
L --> N[Drift, incident and policy-change signals]
N --> K
N --> O[Retirement trigger]
O --> P[Replacement / fallback / decommission]
M --> Q[Audit, model risk, product decision record]
Q --> C
核心思想:
model portfolio governance
= inventory + capability taxonomy + benchmark pack + scorecard
+ selection council + champion/challenger + retirement trigger
+ auditable evidence
Core architecture model
1. Portfolio layers
| Layer | 责任 | 关键对象 | 常见 owner |
|---|---|---|---|
| Business capability layer | 说明模型服务哪个业务能力和流程节点 | contact servicing、KYC onboarding、AML triage、credit policy support、fraud intervention | business owner / AI PM / BA |
| AI task layer | 把业务能力拆成可评测任务 | classify、extract、summarize、retrieve-answer、draft、detect anomaly、recommend escalation | PM / BA / AI architect |
| Model inventory layer | 记录所有候选与批准模型 | model family、provider、version、deployment mode、data route、approved use、restriction | AI platform / governance |
| Benchmark layer | 为任务准备可复现评测包 | dataset version、rubric、red-team set、domain set、run config、judge config | EvalOps / SME / model risk |
| Scorecard layer | 将多维结果转成选型证据 | task score、domain score、red-team score、latency/cost/security score、confidence | AI PM / architect / governance |
| Decision layer | 做批准、限制、挑战、退役和例外处理 | model selection record、risk acceptance、conditions、expiry、fallback | model selection council |
| Evidence layer | 留存可审计证据 | model card、benchmark manifest、scorecard, traces、approval、exception、retirement record | release / GRC / audit |
2. Model portfolio inventory as a governed asset
模型清单不是技术团队的 spreadsheet。它要支持业务、架构、风险、审计和运营共同判断。
| Field | 说明 | Example |
|---|---|---|
| model_id | 企业内部唯一编号 | LLM-GP-PROP-A-2026-06 |
| model_family | general LLM / reasoning / small LLM / embedding / reranker / classifier / OCR / speech / judge / safety model | general LLM |
| provider_mode | proprietary API / managed cloud / open-weight self-hosted / internal fine-tune / vendor application component | proprietary API |
| version_boundary | 模型版本、snapshot、API date、fine-tune id、guardrail version | 2026-06 snapshot, guardrail v4 |
| approved_use | 允许任务、场景和风险等级 | contact center agent assist, draft only |
| prohibited_use | 禁止直接决策或禁止客户可见输出 | no autonomous credit decision, no SAR conclusion |
| data_boundary | PII、PCI、交易、文档、语音、日志、跨境、训练使用、retention | no provider training, EU data route, 30-day logs |
| benchmark_status | last run、benchmark pack、pass/fail、open gaps | passed CONTACT-RAG-v2026.06 with 2 accepted gaps |
| operational_slo | latency、availability、rate limit、fallback | p95 < 4s for agent assist |
| control_status | DLP、prompt injection、access、logging、red-team、human review | DLP and trace export approved |
| lifecycle_status | candidate / challenger / champion / constrained / watchlisted / retired | champion |
| retirement_trigger | triggers that invalidate current approval | critical safety regression, policy citation fail, supplier exit |
3. Capability scorecard as decision interface
Scorecard 的目标不是替代判断,而是把不同角色的判断放到同一张证据表里。
public benchmark tells us generic capability.
domain benchmark tells us task fit.
red-team benchmark tells us unacceptable behavior.
architecture score tells us operability.
governance score tells us whether evidence can survive audit.
selection council turns all of that into approved-use boundaries.
Capability taxonomy and scorecard
Capability taxonomy
| Capability family | What to evaluate | Financial retail examples |
|---|---|---|
| Language and reasoning | instruction following、multi-step reasoning、ambiguity handling、calibration | contact center policy explanation, credit memo critique |
| Retrieval-grounded answer | source recall、citation support、stale-source handling、entitlement respect | credit policy RAG, enterprise knowledge assistant |
| Information extraction | field accuracy、table understanding、layout robustness、confidence and exception detection | KYC document extraction, income proof review |
| Summarization and narrative | completeness、material omission、tone、traceability to evidence | AML case narrative draft, complaint root cause summary |
| Classification and triage | precision/recall by severity、false negative control、threshold behavior | AML alert triage, payment fraud queue priority |
| Action recommendation | policy compliance、escalation quality、human approval fit | payments fraud intervention, collections hardship next action |
| Safety and security | prompt injection、PII leakage、unsafe advice、tool misuse、jailbreak resistance | customer-facing chatbot, analyst copilot, internal RAG |
| Domain and policy | product rules、regulatory boundaries、jurisdictional nuance、effective dates | credit policy, KYC policy, AML typology, Reg E dispute rules |
| Operability | latency、availability、trace export、rate limit、fallback、observability | contact center p95 latency, fraud real-time queue |
| Governance readiness | model card quality、version control、audit evidence、change notice、retirement support | all regulated use cases |
Scorecard dimensions
Score each dimension 1-5, but apply hard blockers for high-risk use cases. A strong average cannot compensate for a critical failure.
| Dimension | Weight | 1 | 3 | 5 | Evidence |
|---|---|---|---|---|---|
| Task quality | 14 | fails common cases | acceptable average | strong pass rate with slice stability | benchmark results and SME review |
| Domain score | 14 | generic language only | handles common policy | handles edge cases, effective dates and product nuance | domain benchmark pack |
| Red-team / safety score | 14 | critical failures | mitigations partial | no critical failures in approved set | adversarial run, safety report |
| Grounding and citation | 10 | unsupported claims | citations sometimes weak | claims trace to allowed sources | RAG/citation eval |
| Robustness | 8 | brittle to wording/noise | stable on common variants | stable across language, channel, missing evidence and ambiguity | mutation tests |
| Security and data boundary | 10 | unclear logs/training/access | basic controls | enforceable data route, DLP, IAM, audit export | security review |
| Latency and reliability | 8 | unusable p95 or rate limits | acceptable with fallback | meets workflow SLO under stress | load and resilience test |
| Cost fitness | 6 | unit cost blocks scale | usable for limited scope | cost fits approved use and fallback policy | unit cost sheet |
| Human oversight fit | 6 | encourages overtrust | review possible | supports review, escalation and override evidence | workflow simulation |
| Governance readiness | 10 | no version/evidence | partial records | model card, run manifest, decision record, retirement support | evidence packet |
Hard blockers:
- Any critical customer harm, privacy, security, regulated advice or unauthorized action failure in approved high-risk scope.
- No reproducible benchmark run for the task.
- No model/version boundary or provider configuration boundary.
- No trace export for regulated workflows requiring review.
- Model behavior materially changed without change notice or re-benchmark.
- Open model cannot be patched, scanned, hosted or access-controlled to enterprise policy.
Model family comparison
| Model family | Strength | Weakness | Good fit | Governance emphasis |
|---|---|---|---|---|
| Frontier proprietary | strong reasoning, language, tool use | cost, latency, data route, black-box change risk | complex contact center, credit policy RAG, knowledge assistant | version boundary, logs, supplier change notice, red-team |
| Small proprietary | fast and cheaper | weaker long reasoning and edge cases | high-volume FAQ, simple classification, draft suggestions | task boundary, escalation, challenger monitoring |
| Open-weight general | deployment control, inspectable hosting choices | ops burden, safety patching, weaker managed controls | internal knowledge assistant, constrained extraction, sovereign data | hosting, patch cadence, safety layer, license review |
| Domain-tuned model | better terminology and stable task behavior | narrow scope, data/version governance | AML typology classification, KYC extraction, fraud intervention | training data lineage, drift, revalidation |
| Specialist extractor/OCR | document/layout accuracy | less flexible reasoning | KYC document extraction, income proof extraction | field-level accuracy, exception routing, confidence calibration |
| Embedding/reranker | retrieval quality and entitlement | invisible failure if not evaluated | RAG for policy and knowledge assistant | source recall, access filtering, index/version |
| Judge/evaluator model | scalable rubric support | bias, instability, circular evaluation | regression triage, large eval runs | judge calibration, human audit sample, version control |
Benchmark and challenger lifecycle
Lifecycle states
| State | Meaning | Allowed decisions |
|---|---|---|
| Candidate | model has been proposed or discovered but not approved | lab testing only |
| Baseline | current comparator or no-AI process | compare, keep as fallback |
| Challenger | model is tested against champion for a defined use case | no production use unless separately approved |
| Champion | model approved for specific use boundary | release with controls |
| Constrained champion | approved only for limited channel, segment, risk tier or human-review mode | pilot or limited production |
| Watchlisted | production signals, supplier changes or benchmark regressions require review | freeze expansion, run additional benchmark |
| Retired | no new use, replaced or decommissioned | archive evidence, keep historical records |
Benchmark pack design
| Pack element | Required content | Example |
|---|---|---|
| Task definition | AI role, input, expected output, unacceptable output | credit policy RAG answers policy questions with citations, no credit decision |
| Dataset manifest | source, version, hash, slice coverage, privacy class | 420 cases, English/Spanish, policy v2026.05 |
| Rubric | scoring dimensions, severity, thresholds | groundedness 0-5, critical fail if unsupported adverse action reason |
| Domain set | business policy, product nuance, jurisdiction, effective date | KYC address proof exceptions, AML typology ambiguity |
| Red-team set | prompt injection, PII, unsafe advice, tool misuse, jailbreak | customer asks agent to reveal another account |
| Operational test | p50/p95 latency, timeout, rate limit, failover | contact center p95 under 800 concurrent agents |
| Security test | access control, data retention, logging, DLP | restricted HR policy not retrievable by branch user |
| Run protocol | model version, prompt version, temperature, repeats, judge version | fixed prompt v12, 3 repeated runs on unstable cases |
| Evidence output | traces, scorecard, failure taxonomy, decision memo | evidence binder object id |
Champion/challenger cadence
| Trigger | Action | Decision path |
|---|---|---|
| New model version available | run benchmark pack against challenger | approve, reject, keep challenger, constrain |
| Production complaints or overrides increase | mine failures into regression set and re-run champion | watchlist or remediate |
| Business policy changes | re-run impacted domain and RAG packs | keep, update prompt/RAG, suspend use |
| Red-team failure appears | run safety pack and incident review | freeze expansion, hotfix, retire if unresolved |
| Cost or latency becomes unfit | compare smaller or local challenger | constrained use or replacement |
| Supplier changes data/log/retention terms | architecture and governance review | suspend new use until evidence is updated |
| Open model patch or vulnerability | patch, scan, rerun critical packs | keep or retire |
Financial retail scenarios
1. Contact center agent assist
| Portfolio decision | Scorecard emphasis | Example threshold |
|---|---|---|
| Frontier proprietary champion for complex policy questions; small model challenger for routine FAQ | grounding, latency, tone, vulnerable customer escalation, PII safety | no critical unsupported fee reversal promise; p95 under workflow SLO |
Evidence:
- contact policy RAG pack with current and stale policy conflicts.
- red-team set for customer pressure, prompt injection, and account privacy.
- trace showing retrieved sources, model output, agent edits, escalation and final disposition.
2. AML triage and investigation narrative
| Portfolio decision | Scorecard emphasis | Example threshold |
|---|---|---|
| Domain-tuned classifier for alert prioritization; LLM only drafts narrative after analyst evidence selection | false negative control, typology coverage, no final SAR conclusion | zero critical missed high-risk typology in challenge set |
Evidence:
- AML typology benchmark by structuring, mule activity, funnel account, rapid movement and benign lookalikes.
- narrative rubric for material omission and evidence citation.
- human review log proving analyst retains final decision.
3. KYC document extraction
| Portfolio decision | Scorecard emphasis | Example threshold |
|---|---|---|
| Specialist OCR/extractor champion; LLM challenger for exception explanation and missing-document summary | field accuracy, confidence calibration, layout robustness, exception routing | document type and expiry date accuracy above approved threshold; low-confidence routed to human |
Evidence:
- document pack with passports, IDs, utility bills, bank statements, low-quality scans and non-English layouts.
- field-level confusion matrix.
- exception evidence for missing address, expired document and name mismatch.
4. Credit policy RAG
| Portfolio decision | Scorecard emphasis | Example threshold |
|---|---|---|
| RAG plus large model for underwriter policy support; no autonomous credit approval or adverse action | citation support, effective date, jurisdiction, protected-class boundary | no unsupported decline reason; every policy statement cites approved source |
Evidence:
- policy question pack across product, state, effective date and exception rules.
- stale policy and conflicting source challenge set.
- model selection record that separates advice support from credit decisioning.
5. Payments fraud intervention
| Portfolio decision | Scorecard emphasis | Example threshold |
|---|---|---|
| Real-time fraud model remains champion for scoring; LLM assists intervention script and case summary | latency, false negative severity, customer harm, script compliance | intervention script cannot encourage unsafe action or reveal detection rules |
Evidence:
- fraud typology pack for APP scam, account takeover, mule transfer, false-positive customer friction.
- p95 latency test for operational queue.
- red-team tests for social engineering and disclosure of fraud controls.
6. Enterprise knowledge assistant
| Portfolio decision | Scorecard emphasis | Example threshold |
|---|---|---|
| open-weight or managed model depending on data residency; embedding/reranker benchmarked separately | entitlement, source freshness, hallucination, knowledge coverage | no restricted document leakage across role boundaries |
Evidence:
- knowledge coverage map by HR, operations, product, risk, technology and policy domains.
- entitlement test with users from branch, contact center, risk and engineering.
- benchmark that separates retriever failure from generator failure.
Metrics/control/evidence model
Metrics
| Metric class | Examples | Decision use |
|---|---|---|
| Capability | pass rate, extraction F1, citation support, narrative completeness, triage precision/recall | determine task fitness |
| Domain | policy compliance, typology coverage, effective-date correctness, jurisdictional nuance | approve business scope |
| Safety | critical failure rate, jailbreak violation, PII leakage, unsafe advice, over-refusal | hard gate for risk tiers |
| Operational | p50/p95 latency, timeout, availability, rate limit, fallback success | decide workflow fit |
| Human system | reviewer agreement, override rate, review time, escalation quality, automation bias signal | validate HITL design |
| Governance | model card completeness, trace completeness, version reproducibility, approval freshness | audit readiness |
| Portfolio | model concentration, open/proprietary mix, challenger freshness, retirement backlog age | management oversight |
Controls
| Control | Purpose | Evidence |
|---|---|---|
| Approved-use boundary | Prevent model reuse beyond evaluated scope | model card, selection record, API policy |
| Benchmark gate | Stop promotion without task evidence | benchmark manifest, run report |
| Red-team gate | Stop release with unacceptable behavior | adversarial run, issue log |
| Domain SME review | Keep scores tied to policy and workflow reality | reviewer log, rubric decisions |
| Version lock and change impact | Prevent silent behavior changes | model version, prompt/version registry, supplier notice |
| Human oversight control | Prevent AI from becoming hidden decision-maker | reviewer action logs, override samples |
| Evidence retention | Support audit and model risk review | evidence packet, GRC record |
| Retirement trigger | Remove models when evidence no longer supports use | watchlist record, decommission decision |
Evidence packet
model card
+ approved-use boundary
+ benchmark pack manifest
+ run configuration
+ scorecard
+ slice and failure analysis
+ red-team report
+ latency/cost/security results
+ model selection record
+ exception or risk acceptance
+ monitoring and retirement triggers
Anti-patterns and failure modes
| Anti-pattern | Why it fails | Better architecture |
|---|---|---|
| Leaderboard-driven selection | Public benchmark does not represent workflow, data, risk and controls | use public benchmark as weak prior, then run task benchmark |
| One model for everything | Ignores task/risk differences and creates concentration risk | portfolio by capability, approved use and risk tier |
| Average score hides critical failure | High average can coexist with one unacceptable AML, KYC or credit failure | hard blockers and severity-weighted scoring |
| No challenger cadence | Champion becomes stale while model market and policy change | quarterly and trigger-based champion/challenger review |
| Model card as static document | Model behavior, provider terms and business policy change | model card with version, evidence and review expiry |
| Open model treated as automatically safer | Hosting control does not solve patching, safety, license, eval or ops | open-weight governance pack and patch lifecycle |
| Proprietary model treated as unknowable | Black-box does not excuse missing evidence | require trace, version boundary, change notice and task eval |
| Judge model trusted blindly | Evaluator bias and instability corrupt scorecard | calibrate judge with human audit and versioned rubric |
| Retirement never happens | Legacy models persist after risk, cost, policy or supplier evidence changes | explicit retirement triggers and owner accountability |
| Model selection council becomes ceremony | Decision board approves without evidence or conditions | decision record, conditions, expiry and post-release monitoring |
Architecture mapping to RAG / Agent / Copilot / Eval / Governance
| Architecture area | Model portfolio governance question | Example design decision |
|---|---|---|
| RAG | Which embedding model, reranker and generator are approved for this corpus and user entitlement model? | Credit policy RAG uses approved embedding v3, reranker challenger under test, generator constrained to cited answers |
| Agent | Which model family can plan or call tools, and under what human approval boundary? | Payments fraud assistant may draft intervention script but cannot block account without rules engine and human approval |
| Copilot | Which model can draft, summarize or recommend inside a human workflow? | AML copilot drafts narrative only after analyst-selected evidence; final disposition remains human |
| Eval | Which benchmark packs prove task, domain, safety and operational fitness? | KYC extraction pack separates OCR field accuracy from LLM exception summary quality |
| Governance | Who approves model use, exception, challenger promotion and retirement? | Model selection council approves champion scope and review expiry; model risk can require independent challenge |
This architecture also clarifies component-level selection. A single use case can have separate champions for:
- generation model.
- embedding model.
- reranker.
- extractor.
- classifier.
- safety model.
- evaluator/judge model.
- fallback model.
The governance object is not “the chatbot model”; it is the AI system model portfolio that produces behavior in a specific workflow.
ADR draft
Title: Establish AI model portfolio benchmarking and selection governance
Date: 2026-06-30
Status: Proposed
Context:
Financial retail AI systems use multiple model families across contact center, AML, KYC, credit, payments and enterprise knowledge workflows. Current model decisions are fragmented across product teams, vendor evaluations, platform experiments and project-specific benchmarks. Public benchmarks and one-off pilots do not provide sufficient evidence for regulated workflow decisions.
Decision:
Create a governed model portfolio architecture with:
1. model portfolio inventory and model cards;
2. AI task capability taxonomy;
3. benchmark packs by use case and risk tier;
4. weighted capability scorecards with hard blockers;
5. champion/challenger lifecycle;
6. model selection council decision records;
7. retirement triggers and evidence packets.
Approved-use decisions will be made at the task/use-case boundary, not at the generic model brand level. High-risk workflows require domain benchmark, red-team score, trace evidence, human oversight fit and explicit retirement triggers.
Alternatives considered:
1. Let each product team choose models independently. Rejected because evidence is not comparable and auditability is weak.
2. Choose one enterprise-standard model. Rejected because task, risk, latency, cost, data residency and control needs differ.
3. Use public leaderboards as primary selection evidence. Rejected because they do not represent financial retail context of use.
Consequences:
Positive:
- Decisions become comparable, auditable and reusable across use cases.
- Model changes can be assessed through challenger runs rather than preference debates.
- Product, architecture, risk and audit share the same evidence language.
Tradeoffs:
- Benchmark packs and model cards require ongoing ownership.
- Some fast experiments will be slowed by evidence gates.
- Scorecards must be maintained as models, policies and business workflows change.
Review triggers:
- new model family or provider enters portfolio;
- champion shows material regression or critical failure;
- business policy, regulation, data boundary or workflow changes;
- supplier changes model version, logging, retention or service behavior;
- quarterly portfolio review.
Interview answer: 30秒, 2分钟, CTO版本
30秒
我不会用一个 leaderboard 分数决定企业模型选型。我会建立 model portfolio governance:先按业务任务和风险等级定义 capability taxonomy,再为每个 use case 建 benchmark pack 和 scorecard,评估质量、domain fit、red-team、安全、延迟、成本、可审计性和人工监督适配。最后由 model selection council 决定 champion、challenger、限制条件和退役触发,并留下 model card、run result 和 selection record。
2分钟
在金融零售里,contact center、AML、KYC、credit policy RAG、payments fraud 和 enterprise knowledge assistant 对模型的要求完全不同。KYC 可能更看重 extraction accuracy 和 low-confidence routing,AML 更看重 false negative 和 typology coverage,credit policy RAG 更看重 citation support 和禁止 unsupported adverse action,contact center 更看重低延迟、话术和升级边界。
所以我会把模型治理做成持续组合管理。第一层是 model portfolio inventory,记录模型家族、版本、供应模式、数据边界、approved use 和 prohibited use。第二层是 capability taxonomy,把任务拆成抽取、分类、检索回答、摘要、建议、安全、domain policy、operability 和 governance。第三层是 benchmark pack,用真实和合成案例、domain edge cases、red-team cases 和运营测试比较 champion 与 challenger。第四层是 scorecard,用权重和 hard blockers 防止平均分掩盖高风险失败。第五层是 decision record,记录为什么批准、限制、拒绝或退役。
这样模型选择就从“哪个模型最好”变成“在这个业务任务和风险边界内,哪个模型有足够证据被批准使用”。
CTO版本
我会把模型组合治理放在 AI platform control plane 与 AI governance operating model 之间。平台负责模型网关、配置版本、eval runner、trace、policy enforcement、fallback 和 monitoring;治理负责 capability taxonomy、risk-tiered benchmark、scorecard、selection council、exception 和 retirement。关键设计是:approved use 绑定 task、workflow、risk tier、data boundary 和 benchmark evidence,而不是绑定模型品牌。
对 CTO 来说,这解决四个架构风险:第一,避免每个团队重复比较模型但证据不可复用;第二,避免供应商或模型版本变化造成 silent regression;第三,避免一刀切模型标准牺牲 latency、cost、security 或 domain fit;第四,为审计、模型风险和生产事故复盘保留可追溯证据。最终输出不是一个“最佳模型列表”,而是一套可运行的 champion/challenger portfolio,能持续吸收新模型,同时能及时退役不再适合的模型。
7-day practice plan
| Day | Practice | Output |
|---|---|---|
| Day 1 | 选择 3 个金融零售用例:contact center、KYC、credit policy RAG。拆分 AI task taxonomy。 | task taxonomy table |
| Day 2 | 为每个用例写 model portfolio inventory 字段和 approved/prohibited use。 | 3 张 model card sketch |
| Day 3 | 设计 capability scorecard,明确 hard blockers 和权重。 | scorecard v1 |
| Day 4 | 为 AML triage 或 payments fraud 写 benchmark pack:domain set、red-team set、operational test。 | benchmark pack outline |
| Day 5 | 做 champion/challenger 决策模拟:frontier proprietary、small model、open-weight、domain-tuned model。 | model selection record |
| Day 6 | 写 retirement trigger 和 evidence packet:政策变化、安全回归、供应商变更、成本/延迟不适配。 | retirement and evidence template |
| Day 7 | 用 30 秒、2 分钟、CTO 版本练习面试表达,并把一个场景做成作品集页面。 | interview script + portfolio artifact |
Source anchors with links
| Source | Link | How this note uses it |
|---|---|---|
| Stanford HELM latest | https://crfm.stanford.edu/helm/latest/ | Holistic and living benchmark mindset for multi-scenario, multi-metric model evaluation |
| HELM paper | https://arxiv.org/abs/2211.09110 | Multi-metric evaluation idea: accuracy, calibration, robustness, fairness, bias, toxicity and efficiency are decision inputs, not one score |
| MLCommons AI Safety / AILuminate | https://mlcommons.org/benchmarks/ai-safety/ | Safety benchmark and system-under-test framing; useful for red-team score and safety hard blockers |
| NIST AI RMF | https://www.nist.gov/itl/ai-risk-management-framework | Govern / Map / Measure / Manage language for risk-based AI governance |
| NIST AI RMF resources and TEVV anchors | https://www.nist.gov/itl/ai-risk-management-framework/ai-risk-management-framework-resources | Links model portfolio evidence to testing, evaluation, verification, validation, GenAI profile and AI RMF playbook resources |
| ISO/IEC 42001 | https://www.iso.org/standard/81230.html | AI management system anchor for policies, objectives, operating controls, performance evaluation and continual improvement |
| ISO/IEC 23894 | https://www.iso.org/standard/77304.html | AI risk management guidance anchor for integrating risk management into AI-related activities and functions |