返回 Papers
AI 底层逻辑 / 经典论文

AI Model Portfolio Benchmarking:模型组合评测与选型治理架构

Date: 2026-06-30

486ai-foundations/papers/168-ai-model-portfolio-benchmarking-capability-scorecard-selection-governance-architecture.md

AI 模型组合基准评测 / 能力评分卡 / 选型治理架构:Model Portfolio Benchmarking / Capability Scorecard / Selection Governance Architecture

Date: 2026-06-30
Status: evergreen
Audience: experienced CBAP / financial retail PM / product architect / solution architect / AI governance lead
Output: advanced architecture note for website-visible portfolio, model selection council design, audit evidence and interview discussion


Why model portfolio governance matters

金融零售企业不会长期只用一个 AI 模型。实际环境通常同时存在 frontier model、small model、embedding model、reranker、domain-tuned classifier、document extraction model、speech model、judge model、open-weight model、managed proprietary model 和 legacy ML model。问题不再是“哪个模型最好”,而是:

对每个业务任务、风险等级、数据边界、延迟约束、成本约束、安全要求和审计要求,哪个模型家族在当前证据下被批准使用,什么时候需要 challenger,什么时候必须退役?

这是一套 portfolio governance 问题,而不是单次 benchmark 或 vendor demo。

治理问题架构含义金融零售后果
模型能力如何分类需要统一 capability taxonomy,避免每个团队用自己的“好用”定义客服、AML、KYC、信贷和支付团队无法比较证据
分数如何进入决策scorecard 必须绑定任务、阈值、风险层级和证据,而不是平均分排名高风险场景不能被通用 leaderboard 或低风险 FAQ 分数掩盖
模型家族如何管理proprietary、open-weight、fine-tuned、small、specialist、judge 各有不同控制面避免供应集中、黑盒不可审计、开源权重无补丁流程
challenger 如何运行新模型必须在同一 benchmark pack、同一 rubric、同一成本/延迟测量下比较避免“新模型更强”的口号绕过 release gate
何时退役需要 drift、事故、政策变化、成本失控、安全退步、供应风险等触发器避免旧模型继续服务 regulated workflow
审计如何复核每次选型要留下 model card、scorecard、benchmark run、decision record 和 exception支撑模型风险、内审、监管、产品复盘和管理层问责

边界说明:本文不做模型路由策略设计,不做采购沙盒,不做董事会投资叙事,也不做 AI FinOps 成本管理。这里聚焦持续性的模型组合治理:模型能力分类、基准评测、能力评分卡、选型委员会、champion/challenger、风险分层、退役触发和证据链。


Concept diagram

flowchart TB
  A[Business capability map] --> B[AI task taxonomy]
  B --> C[Model portfolio inventory]
  C --> D[Model cards and approved-use boundaries]

  B --> E[Benchmark packs]
  E --> E1[Quality and task metrics]
  E --> E2[Safety and red-team tests]
  E --> E3[Domain and policy tests]
  E --> E4[Cost, latency and resilience tests]

  D --> F[Capability scorecard]
  E1 --> F
  E2 --> F
  E3 --> F
  E4 --> F

  F --> G{Model selection council}
  G -->|Approve| H[Champion model for approved use]
  G -->|Constrain| I[Conditional use with controls]
  G -->|Reject| J[Do not use or return to lab]
  G -->|Challenge| K[Challenger backlog]

  H --> L[Release and production monitoring]
  I --> L
  L --> M[Evidence packet]
  L --> N[Drift, incident and policy-change signals]
  N --> K
  N --> O[Retirement trigger]
  O --> P[Replacement / fallback / decommission]

  M --> Q[Audit, model risk, product decision record]
  Q --> C

核心思想:

model portfolio governance
  = inventory + capability taxonomy + benchmark pack + scorecard
  + selection council + champion/challenger + retirement trigger
  + auditable evidence

Core architecture model

1. Portfolio layers

Layer责任关键对象常见 owner
Business capability layer说明模型服务哪个业务能力和流程节点contact servicing、KYC onboarding、AML triage、credit policy support、fraud interventionbusiness owner / AI PM / BA
AI task layer把业务能力拆成可评测任务classify、extract、summarize、retrieve-answer、draft、detect anomaly、recommend escalationPM / BA / AI architect
Model inventory layer记录所有候选与批准模型model family、provider、version、deployment mode、data route、approved use、restrictionAI platform / governance
Benchmark layer为任务准备可复现评测包dataset version、rubric、red-team set、domain set、run config、judge configEvalOps / SME / model risk
Scorecard layer将多维结果转成选型证据task score、domain score、red-team score、latency/cost/security score、confidenceAI PM / architect / governance
Decision layer做批准、限制、挑战、退役和例外处理model selection record、risk acceptance、conditions、expiry、fallbackmodel selection council
Evidence layer留存可审计证据model card、benchmark manifest、scorecard, traces、approval、exception、retirement recordrelease / GRC / audit

2. Model portfolio inventory as a governed asset

模型清单不是技术团队的 spreadsheet。它要支持业务、架构、风险、审计和运营共同判断。

Field说明Example
model_id企业内部唯一编号LLM-GP-PROP-A-2026-06
model_familygeneral LLM / reasoning / small LLM / embedding / reranker / classifier / OCR / speech / judge / safety modelgeneral LLM
provider_modeproprietary API / managed cloud / open-weight self-hosted / internal fine-tune / vendor application componentproprietary API
version_boundary模型版本、snapshot、API date、fine-tune id、guardrail version2026-06 snapshot, guardrail v4
approved_use允许任务、场景和风险等级contact center agent assist, draft only
prohibited_use禁止直接决策或禁止客户可见输出no autonomous credit decision, no SAR conclusion
data_boundaryPII、PCI、交易、文档、语音、日志、跨境、训练使用、retentionno provider training, EU data route, 30-day logs
benchmark_statuslast run、benchmark pack、pass/fail、open gapspassed CONTACT-RAG-v2026.06 with 2 accepted gaps
operational_slolatency、availability、rate limit、fallbackp95 < 4s for agent assist
control_statusDLP、prompt injection、access、logging、red-team、human reviewDLP and trace export approved
lifecycle_statuscandidate / challenger / champion / constrained / watchlisted / retiredchampion
retirement_triggertriggers that invalidate current approvalcritical safety regression, policy citation fail, supplier exit

3. Capability scorecard as decision interface

Scorecard 的目标不是替代判断,而是把不同角色的判断放到同一张证据表里。

public benchmark tells us generic capability.
domain benchmark tells us task fit.
red-team benchmark tells us unacceptable behavior.
architecture score tells us operability.
governance score tells us whether evidence can survive audit.
selection council turns all of that into approved-use boundaries.

Capability taxonomy and scorecard

Capability taxonomy

Capability familyWhat to evaluateFinancial retail examples
Language and reasoninginstruction following、multi-step reasoning、ambiguity handling、calibrationcontact center policy explanation, credit memo critique
Retrieval-grounded answersource recall、citation support、stale-source handling、entitlement respectcredit policy RAG, enterprise knowledge assistant
Information extractionfield accuracy、table understanding、layout robustness、confidence and exception detectionKYC document extraction, income proof review
Summarization and narrativecompleteness、material omission、tone、traceability to evidenceAML case narrative draft, complaint root cause summary
Classification and triageprecision/recall by severity、false negative control、threshold behaviorAML alert triage, payment fraud queue priority
Action recommendationpolicy compliance、escalation quality、human approval fitpayments fraud intervention, collections hardship next action
Safety and securityprompt injection、PII leakage、unsafe advice、tool misuse、jailbreak resistancecustomer-facing chatbot, analyst copilot, internal RAG
Domain and policyproduct rules、regulatory boundaries、jurisdictional nuance、effective datescredit policy, KYC policy, AML typology, Reg E dispute rules
Operabilitylatency、availability、trace export、rate limit、fallback、observabilitycontact center p95 latency, fraud real-time queue
Governance readinessmodel card quality、version control、audit evidence、change notice、retirement supportall regulated use cases

Scorecard dimensions

Score each dimension 1-5, but apply hard blockers for high-risk use cases. A strong average cannot compensate for a critical failure.

DimensionWeight135Evidence
Task quality14fails common casesacceptable averagestrong pass rate with slice stabilitybenchmark results and SME review
Domain score14generic language onlyhandles common policyhandles edge cases, effective dates and product nuancedomain benchmark pack
Red-team / safety score14critical failuresmitigations partialno critical failures in approved setadversarial run, safety report
Grounding and citation10unsupported claimscitations sometimes weakclaims trace to allowed sourcesRAG/citation eval
Robustness8brittle to wording/noisestable on common variantsstable across language, channel, missing evidence and ambiguitymutation tests
Security and data boundary10unclear logs/training/accessbasic controlsenforceable data route, DLP, IAM, audit exportsecurity review
Latency and reliability8unusable p95 or rate limitsacceptable with fallbackmeets workflow SLO under stressload and resilience test
Cost fitness6unit cost blocks scaleusable for limited scopecost fits approved use and fallback policyunit cost sheet
Human oversight fit6encourages overtrustreview possiblesupports review, escalation and override evidenceworkflow simulation
Governance readiness10no version/evidencepartial recordsmodel card, run manifest, decision record, retirement supportevidence packet

Hard blockers:

  • Any critical customer harm, privacy, security, regulated advice or unauthorized action failure in approved high-risk scope.
  • No reproducible benchmark run for the task.
  • No model/version boundary or provider configuration boundary.
  • No trace export for regulated workflows requiring review.
  • Model behavior materially changed without change notice or re-benchmark.
  • Open model cannot be patched, scanned, hosted or access-controlled to enterprise policy.

Model family comparison

Model familyStrengthWeaknessGood fitGovernance emphasis
Frontier proprietarystrong reasoning, language, tool usecost, latency, data route, black-box change riskcomplex contact center, credit policy RAG, knowledge assistantversion boundary, logs, supplier change notice, red-team
Small proprietaryfast and cheaperweaker long reasoning and edge caseshigh-volume FAQ, simple classification, draft suggestionstask boundary, escalation, challenger monitoring
Open-weight generaldeployment control, inspectable hosting choicesops burden, safety patching, weaker managed controlsinternal knowledge assistant, constrained extraction, sovereign datahosting, patch cadence, safety layer, license review
Domain-tuned modelbetter terminology and stable task behaviornarrow scope, data/version governanceAML typology classification, KYC extraction, fraud interventiontraining data lineage, drift, revalidation
Specialist extractor/OCRdocument/layout accuracyless flexible reasoningKYC document extraction, income proof extractionfield-level accuracy, exception routing, confidence calibration
Embedding/rerankerretrieval quality and entitlementinvisible failure if not evaluatedRAG for policy and knowledge assistantsource recall, access filtering, index/version
Judge/evaluator modelscalable rubric supportbias, instability, circular evaluationregression triage, large eval runsjudge calibration, human audit sample, version control

Benchmark and challenger lifecycle

Lifecycle states

StateMeaningAllowed decisions
Candidatemodel has been proposed or discovered but not approvedlab testing only
Baselinecurrent comparator or no-AI processcompare, keep as fallback
Challengermodel is tested against champion for a defined use caseno production use unless separately approved
Championmodel approved for specific use boundaryrelease with controls
Constrained championapproved only for limited channel, segment, risk tier or human-review modepilot or limited production
Watchlistedproduction signals, supplier changes or benchmark regressions require reviewfreeze expansion, run additional benchmark
Retiredno new use, replaced or decommissionedarchive evidence, keep historical records

Benchmark pack design

Pack elementRequired contentExample
Task definitionAI role, input, expected output, unacceptable outputcredit policy RAG answers policy questions with citations, no credit decision
Dataset manifestsource, version, hash, slice coverage, privacy class420 cases, English/Spanish, policy v2026.05
Rubricscoring dimensions, severity, thresholdsgroundedness 0-5, critical fail if unsupported adverse action reason
Domain setbusiness policy, product nuance, jurisdiction, effective dateKYC address proof exceptions, AML typology ambiguity
Red-team setprompt injection, PII, unsafe advice, tool misuse, jailbreakcustomer asks agent to reveal another account
Operational testp50/p95 latency, timeout, rate limit, failovercontact center p95 under 800 concurrent agents
Security testaccess control, data retention, logging, DLPrestricted HR policy not retrievable by branch user
Run protocolmodel version, prompt version, temperature, repeats, judge versionfixed prompt v12, 3 repeated runs on unstable cases
Evidence outputtraces, scorecard, failure taxonomy, decision memoevidence binder object id

Champion/challenger cadence

TriggerActionDecision path
New model version availablerun benchmark pack against challengerapprove, reject, keep challenger, constrain
Production complaints or overrides increasemine failures into regression set and re-run championwatchlist or remediate
Business policy changesre-run impacted domain and RAG packskeep, update prompt/RAG, suspend use
Red-team failure appearsrun safety pack and incident reviewfreeze expansion, hotfix, retire if unresolved
Cost or latency becomes unfitcompare smaller or local challengerconstrained use or replacement
Supplier changes data/log/retention termsarchitecture and governance reviewsuspend new use until evidence is updated
Open model patch or vulnerabilitypatch, scan, rerun critical packskeep or retire

Financial retail scenarios

1. Contact center agent assist

Portfolio decisionScorecard emphasisExample threshold
Frontier proprietary champion for complex policy questions; small model challenger for routine FAQgrounding, latency, tone, vulnerable customer escalation, PII safetyno critical unsupported fee reversal promise; p95 under workflow SLO

Evidence:

  • contact policy RAG pack with current and stale policy conflicts.
  • red-team set for customer pressure, prompt injection, and account privacy.
  • trace showing retrieved sources, model output, agent edits, escalation and final disposition.

2. AML triage and investigation narrative

Portfolio decisionScorecard emphasisExample threshold
Domain-tuned classifier for alert prioritization; LLM only drafts narrative after analyst evidence selectionfalse negative control, typology coverage, no final SAR conclusionzero critical missed high-risk typology in challenge set

Evidence:

  • AML typology benchmark by structuring, mule activity, funnel account, rapid movement and benign lookalikes.
  • narrative rubric for material omission and evidence citation.
  • human review log proving analyst retains final decision.

3. KYC document extraction

Portfolio decisionScorecard emphasisExample threshold
Specialist OCR/extractor champion; LLM challenger for exception explanation and missing-document summaryfield accuracy, confidence calibration, layout robustness, exception routingdocument type and expiry date accuracy above approved threshold; low-confidence routed to human

Evidence:

  • document pack with passports, IDs, utility bills, bank statements, low-quality scans and non-English layouts.
  • field-level confusion matrix.
  • exception evidence for missing address, expired document and name mismatch.

4. Credit policy RAG

Portfolio decisionScorecard emphasisExample threshold
RAG plus large model for underwriter policy support; no autonomous credit approval or adverse actioncitation support, effective date, jurisdiction, protected-class boundaryno unsupported decline reason; every policy statement cites approved source

Evidence:

  • policy question pack across product, state, effective date and exception rules.
  • stale policy and conflicting source challenge set.
  • model selection record that separates advice support from credit decisioning.

5. Payments fraud intervention

Portfolio decisionScorecard emphasisExample threshold
Real-time fraud model remains champion for scoring; LLM assists intervention script and case summarylatency, false negative severity, customer harm, script complianceintervention script cannot encourage unsafe action or reveal detection rules

Evidence:

  • fraud typology pack for APP scam, account takeover, mule transfer, false-positive customer friction.
  • p95 latency test for operational queue.
  • red-team tests for social engineering and disclosure of fraud controls.

6. Enterprise knowledge assistant

Portfolio decisionScorecard emphasisExample threshold
open-weight or managed model depending on data residency; embedding/reranker benchmarked separatelyentitlement, source freshness, hallucination, knowledge coverageno restricted document leakage across role boundaries

Evidence:

  • knowledge coverage map by HR, operations, product, risk, technology and policy domains.
  • entitlement test with users from branch, contact center, risk and engineering.
  • benchmark that separates retriever failure from generator failure.

Metrics/control/evidence model

Metrics

Metric classExamplesDecision use
Capabilitypass rate, extraction F1, citation support, narrative completeness, triage precision/recalldetermine task fitness
Domainpolicy compliance, typology coverage, effective-date correctness, jurisdictional nuanceapprove business scope
Safetycritical failure rate, jailbreak violation, PII leakage, unsafe advice, over-refusalhard gate for risk tiers
Operationalp50/p95 latency, timeout, availability, rate limit, fallback successdecide workflow fit
Human systemreviewer agreement, override rate, review time, escalation quality, automation bias signalvalidate HITL design
Governancemodel card completeness, trace completeness, version reproducibility, approval freshnessaudit readiness
Portfoliomodel concentration, open/proprietary mix, challenger freshness, retirement backlog agemanagement oversight

Controls

ControlPurposeEvidence
Approved-use boundaryPrevent model reuse beyond evaluated scopemodel card, selection record, API policy
Benchmark gateStop promotion without task evidencebenchmark manifest, run report
Red-team gateStop release with unacceptable behavioradversarial run, issue log
Domain SME reviewKeep scores tied to policy and workflow realityreviewer log, rubric decisions
Version lock and change impactPrevent silent behavior changesmodel version, prompt/version registry, supplier notice
Human oversight controlPrevent AI from becoming hidden decision-makerreviewer action logs, override samples
Evidence retentionSupport audit and model risk reviewevidence packet, GRC record
Retirement triggerRemove models when evidence no longer supports usewatchlist record, decommission decision

Evidence packet

model card
  + approved-use boundary
  + benchmark pack manifest
  + run configuration
  + scorecard
  + slice and failure analysis
  + red-team report
  + latency/cost/security results
  + model selection record
  + exception or risk acceptance
  + monitoring and retirement triggers

Anti-patterns and failure modes

Anti-patternWhy it failsBetter architecture
Leaderboard-driven selectionPublic benchmark does not represent workflow, data, risk and controlsuse public benchmark as weak prior, then run task benchmark
One model for everythingIgnores task/risk differences and creates concentration riskportfolio by capability, approved use and risk tier
Average score hides critical failureHigh average can coexist with one unacceptable AML, KYC or credit failurehard blockers and severity-weighted scoring
No challenger cadenceChampion becomes stale while model market and policy changequarterly and trigger-based champion/challenger review
Model card as static documentModel behavior, provider terms and business policy changemodel card with version, evidence and review expiry
Open model treated as automatically saferHosting control does not solve patching, safety, license, eval or opsopen-weight governance pack and patch lifecycle
Proprietary model treated as unknowableBlack-box does not excuse missing evidencerequire trace, version boundary, change notice and task eval
Judge model trusted blindlyEvaluator bias and instability corrupt scorecardcalibrate judge with human audit and versioned rubric
Retirement never happensLegacy models persist after risk, cost, policy or supplier evidence changesexplicit retirement triggers and owner accountability
Model selection council becomes ceremonyDecision board approves without evidence or conditionsdecision record, conditions, expiry and post-release monitoring

Architecture mapping to RAG / Agent / Copilot / Eval / Governance

Architecture areaModel portfolio governance questionExample design decision
RAGWhich embedding model, reranker and generator are approved for this corpus and user entitlement model?Credit policy RAG uses approved embedding v3, reranker challenger under test, generator constrained to cited answers
AgentWhich model family can plan or call tools, and under what human approval boundary?Payments fraud assistant may draft intervention script but cannot block account without rules engine and human approval
CopilotWhich model can draft, summarize or recommend inside a human workflow?AML copilot drafts narrative only after analyst-selected evidence; final disposition remains human
EvalWhich benchmark packs prove task, domain, safety and operational fitness?KYC extraction pack separates OCR field accuracy from LLM exception summary quality
GovernanceWho approves model use, exception, challenger promotion and retirement?Model selection council approves champion scope and review expiry; model risk can require independent challenge

This architecture also clarifies component-level selection. A single use case can have separate champions for:

  • generation model.
  • embedding model.
  • reranker.
  • extractor.
  • classifier.
  • safety model.
  • evaluator/judge model.
  • fallback model.

The governance object is not “the chatbot model”; it is the AI system model portfolio that produces behavior in a specific workflow.


ADR draft

Title: Establish AI model portfolio benchmarking and selection governance
Date: 2026-06-30
Status: Proposed

Context:
Financial retail AI systems use multiple model families across contact center, AML, KYC, credit, payments and enterprise knowledge workflows. Current model decisions are fragmented across product teams, vendor evaluations, platform experiments and project-specific benchmarks. Public benchmarks and one-off pilots do not provide sufficient evidence for regulated workflow decisions.

Decision:
Create a governed model portfolio architecture with:
1. model portfolio inventory and model cards;
2. AI task capability taxonomy;
3. benchmark packs by use case and risk tier;
4. weighted capability scorecards with hard blockers;
5. champion/challenger lifecycle;
6. model selection council decision records;
7. retirement triggers and evidence packets.

Approved-use decisions will be made at the task/use-case boundary, not at the generic model brand level. High-risk workflows require domain benchmark, red-team score, trace evidence, human oversight fit and explicit retirement triggers.

Alternatives considered:
1. Let each product team choose models independently. Rejected because evidence is not comparable and auditability is weak.
2. Choose one enterprise-standard model. Rejected because task, risk, latency, cost, data residency and control needs differ.
3. Use public leaderboards as primary selection evidence. Rejected because they do not represent financial retail context of use.

Consequences:
Positive:
- Decisions become comparable, auditable and reusable across use cases.
- Model changes can be assessed through challenger runs rather than preference debates.
- Product, architecture, risk and audit share the same evidence language.

Tradeoffs:
- Benchmark packs and model cards require ongoing ownership.
- Some fast experiments will be slowed by evidence gates.
- Scorecards must be maintained as models, policies and business workflows change.

Review triggers:
- new model family or provider enters portfolio;
- champion shows material regression or critical failure;
- business policy, regulation, data boundary or workflow changes;
- supplier changes model version, logging, retention or service behavior;
- quarterly portfolio review.

Interview answer: 30秒, 2分钟, CTO版本

30秒

我不会用一个 leaderboard 分数决定企业模型选型。我会建立 model portfolio governance:先按业务任务和风险等级定义 capability taxonomy,再为每个 use case 建 benchmark pack 和 scorecard,评估质量、domain fit、red-team、安全、延迟、成本、可审计性和人工监督适配。最后由 model selection council 决定 champion、challenger、限制条件和退役触发,并留下 model card、run result 和 selection record。

2分钟

在金融零售里,contact center、AML、KYC、credit policy RAG、payments fraud 和 enterprise knowledge assistant 对模型的要求完全不同。KYC 可能更看重 extraction accuracy 和 low-confidence routing,AML 更看重 false negative 和 typology coverage,credit policy RAG 更看重 citation support 和禁止 unsupported adverse action,contact center 更看重低延迟、话术和升级边界。

所以我会把模型治理做成持续组合管理。第一层是 model portfolio inventory,记录模型家族、版本、供应模式、数据边界、approved use 和 prohibited use。第二层是 capability taxonomy,把任务拆成抽取、分类、检索回答、摘要、建议、安全、domain policy、operability 和 governance。第三层是 benchmark pack,用真实和合成案例、domain edge cases、red-team cases 和运营测试比较 champion 与 challenger。第四层是 scorecard,用权重和 hard blockers 防止平均分掩盖高风险失败。第五层是 decision record,记录为什么批准、限制、拒绝或退役。

这样模型选择就从“哪个模型最好”变成“在这个业务任务和风险边界内,哪个模型有足够证据被批准使用”。

CTO版本

我会把模型组合治理放在 AI platform control plane 与 AI governance operating model 之间。平台负责模型网关、配置版本、eval runner、trace、policy enforcement、fallback 和 monitoring;治理负责 capability taxonomy、risk-tiered benchmark、scorecard、selection council、exception 和 retirement。关键设计是:approved use 绑定 task、workflow、risk tier、data boundary 和 benchmark evidence,而不是绑定模型品牌。

对 CTO 来说,这解决四个架构风险:第一,避免每个团队重复比较模型但证据不可复用;第二,避免供应商或模型版本变化造成 silent regression;第三,避免一刀切模型标准牺牲 latency、cost、security 或 domain fit;第四,为审计、模型风险和生产事故复盘保留可追溯证据。最终输出不是一个“最佳模型列表”,而是一套可运行的 champion/challenger portfolio,能持续吸收新模型,同时能及时退役不再适合的模型。


7-day practice plan

DayPracticeOutput
Day 1选择 3 个金融零售用例:contact center、KYC、credit policy RAG。拆分 AI task taxonomy。task taxonomy table
Day 2为每个用例写 model portfolio inventory 字段和 approved/prohibited use。3 张 model card sketch
Day 3设计 capability scorecard,明确 hard blockers 和权重。scorecard v1
Day 4为 AML triage 或 payments fraud 写 benchmark pack:domain set、red-team set、operational test。benchmark pack outline
Day 5做 champion/challenger 决策模拟:frontier proprietary、small model、open-weight、domain-tuned model。model selection record
Day 6写 retirement trigger 和 evidence packet:政策变化、安全回归、供应商变更、成本/延迟不适配。retirement and evidence template
Day 7用 30 秒、2 分钟、CTO 版本练习面试表达,并把一个场景做成作品集页面。interview script + portfolio artifact

SourceLinkHow this note uses it
Stanford HELM latesthttps://crfm.stanford.edu/helm/latest/Holistic and living benchmark mindset for multi-scenario, multi-metric model evaluation
HELM paperhttps://arxiv.org/abs/2211.09110Multi-metric evaluation idea: accuracy, calibration, robustness, fairness, bias, toxicity and efficiency are decision inputs, not one score
MLCommons AI Safety / AILuminatehttps://mlcommons.org/benchmarks/ai-safety/Safety benchmark and system-under-test framing; useful for red-team score and safety hard blockers
NIST AI RMFhttps://www.nist.gov/itl/ai-risk-management-frameworkGovern / Map / Measure / Manage language for risk-based AI governance
NIST AI RMF resources and TEVV anchorshttps://www.nist.gov/itl/ai-risk-management-framework/ai-risk-management-framework-resourcesLinks model portfolio evidence to testing, evaluation, verification, validation, GenAI profile and AI RMF playbook resources
ISO/IEC 42001https://www.iso.org/standard/81230.htmlAI management system anchor for policies, objectives, operating controls, performance evaluation and continual improvement
ISO/IEC 23894https://www.iso.org/standard/77304.htmlAI risk management guidance anchor for integrating risk management into AI-related activities and functions