AI 底层逻辑 / 经典论文

AI Model Portfolio Benchmarking：模型组合评测与选型治理架构

Date: 2026-06-30

486 行ai-foundations/papers/168-ai-model-portfolio-benchmarking-capability-scorecard-selection-governance-architecture.md

AI 模型组合基准评测 / 能力评分卡 / 选型治理架构：Model Portfolio Benchmarking / Capability Scorecard / Selection Governance Architecture

Date: 2026-06-30
Status: evergreen
Audience: experienced CBAP / financial retail PM / product architect / solution architect / AI governance lead
Output: advanced architecture note for website-visible portfolio, model selection council design, audit evidence and interview discussion

Why model portfolio governance matters

金融零售企业不会长期只用一个 AI 模型。实际环境通常同时存在 frontier model、small model、embedding model、reranker、domain-tuned classifier、document extraction model、speech model、judge model、open-weight model、managed proprietary model 和 legacy ML model。问题不再是“哪个模型最好”，而是：

对每个业务任务、风险等级、数据边界、延迟约束、成本约束、安全要求和审计要求，哪个模型家族在当前证据下被批准使用，什么时候需要 challenger，什么时候必须退役？

这是一套 portfolio governance 问题，而不是单次 benchmark 或 vendor demo。

治理问题	架构含义	金融零售后果
模型能力如何分类	需要统一 capability taxonomy，避免每个团队用自己的“好用”定义	客服、AML、KYC、信贷和支付团队无法比较证据
分数如何进入决策	scorecard 必须绑定任务、阈值、风险层级和证据，而不是平均分排名	高风险场景不能被通用 leaderboard 或低风险 FAQ 分数掩盖
模型家族如何管理	proprietary、open-weight、fine-tuned、small、specialist、judge 各有不同控制面	避免供应集中、黑盒不可审计、开源权重无补丁流程
challenger 如何运行	新模型必须在同一 benchmark pack、同一 rubric、同一成本/延迟测量下比较	避免“新模型更强”的口号绕过 release gate
何时退役	需要 drift、事故、政策变化、成本失控、安全退步、供应风险等触发器	避免旧模型继续服务 regulated workflow
审计如何复核	每次选型要留下 model card、scorecard、benchmark run、decision record 和 exception	支撑模型风险、内审、监管、产品复盘和管理层问责

边界说明：本文不做模型路由策略设计，不做采购沙盒，不做董事会投资叙事，也不做 AI FinOps 成本管理。这里聚焦持续性的模型组合治理：模型能力分类、基准评测、能力评分卡、选型委员会、champion/challenger、风险分层、退役触发和证据链。

Concept diagram

flowchart TB
  A[Business capability map] --> B[AI task taxonomy]
  B --> C[Model portfolio inventory]
  C --> D[Model cards and approved-use boundaries]

  B --> E[Benchmark packs]
  E --> E1[Quality and task metrics]
  E --> E2[Safety and red-team tests]
  E --> E3[Domain and policy tests]
  E --> E4[Cost, latency and resilience tests]

  D --> F[Capability scorecard]
  E1 --> F
  E2 --> F
  E3 --> F
  E4 --> F

  F --> G{Model selection council}
  G -->|Approve| H[Champion model for approved use]
  G -->|Constrain| I[Conditional use with controls]
  G -->|Reject| J[Do not use or return to lab]
  G -->|Challenge| K[Challenger backlog]

  H --> L[Release and production monitoring]
  I --> L
  L --> M[Evidence packet]
  L --> N[Drift, incident and policy-change signals]
  N --> K
  N --> O[Retirement trigger]
  O --> P[Replacement / fallback / decommission]

  M --> Q[Audit, model risk, product decision record]
  Q --> C

核心思想：

model portfolio governance
  = inventory + capability taxonomy + benchmark pack + scorecard
  + selection council + champion/challenger + retirement trigger
  + auditable evidence

Core architecture model

1. Portfolio layers

Layer	责任	关键对象	常见 owner
Business capability layer	说明模型服务哪个业务能力和流程节点	contact servicing、KYC onboarding、AML triage、credit policy support、fraud intervention	business owner / AI PM / BA
AI task layer	把业务能力拆成可评测任务	classify、extract、summarize、retrieve-answer、draft、detect anomaly、recommend escalation	PM / BA / AI architect
Model inventory layer	记录所有候选与批准模型	model family、provider、version、deployment mode、data route、approved use、restriction	AI platform / governance
Benchmark layer	为任务准备可复现评测包	dataset version、rubric、red-team set、domain set、run config、judge config	EvalOps / SME / model risk
Scorecard layer	将多维结果转成选型证据	task score、domain score、red-team score、latency/cost/security score、confidence	AI PM / architect / governance
Decision layer	做批准、限制、挑战、退役和例外处理	model selection record、risk acceptance、conditions、expiry、fallback	model selection council
Evidence layer	留存可审计证据	model card、benchmark manifest、scorecard, traces、approval、exception、retirement record	release / GRC / audit

2. Model portfolio inventory as a governed asset

模型清单不是技术团队的 spreadsheet。它要支持业务、架构、风险、审计和运营共同判断。

Field	说明	Example
model_id	企业内部唯一编号	`LLM-GP-PROP-A-2026-06`
model_family	general LLM / reasoning / small LLM / embedding / reranker / classifier / OCR / speech / judge / safety model	general LLM
provider_mode	proprietary API / managed cloud / open-weight self-hosted / internal fine-tune / vendor application component	proprietary API
version_boundary	模型版本、snapshot、API date、fine-tune id、guardrail version	`2026-06 snapshot, guardrail v4`
approved_use	允许任务、场景和风险等级	contact center agent assist, draft only
prohibited_use	禁止直接决策或禁止客户可见输出	no autonomous credit decision, no SAR conclusion
data_boundary	PII、PCI、交易、文档、语音、日志、跨境、训练使用、retention	no provider training, EU data route, 30-day logs
benchmark_status	last run、benchmark pack、pass/fail、open gaps	passed `CONTACT-RAG-v2026.06` with 2 accepted gaps
operational_slo	latency、availability、rate limit、fallback	p95 < 4s for agent assist
control_status	DLP、prompt injection、access、logging、red-team、human review	DLP and trace export approved
lifecycle_status	candidate / challenger / champion / constrained / watchlisted / retired	champion
retirement_trigger	triggers that invalidate current approval	critical safety regression, policy citation fail, supplier exit

3. Capability scorecard as decision interface

Scorecard 的目标不是替代判断，而是把不同角色的判断放到同一张证据表里。

public benchmark tells us generic capability.
domain benchmark tells us task fit.
red-team benchmark tells us unacceptable behavior.
architecture score tells us operability.
governance score tells us whether evidence can survive audit.
selection council turns all of that into approved-use boundaries.

Capability taxonomy and scorecard

Capability taxonomy

Capability family	What to evaluate	Financial retail examples
Language and reasoning	instruction following、multi-step reasoning、ambiguity handling、calibration	contact center policy explanation, credit memo critique
Retrieval-grounded answer	source recall、citation support、stale-source handling、entitlement respect	credit policy RAG, enterprise knowledge assistant
Information extraction	field accuracy、table understanding、layout robustness、confidence and exception detection	KYC document extraction, income proof review
Summarization and narrative	completeness、material omission、tone、traceability to evidence	AML case narrative draft, complaint root cause summary
Classification and triage	precision/recall by severity、false negative control、threshold behavior	AML alert triage, payment fraud queue priority
Action recommendation	policy compliance、escalation quality、human approval fit	payments fraud intervention, collections hardship next action
Safety and security	prompt injection、PII leakage、unsafe advice、tool misuse、jailbreak resistance	customer-facing chatbot, analyst copilot, internal RAG
Domain and policy	product rules、regulatory boundaries、jurisdictional nuance、effective dates	credit policy, KYC policy, AML typology, Reg E dispute rules
Operability	latency、availability、trace export、rate limit、fallback、observability	contact center p95 latency, fraud real-time queue
Governance readiness	model card quality、version control、audit evidence、change notice、retirement support	all regulated use cases

Scorecard dimensions

Score each dimension 1-5, but apply hard blockers for high-risk use cases. A strong average cannot compensate for a critical failure.

Dimension	Weight	1	3	5	Evidence
Task quality	14	fails common cases	acceptable average	strong pass rate with slice stability	benchmark results and SME review
Domain score	14	generic language only	handles common policy	handles edge cases, effective dates and product nuance	domain benchmark pack
Red-team / safety score	14	critical failures	mitigations partial	no critical failures in approved set	adversarial run, safety report
Grounding and citation	10	unsupported claims	citations sometimes weak	claims trace to allowed sources	RAG/citation eval
Robustness	8	brittle to wording/noise	stable on common variants	stable across language, channel, missing evidence and ambiguity	mutation tests
Security and data boundary	10	unclear logs/training/access	basic controls	enforceable data route, DLP, IAM, audit export	security review
Latency and reliability	8	unusable p95 or rate limits	acceptable with fallback	meets workflow SLO under stress	load and resilience test
Cost fitness	6	unit cost blocks scale	usable for limited scope	cost fits approved use and fallback policy	unit cost sheet
Human oversight fit	6	encourages overtrust	review possible	supports review, escalation and override evidence	workflow simulation
Governance readiness	10	no version/evidence	partial records	model card, run manifest, decision record, retirement support	evidence packet

Hard blockers:

Any critical customer harm, privacy, security, regulated advice or unauthorized action failure in approved high-risk scope.
No reproducible benchmark run for the task.
No model/version boundary or provider configuration boundary.
No trace export for regulated workflows requiring review.
Model behavior materially changed without change notice or re-benchmark.
Open model cannot be patched, scanned, hosted or access-controlled to enterprise policy.

Model family comparison

Model family	Strength	Weakness	Good fit	Governance emphasis
Frontier proprietary	strong reasoning, language, tool use	cost, latency, data route, black-box change risk	complex contact center, credit policy RAG, knowledge assistant	version boundary, logs, supplier change notice, red-team
Small proprietary	fast and cheaper	weaker long reasoning and edge cases	high-volume FAQ, simple classification, draft suggestions	task boundary, escalation, challenger monitoring
Open-weight general	deployment control, inspectable hosting choices	ops burden, safety patching, weaker managed controls	internal knowledge assistant, constrained extraction, sovereign data	hosting, patch cadence, safety layer, license review
Domain-tuned model	better terminology and stable task behavior	narrow scope, data/version governance	AML typology classification, KYC extraction, fraud intervention	training data lineage, drift, revalidation
Specialist extractor/OCR	document/layout accuracy	less flexible reasoning	KYC document extraction, income proof extraction	field-level accuracy, exception routing, confidence calibration
Embedding/reranker	retrieval quality and entitlement	invisible failure if not evaluated	RAG for policy and knowledge assistant	source recall, access filtering, index/version
Judge/evaluator model	scalable rubric support	bias, instability, circular evaluation	regression triage, large eval runs	judge calibration, human audit sample, version control

Benchmark and challenger lifecycle

Lifecycle states

State	Meaning	Allowed decisions
Candidate	model has been proposed or discovered but not approved	lab testing only
Baseline	current comparator or no-AI process	compare, keep as fallback
Challenger	model is tested against champion for a defined use case	no production use unless separately approved
Champion	model approved for specific use boundary	release with controls
Constrained champion	approved only for limited channel, segment, risk tier or human-review mode	pilot or limited production
Watchlisted	production signals, supplier changes or benchmark regressions require review	freeze expansion, run additional benchmark
Retired	no new use, replaced or decommissioned	archive evidence, keep historical records

Benchmark pack design

Pack element	Required content	Example
Task definition	AI role, input, expected output, unacceptable output	credit policy RAG answers policy questions with citations, no credit decision
Dataset manifest	source, version, hash, slice coverage, privacy class	420 cases, English/Spanish, policy v2026.05
Rubric	scoring dimensions, severity, thresholds	groundedness 0-5, critical fail if unsupported adverse action reason
Domain set	business policy, product nuance, jurisdiction, effective date	KYC address proof exceptions, AML typology ambiguity
Red-team set	prompt injection, PII, unsafe advice, tool misuse, jailbreak	customer asks agent to reveal another account
Operational test	p50/p95 latency, timeout, rate limit, failover	contact center p95 under 800 concurrent agents
Security test	access control, data retention, logging, DLP	restricted HR policy not retrievable by branch user
Run protocol	model version, prompt version, temperature, repeats, judge version	fixed prompt v12, 3 repeated runs on unstable cases
Evidence output	traces, scorecard, failure taxonomy, decision memo	evidence binder object id

Champion/challenger cadence

Trigger	Action	Decision path
New model version available	run benchmark pack against challenger	approve, reject, keep challenger, constrain
Production complaints or overrides increase	mine failures into regression set and re-run champion	watchlist or remediate
Business policy changes	re-run impacted domain and RAG packs	keep, update prompt/RAG, suspend use
Red-team failure appears	run safety pack and incident review	freeze expansion, hotfix, retire if unresolved
Cost or latency becomes unfit	compare smaller or local challenger	constrained use or replacement
Supplier changes data/log/retention terms	architecture and governance review	suspend new use until evidence is updated
Open model patch or vulnerability	patch, scan, rerun critical packs	keep or retire

Financial retail scenarios

1. Contact center agent assist

Portfolio decision	Scorecard emphasis	Example threshold
Frontier proprietary champion for complex policy questions; small model challenger for routine FAQ	grounding, latency, tone, vulnerable customer escalation, PII safety	no critical unsupported fee reversal promise; p95 under workflow SLO

Evidence:

contact policy RAG pack with current and stale policy conflicts.
red-team set for customer pressure, prompt injection, and account privacy.
trace showing retrieved sources, model output, agent edits, escalation and final disposition.

2. AML triage and investigation narrative

Portfolio decision	Scorecard emphasis	Example threshold
Domain-tuned classifier for alert prioritization; LLM only drafts narrative after analyst evidence selection	false negative control, typology coverage, no final SAR conclusion	zero critical missed high-risk typology in challenge set

Evidence:

AML typology benchmark by structuring, mule activity, funnel account, rapid movement and benign lookalikes.
narrative rubric for material omission and evidence citation.
human review log proving analyst retains final decision.

3. KYC document extraction

Portfolio decision	Scorecard emphasis	Example threshold
Specialist OCR/extractor champion; LLM challenger for exception explanation and missing-document summary	field accuracy, confidence calibration, layout robustness, exception routing	document type and expiry date accuracy above approved threshold; low-confidence routed to human

Evidence:

document pack with passports, IDs, utility bills, bank statements, low-quality scans and non-English layouts.
field-level confusion matrix.
exception evidence for missing address, expired document and name mismatch.

4. Credit policy RAG

Portfolio decision	Scorecard emphasis	Example threshold
RAG plus large model for underwriter policy support; no autonomous credit approval or adverse action	citation support, effective date, jurisdiction, protected-class boundary	no unsupported decline reason; every policy statement cites approved source

Evidence:

policy question pack across product, state, effective date and exception rules.
stale policy and conflicting source challenge set.
model selection record that separates advice support from credit decisioning.

5. Payments fraud intervention

Portfolio decision	Scorecard emphasis	Example threshold
Real-time fraud model remains champion for scoring; LLM assists intervention script and case summary	latency, false negative severity, customer harm, script compliance	intervention script cannot encourage unsafe action or reveal detection rules

Evidence:

fraud typology pack for APP scam, account takeover, mule transfer, false-positive customer friction.
p95 latency test for operational queue.
red-team tests for social engineering and disclosure of fraud controls.

6. Enterprise knowledge assistant

Portfolio decision	Scorecard emphasis	Example threshold
open-weight or managed model depending on data residency; embedding/reranker benchmarked separately	entitlement, source freshness, hallucination, knowledge coverage	no restricted document leakage across role boundaries

Evidence:

knowledge coverage map by HR, operations, product, risk, technology and policy domains.
entitlement test with users from branch, contact center, risk and engineering.
benchmark that separates retriever failure from generator failure.

Metrics/control/evidence model

Metrics

Metric class	Examples	Decision use
Capability	pass rate, extraction F1, citation support, narrative completeness, triage precision/recall	determine task fitness
Domain	policy compliance, typology coverage, effective-date correctness, jurisdictional nuance	approve business scope
Safety	critical failure rate, jailbreak violation, PII leakage, unsafe advice, over-refusal	hard gate for risk tiers
Operational	p50/p95 latency, timeout, availability, rate limit, fallback success	decide workflow fit
Human system	reviewer agreement, override rate, review time, escalation quality, automation bias signal	validate HITL design
Governance	model card completeness, trace completeness, version reproducibility, approval freshness	audit readiness
Portfolio	model concentration, open/proprietary mix, challenger freshness, retirement backlog age	management oversight

Controls

Control	Purpose	Evidence
Approved-use boundary	Prevent model reuse beyond evaluated scope	model card, selection record, API policy
Benchmark gate	Stop promotion without task evidence	benchmark manifest, run report
Red-team gate	Stop release with unacceptable behavior	adversarial run, issue log
Domain SME review	Keep scores tied to policy and workflow reality	reviewer log, rubric decisions
Version lock and change impact	Prevent silent behavior changes	model version, prompt/version registry, supplier notice
Human oversight control	Prevent AI from becoming hidden decision-maker	reviewer action logs, override samples
Evidence retention	Support audit and model risk review	evidence packet, GRC record
Retirement trigger	Remove models when evidence no longer supports use	watchlist record, decommission decision

Evidence packet

model card
  + approved-use boundary
  + benchmark pack manifest
  + run configuration
  + scorecard
  + slice and failure analysis
  + red-team report
  + latency/cost/security results
  + model selection record
  + exception or risk acceptance
  + monitoring and retirement triggers

Anti-patterns and failure modes

Anti-pattern	Why it fails	Better architecture
Leaderboard-driven selection	Public benchmark does not represent workflow, data, risk and controls	use public benchmark as weak prior, then run task benchmark
One model for everything	Ignores task/risk differences and creates concentration risk	portfolio by capability, approved use and risk tier
Average score hides critical failure	High average can coexist with one unacceptable AML, KYC or credit failure	hard blockers and severity-weighted scoring
No challenger cadence	Champion becomes stale while model market and policy change	quarterly and trigger-based champion/challenger review
Model card as static document	Model behavior, provider terms and business policy change	model card with version, evidence and review expiry
Open model treated as automatically safer	Hosting control does not solve patching, safety, license, eval or ops	open-weight governance pack and patch lifecycle
Proprietary model treated as unknowable	Black-box does not excuse missing evidence	require trace, version boundary, change notice and task eval
Judge model trusted blindly	Evaluator bias and instability corrupt scorecard	calibrate judge with human audit and versioned rubric
Retirement never happens	Legacy models persist after risk, cost, policy or supplier evidence changes	explicit retirement triggers and owner accountability
Model selection council becomes ceremony	Decision board approves without evidence or conditions	decision record, conditions, expiry and post-release monitoring

Architecture mapping to RAG / Agent / Copilot / Eval / Governance

Architecture area	Model portfolio governance question	Example design decision
RAG	Which embedding model, reranker and generator are approved for this corpus and user entitlement model?	Credit policy RAG uses approved embedding v3, reranker challenger under test, generator constrained to cited answers
Agent	Which model family can plan or call tools, and under what human approval boundary?	Payments fraud assistant may draft intervention script but cannot block account without rules engine and human approval
Copilot	Which model can draft, summarize or recommend inside a human workflow?	AML copilot drafts narrative only after analyst-selected evidence; final disposition remains human
Eval	Which benchmark packs prove task, domain, safety and operational fitness?	KYC extraction pack separates OCR field accuracy from LLM exception summary quality
Governance	Who approves model use, exception, challenger promotion and retirement?	Model selection council approves champion scope and review expiry; model risk can require independent challenge

This architecture also clarifies component-level selection. A single use case can have separate champions for:

generation model.
embedding model.
reranker.
extractor.
classifier.
safety model.
evaluator/judge model.
fallback model.

The governance object is not “the chatbot model”; it is the AI system model portfolio that produces behavior in a specific workflow.

ADR draft

Title: Establish AI model portfolio benchmarking and selection governance
Date: 2026-06-30
Status: Proposed

Context:
Financial retail AI systems use multiple model families across contact center, AML, KYC, credit, payments and enterprise knowledge workflows. Current model decisions are fragmented across product teams, vendor evaluations, platform experiments and project-specific benchmarks. Public benchmarks and one-off pilots do not provide sufficient evidence for regulated workflow decisions.

Decision:
Create a governed model portfolio architecture with:
1. model portfolio inventory and model cards;
2. AI task capability taxonomy;
3. benchmark packs by use case and risk tier;
4. weighted capability scorecards with hard blockers;
5. champion/challenger lifecycle;
6. model selection council decision records;
7. retirement triggers and evidence packets.

Approved-use decisions will be made at the task/use-case boundary, not at the generic model brand level. High-risk workflows require domain benchmark, red-team score, trace evidence, human oversight fit and explicit retirement triggers.

Alternatives considered:
1. Let each product team choose models independently. Rejected because evidence is not comparable and auditability is weak.
2. Choose one enterprise-standard model. Rejected because task, risk, latency, cost, data residency and control needs differ.
3. Use public leaderboards as primary selection evidence. Rejected because they do not represent financial retail context of use.

Consequences:
Positive:
- Decisions become comparable, auditable and reusable across use cases.
- Model changes can be assessed through challenger runs rather than preference debates.
- Product, architecture, risk and audit share the same evidence language.

Tradeoffs:
- Benchmark packs and model cards require ongoing ownership.
- Some fast experiments will be slowed by evidence gates.
- Scorecards must be maintained as models, policies and business workflows change.

Review triggers:
- new model family or provider enters portfolio;
- champion shows material regression or critical failure;
- business policy, regulation, data boundary or workflow changes;
- supplier changes model version, logging, retention or service behavior;
- quarterly portfolio review.

Interview answer: 30秒, 2分钟, CTO版本

30秒

我不会用一个 leaderboard 分数决定企业模型选型。我会建立 model portfolio governance：先按业务任务和风险等级定义 capability taxonomy，再为每个 use case 建 benchmark pack 和 scorecard，评估质量、domain fit、red-team、安全、延迟、成本、可审计性和人工监督适配。最后由 model selection council 决定 champion、challenger、限制条件和退役触发，并留下 model card、run result 和 selection record。

2分钟

在金融零售里，contact center、AML、KYC、credit policy RAG、payments fraud 和 enterprise knowledge assistant 对模型的要求完全不同。KYC 可能更看重 extraction accuracy 和 low-confidence routing，AML 更看重 false negative 和 typology coverage，credit policy RAG 更看重 citation support 和禁止 unsupported adverse action，contact center 更看重低延迟、话术和升级边界。

所以我会把模型治理做成持续组合管理。第一层是 model portfolio inventory，记录模型家族、版本、供应模式、数据边界、approved use 和 prohibited use。第二层是 capability taxonomy，把任务拆成抽取、分类、检索回答、摘要、建议、安全、domain policy、operability 和 governance。第三层是 benchmark pack，用真实和合成案例、domain edge cases、red-team cases 和运营测试比较 champion 与 challenger。第四层是 scorecard，用权重和 hard blockers 防止平均分掩盖高风险失败。第五层是 decision record，记录为什么批准、限制、拒绝或退役。

这样模型选择就从“哪个模型最好”变成“在这个业务任务和风险边界内，哪个模型有足够证据被批准使用”。

CTO版本

我会把模型组合治理放在 AI platform control plane 与 AI governance operating model 之间。平台负责模型网关、配置版本、eval runner、trace、policy enforcement、fallback 和 monitoring；治理负责 capability taxonomy、risk-tiered benchmark、scorecard、selection council、exception 和 retirement。关键设计是：approved use 绑定 task、workflow、risk tier、data boundary 和 benchmark evidence，而不是绑定模型品牌。

对 CTO 来说，这解决四个架构风险：第一，避免每个团队重复比较模型但证据不可复用；第二，避免供应商或模型版本变化造成 silent regression；第三，避免一刀切模型标准牺牲 latency、cost、security 或 domain fit；第四，为审计、模型风险和生产事故复盘保留可追溯证据。最终输出不是一个“最佳模型列表”，而是一套可运行的 champion/challenger portfolio，能持续吸收新模型，同时能及时退役不再适合的模型。

7-day practice plan

Day	Practice	Output
Day 1	选择 3 个金融零售用例：contact center、KYC、credit policy RAG。拆分 AI task taxonomy。	task taxonomy table
Day 2	为每个用例写 model portfolio inventory 字段和 approved/prohibited use。	3 张 model card sketch
Day 3	设计 capability scorecard，明确 hard blockers 和权重。	scorecard v1
Day 4	为 AML triage 或 payments fraud 写 benchmark pack：domain set、red-team set、operational test。	benchmark pack outline
Day 5	做 champion/challenger 决策模拟：frontier proprietary、small model、open-weight、domain-tuned model。	model selection record
Day 6	写 retirement trigger 和 evidence packet：政策变化、安全回归、供应商变更、成本/延迟不适配。	retirement and evidence template
Day 7	用 30 秒、2 分钟、CTO 版本练习面试表达，并把一个场景做成作品集页面。	interview script + portfolio artifact

Source anchors with links

Source	Link	How this note uses it
Stanford HELM latest	https://crfm.stanford.edu/helm/latest/	Holistic and living benchmark mindset for multi-scenario, multi-metric model evaluation
HELM paper	https://arxiv.org/abs/2211.09110	Multi-metric evaluation idea: accuracy, calibration, robustness, fairness, bias, toxicity and efficiency are decision inputs, not one score
MLCommons AI Safety / AILuminate	https://mlcommons.org/benchmarks/ai-safety/	Safety benchmark and system-under-test framing; useful for red-team score and safety hard blockers
NIST AI RMF	https://www.nist.gov/itl/ai-risk-management-framework	Govern / Map / Measure / Manage language for risk-based AI governance
NIST AI RMF resources and TEVV anchors	https://www.nist.gov/itl/ai-risk-management-framework/ai-risk-management-framework-resources	Links model portfolio evidence to testing, evaluation, verification, validation, GenAI profile and AI RMF playbook resources
ISO/IEC 42001	https://www.iso.org/standard/81230.html	AI management system anchor for policies, objectives, operating controls, performance evaluation and continual improvement
ISO/IEC 23894	https://www.iso.org/standard/77304.html	AI risk management guidance anchor for integrating risk management into AI-related activities and functions