AI 底层逻辑 / 经典论文

Mechanistic Interpretability：Transformer Circuits 与 SAE

Mechanistic interpretability 不等于“模型完全透明”。更准确地说，它尝试把神经网络内部的部分计算机制转成可理解的特征和电路。

329 行ai-foundations/papers/22-mechanistic-interpretability-transformer-circuits-sae.md

Mechanistic Interpretability / Transformer Circuits / Sparse Autoencoders 解读

面向对象: AI PM / AI BA / AI Architect / Model Risk / AI Governance。核心问题: 大模型为什么会产生某些行为？能否从神经网络内部找到可解释的特征、回路和风险信号？学习目标: 理解 mechanistic interpretability 的基本方向，以及它对模型风险、AI assurance、产品边界和企业治理的现实意义。

Source Anchors

Source	Link	用途
Transformer Circuits	https://transformer-circuits.pub/	理解 Anthropic mechanistic interpretability 研究路线
A Mathematical Framework for Transformer Circuits	https://transformer-circuits.pub/2021/framework/index.html	理解 attention heads、MLP、residual stream 和 circuit 分析
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning	https://transformer-circuits.pub/2023/monosemantic-features/index.html	理解 sparse autoencoder / dictionary learning 用于解释内部特征
Scaling Monosemanticity	https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html	理解将 SAE 扩展到更大模型的解释性研究
Model Cards for Model Reporting	https://arxiv.org/abs/1810.03993	将解释性证据转成模型文档和边界说明

Mechanistic interpretability 不等于“模型完全透明”。更准确地说，它尝试把神经网络内部的部分计算机制转成可理解的特征和电路。

1. 为什么要学 Mechanistic Interpretability

对 AI PM/BA/架构师来说，它不是日常写代码技能，但能提高你对模型风险的判断。

它帮助回答:

问题	价值
模型内部是否存在可理解特征？	解释行为来源
某些输出是否由特定 circuit 驱动？	分析风险和 failure mode
模型是否学到了欺骗、偏见、敏感属性代理？	风险发现
能否监控内部激活作为安全信号？	未来安全控制
解释性证据能否支持 assurance case？	治理和审计

但要保持清醒:

现阶段 mechanistic interpretability 是重要研究方向，不是企业上线时可独立依赖的控制。它可以补充 model risk evidence，但不能替代 eval、monitoring、human oversight 和 policy controls。

2. Transformer Circuits 的基本心智模型

Transformer 可以从内部看成多个组件对 residual stream 读写。

Token embedding
  -> residual stream
  -> attention heads read/write
  -> MLP layers read/write
  -> later layers compose features
  -> logits

Attention Head

attention head 可以学习某种模式，例如:

复制前文 token。
连接实体和属性。
查找括号/引号匹配。
聚焦前一句中的主语。

MLP

MLP 常被理解为存储和组合特征的地方之一。

Circuit

circuit 是多个 head / neuron / feature 组合起来完成某种功能。

例子:

induction circuit。
name mover head。
factual association circuit。
refusal / safety relevant features。

3. Monosemanticity 和 Polysemanticity

神经元常常是 polysemantic 的:

一个神经元可能同时响应多个看似无关的概念。

这让解释很难。

Sparse Autoencoder 的思路:

从模型内部某层收集激活。
训练一个稀疏 autoencoder。
把复杂激活分解成更多、更稀疏、更可解释的 features。
分析 feature 代表什么概念。

为什么稀疏有用

概念	含义
Dense activation	很多方向同时混合信息
Sparse feature	少数特征被激活
Dictionary learning	学一组可组合特征
Monosemantic feature	尽量对应单一概念

4. 企业 AI 能从中学到什么

4.1 模型不是规则系统

模型内部不是一组人写的 if/else。

PM/BA 不能说:

只要 prompt 写清楚，模型就一定按规则执行。

更准确:

prompt、RLHF、system policy、tool constraints 和 eval 只是影响行为的外部控制。模型内部可能仍有难以预测的关联和激活模式。

4.2 可解释性是多层的

解释层	例子	成熟度
Product explanation	为什么给用户这个建议	可做
Evidence grounding	哪些来源支持回答	可做
Behavioral eval	哪类输入会失败	可做
Feature/circuit analysis	内部机制为什么触发	研究中
Formal guarantee	证明不会出错	高难度

企业上线主要依赖前三层，但需要理解第四层的发展方向。

4.3 Interpretability Evidence 可进入 Safety Case

在高风险系统中，未来可能把解释性证据作为 supporting evidence:

某类危险特征是否可检测。
某次模型升级是否改变关键 feature activation。
某个 refusal behavior 是否来自稳定机制。
某类 jailbreak 是否激活异常路径。

但这类证据必须和外部 eval 结合。

5. Model Risk 视角

Mechanistic interpretability 对模型风险的启发:

Model risk question	Interpretability angle
模型是否对敏感属性有代理变量？	分析 feature / activation correlation
模型是否在特定语境下绕过安全？	找 triggered features / circuits
模型升级是否改变行为机制？	compare activation / feature patterns
某类 hallucination 是否有内部信号？	activation anomaly / uncertainty proxies
是否可解释给监管？	只能作为辅助证据，不应过度承诺

SR 11-7 类模型风险管理强调 conceptual soundness、validation、ongoing monitoring。Mechanistic interpretability 可能增强 conceptual understanding，但仍不能替代 empirical validation。

6. PM 视角: 产品承诺边界

AI PM 学这部分，是为了避免两种错误:

错误	说明
过度神秘化	觉得模型完全不可理解，所以无法治理
过度承诺	觉得有解释性研究就能保证安全

合理产品表达:

我们能解释系统使用了哪些证据。
我们能测量哪些场景表现好/差。
我们能定义哪些场景必须人工复核。
我们不能声称完全理解模型内部所有机制。
对高风险自动化决策，要用外部控制和审计证据。

7. BA 视角: 可解释需求怎么写

BA 不应该写:

系统必须可解释。

应该拆成:

Requirement	Acceptance
系统应显示关键证据来源	每个 material claim 有 citation
系统应说明置信和限制	输出 confidence factors and limitations
系统应记录决策路径	trace includes prompt, sources, tools, policy gates
系统应支持审核	reviewer can inspect evidence and override
系统应记录模型版本	model/prompt/index/eval version linked
高风险建议必须可挑战	human can reject and label reason

Mechanistic interpretability 可以作为研究性附加证据，而不是业务验收的唯一条件。

8. 架构师视角: 解释性证据架构

当前可落地架构

User task
  -> RAG / tools / model
  -> Answer
  -> Evidence citations
  -> Eval scores
  -> Policy gate result
  -> Human review
  -> Trace store
  -> Audit evidence binder

未来可扩展架构

Model activation telemetry
  -> feature detector / SAE probe
  -> risk signal
  -> monitoring dashboard
  -> investigation workflow

注意:

不要把内部激活直接暴露给普通用户。
不要把研究性 feature 当成稳定业务字段。
要有版本管理和漂移监控。
要防止解释性结果被误读。

9. 金融零售案例

9.1 Credit Assistant

可解释性层级:

Layer	应做
Evidence	引用收入、DTI、policy、documentation
Product	说明只是辅助 underwriter
Risk	记录 fairness / adverse action eval
Interpretability research	分析模型是否对敏感代理变量响应

9.2 AML Copilot

可解释性层级:

Layer	应做
Evidence	交易、KYC、typology、case history
Workflow	investigator review and override
Risk	false positive / false negative eval
Interpretability	研究是否对某些客户语言或地区异常敏感

9.3 Customer Service Copilot

可解释性层级:

Layer	应做
Evidence	product policy / fee schedule
Safety	prohibited claims and escalation
Monitoring	complaint / correction / citation failure
Interpretability	研究 persuasion / unsafe compliance features

10. 与其他能力的连接

Existing asset	连接
`08-llm-as-judge-evaluation.md`	外部行为评测仍是主控制
`17-helm-holistic-evaluation-models.md`	holistic eval 管理多场景多指标
`18-model-cards-datasheets-ai-documentation.md`	解释性结果可进入 model/system card
`AI_ASSURANCE_SAFETY_CASE_PLAYBOOK.md`	interpretability 是 safety case 的 supporting evidence
`AI_MODEL_RISK_MANAGEMENT_PLAYBOOK.md`	增强 conceptual soundness，但不替代 validation
`AI_AUDIT_EVIDENCE_BINDER_PLAYBOOK.md`	解释性证据需要版本化、owner、review cadence

11. 作品集输出

Artifact	内容
Explainability Layer Map	区分 evidence、behavioral、mechanistic、formal explanation
Model Risk Memo	说明 interpretability 能支持什么，不能承诺什么
Safety Case Evidence Note	将 mechanistic evidence 放入 supporting evidence
BA Explainability Requirements	把“可解释”拆成验收标准
Interview One-pager	用金融案例解释模型不可完全透明但可治理

12. 面试表达

30 秒版本

Mechanistic interpretability 尝试理解模型内部的特征和电路，例如 attention head、MLP feature、sparse autoencoder 分解出的可解释 feature。它很重要，但目前更适合作为模型风险和 safety case 的辅助证据，不能替代外部 eval、human oversight 和 audit controls。

2 分钟版本

Transformer Circuits 研究把模型看成 attention heads、MLP 和 residual stream 的计算组合，试图找出完成某类行为的 circuit。Sparse autoencoders 进一步尝试把内部激活分解成更稀疏、更可解释的 features。对企业 AI 来说，这提醒我们模型不是规则系统，prompt 不能保证行为。但也不是完全不可治理。实际落地时，我会区分四层解释: 产品解释、证据 grounding、行为评测、内部机制解释。金融高风险场景主要依赖前三层，mechanistic interpretability 可作为模型风险和 safety case 的补充证据。

CTO 深挖

如果未来要引入 activation-level monitoring，我会把它作为隔离的 telemetry pipeline，不直接影响生产决策，先做 offline analysis 和 incident investigation。任何 feature detector 都要版本化、校准和漂移监控。

Risk 深挖

对 model risk，我不会把 interpretability 当成安全保证。它可以帮助 conceptual soundness 和 failure investigation，但上线仍要靠独立验证、challenge set、ongoing monitoring、change control 和 human oversight。

13. 复习问题

Mechanistic interpretability 和普通 explainability 有什么差异？
什么是 residual stream、attention head、MLP feature？
Sparse autoencoder 为什么可能提升可解释性？
为什么企业不能把 mechanistic interpretability 当成唯一控制？
如何把解释性需求写成 BA 可验收标准？
解释性证据如何进入 model card、safety case 和 audit binder？
在金融零售 AI 中，哪些解释层是当前必须落地的？