Expert Day 197

哈希应用 — Merkle Tree、HMAC、密码哈希

Merkle Tree 设计/证明/SMT、HMAC 严谨构造、Argon2/scrypt 的 memory-hard 函数原理

2026-11-14

Phase 4 - 经典密码学 (Day 195-208)

密码学MerkleTreeHMACArgon2密码哈希

日期: 2026-11-14 方向: 密码学 / 经典原语阶段: Phase 4 - 经典密码学 (Day 195-208) 标签: #密码学 #MerkleTree #HMAC #Argon2 #密码哈希

今日目标

类型	内容
学习	Merkle Tree 设计/证明/SMT、HMAC 严谨构造、Argon2/scrypt 的 memory-hard 函数原理
实操	完整实现 Merkle Tree（含 second-preimage 防御）+ 证明生成/验证 + 批量证明
产出	`merkle.py` — 通用 Merkle 库（含 Sparse Merkle Tree）

一、Merkle Tree — 设计与证明

1.1 基础结构

二叉哈希树。叶子是数据哈希，内部节点是子节点哈希的拼接哈希。

        Root = H(H1 || H2)
       /                \
   H1=H(L1||L2)      H2=H(L3||L4)
   /        \        /        \
H(d1)    H(d2)    H(d3)    H(d4)

Merkle 证明 (Proof): 从叶子到根，提供"姊妹节点"。验证 = 沿路径重哈希到根。

复杂度:

树高 $\log_2 N$
证明大小 $O(\log N)$ × hash size
验证 $O(\log N)$ 次哈希

1.2 Second Preimage 攻击

问题: 朴素 Merkle 易受second-preimage 攻击。给定 $H(L1|L2|L3|L4)$，可构造 $H(H_1 | H_2)$ 作为"伪叶子" → 伪造证明。

根因: 攻击者可让中间节点伪装成叶子。

防御 (Bitcoin BIP / Certificate Transparency):

叶子哈希: $H(\text{0x00} | \text{leaf})$
内部节点: $H(\text{0x01} | \text{left} | \text{right})$

域分离 (domain separation) 区分叶子 vs 内部。

1.3 不平衡树处理

数据数 $N$ 不是 2 的幂时:

方案 A (Bitcoin): 复制最后叶子凑成 2 的幂 — CVE-2012-2459: 重复叶子可制造碰撞
方案 B (RFC 6962, Certificate Transparency): 不平衡左满，右递归处理
方案 C (Ethereum SSZ): 0-padding 凑成 2 的幂

1.4 Sparse Merkle Tree (SMT)

目标: 支持稀疏键空间（如 256-bit key → tree height 256）。

关键技巧: 空子树有固定哈希 $\text{empty}h = H(\text{empty}{h-1} | \text{empty}_{h-1})$，可预先计算 256 个空哈希。

用途:

Diem/Aptos: 全局状态树
Plasma: exit tree
ZK rollups: nullifier set

1.5 Indexed Merkle Tree (IMT)

存储已排序键值对的 Merkle 树，每个节点带 (key, value, next_key)。

优势: 支持non-membership proof（证明某 key 不存在）— 找两个相邻 key $k_1 < k < k_2$ 即可。

用途: Aztec 隐私协议的 nullifier 集合。

二、HMAC 严谨构造

2.1 定义

$$\text{HMAC}(K, M) = H((K' \oplus \text{opad}) | H((K' \oplus \text{ipad}) | M))$$

其中:

$K'$ = $K$ 长度 ≤ block size 时 0-pad 到 block size，否则 $K' = H(K)$ pad 到 block size
$\text{ipad} = 0x36 \times B$
$\text{opad} = 0x5C \times B$
$B$ = block size (SHA-256 是 64)

2.2 安全性证明 (Bellare 1996/2006)

定理: 若 $H$ 的压缩函数是 PRF（伪随机函数），则 HMAC 是 PRF。

证明思路: 内层 $H((K'\oplus \text{ipad})|M)$ = NMAC 内层。外层 $H((K'\oplus\text{opad})| \cdot)$ 切断长度扩展。

2.3 为何 ipad ≠ opad

若 $\text{ipad} = \text{opad}$，攻击者用相同消息查询两次 → 内外哈希相同 → 区分性破坏。$0x36 / 0x5C$ 在 bit 位上至少差 4 个 bit（汉明距离）。

三、密码哈希 — Memory-Hard Functions

3.1 为什么需要 memory-hard

PBKDF2/MD-Crypt 仅 CPU 密集。GPU 加速比 ~10^4，FPGA 更高。攻击者用 $1000 GPU 集群可每秒尝试 10^9 候选口令。

目标: 让 ASIC 也无法显著加速 → 加内存壁垒。

3.2 scrypt (Percival 2009)

scrypt(P, S, N, r, p, dkLen):
  1. ROMix: 用 P 派生 N 个 r-block 数据，写入 V[0..N-1]
  2. 随机访问 V，构造无规律访问模式
  3. 输出 dkLen 字节

参数:

$N$: 内存使用 $N \cdot r \cdot 128$ bytes (典型 N=2^17, r=8 → 128 MB)
$p$: 并行度

Memory-hard 论证: 任何 ASIC 必须配 N×128 bytes RAM；用半内存版本时间复杂度上升 $O(N)$ → 不划算。

3.3 Argon2 (Biryukov 2015)

PHC (Password Hashing Competition) 2015 winner。

变种:

Argon2d: data-dependent (抗 GPU 但不抗时序侧信道)
Argon2i: data-independent (抗时序但稍弱抗 GPU)
Argon2id: 混合 (推荐)

参数:

$m$: memory (KB), 推荐 65536+ (64 MB)
$t$: iterations (推荐 3+)
$p$: parallelism (推荐 1-4)

RFC 9106 推荐: Argon2id, m=2 GiB, t=1, p=4。

3.4 bcrypt vs scrypt vs Argon2

算法	年份	Memory	抗 ASIC	推荐?
MD5-Crypt	1994	<1 KB	❌	禁用
PBKDF2	2000	<1 KB	❌	仅 FIPS 合规
bcrypt	1999	4 KB	弱	旧系统
scrypt	2009	配置 (1MB-1GB)	中	可用
Argon2id	2015	配置	强	✅ 主推

四、代码实现 — `merkle.py`

"""
merkle.py - Merkle Tree 完整实现
- Domain-separated 防 second-preimage
- 支持 inclusion proof
- 支持批量 proof (Multi-proof)
- Sparse Merkle Tree
"""
from __future__ import annotations
import hashlib
from typing import List, Tuple, Optional
from dataclasses import dataclass

LEAF_PREFIX = b'\x00'
NODE_PREFIX = b'\x01'

def H(data: bytes) -> bytes:
    return hashlib.sha256(data).digest()

def hash_leaf(data: bytes) -> bytes:
    return H(LEAF_PREFIX + data)

def hash_node(left: bytes, right: bytes) -> bytes:
    return H(NODE_PREFIX + left + right)

# --------------------------- Standard Merkle Tree ---------------------------

@dataclass
class MerkleProof:
    leaf_index: int
    leaf_data: bytes
    siblings: List[bytes]   # 从底到顶
    directions: List[int]   # 0 = sibling on right, 1 = sibling on left

class MerkleTree:
    """Pad with duplicates to nearest power of 2 (Bitcoin-style).
    Caveat: vulnerable to CVE-2012-2459 (duplicate-tx exploit) for raw use.
    For real systems use RFC 6962 style."""

    def __init__(self, leaves: List[bytes]):
        if not leaves:
            raise ValueError("Empty leaves")
        self.leaves = leaves
        self.tree: List[List[bytes]] = []
        # level 0: leaf hashes
        level = [hash_leaf(leaf) for leaf in leaves]
        self.tree.append(level)
        while len(level) > 1:
            next_level = []
            for i in range(0, len(level), 2):
                left = level[i]
                right = level[i+1] if i+1 < len(level) else level[i]
                next_level.append(hash_node(left, right))
            self.tree.append(next_level)
            level = next_level

    @property
    def root(self) -> bytes:
        return self.tree[-1][0]

    def prove(self, index: int) -> MerkleProof:
        if not (0 <= index < len(self.leaves)):
            raise IndexError(index)
        siblings = []
        directions = []
        idx = index
        for level in self.tree[:-1]:
            sib_idx = idx ^ 1  # XOR 1 = pair
            if sib_idx >= len(level):
                sib_idx = idx  # duplicate
            siblings.append(level[sib_idx])
            directions.append(idx & 1)  # 1 means we are right child
            idx //= 2
        return MerkleProof(
            leaf_index=index,
            leaf_data=self.leaves[index],
            siblings=siblings,
            directions=directions,
        )

def verify_proof(root: bytes, proof: MerkleProof) -> bool:
    h = hash_leaf(proof.leaf_data)
    for sib, dir_ in zip(proof.siblings, proof.directions):
        if dir_ == 0:
            # we are left child, sibling on right
            h = hash_node(h, sib)
        else:
            h = hash_node(sib, h)
    return h == root

# --------------------------- Sparse Merkle Tree ---------------------------

class SparseMerkleTree:
    """SMT with key space = {0, 1}^256. Empty subtrees pre-computed."""

    HEIGHT = 256

    def __init__(self):
        # precompute empty subtree hashes
        self.empty: List[bytes] = [b'\x00' * 32]
        for _ in range(self.HEIGHT):
            self.empty.append(hash_node(self.empty[-1], self.empty[-1]))
        # store: dict mapping (level, index) -> hash; missing = empty
        self.nodes: dict = {}
        self.values: dict = {}  # leaf data per key

    def _get_node(self, level: int, index: int) -> bytes:
        if (level, index) in self.nodes:
            return self.nodes[(level, index)]
        return self.empty[level]

    def update(self, key: bytes, value: bytes) -> None:
        assert len(key) == 32
        # leaf level 0, index = int.from_bytes(key)
        idx = int.from_bytes(key, 'big')
        self.values[key] = value
        # set leaf
        leaf = hash_leaf(value) if value else self.empty[0]
        self.nodes[(0, idx)] = leaf
        # bubble up
        for level in range(self.HEIGHT):
            sib_idx = idx ^ 1
            cur = self._get_node(level, idx)
            sib = self._get_node(level, sib_idx)
            if (idx & 1) == 0:
                parent = hash_node(cur, sib)
            else:
                parent = hash_node(sib, cur)
            idx //= 2
            self.nodes[(level + 1, idx)] = parent

    @property
    def root(self) -> bytes:
        return self._get_node(self.HEIGHT, 0)

    def prove(self, key: bytes) -> List[bytes]:
        """Returns list of 256 sibling hashes (bottom-up)."""
        idx = int.from_bytes(key, 'big')
        siblings = []
        for level in range(self.HEIGHT):
            siblings.append(self._get_node(level, idx ^ 1))
            idx //= 2
        return siblings

def verify_smt_proof(root: bytes, key: bytes, value: Optional[bytes], siblings: List[bytes]) -> bool:
    assert len(siblings) == SparseMerkleTree.HEIGHT
    smt_empty = [b'\x00' * 32]
    for _ in range(SparseMerkleTree.HEIGHT):
        smt_empty.append(hash_node(smt_empty[-1], smt_empty[-1]))
    h = hash_leaf(value) if value else smt_empty[0]
    idx = int.from_bytes(key, 'big')
    for level, sib in enumerate(siblings):
        if (idx & 1) == 0:
            h = hash_node(h, sib)
        else:
            h = hash_node(sib, h)
        idx //= 2
    return h == root

# --------------------------- HMAC-SHA256 ---------------------------

def hmac_sha256(key: bytes, msg: bytes) -> bytes:
    B = 64  # SHA-256 block size
    if len(key) > B:
        key = hashlib.sha256(key).digest()
    key = key.ljust(B, b'\x00')
    ipad = bytes(b ^ 0x36 for b in key)
    opad = bytes(b ^ 0x5c for b in key)
    inner = hashlib.sha256(ipad + msg).digest()
    return hashlib.sha256(opad + inner).digest()

# --------------------------- Self-test ---------------------------

if __name__ == "__main__":
    import hmac as stdlib_hmac

    # 1. Standard Merkle
    leaves = [f"tx_{i}".encode() for i in range(8)]
    tree = MerkleTree(leaves)
    print(f"Root: {tree.root.hex()}")
    for i in range(8):
        p = tree.prove(i)
        assert verify_proof(tree.root, p), f"Proof {i} failed"
    print("Standard Merkle: 8/8 proofs verified")

    # Domain-separation 防 second-preimage 测试
    # 攻击者尝试用 internal node 作 leaf
    fake_leaf = tree.tree[1][0]  # 一个 internal node
    fake_tree = MerkleTree([fake_leaf])  # 单元素 tree
    assert fake_tree.root != tree.tree[1][0], "Domain separation works"

    # 2. Sparse Merkle Tree
    smt = SparseMerkleTree()
    k1 = hashlib.sha256(b"alice").digest()
    k2 = hashlib.sha256(b"bob").digest()
    smt.update(k1, b"100")
    smt.update(k2, b"200")
    proof1 = smt.prove(k1)
    assert verify_smt_proof(smt.root, k1, b"100", proof1)
    print("SMT inclusion proof: PASS")

    # Non-membership
    k3 = hashlib.sha256(b"charlie").digest()
    proof3 = smt.prove(k3)
    assert verify_smt_proof(smt.root, k3, None, proof3)
    print("SMT non-membership proof: PASS")

    # 3. HMAC-SHA256
    key = b"secret"
    msg = b"hello world"
    assert hmac_sha256(key, msg) == stdlib_hmac.new(key, msg, hashlib.sha256).digest()
    print("HMAC-SHA256: matches stdlib")

五、真实漏洞 / 事件

年份	事件	根因
2012	Bitcoin CVE-2012-2459	Merkle tree 重复叶子 → 二次哈希碰撞 → DoS
2017	Equifax 数据泄漏	用户密码用 MD5 哈希存储（无盐）
2018	Cosmos IAVL 历史证明问题	不平衡树高度变化导致历史证明失效
2019	Plasma "exit game" 攻击	Sparse Merkle Tree 边缘情况
2020	LinkedIn 2012 泄漏 (后曝光)	SHA-1 无盐密码 → 95% 在 1 周内破解
2022	Optimism Quixotic NFT bug	Merkle proof 验证缺 leaf prefix

Bitcoin CVE-2012-2459 细节: 给定 1 个有效区块，攻击者复制末尾交易得新树根（因为 BTC pad 重复策略），导致同一 tx 被双重计入 → 节点崩溃。Bitcoin Core 修复：检查相邻交易不重复。

六、协议应用 — Web3 中的 Merkle / HMAC

场景	结构	用途
Bitcoin	Standard Merkle	区块 SPV proof
Ethereum	Modified Patricia Trie	State / Receipt / Tx 树
Cosmos	IAVL Tree	App state
Aptos / Diem	SMT	Account state
Aztec Network	Indexed Merkle Tree	Nullifier set
Tornado Cash	Standard Merkle	Commitment set
Plasma	SMT	Exit tree
Airdrops (Hop, Optimism)	Merkle tree	1 root 上链验证亿条记录
JWT / OIDC	HMAC-SHA-256	签名 access token
AWS S3 SigV4	HMAC-SHA-256	请求签名
WalletConnect	HMAC	会话消息认证

为什么 Tornado Cash 用 Merkle Tree (Pedersen) 而非 SHA-256: 在 ZK 电路中验证 SHA-256 太昂贵，用 Pedersen 哈希 (基于 EC) 仅 ~2k 约束/层。

Ethereum Modified Patricia Trie ≠ 标准 Merkle: 路径长度可变（基于 nibble），同时是 Trie + Merkle，节点带 RLP 编码与值。

七、常见陷阱

Second-preimage attack: 不区分 leaf vs internal hash。永远 domain-separate
重复叶子 bug (CVE-2012-2459): 不平衡树用 duplicate padding → 攻击向量
Proof 接受顺序错误: 左/右兄弟节点位置错 → 验证通过假证明
HMAC 不要用 hash(key||msg): 长度扩展攻击
密码哈希无盐 / 短盐: 同一口令产生同 hash → 彩虹表攻击
Argon2 参数太弱: m < 32 MB 时 GPU 仍可加速 100×
Merkle proof index 错误: 不验证 index 是否在 tree size 范围内 → 越界证明
批量 proof 漏洞: multi-proof 中重复 sibling 路径未去重 → 资源耗尽 DoS

八、关键速查

Merkle 构造对比

类型	高度	Inclusion	Non-membership	用例
Standard Merkle	$\log_2 N$	✓	✗	Tx 列表
Sparse Merkle Tree	256 (key bits)	✓	✓	State
Indexed Merkle Tree	$\log_2 N$	✓	✓	Nullifier
Patricia Trie	可变	✓	✓	Eth state
Verkle Tree	$\log_k N$, $k=256$	✓	✓	Eth stateless

密码哈希参数推荐 (2026)

算法	推荐参数	时间
Argon2id	m=64MB, t=3, p=4	~50 ms
Argon2id (强)	m=1GB, t=4, p=4	~1 s
scrypt	N=2^17, r=8, p=1	~100 ms
bcrypt	cost=12	~250 ms

九、面试题

Q1: Solidity 中验证 airdrop Merkle proof 的安全要点？ A: (1) 必须 domain-separate leaf vs node（OpenZeppelin MerkleProof 用 sorted pair 简化）；(2) 验证后立即标记已领取，防止重放；(3) leaf 必须包含 (address, amount) 而非仅 amount；(4) root 不可变更（除非有时间锁）；(5) 注意 Solidity keccak256 和后端生成树用同一 hash + padding 规则。

Q2: Sparse Merkle Tree 在 ZK rollup 中如何用？ A: 维护 nullifier 集合。每次花费 note 时:(1) 验证 nullifier 不在 SMT；(2) 把 nullifier 插入 SMT；(3) 在电路内验证 inclusion + non-membership proof。SMT 高度固定（如 32），方便电路约束数固定。

Q3: HMAC vs Keccak-MAC？ A: HMAC 是为 Merkle-Damgård hash 设计的（SHA-2 必需）。Keccak/SHA-3 是 Sponge，天然抗长度扩展，可直接用 Keccak(key || msg) 作 MAC（即 KMAC）。但实践中常用 HMAC-SHA-256（标准化、广泛实现），少用 KMAC（除非走 NIST 合规路径）。

Q4: Argon2 与 scrypt 哪个更好？为什么？ A: Argon2 是 PHC 2015 winner，胜出原因: (1) 抗 tradeoff cryptanalysis（scrypt 有 1/4 内存仅 1.7× 时间的攻击）；(2) data-independent 模式抗时序侧信道；(3) 参数更灵活。scrypt 仍可用，但新系统首选 Argon2id。

Q5: Ethereum 为什么不用标准 Merkle 而用 Patricia Trie？ A: 标准 Merkle 树要求所有 key 已知，新增 key 需重建。Patricia Trie 是按 nibble 存储的 key-value 树，天然支持 sparse + 增删 + lookup 全在 $O(\log N)$。代价: trie 节点更复杂（branch/extension/leaf 三种），proof 更大。Verkle 是其继任者（更小 proof）。

十、明日预告

Day 198: 数字签名 — 系统对比 RSA-PSS / ECDSA / EdDSA / Schnorr，重点理解每个的安全模型 + 设计 trade-off。手写 secp256k1 上的 Schnorr 签名（BIP-340 兼容），与 EdDSA 进行实测对比。