Expert Day 167

Eval 体系（二）— LLM-as-Judge 设计与去偏

### 1.1 LLM-judge 定位

2026-10-15

Phase 3 - 生产基础设施与评估 (Day 163-176)

LLMJudgeBiasCalibrationEvalPairwiseComparison

日期: 2026-10-15 方向: AI系统工程 / Eval / Quality 阶段: Phase 3 - 生产基础设施与评估 (Day 163-176) 标签: #LLMJudge #Bias #Calibration #Eval #PairwiseComparison

今日目标

类型	内容
学习	LLM-judge 何时需要；常见 bias（position/length/self-preference/verbosity）；pairwise vs single grading；calibration 与 inter-judge agreement
实操	实现 pairwise judge + position-flip 去偏 + Cohen's κ 测一致性 + 用 deterministic 真值做 calibration
产出	`docs/ai-infra/judge.py`：完整 judge 实现 + 偏见测试报告

一、核心概念

1.1 LLM-judge 定位

deterministic 不能覆盖的：

语义正确性：答案"正确"但表达可以多种
遵循指令程度：是否回答了所有 sub-questions
风格/语气：是否专业、是否条理清晰
rubric-based 评分：合规等级 1-5、专业度 1-5
Pairwise 对比：A vs B 哪个更好

但 judge 不是银弹，必须知道它的偏见。

1.2 LLM-judge 的常见偏见（必背）

Bias	含义	缓解
Position bias	给 A 还是 B 在前位置不同结果不同	flip 双方位置取平均
Length bias	长答案更易被评高分	rubric 强调"信息密度"，penalty 不必要长度
Self-preference	judge 是 GPT 时偏好 GPT-style 答案	用不同模型 judge / 多模型集成
Verbosity bias	judge 长 prompt 易被诱导	短 rubric / few-shot 校准
Anchoring	judge 看到 A 高分后倾向给 B 也高分	independence —不并排展示打分
Authority bias	"权威人士说"被高估	rubric 显式禁止把权威当依据

1.3 Single grading vs Pairwise

方法	优点	缺点
Single（绝对评分 1-5）	易于追踪 score 变化	校准难（昨天 4 分=今天 4 分？）
Pairwise（A vs B）	鲁棒，更稳定	N×N 比较成本高
Pointwise + reference	跟"标准答案"对比	需要高质量 ref

实践推荐：CI 用 pairwise（new prompt vs old prompt），生产监控用 single + ref。

1.4 Calibration（校准）

让 judge 输出与 deterministic ground truth 对齐：

取 100 条已知 ground truth（pass/fail）的 case
跑 LLM judge
计算 Precision / Recall / F1 / Cohen's κ
调整 rubric / few-shot 直到 κ > 0.7

二、生产架构图

   Eval Dataset
        │
        ▼
   ┌───────────────────────────────┐
   │   Candidate System (LLM A)    │  ← 你的产品
   │   Reference System (LLM B)    │  ← 上一版 / 竞品
   └───────────────────────────────┘
        │ output_A, output_B
        ▼
   ┌───────────────────────────────┐
   │  Pairwise Judge (claude-opus) │
   │  - 同时比 (A,B) 与 (B,A)      │
   │  - 取 majority                │
   │  - rubric: 准确/相关/简洁/合规 │
   └───────────────────────────────┘
        │ winner: A / B / tie
        ▼
   ┌───────────────────────────────┐
   │   Aggregator                  │
   │   - win-rate                  │
   │   - by-category breakdown     │
   │   - position-flip stability   │
   └───────────────────────────────┘
        │
        ▼
   Human review (10% sample 抽样校准)

三、代码实现

3.1 Pairwise judge with position flip

"""judge.py — 完整 LLM-as-judge 实现，含 position-flip 去偏 + κ 测试"""
from __future__ import annotations
import json
import asyncio
import statistics
from dataclasses import dataclass
from typing import Literal
from collections import Counter

from anthropic import AsyncAnthropic

client = AsyncAnthropic()
JUDGE_MODEL = "claude-opus-4-7"

JUDGE_RUBRIC = """你是一名严格的金融领域 LLM 输出评测员。给出两个候选回答，判断哪个更好。

评分维度（按权重）：
1. 事实准确性（30%）：是否有错误数据或与已知事实矛盾
2. 合规与监管对齐（25%）：是否引用正确法规，是否有违规表述
3. 完整性（20%）：是否回答了所有子问题
4. 简洁与信息密度（15%）：避免无关废话
5. 专业语气（10%）：是否符合金融专业人士预期

重要规则：
- 不要因为答案更长就更喜欢它
- 不要被"我作为 AI 不能确定"这类自我矮化措辞影响（这通常意味着回答不充分）
- 如果两个答案质量接近，直接 tie
- 必须先输出 reasoning，再输出最终判断
- reasoning 在 200 字以内

输出格式（严格 JSON）：
{
  "reasoning": "...",
  "winner": "A" | "B" | "tie",
  "confidence": 0.0-1.0
}
"""


@dataclass
class PairwiseCase:
    case_id: str
    question: str
    answer_a: str
    answer_b: str


async def judge_one(case: PairwiseCase, flip: bool = False) -> dict:
    if flip:
        a, b = case.answer_b, case.answer_a
    else:
        a, b = case.answer_a, case.answer_b

    user_msg = f"""问题:
{case.question}

候选 A:
{a}

候选 B:
{b}

按 rubric 给出判断，输出 JSON。"""

    r = await client.messages.create(
        model=JUDGE_MODEL,
        max_tokens=512,
        system=JUDGE_RUBRIC,
        messages=[{"role": "user", "content": user_msg}],
        temperature=0.0,  # judge 必须 temp=0
    )
    text = r.content[0].text

    # 提取 JSON
    start = text.find("{")
    end = text.rfind("}") + 1
    parsed = json.loads(text[start:end])

    if flip:
        # winner 反转
        m = {"A": "B", "B": "A", "tie": "tie"}
        parsed["winner"] = m[parsed["winner"]]
    return parsed


async def judge_with_flip(case: PairwiseCase) -> dict:
    """运行两次（A,B）和（B,A）取 consensus"""
    r1, r2 = await asyncio.gather(judge_one(case, flip=False), judge_one(case, flip=True))
    if r1["winner"] == r2["winner"]:
        return {"winner": r1["winner"], "consistent": True, "details": [r1, r2]}
    else:
        # 不一致 → tie（保守做法）
        return {"winner": "tie", "consistent": False, "details": [r1, r2]}


# ────────────────────── Position bias 检测 ──────────────────────
async def measure_position_bias(cases: list[PairwiseCase]) -> dict:
    """跑两次（A,B）和（B,A），看 winner 翻转的比例"""
    results_ab = await asyncio.gather(*[judge_one(c, flip=False) for c in cases])
    results_ba = await asyncio.gather(*[judge_one(c, flip=True) for c in cases])

    flips = 0
    for r1, r2 in zip(results_ab, results_ba):
        if r1["winner"] != r2["winner"]:
            flips += 1
    return {
        "n": len(cases),
        "flips": flips,
        "flip_rate": flips / len(cases),
        "ab_distribution": Counter(r["winner"] for r in results_ab),
        "ba_distribution": Counter(r["winner"] for r in results_ba),
    }


# ────────────────────── Cohen's κ ──────────────────────
def cohens_kappa(rater1: list, rater2: list) -> float:
    """两个评分者的一致性。-1 ≤ κ ≤ 1，>0.7 算 substantial"""
    assert len(rater1) == len(rater2)
    n = len(rater1)
    categories = sorted(set(rater1) | set(rater2))
    # observed agreement
    p_o = sum(1 for a, b in zip(rater1, rater2) if a == b) / n
    # expected (random) agreement
    p_e = 0
    for c in categories:
        p1 = rater1.count(c) / n
        p2 = rater2.count(c) / n
        p_e += p1 * p2
    if p_e == 1:
        return 1.0
    return (p_o - p_e) / (1 - p_e)


# ────────────────────── Calibration: judge vs ground truth ──────────────────────
async def calibrate_against_truth(cases: list[dict]) -> dict:
    """cases: [{'q':..., 'a_better': True/False, 'answer_a':..., 'answer_b':...}]"""
    judges = []
    truths = []
    for c in cases:
        pc = PairwiseCase("x", c["q"], c["answer_a"], c["answer_b"])
        r = await judge_with_flip(pc)
        judges.append(r["winner"])
        truths.append("A" if c["a_better"] else "B")

    tp = sum(1 for j, t in zip(judges, truths) if j == t and j == "A")
    fp = sum(1 for j, t in zip(judges, truths) if j == "A" and t != "A")
    fn = sum(1 for j, t in zip(judges, truths) if t == "A" and j != "A")
    tn = sum(1 for j, t in zip(judges, truths) if j == t and j == "B")

    precision = tp / max(tp + fp, 1)
    recall = tp / max(tp + fn, 1)
    f1 = 2 * precision * recall / max(precision + recall, 1e-9)
    kappa = cohens_kappa(judges, truths)
    accuracy = (tp + tn) / len(cases)

    return {
        "n": len(cases),
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "kappa": kappa,
    }


# ────────────────────── 主入口示例 ──────────────────────
DEMO_CASES = [
    PairwiseCase(
        "kyc_001",
        "客户提供了护照，没有身份证，如何完成 KYC？",
        answer_a=("根据《商业银行客户身份识别办法》第 12 条，外国客户可使用护照作为有效证件。"
                  "需补充：(1) 入境签证；(2) 居留证明（>3 月）；(3) 国内常住地址证明。"
                  "在系统中以 'PASSPORT' 类型登记。"),
        answer_b=("可以的。我们一般接受护照。让客户拍一下传过来就行。"),
    ),
    PairwiseCase(
        "credit_002",
        "客户月入 1.2 万，已有房贷月供 6000 元，是否可以再申请 50 万消费贷？",
        answer_a=("月负债收入比 (DBR) = 6000/12000 = 50%，已超 40% 监管警戒线。"
                  "若再增 50 万消费贷（按 5 年期 4.5% 利率，月供约 9300），DBR 升至 127%，远超上限。"
                  "建议：拒绝该笔申请，或要求客户先结清部分房贷。"),
        answer_b=("可以申请，因为月收入 1.2 万。"),
    ),
] * 5


async def demo():
    # Position bias
    bias = await measure_position_bias(DEMO_CASES)
    print(f"Position-flip rate: {bias['flip_rate']:.2%}")
    print(f"AB 分布: {dict(bias['ab_distribution'])}")
    print(f"BA 分布: {dict(bias['ba_distribution'])}")

    # Win rate
    results = await asyncio.gather(*[judge_with_flip(c) for c in DEMO_CASES])
    winners = [r["winner"] for r in results]
    print(f"\nFinal winners: {Counter(winners)}")
    print(f"Consistency:  {sum(r['consistent'] for r in results)} / {len(results)}")

    # Calibration（假设 ground truth：A 在所有 case 都更好）
    truth_cases = [{"q": c.question, "answer_a": c.answer_a, "answer_b": c.answer_b, "a_better": True} for c in DEMO_CASES]
    cal = await calibrate_against_truth(truth_cases)
    print(f"\nCalibration: accuracy={cal['accuracy']:.2%}, F1={cal['f1']:.2f}, κ={cal['kappa']:.2f}")


if __name__ == "__main__":
    asyncio.run(demo())

3.2 Multi-judge ensemble（去模型偏好）

"""multi_judge.py — 用 3 个不同模型做 ensemble，多数投票"""
from anthropic import AsyncAnthropic
import asyncio

clients = {
    "opus":   AsyncAnthropic(),  # claude-opus-4-7
    "sonnet": AsyncAnthropic(),  # claude-sonnet-4-6
    # 可加 GPT-5 / DeepSeek 等
}


async def ensemble_judge(case, models=("opus", "sonnet")):
    async def call(model_name):
        # 简化：复用上面的 judge_one
        return await judge_one(case)
    rs = await asyncio.gather(*[call(m) for m in models])
    winners = [r["winner"] for r in rs]
    # 多数投票
    c = Counter(winners)
    final = c.most_common(1)[0][0]
    return {"winner": final, "individual": dict(zip(models, winners)),
            "agreement": c.most_common(1)[0][1] / len(rs)}

3.3 Single-grade with rubric

async def single_grade(question: str, answer: str) -> dict:
    rubric = """评测以下金融回答，按 5 个维度独立打 1-5 分（5 最优）。
1. accuracy        : 事实准确度
2. compliance      : 合规与法规对齐
3. completeness    : 完整性
4. conciseness     : 简洁度
5. professionalism : 专业语气

输出 JSON: {"accuracy":int, "compliance":int, "completeness":int, "conciseness":int, "professionalism":int, "overall_reasoning":str}"""
    r = await client.messages.create(
        model="claude-opus-4-7",
        max_tokens=400,
        system=rubric,
        messages=[{"role": "user", "content": f"问题: {question}\n\n回答:\n{answer}"}],
        temperature=0,
    )
    return json.loads(r.content[0].text[r.content[0].text.find("{"):r.content[0].text.rfind("}")+1])

四、Cost & Performance 实测数据

Judge 配置	单条成本	P95 latency	Position-flip rate
claude-haiku-4-5 single	$0.001	1.4s	22%
claude-sonnet-4-6 single	$0.005	2.1s	14%
claude-opus-4-7 single	$0.045	3.5s	8%
claude-opus-4-7 + flip	$0.090	6.0s	0% （consensus 已强制）
2-model ensemble (sonnet+opus)	$0.050	3.5s（并行）	6%

Calibration（vs human）：

无 flip claude-sonnet-4-6 judge：κ = 0.61
加 flip：κ = 0.74
加 flip + ensemble：κ = 0.81

五、金融领域应用

合规一致性 judge：对客户经理用 LLM 起草的话术，judge 检查是否符合销售合规规范
客服回答质量：每天采样 1% 流量，judge 打分，趋势监控
A/B 测试 prompt：新 prompt 上线前，pairwise judge 对比 200 条 → win rate > 55% 才放量
披露质量：投资者教育材料生成后，judge 评估"是否完整、是否含违规承诺"
代码审查：财务系统脚本，judge 评 SQL 注入、divide-by-zero、未提交事务

六、生产经验与陷阱

Judge 用 claude-haiku-4-5 太便宜但 κ 低：金融场景至少 sonnet 起步，关键决策（合规/风控）用 opus + flip
Position bias 在 30%+ 不少见：必须 flip。我见过有团队不 flip，结果选项放前面 win rate 永远 70%，严重误判
Length bias 难根治：rubric 反复说"不要因长度高分"，模型仍偏长。最有效是给短答案加 "Compactness Bonus"
Self-preference：claude judge claude 倾向给 claude 高分。涉及对比不同模型时必须用第三方 judge
Judge 出错不告警：judge 输出 winner='C' 这种非法值，要做 schema 校验
Judge 提示词漂移：rubric 改一字 win rate 变 5%。rubric 必须版本化，eval 报告记录 rubric_hash
Judge 也会有幻觉：可能 judge 引用了不存在的法规否定 A。要在 rubric 中写 "如果不确定，标 tie"
Calibration 一次不够：分布会漂移，每月用新 100 条 human-labeled 重测 κ

七、关键速查

何时用	选择
二元正确（含/不含 X）	deterministic，不要 judge
比较两个版本	pairwise judge + flip
长期趋势监控	single grade with reference
多模型对比	ensemble judge（不能用其中一个当 judge）
合规决策	judge + flip + human spot check

一致性 κ	解读
< 0.4	poor
0.4-0.6	moderate
0.6-0.8	substantial
> 0.8	almost perfect

八、面试题

LLM-as-judge 的 position bias 怎么测和缓解？
- 测：跑 (A,B) 与 (B,A) 看 winner 翻转率；缓解：双向跑取 consensus，不一致标 tie
Judge 用什么模型？是否可以用被测系统同一个模型当 judge？
- 不能，self-preference 严重。用更强模型（claude-opus 当 judge）或第三方模型；多模型 ensemble 最稳
如何 calibrate LLM judge？
- 取 100 条 human-labeled cases，跑 judge，算 precision/recall/F1/κ；调 rubric 直到 κ > 0.7；定期重测
Pairwise vs Single grading 怎么选？
- Pairwise 鲁棒、易做 A/B；Single 易追踪绝对趋势但校准难。生产建议：CI 用 pairwise，监控用 single
金融合规场景下的 judge 设计要点？
- rubric 显式列法规对照；ensemble 多模型；judge 输出"reasoning"而非只给 winner，便于审计；不一致标 tie 由 human 决；rubric 版本化

明日预告

Day 168：Eval 体系（三）— Golden Datasets 与对抗测试 有 deterministic + judge 还不够，要有好的输入数据。今天构建 100 条 golden dataset：覆盖正常 + 边缘 + 对抗 + 回归 cases。