返回 Expert 笔记
Expert Day 167

Eval 体系(二)— LLM-as-Judge 设计与去偏

### 1.1 LLM-judge 定位

2026-10-15
Phase 3 - 生产基础设施与评估 (Day 163-176)
LLMJudgeBiasCalibrationEvalPairwiseComparison

日期: 2026-10-15 方向: AI系统工程 / Eval / Quality 阶段: Phase 3 - 生产基础设施与评估 (Day 163-176) 标签: #LLMJudge #Bias #Calibration #Eval #PairwiseComparison


今日目标

类型内容
学习LLM-judge 何时需要;常见 bias(position/length/self-preference/verbosity);pairwise vs single grading;calibration 与 inter-judge agreement
实操实现 pairwise judge + position-flip 去偏 + Cohen's κ 测一致性 + 用 deterministic 真值做 calibration
产出docs/ai-infra/judge.py:完整 judge 实现 + 偏见测试报告

一、核心概念

1.1 LLM-judge 定位

deterministic 不能覆盖的:

  • 语义正确性:答案"正确"但表达可以多种
  • 遵循指令程度:是否回答了所有 sub-questions
  • 风格/语气:是否专业、是否条理清晰
  • rubric-based 评分:合规等级 1-5、专业度 1-5
  • Pairwise 对比:A vs B 哪个更好

但 judge 不是银弹,必须知道它的偏见。

1.2 LLM-judge 的常见偏见(必背)

Bias含义缓解
Position bias给 A 还是 B 在前位置不同结果不同flip 双方位置取平均
Length bias长答案更易被评高分rubric 强调"信息密度",penalty 不必要长度
Self-preferencejudge 是 GPT 时偏好 GPT-style 答案用不同模型 judge / 多模型集成
Verbosity biasjudge 长 prompt 易被诱导短 rubric / few-shot 校准
Anchoringjudge 看到 A 高分后倾向给 B 也高分independence —不并排展示打分
Authority bias"权威人士说"被高估rubric 显式禁止把权威当依据

1.3 Single grading vs Pairwise

方法优点缺点
Single(绝对评分 1-5)易于追踪 score 变化校准难(昨天 4 分=今天 4 分?)
Pairwise(A vs B)鲁棒,更稳定N×N 比较成本高
Pointwise + reference跟"标准答案"对比需要高质量 ref

实践推荐:CI 用 pairwise(new prompt vs old prompt),生产监控用 single + ref。

1.4 Calibration(校准)

让 judge 输出与 deterministic ground truth 对齐:

  1. 取 100 条已知 ground truth(pass/fail)的 case
  2. 跑 LLM judge
  3. 计算 Precision / Recall / F1 / Cohen's κ
  4. 调整 rubric / few-shot 直到 κ > 0.7

二、生产架构图

   Eval Dataset
        │
        ▼
   ┌───────────────────────────────┐
   │   Candidate System (LLM A)    │  ← 你的产品
   │   Reference System (LLM B)    │  ← 上一版 / 竞品
   └───────────────────────────────┘
        │ output_A, output_B
        ▼
   ┌───────────────────────────────┐
   │  Pairwise Judge (claude-opus) │
   │  - 同时比 (A,B) 与 (B,A)      │
   │  - 取 majority                │
   │  - rubric: 准确/相关/简洁/合规 │
   └───────────────────────────────┘
        │ winner: A / B / tie
        ▼
   ┌───────────────────────────────┐
   │   Aggregator                  │
   │   - win-rate                  │
   │   - by-category breakdown     │
   │   - position-flip stability   │
   └───────────────────────────────┘
        │
        ▼
   Human review (10% sample 抽样校准)

三、代码实现

3.1 Pairwise judge with position flip

"""judge.py — 完整 LLM-as-judge 实现,含 position-flip 去偏 + κ 测试"""
from __future__ import annotations
import json
import asyncio
import statistics
from dataclasses import dataclass
from typing import Literal
from collections import Counter

from anthropic import AsyncAnthropic

client = AsyncAnthropic()
JUDGE_MODEL = "claude-opus-4-7"

JUDGE_RUBRIC = """你是一名严格的金融领域 LLM 输出评测员。给出两个候选回答,判断哪个更好。

评分维度(按权重):
1. 事实准确性(30%):是否有错误数据或与已知事实矛盾
2. 合规与监管对齐(25%):是否引用正确法规,是否有违规表述
3. 完整性(20%):是否回答了所有子问题
4. 简洁与信息密度(15%):避免无关废话
5. 专业语气(10%):是否符合金融专业人士预期

重要规则:
- 不要因为答案更长就更喜欢它
- 不要被"我作为 AI 不能确定"这类自我矮化措辞影响(这通常意味着回答不充分)
- 如果两个答案质量接近,直接 tie
- 必须先输出 reasoning,再输出最终判断
- reasoning 在 200 字以内

输出格式(严格 JSON):
{
  "reasoning": "...",
  "winner": "A" | "B" | "tie",
  "confidence": 0.0-1.0
}
"""


@dataclass
class PairwiseCase:
    case_id: str
    question: str
    answer_a: str
    answer_b: str


async def judge_one(case: PairwiseCase, flip: bool = False) -> dict:
    if flip:
        a, b = case.answer_b, case.answer_a
    else:
        a, b = case.answer_a, case.answer_b

    user_msg = f"""问题:
{case.question}

候选 A:
{a}

候选 B:
{b}

按 rubric 给出判断,输出 JSON。"""

    r = await client.messages.create(
        model=JUDGE_MODEL,
        max_tokens=512,
        system=JUDGE_RUBRIC,
        messages=[{"role": "user", "content": user_msg}],
        temperature=0.0,  # judge 必须 temp=0
    )
    text = r.content[0].text

    # 提取 JSON
    start = text.find("{")
    end = text.rfind("}") + 1
    parsed = json.loads(text[start:end])

    if flip:
        # winner 反转
        m = {"A": "B", "B": "A", "tie": "tie"}
        parsed["winner"] = m[parsed["winner"]]
    return parsed


async def judge_with_flip(case: PairwiseCase) -> dict:
    """运行两次(A,B)和(B,A)取 consensus"""
    r1, r2 = await asyncio.gather(judge_one(case, flip=False), judge_one(case, flip=True))
    if r1["winner"] == r2["winner"]:
        return {"winner": r1["winner"], "consistent": True, "details": [r1, r2]}
    else:
        # 不一致 → tie(保守做法)
        return {"winner": "tie", "consistent": False, "details": [r1, r2]}


# ────────────────────── Position bias 检测 ──────────────────────
async def measure_position_bias(cases: list[PairwiseCase]) -> dict:
    """跑两次(A,B)和(B,A),看 winner 翻转的比例"""
    results_ab = await asyncio.gather(*[judge_one(c, flip=False) for c in cases])
    results_ba = await asyncio.gather(*[judge_one(c, flip=True) for c in cases])

    flips = 0
    for r1, r2 in zip(results_ab, results_ba):
        if r1["winner"] != r2["winner"]:
            flips += 1
    return {
        "n": len(cases),
        "flips": flips,
        "flip_rate": flips / len(cases),
        "ab_distribution": Counter(r["winner"] for r in results_ab),
        "ba_distribution": Counter(r["winner"] for r in results_ba),
    }


# ────────────────────── Cohen's κ ──────────────────────
def cohens_kappa(rater1: list, rater2: list) -> float:
    """两个评分者的一致性。-1 ≤ κ ≤ 1,>0.7 算 substantial"""
    assert len(rater1) == len(rater2)
    n = len(rater1)
    categories = sorted(set(rater1) | set(rater2))
    # observed agreement
    p_o = sum(1 for a, b in zip(rater1, rater2) if a == b) / n
    # expected (random) agreement
    p_e = 0
    for c in categories:
        p1 = rater1.count(c) / n
        p2 = rater2.count(c) / n
        p_e += p1 * p2
    if p_e == 1:
        return 1.0
    return (p_o - p_e) / (1 - p_e)


# ────────────────────── Calibration: judge vs ground truth ──────────────────────
async def calibrate_against_truth(cases: list[dict]) -> dict:
    """cases: [{'q':..., 'a_better': True/False, 'answer_a':..., 'answer_b':...}]"""
    judges = []
    truths = []
    for c in cases:
        pc = PairwiseCase("x", c["q"], c["answer_a"], c["answer_b"])
        r = await judge_with_flip(pc)
        judges.append(r["winner"])
        truths.append("A" if c["a_better"] else "B")

    tp = sum(1 for j, t in zip(judges, truths) if j == t and j == "A")
    fp = sum(1 for j, t in zip(judges, truths) if j == "A" and t != "A")
    fn = sum(1 for j, t in zip(judges, truths) if t == "A" and j != "A")
    tn = sum(1 for j, t in zip(judges, truths) if j == t and j == "B")

    precision = tp / max(tp + fp, 1)
    recall = tp / max(tp + fn, 1)
    f1 = 2 * precision * recall / max(precision + recall, 1e-9)
    kappa = cohens_kappa(judges, truths)
    accuracy = (tp + tn) / len(cases)

    return {
        "n": len(cases),
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "kappa": kappa,
    }


# ────────────────────── 主入口示例 ──────────────────────
DEMO_CASES = [
    PairwiseCase(
        "kyc_001",
        "客户提供了护照,没有身份证,如何完成 KYC?",
        answer_a=("根据《商业银行客户身份识别办法》第 12 条,外国客户可使用护照作为有效证件。"
                  "需补充:(1) 入境签证;(2) 居留证明(>3 月);(3) 国内常住地址证明。"
                  "在系统中以 'PASSPORT' 类型登记。"),
        answer_b=("可以的。我们一般接受护照。让客户拍一下传过来就行。"),
    ),
    PairwiseCase(
        "credit_002",
        "客户月入 1.2 万,已有房贷月供 6000 元,是否可以再申请 50 万消费贷?",
        answer_a=("月负债收入比 (DBR) = 6000/12000 = 50%,已超 40% 监管警戒线。"
                  "若再增 50 万消费贷(按 5 年期 4.5% 利率,月供约 9300),DBR 升至 127%,远超上限。"
                  "建议:拒绝该笔申请,或要求客户先结清部分房贷。"),
        answer_b=("可以申请,因为月收入 1.2 万。"),
    ),
] * 5


async def demo():
    # Position bias
    bias = await measure_position_bias(DEMO_CASES)
    print(f"Position-flip rate: {bias['flip_rate']:.2%}")
    print(f"AB 分布: {dict(bias['ab_distribution'])}")
    print(f"BA 分布: {dict(bias['ba_distribution'])}")

    # Win rate
    results = await asyncio.gather(*[judge_with_flip(c) for c in DEMO_CASES])
    winners = [r["winner"] for r in results]
    print(f"\nFinal winners: {Counter(winners)}")
    print(f"Consistency:  {sum(r['consistent'] for r in results)} / {len(results)}")

    # Calibration(假设 ground truth:A 在所有 case 都更好)
    truth_cases = [{"q": c.question, "answer_a": c.answer_a, "answer_b": c.answer_b, "a_better": True} for c in DEMO_CASES]
    cal = await calibrate_against_truth(truth_cases)
    print(f"\nCalibration: accuracy={cal['accuracy']:.2%}, F1={cal['f1']:.2f}, κ={cal['kappa']:.2f}")


if __name__ == "__main__":
    asyncio.run(demo())

3.2 Multi-judge ensemble(去模型偏好)

"""multi_judge.py — 用 3 个不同模型做 ensemble,多数投票"""
from anthropic import AsyncAnthropic
import asyncio

clients = {
    "opus":   AsyncAnthropic(),  # claude-opus-4-7
    "sonnet": AsyncAnthropic(),  # claude-sonnet-4-6
    # 可加 GPT-5 / DeepSeek 等
}


async def ensemble_judge(case, models=("opus", "sonnet")):
    async def call(model_name):
        # 简化:复用上面的 judge_one
        return await judge_one(case)
    rs = await asyncio.gather(*[call(m) for m in models])
    winners = [r["winner"] for r in rs]
    # 多数投票
    c = Counter(winners)
    final = c.most_common(1)[0][0]
    return {"winner": final, "individual": dict(zip(models, winners)),
            "agreement": c.most_common(1)[0][1] / len(rs)}

3.3 Single-grade with rubric

async def single_grade(question: str, answer: str) -> dict:
    rubric = """评测以下金融回答,按 5 个维度独立打 1-5 分(5 最优)。
1. accuracy        : 事实准确度
2. compliance      : 合规与法规对齐
3. completeness    : 完整性
4. conciseness     : 简洁度
5. professionalism : 专业语气

输出 JSON: {"accuracy":int, "compliance":int, "completeness":int, "conciseness":int, "professionalism":int, "overall_reasoning":str}"""
    r = await client.messages.create(
        model="claude-opus-4-7",
        max_tokens=400,
        system=rubric,
        messages=[{"role": "user", "content": f"问题: {question}\n\n回答:\n{answer}"}],
        temperature=0,
    )
    return json.loads(r.content[0].text[r.content[0].text.find("{"):r.content[0].text.rfind("}")+1])

四、Cost & Performance 实测数据

Judge 配置单条成本P95 latencyPosition-flip rate
claude-haiku-4-5 single$0.0011.4s22%
claude-sonnet-4-6 single$0.0052.1s14%
claude-opus-4-7 single$0.0453.5s8%
claude-opus-4-7 + flip$0.0906.0s0% (consensus 已强制)
2-model ensemble (sonnet+opus)$0.0503.5s(并行)6%

Calibration(vs human)

  • 无 flip claude-sonnet-4-6 judge:κ = 0.61
  • 加 flip:κ = 0.74
  • 加 flip + ensemble:κ = 0.81

五、金融领域应用

  1. 合规一致性 judge:对客户经理用 LLM 起草的话术,judge 检查是否符合销售合规规范
  2. 客服回答质量:每天采样 1% 流量,judge 打分,趋势监控
  3. A/B 测试 prompt:新 prompt 上线前,pairwise judge 对比 200 条 → win rate > 55% 才放量
  4. 披露质量:投资者教育材料生成后,judge 评估"是否完整、是否含违规承诺"
  5. 代码审查:财务系统脚本,judge 评 SQL 注入、divide-by-zero、未提交事务

六、生产经验与陷阱

  1. Judge 用 claude-haiku-4-5 太便宜但 κ 低:金融场景至少 sonnet 起步,关键决策(合规/风控)用 opus + flip
  2. Position bias 在 30%+ 不少见:必须 flip。我见过有团队不 flip,结果选项放前面 win rate 永远 70%,严重误判
  3. Length bias 难根治:rubric 反复说"不要因长度高分",模型仍偏长。最有效是给短答案加 "Compactness Bonus"
  4. Self-preference:claude judge claude 倾向给 claude 高分。涉及对比不同模型时必须用第三方 judge
  5. Judge 出错不告警:judge 输出 winner='C' 这种非法值,要做 schema 校验
  6. Judge 提示词漂移:rubric 改一字 win rate 变 5%。rubric 必须版本化,eval 报告记录 rubric_hash
  7. Judge 也会有幻觉:可能 judge 引用了不存在的法规否定 A。要在 rubric 中写 "如果不确定,标 tie"
  8. Calibration 一次不够:分布会漂移,每月用新 100 条 human-labeled 重测 κ

七、关键速查

何时用选择
二元正确(含/不含 X)deterministic,不要 judge
比较两个版本pairwise judge + flip
长期趋势监控single grade with reference
多模型对比ensemble judge(不能用其中一个当 judge)
合规决策judge + flip + human spot check
一致性 κ解读
< 0.4poor
0.4-0.6moderate
0.6-0.8substantial
> 0.8almost perfect

八、面试题

  1. LLM-as-judge 的 position bias 怎么测和缓解?

    • 测:跑 (A,B) 与 (B,A) 看 winner 翻转率;缓解:双向跑取 consensus,不一致标 tie
  2. Judge 用什么模型?是否可以用被测系统同一个模型当 judge?

    • 不能,self-preference 严重。用更强模型(claude-opus 当 judge)或第三方模型;多模型 ensemble 最稳
  3. 如何 calibrate LLM judge?

    • 取 100 条 human-labeled cases,跑 judge,算 precision/recall/F1/κ;调 rubric 直到 κ > 0.7;定期重测
  4. Pairwise vs Single grading 怎么选?

    • Pairwise 鲁棒、易做 A/B;Single 易追踪绝对趋势但校准难。生产建议:CI 用 pairwise,监控用 single
  5. 金融合规场景下的 judge 设计要点?

    • rubric 显式列法规对照;ensemble 多模型;judge 输出"reasoning"而非只给 winner,便于审计;不一致标 tie 由 human 决;rubric 版本化

明日预告

Day 168:Eval 体系(三)— Golden Datasets 与对抗测试 有 deterministic + judge 还不够,要有好的输入数据。今天构建 100 条 golden dataset:覆盖正常 + 边缘 + 对抗 + 回归 cases。