Expert Day 167
Eval 体系(二)— LLM-as-Judge 设计与去偏
### 1.1 LLM-judge 定位
2026-10-15
Phase 3 - 生产基础设施与评估 (Day 163-176)LLMJudgeBiasCalibrationEvalPairwiseComparison
日期: 2026-10-15 方向: AI系统工程 / Eval / Quality 阶段: Phase 3 - 生产基础设施与评估 (Day 163-176) 标签: #LLMJudge #Bias #Calibration #Eval #PairwiseComparison
今日目标
| 类型 | 内容 |
|---|---|
| 学习 | LLM-judge 何时需要;常见 bias(position/length/self-preference/verbosity);pairwise vs single grading;calibration 与 inter-judge agreement |
| 实操 | 实现 pairwise judge + position-flip 去偏 + Cohen's κ 测一致性 + 用 deterministic 真值做 calibration |
| 产出 | docs/ai-infra/judge.py:完整 judge 实现 + 偏见测试报告 |
一、核心概念
1.1 LLM-judge 定位
deterministic 不能覆盖的:
- 语义正确性:答案"正确"但表达可以多种
- 遵循指令程度:是否回答了所有 sub-questions
- 风格/语气:是否专业、是否条理清晰
- rubric-based 评分:合规等级 1-5、专业度 1-5
- Pairwise 对比:A vs B 哪个更好
但 judge 不是银弹,必须知道它的偏见。
1.2 LLM-judge 的常见偏见(必背)
| Bias | 含义 | 缓解 |
|---|---|---|
| Position bias | 给 A 还是 B 在前位置不同结果不同 | flip 双方位置取平均 |
| Length bias | 长答案更易被评高分 | rubric 强调"信息密度",penalty 不必要长度 |
| Self-preference | judge 是 GPT 时偏好 GPT-style 答案 | 用不同模型 judge / 多模型集成 |
| Verbosity bias | judge 长 prompt 易被诱导 | 短 rubric / few-shot 校准 |
| Anchoring | judge 看到 A 高分后倾向给 B 也高分 | independence —不并排展示打分 |
| Authority bias | "权威人士说"被高估 | rubric 显式禁止把权威当依据 |
1.3 Single grading vs Pairwise
| 方法 | 优点 | 缺点 |
|---|---|---|
| Single(绝对评分 1-5) | 易于追踪 score 变化 | 校准难(昨天 4 分=今天 4 分?) |
| Pairwise(A vs B) | 鲁棒,更稳定 | N×N 比较成本高 |
| Pointwise + reference | 跟"标准答案"对比 | 需要高质量 ref |
实践推荐:CI 用 pairwise(new prompt vs old prompt),生产监控用 single + ref。
1.4 Calibration(校准)
让 judge 输出与 deterministic ground truth 对齐:
- 取 100 条已知 ground truth(pass/fail)的 case
- 跑 LLM judge
- 计算 Precision / Recall / F1 / Cohen's κ
- 调整 rubric / few-shot 直到 κ > 0.7
二、生产架构图
Eval Dataset
│
▼
┌───────────────────────────────┐
│ Candidate System (LLM A) │ ← 你的产品
│ Reference System (LLM B) │ ← 上一版 / 竞品
└───────────────────────────────┘
│ output_A, output_B
▼
┌───────────────────────────────┐
│ Pairwise Judge (claude-opus) │
│ - 同时比 (A,B) 与 (B,A) │
│ - 取 majority │
│ - rubric: 准确/相关/简洁/合规 │
└───────────────────────────────┘
│ winner: A / B / tie
▼
┌───────────────────────────────┐
│ Aggregator │
│ - win-rate │
│ - by-category breakdown │
│ - position-flip stability │
└───────────────────────────────┘
│
▼
Human review (10% sample 抽样校准)
三、代码实现
3.1 Pairwise judge with position flip
"""judge.py — 完整 LLM-as-judge 实现,含 position-flip 去偏 + κ 测试"""
from __future__ import annotations
import json
import asyncio
import statistics
from dataclasses import dataclass
from typing import Literal
from collections import Counter
from anthropic import AsyncAnthropic
client = AsyncAnthropic()
JUDGE_MODEL = "claude-opus-4-7"
JUDGE_RUBRIC = """你是一名严格的金融领域 LLM 输出评测员。给出两个候选回答,判断哪个更好。
评分维度(按权重):
1. 事实准确性(30%):是否有错误数据或与已知事实矛盾
2. 合规与监管对齐(25%):是否引用正确法规,是否有违规表述
3. 完整性(20%):是否回答了所有子问题
4. 简洁与信息密度(15%):避免无关废话
5. 专业语气(10%):是否符合金融专业人士预期
重要规则:
- 不要因为答案更长就更喜欢它
- 不要被"我作为 AI 不能确定"这类自我矮化措辞影响(这通常意味着回答不充分)
- 如果两个答案质量接近,直接 tie
- 必须先输出 reasoning,再输出最终判断
- reasoning 在 200 字以内
输出格式(严格 JSON):
{
"reasoning": "...",
"winner": "A" | "B" | "tie",
"confidence": 0.0-1.0
}
"""
@dataclass
class PairwiseCase:
case_id: str
question: str
answer_a: str
answer_b: str
async def judge_one(case: PairwiseCase, flip: bool = False) -> dict:
if flip:
a, b = case.answer_b, case.answer_a
else:
a, b = case.answer_a, case.answer_b
user_msg = f"""问题:
{case.question}
候选 A:
{a}
候选 B:
{b}
按 rubric 给出判断,输出 JSON。"""
r = await client.messages.create(
model=JUDGE_MODEL,
max_tokens=512,
system=JUDGE_RUBRIC,
messages=[{"role": "user", "content": user_msg}],
temperature=0.0, # judge 必须 temp=0
)
text = r.content[0].text
# 提取 JSON
start = text.find("{")
end = text.rfind("}") + 1
parsed = json.loads(text[start:end])
if flip:
# winner 反转
m = {"A": "B", "B": "A", "tie": "tie"}
parsed["winner"] = m[parsed["winner"]]
return parsed
async def judge_with_flip(case: PairwiseCase) -> dict:
"""运行两次(A,B)和(B,A)取 consensus"""
r1, r2 = await asyncio.gather(judge_one(case, flip=False), judge_one(case, flip=True))
if r1["winner"] == r2["winner"]:
return {"winner": r1["winner"], "consistent": True, "details": [r1, r2]}
else:
# 不一致 → tie(保守做法)
return {"winner": "tie", "consistent": False, "details": [r1, r2]}
# ────────────────────── Position bias 检测 ──────────────────────
async def measure_position_bias(cases: list[PairwiseCase]) -> dict:
"""跑两次(A,B)和(B,A),看 winner 翻转的比例"""
results_ab = await asyncio.gather(*[judge_one(c, flip=False) for c in cases])
results_ba = await asyncio.gather(*[judge_one(c, flip=True) for c in cases])
flips = 0
for r1, r2 in zip(results_ab, results_ba):
if r1["winner"] != r2["winner"]:
flips += 1
return {
"n": len(cases),
"flips": flips,
"flip_rate": flips / len(cases),
"ab_distribution": Counter(r["winner"] for r in results_ab),
"ba_distribution": Counter(r["winner"] for r in results_ba),
}
# ────────────────────── Cohen's κ ──────────────────────
def cohens_kappa(rater1: list, rater2: list) -> float:
"""两个评分者的一致性。-1 ≤ κ ≤ 1,>0.7 算 substantial"""
assert len(rater1) == len(rater2)
n = len(rater1)
categories = sorted(set(rater1) | set(rater2))
# observed agreement
p_o = sum(1 for a, b in zip(rater1, rater2) if a == b) / n
# expected (random) agreement
p_e = 0
for c in categories:
p1 = rater1.count(c) / n
p2 = rater2.count(c) / n
p_e += p1 * p2
if p_e == 1:
return 1.0
return (p_o - p_e) / (1 - p_e)
# ────────────────────── Calibration: judge vs ground truth ──────────────────────
async def calibrate_against_truth(cases: list[dict]) -> dict:
"""cases: [{'q':..., 'a_better': True/False, 'answer_a':..., 'answer_b':...}]"""
judges = []
truths = []
for c in cases:
pc = PairwiseCase("x", c["q"], c["answer_a"], c["answer_b"])
r = await judge_with_flip(pc)
judges.append(r["winner"])
truths.append("A" if c["a_better"] else "B")
tp = sum(1 for j, t in zip(judges, truths) if j == t and j == "A")
fp = sum(1 for j, t in zip(judges, truths) if j == "A" and t != "A")
fn = sum(1 for j, t in zip(judges, truths) if t == "A" and j != "A")
tn = sum(1 for j, t in zip(judges, truths) if j == t and j == "B")
precision = tp / max(tp + fp, 1)
recall = tp / max(tp + fn, 1)
f1 = 2 * precision * recall / max(precision + recall, 1e-9)
kappa = cohens_kappa(judges, truths)
accuracy = (tp + tn) / len(cases)
return {
"n": len(cases),
"accuracy": accuracy,
"precision": precision,
"recall": recall,
"f1": f1,
"kappa": kappa,
}
# ────────────────────── 主入口示例 ──────────────────────
DEMO_CASES = [
PairwiseCase(
"kyc_001",
"客户提供了护照,没有身份证,如何完成 KYC?",
answer_a=("根据《商业银行客户身份识别办法》第 12 条,外国客户可使用护照作为有效证件。"
"需补充:(1) 入境签证;(2) 居留证明(>3 月);(3) 国内常住地址证明。"
"在系统中以 'PASSPORT' 类型登记。"),
answer_b=("可以的。我们一般接受护照。让客户拍一下传过来就行。"),
),
PairwiseCase(
"credit_002",
"客户月入 1.2 万,已有房贷月供 6000 元,是否可以再申请 50 万消费贷?",
answer_a=("月负债收入比 (DBR) = 6000/12000 = 50%,已超 40% 监管警戒线。"
"若再增 50 万消费贷(按 5 年期 4.5% 利率,月供约 9300),DBR 升至 127%,远超上限。"
"建议:拒绝该笔申请,或要求客户先结清部分房贷。"),
answer_b=("可以申请,因为月收入 1.2 万。"),
),
] * 5
async def demo():
# Position bias
bias = await measure_position_bias(DEMO_CASES)
print(f"Position-flip rate: {bias['flip_rate']:.2%}")
print(f"AB 分布: {dict(bias['ab_distribution'])}")
print(f"BA 分布: {dict(bias['ba_distribution'])}")
# Win rate
results = await asyncio.gather(*[judge_with_flip(c) for c in DEMO_CASES])
winners = [r["winner"] for r in results]
print(f"\nFinal winners: {Counter(winners)}")
print(f"Consistency: {sum(r['consistent'] for r in results)} / {len(results)}")
# Calibration(假设 ground truth:A 在所有 case 都更好)
truth_cases = [{"q": c.question, "answer_a": c.answer_a, "answer_b": c.answer_b, "a_better": True} for c in DEMO_CASES]
cal = await calibrate_against_truth(truth_cases)
print(f"\nCalibration: accuracy={cal['accuracy']:.2%}, F1={cal['f1']:.2f}, κ={cal['kappa']:.2f}")
if __name__ == "__main__":
asyncio.run(demo())
3.2 Multi-judge ensemble(去模型偏好)
"""multi_judge.py — 用 3 个不同模型做 ensemble,多数投票"""
from anthropic import AsyncAnthropic
import asyncio
clients = {
"opus": AsyncAnthropic(), # claude-opus-4-7
"sonnet": AsyncAnthropic(), # claude-sonnet-4-6
# 可加 GPT-5 / DeepSeek 等
}
async def ensemble_judge(case, models=("opus", "sonnet")):
async def call(model_name):
# 简化:复用上面的 judge_one
return await judge_one(case)
rs = await asyncio.gather(*[call(m) for m in models])
winners = [r["winner"] for r in rs]
# 多数投票
c = Counter(winners)
final = c.most_common(1)[0][0]
return {"winner": final, "individual": dict(zip(models, winners)),
"agreement": c.most_common(1)[0][1] / len(rs)}
3.3 Single-grade with rubric
async def single_grade(question: str, answer: str) -> dict:
rubric = """评测以下金融回答,按 5 个维度独立打 1-5 分(5 最优)。
1. accuracy : 事实准确度
2. compliance : 合规与法规对齐
3. completeness : 完整性
4. conciseness : 简洁度
5. professionalism : 专业语气
输出 JSON: {"accuracy":int, "compliance":int, "completeness":int, "conciseness":int, "professionalism":int, "overall_reasoning":str}"""
r = await client.messages.create(
model="claude-opus-4-7",
max_tokens=400,
system=rubric,
messages=[{"role": "user", "content": f"问题: {question}\n\n回答:\n{answer}"}],
temperature=0,
)
return json.loads(r.content[0].text[r.content[0].text.find("{"):r.content[0].text.rfind("}")+1])
四、Cost & Performance 实测数据
| Judge 配置 | 单条成本 | P95 latency | Position-flip rate |
|---|---|---|---|
| claude-haiku-4-5 single | $0.001 | 1.4s | 22% |
| claude-sonnet-4-6 single | $0.005 | 2.1s | 14% |
| claude-opus-4-7 single | $0.045 | 3.5s | 8% |
| claude-opus-4-7 + flip | $0.090 | 6.0s | 0% (consensus 已强制) |
| 2-model ensemble (sonnet+opus) | $0.050 | 3.5s(并行) | 6% |
Calibration(vs human):
- 无 flip claude-sonnet-4-6 judge:κ = 0.61
- 加 flip:κ = 0.74
- 加 flip + ensemble:κ = 0.81
五、金融领域应用
- 合规一致性 judge:对客户经理用 LLM 起草的话术,judge 检查是否符合销售合规规范
- 客服回答质量:每天采样 1% 流量,judge 打分,趋势监控
- A/B 测试 prompt:新 prompt 上线前,pairwise judge 对比 200 条 → win rate > 55% 才放量
- 披露质量:投资者教育材料生成后,judge 评估"是否完整、是否含违规承诺"
- 代码审查:财务系统脚本,judge 评 SQL 注入、divide-by-zero、未提交事务
六、生产经验与陷阱
- Judge 用 claude-haiku-4-5 太便宜但 κ 低:金融场景至少 sonnet 起步,关键决策(合规/风控)用 opus + flip
- Position bias 在 30%+ 不少见:必须 flip。我见过有团队不 flip,结果选项放前面 win rate 永远 70%,严重误判
- Length bias 难根治:rubric 反复说"不要因长度高分",模型仍偏长。最有效是给短答案加 "Compactness Bonus"
- Self-preference:claude judge claude 倾向给 claude 高分。涉及对比不同模型时必须用第三方 judge
- Judge 出错不告警:judge 输出 winner='C' 这种非法值,要做 schema 校验
- Judge 提示词漂移:rubric 改一字 win rate 变 5%。rubric 必须版本化,eval 报告记录 rubric_hash
- Judge 也会有幻觉:可能 judge 引用了不存在的法规否定 A。要在 rubric 中写 "如果不确定,标 tie"
- Calibration 一次不够:分布会漂移,每月用新 100 条 human-labeled 重测 κ
七、关键速查
| 何时用 | 选择 |
|---|---|
| 二元正确(含/不含 X) | deterministic,不要 judge |
| 比较两个版本 | pairwise judge + flip |
| 长期趋势监控 | single grade with reference |
| 多模型对比 | ensemble judge(不能用其中一个当 judge) |
| 合规决策 | judge + flip + human spot check |
| 一致性 κ | 解读 |
|---|---|
| < 0.4 | poor |
| 0.4-0.6 | moderate |
| 0.6-0.8 | substantial |
| > 0.8 | almost perfect |
八、面试题
-
LLM-as-judge 的 position bias 怎么测和缓解?
- 测:跑 (A,B) 与 (B,A) 看 winner 翻转率;缓解:双向跑取 consensus,不一致标 tie
-
Judge 用什么模型?是否可以用被测系统同一个模型当 judge?
- 不能,self-preference 严重。用更强模型(claude-opus 当 judge)或第三方模型;多模型 ensemble 最稳
-
如何 calibrate LLM judge?
- 取 100 条 human-labeled cases,跑 judge,算 precision/recall/F1/κ;调 rubric 直到 κ > 0.7;定期重测
-
Pairwise vs Single grading 怎么选?
- Pairwise 鲁棒、易做 A/B;Single 易追踪绝对趋势但校准难。生产建议:CI 用 pairwise,监控用 single
-
金融合规场景下的 judge 设计要点?
- rubric 显式列法规对照;ensemble 多模型;judge 输出"reasoning"而非只给 winner,便于审计;不一致标 tie 由 human 决;rubric 版本化
明日预告
Day 168:Eval 体系(三)— Golden Datasets 与对抗测试 有 deterministic + judge 还不够,要有好的输入数据。今天构建 100 条 golden dataset:覆盖正常 + 边缘 + 对抗 + 回归 cases。