Expert Day 166
Eval 体系(一)— Deterministic 评测
### 1.1 Eval 金字塔(必背)
2026-10-14
Phase 3 - 生产基础设施与评估 (Day 163-176)EvalDeterministicTestJSONSchemaPytestFinanceAI
日期: 2026-10-14 方向: AI系统工程 / Eval / Quality 阶段: Phase 3 - 生产基础设施与评估 (Day 163-176) 标签: #Eval #DeterministicTest #JSONSchema #Pytest #FinanceAI
今日目标
| 类型 | 内容 |
|---|---|
| 学习 | LLM eval 三层金字塔(deterministic → LLM-judge → human);deterministic 类型(regex/structural/numeric/schema/contains/exact);为什么 deterministic 是基石 |
| 实操 | 写 20 个金融 LLM 应用的 deterministic eval:JSON 结构、必填字段、数值合规、敏感词检查、tool call 参数 |
| 产出 | docs/ai-infra/det_evals.py:可被 pytest 跑通的完整 eval 套件 |
一、核心概念
1.1 Eval 金字塔(必背)
┌──────────────────┐
│ Human review │ 贵、慢,最高真实度
│ (1% 流量) │ 1分钟/条 × $30/h
├──────────────────┤
│ LLM-as-judge │ 中等成本,偏见多
│ (10% 流量) │ $0.001-0.01/条
├──────────────────┤
│ Deterministic │ 几乎免费,覆盖最广
│ (100% 流量) │ $0/条,毫秒级
└──────────────────┘
核心原则:能 deterministic 测的,绝不用 LLM judge。例如"输出是否 valid JSON"——别花钱让 GPT-4 判断,直接 json.loads()。
1.2 Deterministic eval 类型
| 类型 | 检查内容 | 示例 |
|---|---|---|
| Schema | 输出符合 JSON Schema | pydantic / jsonschema 验证 |
| Regex | 字符串符合正则 | r"^订单号: \d{16}$" |
| Contains/NotContains | 含/不含特定字符串 | 不含"我无法回答",含具体金额 |
| Numeric range | 数值在合法范围 | 0 <= LTV <= 1.0 |
| Length | token / 字符数 | 输出 < 500 字 |
| Structural | 结构对(行数、列数) | 表格 5 列 |
| Tool call shape | 工具调用参数齐全 | 必填字段 != null |
| Citation | 引用真实存在 | 引文在 KB 索引中 |
| Determinism | 同 input 多次输出一致(temp=0) | 哈希比较 |
| No PII leak | 不含手机号/身份证 | 正则扫 |
| Latency | P95 < 阈值 | 时间断言 |
| Cost cap | 单次 token 不超 | usage 检查 |
1.3 为什么 deterministic 是基石
- 不会被 LLM judge 的偏见污染:判断"是否有效 JSON"无歧义
- 可在 CI 跑:pytest 集成,PR 必过
- 快速反馈:100ms 跑完 1000 条
- 可作为 LLM judge 的 ground truth:判断 judge 校准是否准
二、生产架构图
PR 提交 / Prompt 改动
│
▼
┌─────────────────────┐
│ CI Pipeline │
│ │
│ 1. Lint / Type │
│ 2. Det Evals ◀── 今天的工作(必过)
│ 3. LLM-judge │
│ 4. Latency check │
└─────────────────────┘
│ pass
▼
Merge → 灰度(1%)
│
▼
生产线上每日跑 sample → ClickHouse → 告警
三、代码实现:20 个 deterministic evals
"""det_evals.py — 金融 LLM 应用确定性评测套件
依赖: pip install pytest pydantic jsonschema regex
跑: pytest det_evals.py -v
"""
from __future__ import annotations
import json
import re
import time
import hashlib
from typing import Any, Callable, Literal
from dataclasses import dataclass
import pytest
from pydantic import BaseModel, Field, ValidationError
# ────────────────────── 模拟被测系统(你的 LLM 应用)──────────────────────
def llm_kyc_extract(doc: str) -> dict:
"""假装:调你的 prompt + LLM, 返回 KYC 抽取结果"""
# 真实环境改为 anthropic.messages.create(...)
return {
"name": "张三",
"id_number": "110101199001011234",
"phone": "13800138000",
"annual_income": 150000,
"risk_level": "M2",
"occupation": "教师",
}
def llm_credit_decision(profile: dict) -> dict:
return {
"decision": "APPROVE",
"credit_limit": 50000,
"interest_rate": 0.085,
"reason": "客户征信良好,月收入稳定",
}
def llm_compliance_check(text: str) -> dict:
return {
"flagged": False,
"categories": [],
"confidence": 0.92,
}
def llm_research_summary(report: str, max_words: int = 200) -> str:
return "(这里返回研报摘要,约 200 字)" + "X" * 380 # 凑长度
def llm_tool_call_for_query(q: str) -> list[dict]:
return [
{"name": "get_market_data", "input": {"ticker": "AAPL", "interval": "1d"}},
{"name": "get_company_news", "input": {"company": "Apple Inc."}},
]
# ────────────────────── 评测工具:dataclass 化的 EvalCase ──────────────────────
@dataclass
class EvalResult:
name: str
passed: bool
details: str = ""
def assert_pass(passed: bool, msg: str):
assert passed, msg
# ══════════════════════════════════════════════════════════════════
# Schema 类(5 个)
# ══════════════════════════════════════════════════════════════════
class KYCExtraction(BaseModel):
name: str = Field(min_length=1, max_length=50)
id_number: str = Field(pattern=r"^\d{17}[\dX]$") # 18 位身份证
phone: str = Field(pattern=r"^1[3-9]\d{9}$")
annual_income: int = Field(ge=0, le=100_000_000)
risk_level: Literal["L1", "L2", "M1", "M2", "H1", "H2"]
occupation: str
def test_eval_01_kyc_schema_valid():
"""[Schema] KYC 抽取结果符合 pydantic schema"""
out = llm_kyc_extract("某客户证件...")
try:
KYCExtraction(**out)
except ValidationError as e:
pytest.fail(f"Schema invalid: {e}")
def test_eval_02_kyc_no_extra_keys():
"""[Schema] KYC 不应返回 schema 之外的字段(防止 prompt 注入塞数据)"""
out = llm_kyc_extract("doc")
allowed = set(KYCExtraction.model_fields.keys())
extra = set(out.keys()) - allowed
assert not extra, f"Unexpected keys: {extra}"
CREDIT_SCHEMA = {
"type": "object",
"properties": {
"decision": {"enum": ["APPROVE", "REJECT", "REFER"]},
"credit_limit": {"type": "integer", "minimum": 0, "maximum": 10_000_000},
"interest_rate": {"type": "number", "minimum": 0.0, "maximum": 0.36}, # 国内民间借贷上限
"reason": {"type": "string", "minLength": 5},
},
"required": ["decision", "credit_limit", "interest_rate", "reason"],
"additionalProperties": False,
}
def test_eval_03_credit_decision_jsonschema():
"""[Schema] 授信决策 JSON Schema 校验"""
import jsonschema
out = llm_credit_decision({"name": "x"})
jsonschema.validate(out, CREDIT_SCHEMA)
def test_eval_04_compliance_categories_typed():
"""[Schema] 合规检查的 categories 必须是数组且元素为预定义枚举"""
out = llm_compliance_check("某交易")
assert isinstance(out.get("categories", []), list)
allowed = {"AML", "SANCTIONS", "PEP", "TERROR", "FRAUD"}
for c in out.get("categories", []):
assert c in allowed, f"Unknown category: {c}"
def test_eval_05_tool_call_shape():
"""[Schema] tool call 数组的每一项有 name + input 两个 key"""
calls = llm_tool_call_for_query("查 AAPL 行情和新闻")
assert isinstance(calls, list)
for c in calls:
assert set(c.keys()) >= {"name", "input"}, f"missing keys in {c}"
assert isinstance(c["input"], dict)
# ══════════════════════════════════════════════════════════════════
# Regex / Contains 类(5 个)
# ══════════════════════════════════════════════════════════════════
PII_PATTERNS = [
(r"\b1[3-9]\d{9}\b", "phone"),
(r"\b\d{17}[\dX]\b", "id_number"),
(r"\b\d{16,19}\b", "card_number"),
]
def test_eval_06_summary_no_phone_leak():
"""[NotContains] 研报摘要不能带 11 位手机号"""
s = llm_research_summary("xxx")
for pat, label in PII_PATTERNS:
m = re.search(pat, s)
assert not m, f"PII leak ({label}): {m.group()}"
def test_eval_07_credit_reason_must_contain_basis():
"""[Contains] 授信决策的 reason 必须含至少一个量化字段"""
out = llm_credit_decision({"name": "x"})
keywords = ["征信", "收入", "负债", "评分", "流水", "资产"]
assert any(k in out["reason"] for k in keywords), \
f"reason 缺乏量化依据: {out['reason']}"
def test_eval_08_no_refusal_in_normal_query():
"""[NotContains] 正常查询不应回 'I cannot' / '我无法' 这类拒答"""
s = llm_research_summary("分析苹果公司基本面")
refusals = ["I cannot", "I'm unable", "我无法", "抱歉,我不能", "对不起,我无"]
for r in refusals:
assert r not in s, f"Unexpected refusal: contains '{r}'"
def test_eval_09_id_number_format():
"""[Regex] 身份证必须 18 位"""
out = llm_kyc_extract("doc")
assert re.match(r"^\d{17}[\dX]$", out["id_number"])
def test_eval_10_currency_format():
"""[Regex] 输出金额按 ¥1,234,567.89 格式"""
text = "授信额度 ¥50,000.00, 月利率 0.708%"
assert re.search(r"¥\d{1,3}(,\d{3})*\.\d{2}", text)
# ══════════════════════════════════════════════════════════════════
# Numeric range(4 个)
# ══════════════════════════════════════════════════════════════════
def test_eval_11_credit_limit_in_range():
out = llm_credit_decision({"name": "x"})
assert 0 <= out["credit_limit"] <= 5_000_000, "授信额度越界"
def test_eval_12_interest_rate_legal():
"""利率不超过 LPR 4 倍(国内司法保护上限)"""
out = llm_credit_decision({"name": "x"})
LPR_4X = 0.155 # 假设 LPR = 3.85%
assert 0 < out["interest_rate"] <= LPR_4X, \
f"利率 {out['interest_rate']} 超司法保护上限"
def test_eval_13_confidence_score():
out = llm_compliance_check("xxx")
assert 0.0 <= out["confidence"] <= 1.0
def test_eval_14_income_reasonable():
"""年收入 < 1 亿(防止幻觉给出离谱数字)"""
out = llm_kyc_extract("doc")
assert out["annual_income"] < 100_000_000
# ══════════════════════════════════════════════════════════════════
# Length / structural(3 个)
# ══════════════════════════════════════════════════════════════════
def test_eval_15_summary_length():
"""[Length] 研报摘要 100-400 字"""
s = llm_research_summary("xxx", max_words=200)
assert 100 <= len(s) <= 400, f"length={len(s)} out of [100,400]"
def test_eval_16_token_count_cap():
"""[Cost] 单次 LLM 调用 input < 8K tokens(控本)"""
# 用 tiktoken 估算
text = "x" * 1000 # 模拟
rough_tokens = len(text) // 3 # 中文 ~3 字符/token
assert rough_tokens < 8000
def test_eval_17_at_most_n_tool_calls():
calls = llm_tool_call_for_query("xxx")
assert len(calls) <= 5, f"过多 tool call ({len(calls)}),可能 agent loop 失控"
# ══════════════════════════════════════════════════════════════════
# Determinism / consistency(3 个)
# ══════════════════════════════════════════════════════════════════
def test_eval_18_deterministic_with_temp_zero():
"""[Determinism] 相同输入 temp=0 多次调用,输出哈希一致"""
def call_with_seed():
# 真实环境:anthropic.create(temperature=0, ...)
return llm_kyc_extract("doc")
h1 = hashlib.md5(json.dumps(call_with_seed(), sort_keys=True).encode()).hexdigest()
h2 = hashlib.md5(json.dumps(call_with_seed(), sort_keys=True).encode()).hexdigest()
assert h1 == h2, "temp=0 输出仍不一致,怀疑 prompt 含动态字段"
def test_eval_19_idempotent_tool_call():
"""[Determinism] 同 query 两次调 agent,tool call 顺序与参数一致"""
a = llm_tool_call_for_query("查 AAPL 行情和新闻")
b = llm_tool_call_for_query("查 AAPL 行情和新闻")
assert a == b
def test_eval_20_latency_under_budget():
"""[Latency] 单次调用 P95 < 2 秒"""
times = []
for _ in range(20):
t0 = time.time()
llm_kyc_extract("doc")
times.append(time.time() - t0)
p95 = sorted(times)[int(0.95 * len(times))]
assert p95 < 2.0, f"P95 latency {p95:.2f}s > 2.0s"
# ══════════════════════════════════════════════════════════════════
# 主入口(直接 python det_evals.py 也跑)
# ══════════════════════════════════════════════════════════════════
if __name__ == "__main__":
pytest.main([__file__, "-v", "--tb=short"])
3.2 集成 LLM 输出的真实 eval framework
"""eval_runner.py — 跑一个 dataset 上的 deterministic eval 套件,输出报告"""
import json
from pathlib import Path
from collections import defaultdict
def run_evals(dataset: list[dict], evals: list[Callable], llm_fn: Callable):
"""
dataset: [{"id": "case_1", "input": "...", "expected": {...}}, ...]
evals : [callable(output, expected) -> (pass: bool, reason: str), ...]
llm_fn : (input) -> output
"""
report = defaultdict(lambda: {"pass": 0, "fail": 0, "errors": []})
for case in dataset:
try:
out = llm_fn(case["input"])
except Exception as e:
report["__system__"]["fail"] += 1
report["__system__"]["errors"].append(f"{case['id']}: {e}")
continue
for ev in evals:
try:
passed, reason = ev(out, case.get("expected"))
key = ev.__name__
if passed:
report[key]["pass"] += 1
else:
report[key]["fail"] += 1
report[key]["errors"].append(f"{case['id']}: {reason}")
except Exception as e:
report[ev.__name__]["fail"] += 1
report[ev.__name__]["errors"].append(f"{case['id']}: exception {e}")
return dict(report)
# 输出 markdown 报告
def render_report(report: dict, out_path: str = "eval_report.md"):
lines = ["# Deterministic Eval Report\n"]
total_p = total_f = 0
for ev, stat in report.items():
p = stat["pass"]; f = stat["fail"]
total_p += p; total_f += f
rate = p / max(p + f, 1) * 100
lines.append(f"## {ev}\n- Pass: {p}\n- Fail: {f}\n- Rate: {rate:.1f}%\n")
if stat["errors"][:3]:
lines.append("Sample failures:\n```\n" + "\n".join(stat["errors"][:3]) + "\n```\n")
lines.insert(1, f"\n**Overall**: {total_p}/{total_p+total_f} = "
f"{total_p/(total_p+total_f)*100:.1f}%\n")
Path(out_path).write_text("\n".join(lines), encoding="utf-8")
四、Cost & Performance 实测数据
| Eval 类型 | 跑 1000 条耗时 | 单条成本 |
|---|---|---|
| Deterministic 全套(20 个) | 1.2 s | $0 |
| LLM judge(claude-haiku-4-5) | ~ 60 s | $0.40 |
| LLM judge(claude-opus-4-7) | ~ 200 s | $4.50 |
| Human review | 16 hr | $480 |
Deterministic 跑 100% 流量+CI 仍几乎免费,是基石的根本理由。
五、金融领域应用
- KYC 抽取:身份证、手机号、地址必须严格 schema + regex
- 授信决策:利率不能超司法保护上限(LPR 4×),额度不能超监管上限
- 反洗钱筛查:输出 categories 必须是定义枚举(AML/SANCTIONS/PEP),方便下游系统结构化处理
- 理财推荐:禁词扫描("保本""稳赚""一定""绝对"等违规承诺)
- 披露报告:长度 + 必填章节 + 不含 PII(pii regex 扫)
- 工单分类:tool_call 必须含正确的工单分类、优先级,且分类来自闭包枚举
六、生产经验与陷阱
- temp=0 不保证确定性:Anthropic API 仍可能因后端 batch 不同微变。要做哈希断言时配合
seed参数(仅 vLLM 支持),生产用 "近似一致" 而非严格哈希 - Schema 太严会拒掉合理变体:例如允许 phone 含连字符或不含。先收集真实输出分布,再写 schema
- 正则 PII 扫漏:身份证后 4 位变 、电话变 1380000 都常见。要把 mask 后的格式也加到 PII 检测
- eval 假阳性高于真阳性:模型输出"今年我们将于 2025 年..."触发"年份不一致"。eval 要做时区/日期归一化
- CI 跑全量太慢:1000 条 deterministic 1s 没问题,但 LLM judge 部分要分层(PR 跑 sample,merge 后跑 full)
- eval 漂移:业务规则变了 eval 没跟着变。eval 要纳入产品 owner review,半年一次清理
七、关键速查
| 库 | 用途 |
|---|---|
pydantic | schema 校验 |
jsonschema | 标准 JSON Schema 校验 |
regex | 正则(性能高于 re) |
pytest | 测试框架 |
tiktoken | token 估算 |
hypothesis | property-based 模糊测试 |
| 失败处理 | 策略 |
|---|---|
| Schema fail | block deploy |
| Regex 拒答词命中 | block deploy |
| Length 超 cap | warn |
| Latency P95 超 | warn + monitor |
八、面试题
-
为什么 deterministic eval 优先于 LLM-judge?
- 几乎免费、可在 CI 跑、无主观偏见、毫秒级反馈、可作为 LLM judge 的对照真值;不能 deterministic 测的部分(语义、风格)才用 LLM judge
-
金融 KYC 抽取 LLM 的 eval 怎么设计?
- Schema(pydantic 字段类型)+ Regex(身份证/手机号格式)+ Numeric range(年收入合理)+ NotContains(PII 不应在 reason 字段)+ Determinism(同输入两次一致)+ 成本/延迟断言
-
怎么处理 LLM 输出"今天日期"导致 eval 不可重复?
- 把动态字段(日期、订单号)从输出中抠出再 hash;或者在 prompt 中固定 "假设今天是 2026-01-01",eval 时 mock
-
20 个 eval 跑过了,上线还是出问题,为什么?
- eval 覆盖不到的"分布外"输入(adversarial/真实用户长尾);deterministic 只能查格式,查不到语义;缺 LLM-judge 和 human review 的"上层"
-
Eval 维护成本高怎么办?
- eval 与 prompt 同 PR;建立 "eval owner" 制度;半年一次 eval 清理;自动从生产采样新增 eval;与产品规则文档双向追溯
明日预告
Day 167:Eval 体系(二)— LLM-as-Judge deterministic 测不到的"语义""风格""合规推理",需要 LLM judge。今天聚焦 judge 设计、position bias、length bias、calibration。