返回 Expert 笔记
Expert Day 166

Eval 体系(一)— Deterministic 评测

### 1.1 Eval 金字塔(必背)

2026-10-14
Phase 3 - 生产基础设施与评估 (Day 163-176)
EvalDeterministicTestJSONSchemaPytestFinanceAI

日期: 2026-10-14 方向: AI系统工程 / Eval / Quality 阶段: Phase 3 - 生产基础设施与评估 (Day 163-176) 标签: #Eval #DeterministicTest #JSONSchema #Pytest #FinanceAI


今日目标

类型内容
学习LLM eval 三层金字塔(deterministic → LLM-judge → human);deterministic 类型(regex/structural/numeric/schema/contains/exact);为什么 deterministic 是基石
实操写 20 个金融 LLM 应用的 deterministic eval:JSON 结构、必填字段、数值合规、敏感词检查、tool call 参数
产出docs/ai-infra/det_evals.py:可被 pytest 跑通的完整 eval 套件

一、核心概念

1.1 Eval 金字塔(必背)

                 ┌──────────────────┐
                 │  Human review    │  贵、慢,最高真实度
                 │   (1% 流量)       │  1分钟/条 × $30/h
                 ├──────────────────┤
                 │  LLM-as-judge    │  中等成本,偏见多
                 │   (10% 流量)     │  $0.001-0.01/条
                 ├──────────────────┤
                 │  Deterministic   │  几乎免费,覆盖最广
                 │   (100% 流量)    │  $0/条,毫秒级
                 └──────────────────┘

核心原则:能 deterministic 测的,绝不用 LLM judge。例如"输出是否 valid JSON"——别花钱让 GPT-4 判断,直接 json.loads()

1.2 Deterministic eval 类型

类型检查内容示例
Schema输出符合 JSON Schemapydantic / jsonschema 验证
Regex字符串符合正则r"^订单号: \d{16}$"
Contains/NotContains含/不含特定字符串不含"我无法回答",含具体金额
Numeric range数值在合法范围0 <= LTV <= 1.0
Lengthtoken / 字符数输出 < 500 字
Structural结构对(行数、列数)表格 5 列
Tool call shape工具调用参数齐全必填字段 != null
Citation引用真实存在引文在 KB 索引中
Determinism同 input 多次输出一致(temp=0)哈希比较
No PII leak不含手机号/身份证正则扫
LatencyP95 < 阈值时间断言
Cost cap单次 token 不超usage 检查

1.3 为什么 deterministic 是基石

  1. 不会被 LLM judge 的偏见污染:判断"是否有效 JSON"无歧义
  2. 可在 CI 跑:pytest 集成,PR 必过
  3. 快速反馈:100ms 跑完 1000 条
  4. 可作为 LLM judge 的 ground truth:判断 judge 校准是否准

二、生产架构图

   PR 提交 / Prompt 改动
        │
        ▼
   ┌─────────────────────┐
   │   CI Pipeline       │
   │                     │
   │  1. Lint / Type     │
   │  2. Det Evals  ◀── 今天的工作(必过)
   │  3. LLM-judge       │
   │  4. Latency check   │
   └─────────────────────┘
        │ pass
        ▼
   Merge → 灰度(1%)
        │
        ▼
   生产线上每日跑 sample → ClickHouse → 告警

三、代码实现:20 个 deterministic evals

"""det_evals.py — 金融 LLM 应用确定性评测套件
依赖: pip install pytest pydantic jsonschema regex
跑: pytest det_evals.py -v
"""
from __future__ import annotations

import json
import re
import time
import hashlib
from typing import Any, Callable, Literal
from dataclasses import dataclass

import pytest
from pydantic import BaseModel, Field, ValidationError

# ────────────────────── 模拟被测系统(你的 LLM 应用)──────────────────────

def llm_kyc_extract(doc: str) -> dict:
    """假装:调你的 prompt + LLM, 返回 KYC 抽取结果"""
    # 真实环境改为 anthropic.messages.create(...)
    return {
        "name": "张三",
        "id_number": "110101199001011234",
        "phone": "13800138000",
        "annual_income": 150000,
        "risk_level": "M2",
        "occupation": "教师",
    }


def llm_credit_decision(profile: dict) -> dict:
    return {
        "decision": "APPROVE",
        "credit_limit": 50000,
        "interest_rate": 0.085,
        "reason": "客户征信良好,月收入稳定",
    }


def llm_compliance_check(text: str) -> dict:
    return {
        "flagged": False,
        "categories": [],
        "confidence": 0.92,
    }


def llm_research_summary(report: str, max_words: int = 200) -> str:
    return "(这里返回研报摘要,约 200 字)" + "X" * 380  # 凑长度


def llm_tool_call_for_query(q: str) -> list[dict]:
    return [
        {"name": "get_market_data", "input": {"ticker": "AAPL", "interval": "1d"}},
        {"name": "get_company_news", "input": {"company": "Apple Inc."}},
    ]


# ────────────────────── 评测工具:dataclass 化的 EvalCase ──────────────────────

@dataclass
class EvalResult:
    name: str
    passed: bool
    details: str = ""


def assert_pass(passed: bool, msg: str):
    assert passed, msg


# ══════════════════════════════════════════════════════════════════
# Schema 类(5 个)
# ══════════════════════════════════════════════════════════════════

class KYCExtraction(BaseModel):
    name: str = Field(min_length=1, max_length=50)
    id_number: str = Field(pattern=r"^\d{17}[\dX]$")  # 18 位身份证
    phone: str = Field(pattern=r"^1[3-9]\d{9}$")
    annual_income: int = Field(ge=0, le=100_000_000)
    risk_level: Literal["L1", "L2", "M1", "M2", "H1", "H2"]
    occupation: str


def test_eval_01_kyc_schema_valid():
    """[Schema] KYC 抽取结果符合 pydantic schema"""
    out = llm_kyc_extract("某客户证件...")
    try:
        KYCExtraction(**out)
    except ValidationError as e:
        pytest.fail(f"Schema invalid: {e}")


def test_eval_02_kyc_no_extra_keys():
    """[Schema] KYC 不应返回 schema 之外的字段(防止 prompt 注入塞数据)"""
    out = llm_kyc_extract("doc")
    allowed = set(KYCExtraction.model_fields.keys())
    extra = set(out.keys()) - allowed
    assert not extra, f"Unexpected keys: {extra}"


CREDIT_SCHEMA = {
    "type": "object",
    "properties": {
        "decision": {"enum": ["APPROVE", "REJECT", "REFER"]},
        "credit_limit": {"type": "integer", "minimum": 0, "maximum": 10_000_000},
        "interest_rate": {"type": "number", "minimum": 0.0, "maximum": 0.36},  # 国内民间借贷上限
        "reason": {"type": "string", "minLength": 5},
    },
    "required": ["decision", "credit_limit", "interest_rate", "reason"],
    "additionalProperties": False,
}


def test_eval_03_credit_decision_jsonschema():
    """[Schema] 授信决策 JSON Schema 校验"""
    import jsonschema
    out = llm_credit_decision({"name": "x"})
    jsonschema.validate(out, CREDIT_SCHEMA)


def test_eval_04_compliance_categories_typed():
    """[Schema] 合规检查的 categories 必须是数组且元素为预定义枚举"""
    out = llm_compliance_check("某交易")
    assert isinstance(out.get("categories", []), list)
    allowed = {"AML", "SANCTIONS", "PEP", "TERROR", "FRAUD"}
    for c in out.get("categories", []):
        assert c in allowed, f"Unknown category: {c}"


def test_eval_05_tool_call_shape():
    """[Schema] tool call 数组的每一项有 name + input 两个 key"""
    calls = llm_tool_call_for_query("查 AAPL 行情和新闻")
    assert isinstance(calls, list)
    for c in calls:
        assert set(c.keys()) >= {"name", "input"}, f"missing keys in {c}"
        assert isinstance(c["input"], dict)


# ══════════════════════════════════════════════════════════════════
# Regex / Contains 类(5 个)
# ══════════════════════════════════════════════════════════════════

PII_PATTERNS = [
    (r"\b1[3-9]\d{9}\b", "phone"),
    (r"\b\d{17}[\dX]\b", "id_number"),
    (r"\b\d{16,19}\b", "card_number"),
]


def test_eval_06_summary_no_phone_leak():
    """[NotContains] 研报摘要不能带 11 位手机号"""
    s = llm_research_summary("xxx")
    for pat, label in PII_PATTERNS:
        m = re.search(pat, s)
        assert not m, f"PII leak ({label}): {m.group()}"


def test_eval_07_credit_reason_must_contain_basis():
    """[Contains] 授信决策的 reason 必须含至少一个量化字段"""
    out = llm_credit_decision({"name": "x"})
    keywords = ["征信", "收入", "负债", "评分", "流水", "资产"]
    assert any(k in out["reason"] for k in keywords), \
        f"reason 缺乏量化依据: {out['reason']}"


def test_eval_08_no_refusal_in_normal_query():
    """[NotContains] 正常查询不应回 'I cannot' / '我无法' 这类拒答"""
    s = llm_research_summary("分析苹果公司基本面")
    refusals = ["I cannot", "I'm unable", "我无法", "抱歉,我不能", "对不起,我无"]
    for r in refusals:
        assert r not in s, f"Unexpected refusal: contains '{r}'"


def test_eval_09_id_number_format():
    """[Regex] 身份证必须 18 位"""
    out = llm_kyc_extract("doc")
    assert re.match(r"^\d{17}[\dX]$", out["id_number"])


def test_eval_10_currency_format():
    """[Regex] 输出金额按 ¥1,234,567.89 格式"""
    text = "授信额度 ¥50,000.00, 月利率 0.708%"
    assert re.search(r"¥\d{1,3}(,\d{3})*\.\d{2}", text)


# ══════════════════════════════════════════════════════════════════
# Numeric range(4 个)
# ══════════════════════════════════════════════════════════════════

def test_eval_11_credit_limit_in_range():
    out = llm_credit_decision({"name": "x"})
    assert 0 <= out["credit_limit"] <= 5_000_000, "授信额度越界"


def test_eval_12_interest_rate_legal():
    """利率不超过 LPR 4 倍(国内司法保护上限)"""
    out = llm_credit_decision({"name": "x"})
    LPR_4X = 0.155  # 假设 LPR = 3.85%
    assert 0 < out["interest_rate"] <= LPR_4X, \
        f"利率 {out['interest_rate']} 超司法保护上限"


def test_eval_13_confidence_score():
    out = llm_compliance_check("xxx")
    assert 0.0 <= out["confidence"] <= 1.0


def test_eval_14_income_reasonable():
    """年收入 < 1 亿(防止幻觉给出离谱数字)"""
    out = llm_kyc_extract("doc")
    assert out["annual_income"] < 100_000_000


# ══════════════════════════════════════════════════════════════════
# Length / structural(3 个)
# ══════════════════════════════════════════════════════════════════

def test_eval_15_summary_length():
    """[Length] 研报摘要 100-400 字"""
    s = llm_research_summary("xxx", max_words=200)
    assert 100 <= len(s) <= 400, f"length={len(s)} out of [100,400]"


def test_eval_16_token_count_cap():
    """[Cost] 单次 LLM 调用 input < 8K tokens(控本)"""
    # 用 tiktoken 估算
    text = "x" * 1000  # 模拟
    rough_tokens = len(text) // 3  # 中文 ~3 字符/token
    assert rough_tokens < 8000


def test_eval_17_at_most_n_tool_calls():
    calls = llm_tool_call_for_query("xxx")
    assert len(calls) <= 5, f"过多 tool call ({len(calls)}),可能 agent loop 失控"


# ══════════════════════════════════════════════════════════════════
# Determinism / consistency(3 个)
# ══════════════════════════════════════════════════════════════════

def test_eval_18_deterministic_with_temp_zero():
    """[Determinism] 相同输入 temp=0 多次调用,输出哈希一致"""
    def call_with_seed():
        # 真实环境:anthropic.create(temperature=0, ...)
        return llm_kyc_extract("doc")
    h1 = hashlib.md5(json.dumps(call_with_seed(), sort_keys=True).encode()).hexdigest()
    h2 = hashlib.md5(json.dumps(call_with_seed(), sort_keys=True).encode()).hexdigest()
    assert h1 == h2, "temp=0 输出仍不一致,怀疑 prompt 含动态字段"


def test_eval_19_idempotent_tool_call():
    """[Determinism] 同 query 两次调 agent,tool call 顺序与参数一致"""
    a = llm_tool_call_for_query("查 AAPL 行情和新闻")
    b = llm_tool_call_for_query("查 AAPL 行情和新闻")
    assert a == b


def test_eval_20_latency_under_budget():
    """[Latency] 单次调用 P95 < 2 秒"""
    times = []
    for _ in range(20):
        t0 = time.time()
        llm_kyc_extract("doc")
        times.append(time.time() - t0)
    p95 = sorted(times)[int(0.95 * len(times))]
    assert p95 < 2.0, f"P95 latency {p95:.2f}s > 2.0s"


# ══════════════════════════════════════════════════════════════════
# 主入口(直接 python det_evals.py 也跑)
# ══════════════════════════════════════════════════════════════════
if __name__ == "__main__":
    pytest.main([__file__, "-v", "--tb=short"])

3.2 集成 LLM 输出的真实 eval framework

"""eval_runner.py — 跑一个 dataset 上的 deterministic eval 套件,输出报告"""
import json
from pathlib import Path
from collections import defaultdict


def run_evals(dataset: list[dict], evals: list[Callable], llm_fn: Callable):
    """
    dataset: [{"id": "case_1", "input": "...", "expected": {...}}, ...]
    evals  : [callable(output, expected) -> (pass: bool, reason: str), ...]
    llm_fn : (input) -> output
    """
    report = defaultdict(lambda: {"pass": 0, "fail": 0, "errors": []})
    for case in dataset:
        try:
            out = llm_fn(case["input"])
        except Exception as e:
            report["__system__"]["fail"] += 1
            report["__system__"]["errors"].append(f"{case['id']}: {e}")
            continue
        for ev in evals:
            try:
                passed, reason = ev(out, case.get("expected"))
                key = ev.__name__
                if passed:
                    report[key]["pass"] += 1
                else:
                    report[key]["fail"] += 1
                    report[key]["errors"].append(f"{case['id']}: {reason}")
            except Exception as e:
                report[ev.__name__]["fail"] += 1
                report[ev.__name__]["errors"].append(f"{case['id']}: exception {e}")
    return dict(report)


# 输出 markdown 报告
def render_report(report: dict, out_path: str = "eval_report.md"):
    lines = ["# Deterministic Eval Report\n"]
    total_p = total_f = 0
    for ev, stat in report.items():
        p = stat["pass"]; f = stat["fail"]
        total_p += p; total_f += f
        rate = p / max(p + f, 1) * 100
        lines.append(f"## {ev}\n- Pass: {p}\n- Fail: {f}\n- Rate: {rate:.1f}%\n")
        if stat["errors"][:3]:
            lines.append("Sample failures:\n```\n" + "\n".join(stat["errors"][:3]) + "\n```\n")
    lines.insert(1, f"\n**Overall**: {total_p}/{total_p+total_f} = "
                    f"{total_p/(total_p+total_f)*100:.1f}%\n")
    Path(out_path).write_text("\n".join(lines), encoding="utf-8")

四、Cost & Performance 实测数据

Eval 类型跑 1000 条耗时单条成本
Deterministic 全套(20 个)1.2 s$0
LLM judge(claude-haiku-4-5)~ 60 s$0.40
LLM judge(claude-opus-4-7)~ 200 s$4.50
Human review16 hr$480

Deterministic 跑 100% 流量+CI 仍几乎免费,是基石的根本理由。


五、金融领域应用

  1. KYC 抽取:身份证、手机号、地址必须严格 schema + regex
  2. 授信决策:利率不能超司法保护上限(LPR 4×),额度不能超监管上限
  3. 反洗钱筛查:输出 categories 必须是定义枚举(AML/SANCTIONS/PEP),方便下游系统结构化处理
  4. 理财推荐:禁词扫描("保本""稳赚""一定""绝对"等违规承诺)
  5. 披露报告:长度 + 必填章节 + 不含 PII(pii regex 扫)
  6. 工单分类:tool_call 必须含正确的工单分类、优先级,且分类来自闭包枚举

六、生产经验与陷阱

  1. temp=0 不保证确定性:Anthropic API 仍可能因后端 batch 不同微变。要做哈希断言时配合 seed 参数(仅 vLLM 支持),生产用 "近似一致" 而非严格哈希
  2. Schema 太严会拒掉合理变体:例如允许 phone 含连字符或不含。先收集真实输出分布,再写 schema
  3. 正则 PII 扫漏:身份证后 4 位变 、电话变 1380000 都常见。要把 mask 后的格式也加到 PII 检测
  4. eval 假阳性高于真阳性:模型输出"今年我们将于 2025 年..."触发"年份不一致"。eval 要做时区/日期归一化
  5. CI 跑全量太慢:1000 条 deterministic 1s 没问题,但 LLM judge 部分要分层(PR 跑 sample,merge 后跑 full)
  6. eval 漂移:业务规则变了 eval 没跟着变。eval 要纳入产品 owner review,半年一次清理

七、关键速查

用途
pydanticschema 校验
jsonschema标准 JSON Schema 校验
regex正则(性能高于 re
pytest测试框架
tiktokentoken 估算
hypothesisproperty-based 模糊测试
失败处理策略
Schema failblock deploy
Regex 拒答词命中block deploy
Length 超 capwarn
Latency P95 超warn + monitor

八、面试题

  1. 为什么 deterministic eval 优先于 LLM-judge?

    • 几乎免费、可在 CI 跑、无主观偏见、毫秒级反馈、可作为 LLM judge 的对照真值;不能 deterministic 测的部分(语义、风格)才用 LLM judge
  2. 金融 KYC 抽取 LLM 的 eval 怎么设计?

    • Schema(pydantic 字段类型)+ Regex(身份证/手机号格式)+ Numeric range(年收入合理)+ NotContains(PII 不应在 reason 字段)+ Determinism(同输入两次一致)+ 成本/延迟断言
  3. 怎么处理 LLM 输出"今天日期"导致 eval 不可重复?

    • 把动态字段(日期、订单号)从输出中抠出再 hash;或者在 prompt 中固定 "假设今天是 2026-01-01",eval 时 mock
  4. 20 个 eval 跑过了,上线还是出问题,为什么?

    • eval 覆盖不到的"分布外"输入(adversarial/真实用户长尾);deterministic 只能查格式,查不到语义;缺 LLM-judge 和 human review 的"上层"
  5. Eval 维护成本高怎么办?

    • eval 与 prompt 同 PR;建立 "eval owner" 制度;半年一次 eval 清理;自动从生产采样新增 eval;与产品规则文档双向追溯

明日预告

Day 167:Eval 体系(二)— LLM-as-Judge deterministic 测不到的"语义""风格""合规推理",需要 LLM judge。今天聚焦 judge 设计、position bias、length bias、calibration。