Expert Day 166

Eval 体系（一）— Deterministic 评测

### 1.1 Eval 金字塔（必背）

2026-10-14

Phase 3 - 生产基础设施与评估 (Day 163-176)

EvalDeterministicTestJSONSchemaPytestFinanceAI

日期: 2026-10-14 方向: AI系统工程 / Eval / Quality 阶段: Phase 3 - 生产基础设施与评估 (Day 163-176) 标签: #Eval #DeterministicTest #JSONSchema #Pytest #FinanceAI

今日目标

类型	内容
学习	LLM eval 三层金字塔（deterministic → LLM-judge → human）；deterministic 类型（regex/structural/numeric/schema/contains/exact）；为什么 deterministic 是基石
实操	写 20 个金融 LLM 应用的 deterministic eval：JSON 结构、必填字段、数值合规、敏感词检查、tool call 参数
产出	`docs/ai-infra/det_evals.py`：可被 pytest 跑通的完整 eval 套件

一、核心概念

1.1 Eval 金字塔（必背）

                 ┌──────────────────┐
                 │  Human review    │  贵、慢，最高真实度
                 │   (1% 流量)       │  1分钟/条 × $30/h
                 ├──────────────────┤
                 │  LLM-as-judge    │  中等成本，偏见多
                 │   (10% 流量)     │  $0.001-0.01/条
                 ├──────────────────┤
                 │  Deterministic   │  几乎免费，覆盖最广
                 │   (100% 流量)    │  $0/条，毫秒级
                 └──────────────────┘

核心原则：能 deterministic 测的，绝不用 LLM judge。例如"输出是否 valid JSON"——别花钱让 GPT-4 判断，直接 json.loads()。

1.2 Deterministic eval 类型

类型	检查内容	示例
Schema	输出符合 JSON Schema	pydantic / jsonschema 验证
Regex	字符串符合正则	`r"^订单号: \d{16}$"`
Contains/NotContains	含/不含特定字符串	不含"我无法回答"，含具体金额
Numeric range	数值在合法范围	0 <= LTV <= 1.0
Length	token / 字符数	输出 < 500 字
Structural	结构对（行数、列数）	表格 5 列
Tool call shape	工具调用参数齐全	必填字段 != null
Citation	引用真实存在	引文在 KB 索引中
Determinism	同 input 多次输出一致（temp=0）	哈希比较
No PII leak	不含手机号/身份证	正则扫
Latency	P95 < 阈值	时间断言
Cost cap	单次 token 不超	usage 检查

1.3 为什么 deterministic 是基石

不会被 LLM judge 的偏见污染：判断"是否有效 JSON"无歧义
可在 CI 跑：pytest 集成，PR 必过
快速反馈：100ms 跑完 1000 条
可作为 LLM judge 的 ground truth：判断 judge 校准是否准

二、生产架构图

   PR 提交 / Prompt 改动
        │
        ▼
   ┌─────────────────────┐
   │   CI Pipeline       │
   │                     │
   │  1. Lint / Type     │
   │  2. Det Evals  ◀── 今天的工作（必过）
   │  3. LLM-judge       │
   │  4. Latency check   │
   └─────────────────────┘
        │ pass
        ▼
   Merge → 灰度（1%）
        │
        ▼
   生产线上每日跑 sample → ClickHouse → 告警

三、代码实现：20 个 deterministic evals

"""det_evals.py — 金融 LLM 应用确定性评测套件
依赖: pip install pytest pydantic jsonschema regex
跑: pytest det_evals.py -v
"""
from __future__ import annotations

import json
import re
import time
import hashlib
from typing import Any, Callable, Literal
from dataclasses import dataclass

import pytest
from pydantic import BaseModel, Field, ValidationError

# ────────────────────── 模拟被测系统（你的 LLM 应用）──────────────────────

def llm_kyc_extract(doc: str) -> dict:
    """假装：调你的 prompt + LLM, 返回 KYC 抽取结果"""
    # 真实环境改为 anthropic.messages.create(...)
    return {
        "name": "张三",
        "id_number": "110101199001011234",
        "phone": "13800138000",
        "annual_income": 150000,
        "risk_level": "M2",
        "occupation": "教师",
    }


def llm_credit_decision(profile: dict) -> dict:
    return {
        "decision": "APPROVE",
        "credit_limit": 50000,
        "interest_rate": 0.085,
        "reason": "客户征信良好，月收入稳定",
    }


def llm_compliance_check(text: str) -> dict:
    return {
        "flagged": False,
        "categories": [],
        "confidence": 0.92,
    }


def llm_research_summary(report: str, max_words: int = 200) -> str:
    return "（这里返回研报摘要，约 200 字）" + "X" * 380  # 凑长度


def llm_tool_call_for_query(q: str) -> list[dict]:
    return [
        {"name": "get_market_data", "input": {"ticker": "AAPL", "interval": "1d"}},
        {"name": "get_company_news", "input": {"company": "Apple Inc."}},
    ]


# ────────────────────── 评测工具：dataclass 化的 EvalCase ──────────────────────

@dataclass
class EvalResult:
    name: str
    passed: bool
    details: str = ""


def assert_pass(passed: bool, msg: str):
    assert passed, msg


# ══════════════════════════════════════════════════════════════════
# Schema 类（5 个）
# ══════════════════════════════════════════════════════════════════

class KYCExtraction(BaseModel):
    name: str = Field(min_length=1, max_length=50)
    id_number: str = Field(pattern=r"^\d{17}[\dX]$")  # 18 位身份证
    phone: str = Field(pattern=r"^1[3-9]\d{9}$")
    annual_income: int = Field(ge=0, le=100_000_000)
    risk_level: Literal["L1", "L2", "M1", "M2", "H1", "H2"]
    occupation: str


def test_eval_01_kyc_schema_valid():
    """[Schema] KYC 抽取结果符合 pydantic schema"""
    out = llm_kyc_extract("某客户证件...")
    try:
        KYCExtraction(**out)
    except ValidationError as e:
        pytest.fail(f"Schema invalid: {e}")


def test_eval_02_kyc_no_extra_keys():
    """[Schema] KYC 不应返回 schema 之外的字段（防止 prompt 注入塞数据）"""
    out = llm_kyc_extract("doc")
    allowed = set(KYCExtraction.model_fields.keys())
    extra = set(out.keys()) - allowed
    assert not extra, f"Unexpected keys: {extra}"


CREDIT_SCHEMA = {
    "type": "object",
    "properties": {
        "decision": {"enum": ["APPROVE", "REJECT", "REFER"]},
        "credit_limit": {"type": "integer", "minimum": 0, "maximum": 10_000_000},
        "interest_rate": {"type": "number", "minimum": 0.0, "maximum": 0.36},  # 国内民间借贷上限
        "reason": {"type": "string", "minLength": 5},
    },
    "required": ["decision", "credit_limit", "interest_rate", "reason"],
    "additionalProperties": False,
}


def test_eval_03_credit_decision_jsonschema():
    """[Schema] 授信决策 JSON Schema 校验"""
    import jsonschema
    out = llm_credit_decision({"name": "x"})
    jsonschema.validate(out, CREDIT_SCHEMA)


def test_eval_04_compliance_categories_typed():
    """[Schema] 合规检查的 categories 必须是数组且元素为预定义枚举"""
    out = llm_compliance_check("某交易")
    assert isinstance(out.get("categories", []), list)
    allowed = {"AML", "SANCTIONS", "PEP", "TERROR", "FRAUD"}
    for c in out.get("categories", []):
        assert c in allowed, f"Unknown category: {c}"


def test_eval_05_tool_call_shape():
    """[Schema] tool call 数组的每一项有 name + input 两个 key"""
    calls = llm_tool_call_for_query("查 AAPL 行情和新闻")
    assert isinstance(calls, list)
    for c in calls:
        assert set(c.keys()) >= {"name", "input"}, f"missing keys in {c}"
        assert isinstance(c["input"], dict)


# ══════════════════════════════════════════════════════════════════
# Regex / Contains 类（5 个）
# ══════════════════════════════════════════════════════════════════

PII_PATTERNS = [
    (r"\b1[3-9]\d{9}\b", "phone"),
    (r"\b\d{17}[\dX]\b", "id_number"),
    (r"\b\d{16,19}\b", "card_number"),
]


def test_eval_06_summary_no_phone_leak():
    """[NotContains] 研报摘要不能带 11 位手机号"""
    s = llm_research_summary("xxx")
    for pat, label in PII_PATTERNS:
        m = re.search(pat, s)
        assert not m, f"PII leak ({label}): {m.group()}"


def test_eval_07_credit_reason_must_contain_basis():
    """[Contains] 授信决策的 reason 必须含至少一个量化字段"""
    out = llm_credit_decision({"name": "x"})
    keywords = ["征信", "收入", "负债", "评分", "流水", "资产"]
    assert any(k in out["reason"] for k in keywords), \
        f"reason 缺乏量化依据: {out['reason']}"


def test_eval_08_no_refusal_in_normal_query():
    """[NotContains] 正常查询不应回 'I cannot' / '我无法' 这类拒答"""
    s = llm_research_summary("分析苹果公司基本面")
    refusals = ["I cannot", "I'm unable", "我无法", "抱歉，我不能", "对不起，我无"]
    for r in refusals:
        assert r not in s, f"Unexpected refusal: contains '{r}'"


def test_eval_09_id_number_format():
    """[Regex] 身份证必须 18 位"""
    out = llm_kyc_extract("doc")
    assert re.match(r"^\d{17}[\dX]$", out["id_number"])


def test_eval_10_currency_format():
    """[Regex] 输出金额按 ¥1,234,567.89 格式"""
    text = "授信额度 ¥50,000.00, 月利率 0.708%"
    assert re.search(r"¥\d{1,3}(,\d{3})*\.\d{2}", text)


# ══════════════════════════════════════════════════════════════════
# Numeric range（4 个）
# ══════════════════════════════════════════════════════════════════

def test_eval_11_credit_limit_in_range():
    out = llm_credit_decision({"name": "x"})
    assert 0 <= out["credit_limit"] <= 5_000_000, "授信额度越界"


def test_eval_12_interest_rate_legal():
    """利率不超过 LPR 4 倍（国内司法保护上限）"""
    out = llm_credit_decision({"name": "x"})
    LPR_4X = 0.155  # 假设 LPR = 3.85%
    assert 0 < out["interest_rate"] <= LPR_4X, \
        f"利率 {out['interest_rate']} 超司法保护上限"


def test_eval_13_confidence_score():
    out = llm_compliance_check("xxx")
    assert 0.0 <= out["confidence"] <= 1.0


def test_eval_14_income_reasonable():
    """年收入 < 1 亿（防止幻觉给出离谱数字）"""
    out = llm_kyc_extract("doc")
    assert out["annual_income"] < 100_000_000


# ══════════════════════════════════════════════════════════════════
# Length / structural（3 个）
# ══════════════════════════════════════════════════════════════════

def test_eval_15_summary_length():
    """[Length] 研报摘要 100-400 字"""
    s = llm_research_summary("xxx", max_words=200)
    assert 100 <= len(s) <= 400, f"length={len(s)} out of [100,400]"


def test_eval_16_token_count_cap():
    """[Cost] 单次 LLM 调用 input < 8K tokens（控本）"""
    # 用 tiktoken 估算
    text = "x" * 1000  # 模拟
    rough_tokens = len(text) // 3  # 中文 ~3 字符/token
    assert rough_tokens < 8000


def test_eval_17_at_most_n_tool_calls():
    calls = llm_tool_call_for_query("xxx")
    assert len(calls) <= 5, f"过多 tool call ({len(calls)})，可能 agent loop 失控"


# ══════════════════════════════════════════════════════════════════
# Determinism / consistency（3 个）
# ══════════════════════════════════════════════════════════════════

def test_eval_18_deterministic_with_temp_zero():
    """[Determinism] 相同输入 temp=0 多次调用，输出哈希一致"""
    def call_with_seed():
        # 真实环境：anthropic.create(temperature=0, ...)
        return llm_kyc_extract("doc")
    h1 = hashlib.md5(json.dumps(call_with_seed(), sort_keys=True).encode()).hexdigest()
    h2 = hashlib.md5(json.dumps(call_with_seed(), sort_keys=True).encode()).hexdigest()
    assert h1 == h2, "temp=0 输出仍不一致，怀疑 prompt 含动态字段"


def test_eval_19_idempotent_tool_call():
    """[Determinism] 同 query 两次调 agent，tool call 顺序与参数一致"""
    a = llm_tool_call_for_query("查 AAPL 行情和新闻")
    b = llm_tool_call_for_query("查 AAPL 行情和新闻")
    assert a == b


def test_eval_20_latency_under_budget():
    """[Latency] 单次调用 P95 < 2 秒"""
    times = []
    for _ in range(20):
        t0 = time.time()
        llm_kyc_extract("doc")
        times.append(time.time() - t0)
    p95 = sorted(times)[int(0.95 * len(times))]
    assert p95 < 2.0, f"P95 latency {p95:.2f}s > 2.0s"


# ══════════════════════════════════════════════════════════════════
# 主入口（直接 python det_evals.py 也跑）
# ══════════════════════════════════════════════════════════════════
if __name__ == "__main__":
    pytest.main([__file__, "-v", "--tb=short"])

3.2 集成 LLM 输出的真实 eval framework

"""eval_runner.py — 跑一个 dataset 上的 deterministic eval 套件，输出报告"""
import json
from pathlib import Path
from collections import defaultdict


def run_evals(dataset: list[dict], evals: list[Callable], llm_fn: Callable):
    """
    dataset: [{"id": "case_1", "input": "...", "expected": {...}}, ...]
    evals  : [callable(output, expected) -> (pass: bool, reason: str), ...]
    llm_fn : (input) -> output
    """
    report = defaultdict(lambda: {"pass": 0, "fail": 0, "errors": []})
    for case in dataset:
        try:
            out = llm_fn(case["input"])
        except Exception as e:
            report["__system__"]["fail"] += 1
            report["__system__"]["errors"].append(f"{case['id']}: {e}")
            continue
        for ev in evals:
            try:
                passed, reason = ev(out, case.get("expected"))
                key = ev.__name__
                if passed:
                    report[key]["pass"] += 1
                else:
                    report[key]["fail"] += 1
                    report[key]["errors"].append(f"{case['id']}: {reason}")
            except Exception as e:
                report[ev.__name__]["fail"] += 1
                report[ev.__name__]["errors"].append(f"{case['id']}: exception {e}")
    return dict(report)


# 输出 markdown 报告
def render_report(report: dict, out_path: str = "eval_report.md"):
    lines = ["# Deterministic Eval Report\n"]
    total_p = total_f = 0
    for ev, stat in report.items():
        p = stat["pass"]; f = stat["fail"]
        total_p += p; total_f += f
        rate = p / max(p + f, 1) * 100
        lines.append(f"## {ev}\n- Pass: {p}\n- Fail: {f}\n- Rate: {rate:.1f}%\n")
        if stat["errors"][:3]:
            lines.append("Sample failures:\n```\n" + "\n".join(stat["errors"][:3]) + "\n```\n")
    lines.insert(1, f"\n**Overall**: {total_p}/{total_p+total_f} = "
                    f"{total_p/(total_p+total_f)*100:.1f}%\n")
    Path(out_path).write_text("\n".join(lines), encoding="utf-8")

四、Cost & Performance 实测数据

Eval 类型	跑 1000 条耗时	单条成本
Deterministic 全套（20 个）	1.2 s	$0
LLM judge（claude-haiku-4-5）	~ 60 s	$0.40
LLM judge（claude-opus-4-7）	~ 200 s	$4.50
Human review	16 hr	$480

Deterministic 跑 100% 流量+CI 仍几乎免费，是基石的根本理由。

五、金融领域应用

KYC 抽取：身份证、手机号、地址必须严格 schema + regex
授信决策：利率不能超司法保护上限（LPR 4×），额度不能超监管上限
反洗钱筛查：输出 categories 必须是定义枚举（AML/SANCTIONS/PEP），方便下游系统结构化处理
理财推荐：禁词扫描（"保本""稳赚""一定""绝对"等违规承诺）
披露报告：长度 + 必填章节 + 不含 PII（pii regex 扫）
工单分类：tool_call 必须含正确的工单分类、优先级，且分类来自闭包枚举

六、生产经验与陷阱

temp=0 不保证确定性：Anthropic API 仍可能因后端 batch 不同微变。要做哈希断言时配合 seed 参数（仅 vLLM 支持），生产用 "近似一致" 而非严格哈希
Schema 太严会拒掉合理变体：例如允许 phone 含连字符或不含。先收集真实输出分布，再写 schema
正则 PII 扫漏：身份证后 4 位变 、电话变 1380000 都常见。要把 mask 后的格式也加到 PII 检测
eval 假阳性高于真阳性：模型输出"今年我们将于 2025 年..."触发"年份不一致"。eval 要做时区/日期归一化
CI 跑全量太慢：1000 条 deterministic 1s 没问题，但 LLM judge 部分要分层（PR 跑 sample，merge 后跑 full）
eval 漂移：业务规则变了 eval 没跟着变。eval 要纳入产品 owner review，半年一次清理

七、关键速查

库	用途
`pydantic`	schema 校验
`jsonschema`	标准 JSON Schema 校验
`regex`	正则（性能高于 `re`）
`pytest`	测试框架
`tiktoken`	token 估算
`hypothesis`	property-based 模糊测试

失败处理	策略
Schema fail	block deploy
Regex 拒答词命中	block deploy
Length 超 cap	warn
Latency P95 超	warn + monitor

八、面试题

为什么 deterministic eval 优先于 LLM-judge？
- 几乎免费、可在 CI 跑、无主观偏见、毫秒级反馈、可作为 LLM judge 的对照真值；不能 deterministic 测的部分（语义、风格）才用 LLM judge
金融 KYC 抽取 LLM 的 eval 怎么设计？
- Schema（pydantic 字段类型）+ Regex（身份证/手机号格式）+ Numeric range（年收入合理）+ NotContains（PII 不应在 reason 字段）+ Determinism（同输入两次一致）+ 成本/延迟断言
怎么处理 LLM 输出"今天日期"导致 eval 不可重复？
- 把动态字段（日期、订单号）从输出中抠出再 hash；或者在 prompt 中固定 "假设今天是 2026-01-01"，eval 时 mock
20 个 eval 跑过了，上线还是出问题，为什么？
- eval 覆盖不到的"分布外"输入（adversarial/真实用户长尾）；deterministic 只能查格式，查不到语义；缺 LLM-judge 和 human review 的"上层"
Eval 维护成本高怎么办？
- eval 与 prompt 同 PR；建立 "eval owner" 制度；半年一次 eval 清理；自动从生产采样新增 eval；与产品规则文档双向追溯

明日预告

Day 167：Eval 体系（二）— LLM-as-Judge deterministic 测不到的"语义""风格""合规推理"，需要 LLM judge。今天聚焦 judge 设计、position bias、length bias、calibration。