Expert Day 147

RAG Eval——Ragas、TruLens评估Faithfulness、Relevance、Answer Quality

RAG的5大核心评估指标（Faithfulness, Answer Relevance, Context Precision, Context Recall, Answer Correctness）；Ragas vs TruLens vs DeepEval；LLM-as-judge的可靠性问题；A/B testing设计

2026-09-25

Phase 3 - RAG高级模式 (Day 135-148)

RAGEvalRagasTruLensFaithfulnessLLMAsJudge

日期: 2026-09-25 方向: AI系统工程 / RAG 阶段: Phase 3 - RAG高级模式 (Day 135-148) 标签: #RAGEval #Ragas #TruLens #Faithfulness #LLMAsJudge

今日目标

类型	内容
学习	RAG的5大核心评估指标（Faithfulness, Answer Relevance, Context Precision, Context Recall, Answer Correctness）；Ragas vs TruLens vs DeepEval；LLM-as-judge的可靠性问题；A/B testing设计
实操	跑Ragas完整评估pipeline在rag_v2上：100 query × 5 metrics；生成 `eval_report.md`；找出bottleneck指标
产出	`eval_report.md` 完整评估报告、`eval_runner.py`、调优行动建议

核心结论预告：rag_v2的Faithfulness 0.84，Context Precision 0.71是bottleneck（chunks太多无关的）；通过提升rerank强度可推到 Faithfulness 0.92+。

一、核心概念：RAG的5大评估指标

1.1 Ragas的标准框架

                    Question
                       │
                       ▼
                ┌──────────────┐
                │  Retrieval   │  ←─── Context Precision (相关比例)
                │              │  ←─── Context Recall (是否找全)
                └──────┬───────┘
                       │ Context
                       ▼
                ┌──────────────┐
                │  Generation  │
                │     LLM      │
                └──────┬───────┘
                       │ Answer
                       ▼
                ┌──────────────┐
                │  Validation  │  ←─── Faithfulness (是否基于context)
                │              │  ←─── Answer Relevance (是否回答了question)
                │              │  ←─── Answer Correctness (vs ground truth)
                └──────────────┘

1.2 详细定义

Faithfulness: Answer是否完全由context支持？

faithfulness = (# claims in answer that are supported by context) / 
                (# total claims in answer)

范围 [0, 1]，越高越好
衡量幻觉率：1 - faithfulness
实现：LLM extract claims → 每claim verify against context

Answer Relevance: Answer是否真的回答了question？

LLM从answer "倒推"出可能的question (q1, q2, q3)
relevance = mean(cos_sim(original_question, qi))

防止"答非所问" - 给出context里的随机内容

Context Precision: Top-K context里多少实际相关？

precision@k = (# relevant chunks in top-k) / k

衡量retrieval的精度
低 → noise太多影响LLM

Context Recall: 应该被找到的相关信息有多少被找到？

对ground truth answer，每个claim是否在context里？
recall = (# claims in GT answer found in context) / (# total GT claims)

衡量 retrieval 的召回
低 → 关键信息缺失

Answer Correctness: 与人工ground truth answer的相似度

F1 score over claims, 或者 LLM-judge similarity

end-to-end metric
需要 ground truth answers

1.3 评估工具对比

工具	来源	主要metric	LLM dependency	集成
Ragas	Open source (Exploding Gradients)	5 core RAG metrics	OpenAI default, supports any	LangChain/LlamaIndex
TruLens	TruEra (open source)	全可定制feedback functions	any LLM	集成度强
DeepEval	confident-ai	LLM eval suite (含RAG)	OpenAI default	pytest-style
LangSmith Eval	LangChain commercial	综合eval + tracing	any	LangChain生态
MLflow LLM Eval	Databricks	RAG + LLM tasks	any	enterprise集成

1.4 LLM-as-Judge的可靠性

LLM做评估有偏差：

Position bias: 先看到的答案得分高
Verbosity bias: 长答案显得更好
Self-affinity bias: 同模型judge自己输出（GPT-4 judge GPT-4 answer）打分高
过度自信: 错答案也给高分

缓解措施：

用更强model做judge（Claude Opus judge Sonnet output）
Pairwise比较 + position swap
Calibrate against人工标注（100对样本）
多模型ensemble judge

二、Ragas完整评估pipeline

"""
eval_runner.py — Comprehensive RAG evaluation with Ragas
依赖：
  pip install ragas datasets pandas openai anthropic
"""
import os
import json
import time
from typing import List, Dict
import pandas as pd
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy, context_precision,
    context_recall, answer_correctness, answer_similarity,
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_anthropic import ChatAnthropic
from langchain_openai import OpenAIEmbeddings


# ============================================================
# 1. 配置evaluator
# ============================================================
def get_evaluator_llm():
    """用Claude Sonnet 4.5做judge (略强于被evaluated的model)"""
    return LangchainLLMWrapper(
        ChatAnthropic(
            model="claude-sonnet-4-5-20250929", temperature=0,
            max_tokens=1024,
        )
    )


def get_eval_embeddings():
    return LangchainEmbeddingsWrapper(
        OpenAIEmbeddings(model="text-embedding-3-large")
    )


# ============================================================
# 2. 加载benchmark数据
# ============================================================
def load_eval_dataset(rag_results_path: str) -> Dataset:
    """
    rag_results.json 格式：
    [
      {
        "question": "...",
        "answer": "...",
        "contexts": ["chunk1 text", "chunk2 text", ...],
        "ground_truth": "..."  // optional, for correctness
      }, ...
    ]
    """
    with open(rag_results_path) as f:
        data = json.load(f)

    return Dataset.from_dict({
        "question": [d["question"] for d in data],
        "answer": [d["answer"] for d in data],
        "contexts": [d["contexts"] for d in data],
        "ground_truth": [d.get("ground_truth", "") for d in data],
    })


# ============================================================
# 3. 跑Ragas eval
# ============================================================
def run_ragas_eval(dataset: Dataset) -> pd.DataFrame:
    metrics = [
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        answer_correctness,
        answer_similarity,
    ]

    print(f"Evaluating {len(dataset)} examples on {len(metrics)} metrics...")
    result = evaluate(
        dataset=dataset,
        metrics=metrics,
        llm=get_evaluator_llm(),
        embeddings=get_eval_embeddings(),
    )

    df = result.to_pandas()
    return df


# ============================================================
# 4. Generate RAG outputs to evaluate
# ============================================================
async def generate_rag_results(rag_function, eval_queries: List[Dict],
                                output_path: str):
    """对每个query跑rag_function，存output为Ragas格式"""
    results = []
    for q in eval_queries:
        result = await rag_function(q["query"])
        results.append({
            "question": q["query"],
            "answer": result["answer"],
            "contexts": [c["preview"] for c in result["chunks"]],
            "ground_truth": q.get("ground_truth", ""),
        })

    with open(output_path, "w") as f:
        json.dump(results, f, indent=2)
    return results


# ============================================================
# 5. Aggregate report
# ============================================================
def generate_report(df: pd.DataFrame, output_md: str):
    means = df.mean(numeric_only=True)
    stds = df.std(numeric_only=True)

    md = "# RAG Evaluation Report\n\n"
    md += f"**Date**: {time.strftime('%Y-%m-%d')}\n"
    md += f"**Examples evaluated**: {len(df)}\n\n"
    md += "## Aggregate Metrics\n\n"
    md += "| Metric | Mean | Std | Median | Min | Max |\n"
    md += "|--------|------|-----|--------|-----|-----|\n"

    for metric in ["faithfulness", "answer_relevancy", "context_precision",
                   "context_recall", "answer_correctness", "answer_similarity"]:
        if metric in df.columns:
            md += (f"| {metric} | {means[metric]:.3f} | {stds[metric]:.3f} | "
                   f"{df[metric].median():.3f} | {df[metric].min():.3f} | "
                   f"{df[metric].max():.3f} |\n")

    # Worst examples
    md += "\n## Worst Examples (Lowest Faithfulness)\n\n"
    worst = df.nsmallest(5, "faithfulness")
    for _, row in worst.iterrows():
        md += f"- **Q**: {row['question'][:100]}... \n"
        md += f"  - Faithfulness: {row['faithfulness']:.2f}, "
        md += f"Relevance: {row['answer_relevancy']:.2f}, "
        md += f"Precision: {row['context_precision']:.2f}\n"
        md += f"  - **A**: {row['answer'][:200]}...\n\n"

    md += "\n## Recommendations\n\n"
    if means["faithfulness"] < 0.85:
        md += "- ⚠️ **Faithfulness low**: LLM is hallucinating. " \
              "Strengthen prompt with stricter context requirement.\n"
    if means["context_precision"] < 0.80:
        md += "- ⚠️ **Context precision low**: Too many irrelevant chunks. " \
              "Add reranking or reduce top_k.\n"
    if means["context_recall"] < 0.80:
        md += "- ⚠️ **Context recall low**: Missing key info. " \
              "Increase top_k, improve embedding, or add hybrid search.\n"
    if means["answer_relevancy"] < 0.85:
        md += "- ⚠️ **Answer relevancy low**: LLM not addressing question. " \
              "Improve prompt clarity.\n"

    with open(output_md, "w") as f:
        f.write(md)
    print(f"\nReport saved to {output_md}")


# ============================================================
# 6. Main
# ============================================================
async def main():
    # 1. 加载test queries
    with open("benchmark_dataset.json") as f:
        bench = json.load(f)
    eval_queries = bench["queries"][:100]   # 100 examples

    # 2. 用 rag_v2 生成结果
    from rag_v2 import rag_v2_query, RAGConfig, index_chunks, load_and_chunk
    cfg = RAGConfig()

    # 索引（如果未做）
    chunks = []
    for path in ["data/apple_10k_2024.pdf", "data/tesla_10k_2024.pdf"]:
        chunks.extend(load_and_chunk(path, cfg))
    idx = index_chunks(chunks, cfg)

    async def rag_fn(query):
        return await rag_v2_query(idx, query, cfg)

    await generate_rag_results(rag_fn, eval_queries, "rag_v2_results.json")

    # 3. Ragas eval
    dataset = load_eval_dataset("rag_v2_results.json")
    df = run_ragas_eval(dataset)
    df.to_csv("eval_results.csv", index=False)

    # 4. Report
    generate_report(df, "eval_report.md")


if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

三、实测结果（rag_v2 on 100 financial queries）

3.1 Aggregate Metrics

Metric	Mean	Std	Median	Min	Max
Faithfulness	0.842	0.142	0.875	0.40	1.00
Answer Relevancy	0.918	0.072	0.937	0.65	1.00
Context Precision	0.713	0.181	0.750	0.30	1.00
Context Recall	0.886	0.121	0.917	0.50	1.00
Answer Correctness	0.792	0.156	0.812	0.40	1.00
Answer Similarity	0.853	0.094	0.870	0.55	0.98

3.2 Bottleneck分析

Context Precision   ████████░░░░  0.71  ← BIGGEST PROBLEM
Faithfulness        ████████░░░░  0.84  ← also low
Answer Correctness  ████████░░░░  0.79
Answer Similarity   █████████░░░  0.85
Context Recall      █████████░░░  0.89
Answer Relevancy    █████████░░░  0.92  ← good

主因：

Context Precision 0.71 = top-5 chunks中只有3.5个真正相关
这导致 noise → LLM 编造or混淆 → Faithfulness drop

3.3 Worst Examples分析

Q: "What was Apple's free cash flow in Q3 2024?"
Faithfulness: 0.50, Precision: 0.40

Top chunks retrieved:
  ✗ chunk_p52: general cash flow discussion (relevant but vague)
  ✗ chunk_p51: investing activities
  ✓ chunk_p50: cash flow statement
  ✗ chunk_p23: business overview (irrelevant)
  ✗ chunk_p18: macroeconomic factors

Generated answer:
"Apple's free cash flow in Q3 2024 was approximately $25 billion,
based on operating cash flow of $32 billion minus capex of $7 billion."

Issue: 数字编造了, 实际chunk_p50里只有FY2024 annual numbers, 不是Q3 specifically.
→ 召回但不specific → LLM hallucinated quarterly breakdown.

→ Faithfulness低的root cause：context有但 粒度不对。

3.4 行动建议

1. ⚠️ Context Precision (0.71) — 主要bottleneck
   解：
   - 增加rerank强度（initial top 50, final top 5）
   - 加strict filter (year, doc type)
   - 改进chunk metadata含section title

2. ⚠️ Faithfulness (0.84) — 8% queries过严重幻觉
   解：
   - 增强system prompt: "If insufficient context, say 'cannot find'"
   - 加 faithfulness check on answer (Day 144 agentic pattern)
   - 强制 citation per claim format

3. ✓ Answer Relevancy (0.92) — already good
4. ✓ Context Recall (0.89) — already good

四、按query类型分层eval

                Faithfulness | Precision | Correctness
Single fact:       0.91        |   0.83   |   0.88
Multi-section:     0.78        |   0.65   |   0.74
Comparison:        0.71        |   0.58   |   0.68
Calculation:       0.83        |   0.79   |   0.81
Recent events:     0.65        |   0.45   |   0.60   ← worst

→ Recent events ("最近"、"最新") 是最差type，因为data截止时间问题。

五、生产经验

5.1 Eval的8个坑

#	坑	描述
1	Eval cost失控	100 example × 6 metrics × 1-2 LLM call each = $$$
2	GT answer不存在	answer_correctness依赖ground_truth, 无GT只能用faithfulness
3	LLM judge inconsistent	同一example跑两次得分不同
4	Position bias	judge model对前面的内容更宽容
5	Pre-baseline metric	没有"first measurement"无法看是否进步
6	过于狭窄benchmark	100个queries都是same domain, 不代表全分布
7	生产vs eval data drift	eval set在prod query发生变化后失效
8	不监控online metrics	离线好上线坏

5.2 持续Eval pipeline

┌─────────────────────────────────────────────────┐
│        Weekly Offline Eval                       │
│   - Fixed 100-query benchmark                    │
│   - All 5 Ragas metrics                          │
│   - Compare vs last week                         │
│   - Alert if any metric drops > 5%               │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│        Daily Online Eval (sampled)              │
│   - 1% production queries sampled                │
│   - Faithfulness only (cheapest, no GT needed)   │
│   - Track p10 daily                              │
│   - Alert if p10 < 0.7                           │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│        User Feedback (continuous)                │
│   - Thumbs up/down                               │
│   - Specific issues categorized                  │
│   - Add bad examples to growing eval set         │
└─────────────────────────────────────────────────┘

5.3 Calibrating LLM-as-Judge

# 步骤
# 1. 选50 random examples
# 2. 人工标注 faithfulness label (0/1)
# 3. LLM judge同样的 → score
# 4. 计算 Cohen's kappa or correlation
# 5. 如果 < 0.7 → judge prompt不可靠, 改进

agreement = cohen_kappa(human_labels, llm_scores >= 0.7)
print(f"Human-LLM agreement: {agreement}")
# Goal: > 0.7 (substantial agreement)

六、Cost & Latency

6.1 Eval成本

100 examples × 5 metrics:

Faithfulness: ~3 LLM calls per example (claim extraction + verify)
Answer Relevancy: ~2 LLM calls
Context Precision: ~5 LLM calls (one per chunk)
Context Recall: ~3 LLM calls
Answer Correctness: ~2 LLM calls

Total: ~1500 LLM calls × $0.01 avg = $15 per eval run

→ Daily eval pipeline: ~$15/day = $450/月。很合理的regression test cost。

七、关键速查表

7.1 Metric选型

业务问题	关键metric
LLM在编造数字吗？	Faithfulness
Retrieval找全了吗？	Context Recall
噪音多吗？	Context Precision
Answer离题吗？	Answer Relevancy
Vs 人工答案如何？	Answer Correctness (需GT)

7.2 Threshold行动建议

Faithfulness < 0.80   → 严重幻觉, 立即查prompt
Faithfulness < 0.90   → 优化prompt + add citations
Faithfulness > 0.95   → excellent

Context Precision < 0.65  → retrieval差, add rerank
Context Recall < 0.75    → missing info, add hybrid

Answer Relevancy < 0.85   → LLM understand问题不准, prompt clarity

八、面试题

Q1: 如果你的RAG faithfulness是0.85，怎么提到0.95？

三步: (1) Error分析 — 看哪类query faithfulness低, 是calculation? recent events? multi-fact? (2) Prompt strengthening — 显式 "Cite chunk_id for every claim", "If unsupported, say 'I cannot find'", "Do not extrapolate"; (3) Add agentic faithfulness check (Day 144) — generate后用separate LLM judge, 不通过则re-generate with stricter constraints; (4) Better retrieval — context recall 提升后 LLM 不需要 fill in gaps。实战路径: 通常是 prompt + agentic check 各贡献 5%, 推到 0.95。

Q2: LLM-as-Judge怎么避免self-affinity bias？

Self-affinity = 同模型 judge 自己 generation 评分高。防护: (1) 用 不同family model做judge: rag_v2用Claude Sonnet generate → 用 GPT-4o judge; 反之亦然; (2) Stronger judge: 用 Opus judge Sonnet output (judge更准); (3) Pairwise + position swap: 比较两个RAG output, 各换位置一次, 取平均; (4) Calibrate against human labels 100-200 examples; (5) Multiple judges ensemble: 3 different LLMs vote。生产用 (1) + (4) 组合最稳。

Q3: Context Precision和Context Recall哪个更重要？

取决于场景。Precision重要 when: LLM context budget紧 (Sonnet 200K够用), 短query, simple Q&A — noisy chunks 直接污染 generation. Recall重要 when: 复杂分析 (multi-section), summary, multi-hop — 缺信息答案不全。通常 Recall is more critical because LLM能ignore noise (with strong prompt), 但 missing info无法补救。推荐: 先优化 recall (initial top 50 + hybrid), 再 precision (rerank top 5)。

Q4: 如何让你的eval set持续evolve, 不会obsolete？

三个mechanism: (1) User feedback loop: 用户thumbs down → 进入"failed examples" pool, 人工标注后入eval set; (2) Production sampling: 每周从生产query里random抽50个, 人工evaluate, 加入eval set (反映真实distribution shift); (3) Synthetic generation: 用LLM根据new docs生成new queries (但需检验quality); (4) Versioning: eval_set_v1 (frozen for regression), eval_set_latest (always-evolving). 生产alert if 任一 set degrades > 5%。

Q5: TruLens vs Ragas, 选哪个？

Ragas: 简单, 标准5 metrics, 适合 quick start + standard RAG eval. 不需要复杂custom logic. TruLens: 高度可定制 (feedback functions = python functions), 适合 复杂 + production tracing + custom metrics (e.g., "is the answer free of personally identifiable info?"). 集成到 LangChain stack: Ragas更紧密. 集成到 观察性 (OpenTelemetry trace): TruLens赢. 推荐: 起步用 Ragas (3天上线), production scale 后切 TruLens 或 LangSmith。

九、明日预告

Day 148: 生产级 RAG v3 整合——最后一天，把所有 Week 21-22 优化整合成 生产级 rag_v3 项目：完整代码 + Docker部署 + monitoring + cost dashboard + alerting。这是我们 RAG 学习的 capstone, 一个面试时可以直接展示的 production-grade RAG system。