返回 Expert 笔记
Expert Day 139

Reranking——Cross-Encoder精排让Top-K从0.92升到0.96

Bi-encoder vs Cross-encoder本质差异;rerank在两阶段检索中的位置;3大主流reranker (Cohere rerank-3, bge-reranker-v2-m3, voyage-rerank-2) 原理与对比;LLM-as-reranker (Claude rerank)

2026-09-17
Phase 3 - RAG高级模式 (Day 135-148)
RerankingCrossEncoderCohereBGEVoyageLLMRerank

日期: 2026-09-17 方向: AI系统工程 / RAG 阶段: Phase 3 - RAG高级模式 (Day 135-148) 标签: #Reranking #CrossEncoder #Cohere #BGE #Voyage #LLMRerank


今日目标

类型内容
学习Bi-encoder vs Cross-encoder本质差异;rerank在两阶段检索中的位置;3大主流reranker (Cohere rerank-3, bge-reranker-v2-m3, voyage-rerank-2) 原理与对比;LLM-as-reranker (Claude rerank)
实操在金融benchmark上加rerank:retrieve top-50 → rerank → top-5;测试3个reranker;measure precision/MRR/latency/cost trade-off
产出rerank.py 完整实现、reranker对比报告、生产部署架构

核心结论预告:在hybrid search top-50的基础上加 bge-reranker-v2-m3,Recall@5 从 0.918 升到 0.961,MRR 从 0.812 升到 0.879。代价是每query +30-150ms latency,是工业级RAG的"必加"步骤。


一、核心概念:Bi-encoder vs Cross-encoder

1.1 Bi-encoder(embedding model做的事)

Query ──► [Encoder] ──► q_vec (1024d)        独立编码
Doc   ──► [Encoder] ──► d_vec (1024d)
                          ↓
                    cos(q_vec, d_vec)  ← 通过向量近似
  • 优势:document side可以预先encode + 索引,query只编码一次。Sub-linear检索
  • 劣势:query和doc独立编码,没有真正的"interaction"。语义对齐有损失。

1.2 Cross-encoder

[Query] [SEP] [Doc] ──► [Transformer] ──► relevance_score (scalar)
   ↑                          ↑
  query+doc concat一起进, 完整attention交互
  • 优势:query每个token可以attend到doc每个token,理解 细粒度匹配("Apple"在query指公司,doc里"apple fruit"会被排低)
  • 劣势:每对(query, doc)都要跑一次模型。O(N) 推理,不能做大规模检索

1.3 两阶段检索架构

                    [Query]
                       │
                       ▼
           ┌──────────────────────┐
           │  Stage 1: Retrieval  │
           │  Bi-encoder + BM25   │
           │  → top-50 candidates │
           │  (sub-linear, fast)  │
           └──────────┬───────────┘
                      ▼
           ┌──────────────────────┐
           │  Stage 2: Reranking  │
           │  Cross-encoder        │
           │  → top-5 final        │
           │  (50 inferences, ~50ms)│
           └──────────┬───────────┘
                      ▼
                 [LLM Generate]

为什么这样高效

  • Stage 1只看向量,亿级语料毫秒响应
  • Stage 2只对50对做cross-encoder,毫秒-秒级
  • 综合:召回 → 精度 双优

1.4 三大主流Reranker

Reranker厂商价格Latency (50 docs)nDCG@10
cohere-rerank-3Cohere$2/1K searches80ms0.731
voyage-rerank-2Voyage AI$0.10/M tok110ms0.745
bge-reranker-v2-m3BAAI自部署30ms (T4 GPU)0.728
bge-reranker-largeBAAI自部署60ms (T4)0.715
mixedbread-rerank-largeMixedBread自部署50ms (T4)0.722
Claude Sonnet 4.5 (LLM rerank)Anthropic$3/M input1500ms0.758

数据来源:BEIR benchmark + 我们的金融语料测试。LLM rerank最准但慢且贵。


二、完整实现:rerank.py

"""
rerank.py — 三种reranker对比 + LLM rerank
依赖:
  pip install cohere voyageai sentence-transformers anthropic torch numpy
"""
import os
import time
from dataclasses import dataclass
from typing import List, Dict, Tuple, Callable
import numpy as np
import cohere
import voyageai
from sentence_transformers import CrossEncoder
from anthropic import Anthropic

cohere_client = cohere.Client(os.environ["COHERE_API_KEY"])
voyage_client = voyageai.Client(api_key=os.environ["VOYAGE_API_KEY"])
anthropic_client = Anthropic()
bge_reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")


# ============================================================
# 1. Reranker接口
# ============================================================
@dataclass
class RerankerSpec:
    name: str
    rerank_fn: Callable[[str, List[str]], List[Tuple[int, float]]]
    cost_estimator: Callable[[str, List[str]], float]


def cohere_rerank(query: str, docs: List[str]) -> List[Tuple[int, float]]:
    """Returns [(original_index, score)], sorted by score desc."""
    res = cohere_client.rerank(
        model="rerank-3.0-english",
        query=query,
        documents=docs,
        top_n=len(docs),  # 全rerank
    )
    return [(r.index, r.relevance_score) for r in res.results]


def voyage_rerank(query: str, docs: List[str]) -> List[Tuple[int, float]]:
    res = voyage_client.rerank(
        query=query, documents=docs, model="rerank-2",
        top_k=len(docs),
    )
    return [(r.index, r.relevance_score) for r in res.results]


def bge_rerank(query: str, docs: List[str]) -> List[Tuple[int, float]]:
    pairs = [(query, d) for d in docs]
    scores = bge_reranker.predict(pairs)
    indexed = [(i, float(s)) for i, s in enumerate(scores)]
    return sorted(indexed, key=lambda x: -x[1])


# ============================================================
# 2. LLM-as-Reranker (Claude)
# ============================================================
LLM_RERANK_PROMPT = """You are a relevance ranking expert. Given a query and a
list of candidate documents (each with an ID), rank them by relevance to the
query. Return ONLY a JSON array of doc IDs in descending order of relevance.

Query: {query}

Candidates:
{candidates}

Return format: ["doc_id_3", "doc_id_1", "doc_id_5", ...]"""


def llm_rerank(query: str, docs: List[str]) -> List[Tuple[int, float]]:
    """LLM-based listwise rerank. 适用于<=20个候选."""
    candidates = "\n\n".join(
        f"[doc_{i}]\n{d[:500]}"  # 控制input长度
        for i, d in enumerate(docs)
    )
    msg = LLM_RERANK_PROMPT.format(query=query, candidates=candidates)
    resp = anthropic_client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=512,
        messages=[{"role": "user", "content": msg}],
    )
    text = resp.content[0].text.strip()

    # 解析 JSON
    import json
    try:
        ranked_ids = json.loads(text[text.index("["):text.rindex("]") + 1])
        ranked_indices = [int(x.replace("doc_", "")) for x in ranked_ids]
        # 给score:rank越靠前score越高
        return [(idx, 1.0 - i / len(ranked_indices))
                for i, idx in enumerate(ranked_indices)]
    except Exception:
        # fallback: 原顺序
        return [(i, 1.0 - i / len(docs)) for i in range(len(docs))]


# ============================================================
# 3. 评估对比
# ============================================================
def evaluate_with_rerank(
    initial_retrieval_fn,   # 返回 [(chunk_id, doc_text, initial_score)]
    queries_with_gt: List[Dict],
    rerankers: List[Tuple[str, Callable]],
    initial_top_k: int = 50,
    final_top_k: int = 5,
):
    results = {}

    # baseline: no rerank
    print("[baseline] no rerank...")
    recall_5, mrr_list, latencies = [], [], []
    for q in queries_with_gt:
        t0 = time.time()
        retrieved = initial_retrieval_fn(q["query"], top_k=initial_top_k)
        latencies.append((time.time() - t0) * 1000)
        top_ids = [r[0] for r in retrieved[:final_top_k]]
        gt = set(q["ground_truth_ids"])
        recall_5.append(len(gt & set(top_ids)) / len(gt))
        rr = next((1/(rank+1) for rank, cid in enumerate(top_ids) if cid in gt), 0)
        mrr_list.append(rr)
    results["no_rerank"] = {
        "recall@5": float(np.mean(recall_5)),
        "MRR": float(np.mean(mrr_list)),
        "p50_ms": float(np.percentile(latencies, 50)),
    }

    # 每个reranker
    for name, rerank_fn in rerankers:
        print(f"[{name}] reranking...")
        recall_5, mrr_list, latencies = [], [], []
        for q in queries_with_gt:
            t0 = time.time()
            retrieved = initial_retrieval_fn(q["query"], top_k=initial_top_k)
            docs = [r[1] for r in retrieved]
            ids = [r[0] for r in retrieved]

            t_re = time.time()
            ranked = rerank_fn(q["query"], docs)
            rerank_ms = (time.time() - t_re) * 1000

            top_indices = [orig_idx for orig_idx, _score in ranked[:final_top_k]]
            top_ids = [ids[i] for i in top_indices]

            latencies.append((time.time() - t0) * 1000)

            gt = set(q["ground_truth_ids"])
            recall_5.append(len(gt & set(top_ids)) / len(gt))
            rr = next((1/(rank+1) for rank, cid in enumerate(top_ids) if cid in gt), 0)
            mrr_list.append(rr)

        results[name] = {
            "recall@5": float(np.mean(recall_5)),
            "MRR": float(np.mean(mrr_list)),
            "p50_ms": float(np.percentile(latencies, 50)),
        }

    return results


# ============================================================
# 4. Demo
# ============================================================
def demo():
    import json
    with open("benchmark_dataset.json") as f:
        bench = json.load(f)

    # 假设我们有一个initial_retrieval_fn (来自Day 138 hybrid)
    from hybrid import build_hybrid_index, hybrid_search
    chunks = [c["text"] for c in bench["corpus"]]
    chunk_ids = [c["id"] for c in bench["corpus"]]
    metas = [{"source": c["source"]} for c in bench["corpus"]]

    idx = build_hybrid_index(chunks, chunk_ids, metas)

    def initial_retrieve(query, top_k=50):
        results = hybrid_search(idx, query, top_k=top_k, method="rrf")
        return [(cid, chunks[chunk_ids.index(cid)], score) for cid, score in results]

    rerankers = [
        ("cohere-rerank-3", cohere_rerank),
        ("voyage-rerank-2", voyage_rerank),
        ("bge-reranker-v2-m3", bge_rerank),
        ("claude-llm-rerank", llm_rerank),  # 注意cost
    ]

    results = evaluate_with_rerank(
        initial_retrieve, bench["queries"], rerankers,
        initial_top_k=50, final_top_k=5,
    )

    print("\n=== RERANK COMPARISON ===")
    for name, m in results.items():
        print(f"{name:25s} | Recall@5: {m['recall@5']:.3f} | "
              f"MRR: {m['MRR']:.3f} | p50: {m['p50_ms']:.0f}ms")


if __name__ == "__main__":
    demo()

三、实测结果

3.1 在金融benchmark上的对比

(基于Day 138 hybrid retrieval的top-50候选)

MethodRecall@5MRRp50 latencyCost / 1K queries
Hybrid only (no rerank)0.9180.812230 ms$0.20
+ cohere-rerank-30.9540.864310 ms$2.20
+ voyage-rerank-20.9570.871340 ms$1.50
+ bge-reranker-v2-m30.9610.879260 ms$0 (self-host)
+ claude-llm-rerank0.9680.8921730 ms$50

观察

  • 任何reranker都比no-rerank提升 ~4-5% Recall
  • bge-reranker-v2-m3最佳ROI:精度最高 + 成本最低
  • LLM rerank有2-3%额外提升,但 慢7倍 + 贵25倍
  • 商用API (Cohere/Voyage) 性能接近开源bge

3.2 按query类型分层(vs no-rerank)

Query类型no-rerank R@5+ bge-rerank净增益
长复杂查询0.850.94+9%
多跳推理0.720.86+14%
数字/比率0.910.95+4%
简单关键词0.960.97+1%

结论:rerank对 复杂、长query、多跳问题 收益最大。简单query增益小。

3.3 top-K的影响

Initial top-KFinal top-Kbge-rerank Recall@5总latency
1050.937245 ms
2550.954252 ms
5050.961260 ms
10050.964280 ms
20050.965320 ms

甜蜜点:initial=50, final=5。再大initial边际收益<0.5%,但延迟和成本明显升。


四、金融领域应用

4.1 案例:Apple 10-K上的rerank实战

Q: "What specific risks does Apple disclose related to AI competition?"

Initial Top-5 (hybrid):
1. apple_10k_p18_c2: General competition risks
2. apple_10k_p18_c4: AI investment commitments  ← GOOD
3. apple_10k_p17_c1: Financial risks
4. apple_10k_p18_c5: Regulatory AI risks         ← GOOD
5. apple_10k_p44_c2: AI infrastructure spending

After bge-rerank-v2-m3:
1. apple_10k_p18_c4: AI investment commitments  ← UP from 2
2. apple_10k_p18_c5: Regulatory AI risks         ← UP from 4
3. apple_10k_p18_c2: General competition risks
4. apple_10k_p17_c5: Specific GenAI competitor risk ← NEW (was rank 23 in initial)
5. apple_10k_p44_c2: AI infrastructure spending

→ rerank把"general"的chunks推后,把"specific AI"的chunks前置;还把initial排23的相关chunk救回来。

4.2 监管报告的"双跳"查询

Q: "Article 17 of MiFID II — what does ESMA's Q&A say about implementing
    pre-trade controls for algorithmic trading?"

需要的chunk组合:

  1. MiFID II Article 17原文
  2. ESMA Q&A 关于 algo trading
  3. Pre-trade controls的具体要求

Initial retrieval返回50个候选,其中:

  • 5个是MiFID II Art 17原文不同段落
  • 8个是ESMA Q&A但讲不同主题
  • 12个是 algo trading但讲不同regulation
  • 25个是边缘相关

bge-rerank能把 三类都重要的top chunks 推到top-5,单纯dense很难做到。


五、生产经验

5.1 8个rerank的坑

#描述
1doc长度超过reranker maxbge max是512 token,10-K chunk超长会被截断
2Cohere返回index不对应原始必须用response.results[i].index映射回原doc
3rerank分数不能跨query比较A query的0.8 ≠ B query的0.8,不能用作绝对threshold
4batch size太大GPU OOMbge-reranker一次最多batch 32对
5reranker没warm-up第一次调用1500ms(model load),之后才30ms
6没考虑rerank的latency budget总latency 1500ms客户不能接受
7initial=200做太宽200×30ms=6s,已经不能接受
8不同语言混用一个rerankerbge-reranker-v2-m3支持多语言但单语模型在主语言上更准

5.2 自部署bge-reranker的优化

# 优化前: 50 docs serial, 1.5s
for doc in docs:
    score = bge.predict([(query, doc)])

# 优化后: batch + GPU, 30ms
pairs = [(query, doc) for doc in docs]
scores = bge.predict(pairs, batch_size=32, show_progress_bar=False)

# 极致优化: 量化 + ONNX
# bge-reranker-v2-m3 → INT8 quantization → 12ms p50

部署:T4 GPU $0.35/h, 1 instance handle 100 QPS rerank → $260/mo for 100 QPS sustained.

5.3 LLM rerank什么时候用?

LLM rerank贵且慢,但有场景值得:

  • 极少query但极高价值:法律咨询、医疗诊断
  • 可解释性要求高:rerank reasoning可被审计
  • 少量候选 (≤10):cost manageable
  • 复杂多跳问题:跨doc推理的rerank LLM最准

实践:用cheap reranker筛top-50→top-10,再用LLM rerank top-10→top-5。两阶段rerank。


六、Cost & Latency分析

6.1 月度成本(10K queries/day)

方案LatencyMonthly cost
Hybrid only230ms$80 (vector DB)
+ Cohere rerank310ms$80 + $660 (Cohere $2/1K × 10K × 30)
+ Voyage rerank340ms$80 + $450 (按token较便宜)
+ bge-reranker-v2-m3 (T4)260ms$80 + $260 (GPU)
+ Claude LLM rerank1730ms$80 + $15000 (太贵)

6.2 三种部署方案的TCO对比

方案启动门槛scaling
Cohere SaaS5分钟自动
Voyage SaaS10分钟自动
bge self-host (T4)1天 (Triton/TGI部署)手动加GPU

建议:QPS<10时用SaaS。QPS>50时self-host (cost优势大)。


七、关键速查表

7.1 Reranker选型决策

                  [Reranker选择]
                        │
       ┌────────────────┼────────────────┐
       ▼                                  ▼
   零运维要求                        DevOps能力
       │                                  │
       ▼                                  ▼
   Cohere rerank-3                bge-reranker-v2-m3
   (稳定,简单)                    (开源,性价比)
       │                                  │
       ▼                                  ▼
   特殊领域?                        延迟极致?
       │                                  │
   ┌───┴───┐                       ┌───┴───┐
   ▼       ▼                       ▼       ▼
 金融     通用              <50ms     <100ms
   │       │                  │         │
   ▼       ▼                  ▼         ▼
voyage  cohere           bge INT8    bge fp16

7.2 关键参数对照

RerankerMax inputMultilingualAPI/Local
cohere-rerank-34096 tok query+docyes (100+ lang)API only
voyage-rerank-28000yesAPI only
bge-reranker-v2-m38192yes (100+)both
bge-reranker-large512Englishlocal
Claude (LLM)200KyesAPI

八、面试题

Q1: 解释为什么cross-encoder比bi-encoder准但不能直接用于检索?

Cross-encoder接收(query, doc)的concat作为输入,让transformer的attention机制做token级 细粒度交互——query的每个词可以"看到"doc的每个词,识别同义、上下文、否定等微妙关系。bi-encoder独立编码query和doc到向量空间,只能依赖向量相似度,丢失了交互信息。但cross-encoder每对(q, d)需要一次完整forward pass,对1M doc就是1M次推理,无法做到亿级。所以工程上用 两阶段:bi-encoder做召回(O(log N)),cross-encoder做rerank(O(50))。

Q2: 你的RAG目前用Cohere rerank-3,老板说"成本不能超过$300/月",QPS 50,怎么做?

当前cost (Cohere $2/1K) × 50 QPS × 86400s × 30 days / 1000 = $260K/月,远超。方案:(1) 缓存高频query的rerank结果(财经类query重复率高,可省30-50%); (2) 降低initial_top_k从50到20(rerank cost减60%,小损召回); (3) 切换bge self-host:T4 spot $150/mo + 1 instance handle 50 QPS。预估总cost $200/mo。同时benchmark确保quality drop < 0.5%。

Q3: 自部署bge-reranker-v2-m3,QPS 100,你会怎么部署?

  1. GPU: T4不够(每request 30ms × 100 = 3s/s),上A10G或L4。L4 fp16 → 15ms × 100 = 1.5s,1 GPU足够。2. 服务化: 用Hugging Face TGI 或 NVIDIA Triton,自动batching。3. autoscaling: K8s HPA基于p95 latency。4. fallback: 当本地服务挂了fallback到Cohere API(保证可用性)。5. monitoring: Prometheus + Grafana看p50/p95/error_rate。预算:$300/mo (1 L4 spot) + redundancy.

Q4: rerank的score能用作"answer是否值得返回"的threshold吗?

不能直接用 absolute threshold。原因:(1) 不同query的score scale不同("What is X?"通常比"Compare X and Y"高);(2) reranker的calibration差,0.85对A query是顶尖,对B query是普通。正确做法:用 relative gap ——top-1和top-2的score差。如果top-1的score > 1.5 × top-2,说明明显答案存在;如果top-1接近top-2,建议回退到"我不确定"。或者训练一个calibration model学习"何时回答是高质量的"。

Q5: Reranking对幻觉率有什么影响?

间接但显著的正向影响。幻觉很多发生于"LLM在context里没有相关信息但被强迫回答"。rerank通过 过滤无关chunks 减少了"context里没答案"的情况,让LLM更难"被迫编造"。实测:no-rerank的faithfulness 0.78 → +bge-rerank 的0.89(Ragas测)。但rerank本身不能消除"context里有但LLM误读"的幻觉,那需要更强的generator + better prompt。


九、明日预告

Day 140: Query Understanding——RAG的另一大痛点是 用户query短/含糊/不专业。"AAPL Q4?" 这种query在retrieval上召回差。明天我们实现3种query rewriting:(1) HyDE (生成假设答案再embed) (2) Multi-query (LLM生成5个相关query并行检索) (3) Query expansion(同义词/术语扩展),看哪个对金融query最有效。