返回 Expert 笔记
Expert Day 129

Multi-modal Prompting——Vision、Voice、Document与Claude Vision深度

多模态LLM架构(CLIP/encoder-decoder fusion)、Claude Vision、PDF processing、Voice (Whisper/STT/TTS)

2026-09-07
Phase 3 - LLM基础与Prompt工程 (Day 121-134)
VisionMultimodalClaudeVisionDocumentPDFOCR

日期: 2026-09-07 方向: AI系统工程 阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #Vision #Multimodal #ClaudeVision #Document #PDF #OCR


今日目标

类型内容
学习多模态LLM架构(CLIP/encoder-decoder fusion)、Claude Vision、PDF processing、Voice (Whisper/STT/TTS)
实操用Claude Vision处理金融报表PDF、bar chart图、扫描发票
产出mm_demo.py + 多模态成本/精度对照表

一、理论基础

1.1 多模态LLM架构

主流pattern:

  1. Vision Encoder + LLM:CLIP ViT等encoder提image embedding,project到LLM token space。如LLaVA、Claude 3+。
  2. Native Multimodal:从训练开始就联合文本+图像(Gemini 1.5、GPT-4o)
  3. Adapter approach:冻结LLM,只train adapter层(cheap但效果次)

Claude Vision (3+) 大概率走Vision Encoder + LLM融合架构(基于公开能力推测)。

1.2 图像token化

图像要变token喂给LLM:

  • Patch化:把1024×1024图切成32×32 patches(共1024 patches)
  • 每个patch过Vision encoder得embedding
  • Project到LLM空间,每个patch约对应1 token

Claude Vision token cost估算: $$ \text{tokens} \approx \frac{w \times h}{750} $$ (基于Anthropic文档;实际是按scaled版本计算,但750 ratio是好rule of thumb)

例:

  • 1280×1024 → ~1750 tokens
  • 4K image (3840×2160) → ~11000 tokens(不一定真cost这么多,Anthropic会先downscale)

1.3 主流多模态能力

模型图像视频音频PDF nativeDocument layout
Claude 4.7❌ direct (frame extract)✅ 强
GPT-5✅ (Sora integration)✅ (TTS+STT)
Gemini 2.5 Pro✅ 1h+ video
Llama 3.2 Vision

1.4 Claude Vision专长

Anthropic官方强调的能力:

  • Chart/graph理解(金融、scientific)
  • Handwriting识别
  • Multi-image comparison
  • OCR-quality文字提取
  • Diagram analysis (UML/flowchart)
  • 长document(PDF up to 100页)

二、直觉解释

为什么vision token那么贵?

一张图1000+ tokens,相当于500+ word的英文。但信息密度高(一张表格图= 几百行文字描述)。ROI看任务——OCR文档省掉手工输入,值;闲聊场景图就贵。

为什么LLM能"理解"图?

不是真的"看见"。是image encoder把image压成vector,这些vector在training中和"图片标题"、"图片说明"对齐。所以模型本质是"翻译"——把image vector翻译成"图片描述的token序列"。

Document处理为什么难?

PDF内有:

  • 文本流(vector)
  • 扫描版图像(raster)
  • 复杂布局(multi-column、tables、footnotes)
  • 嵌入图表

普通text extract丢失layout。Vision-LLM直接看页面图保留所有信息——但cost高。生产里混用:text-extractable的页面用文本,扫描页面用vision。


三、代码实现

3.1 Claude Vision基础调用

# vision_basics.py
"""
用Claude Vision分析金融图表
"""
import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def encode_image(image_path):
    """读图片转base64"""
    with open(image_path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

def analyze_chart(image_path, prompt="Describe this financial chart in detail."):
    image_data = encode_image(image_path)
    media_type = f"image/{Path(image_path).suffix[1:].lower()}"
    if media_type == "image/jpg":
        media_type = "image/jpeg"

    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": image_data,
                    }
                },
                {"type": "text", "text": prompt}
            ]
        }]
    )
    print(f"Tokens: {response.usage.input_tokens} in, {response.usage.output_tokens} out")
    return response.content[0].text

# 用法
# print(analyze_chart("aapl_revenue_chart.png"))

3.2 用URL方式(更省)

# Anthropic也支持URL直接拉取
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": "https://example.com/chart.png"
                }
            },
            {"type": "text", "text": "Extract all numeric values."}
        ]
    }]
)

3.3 PDF processing(Claude支持原生PDF)

# pdf_analysis.py
"""
直接喂PDF(Claude会做内部page-by-page vision processing)
"""
import anthropic
import base64

client = anthropic.Anthropic()

with open("apple_10k.pdf", "rb") as f:
    pdf_data = base64.standard_b64encode(f.read()).decode()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {
                    "type": "base64",
                    "media_type": "application/pdf",
                    "data": pdf_data
                }
            },
            {"type": "text", "text": "Extract Q3 income statement key metrics."}
        ]
    }]
)
print(response.content[0].text)
print(f"Cost: {response.usage.input_tokens} input tokens")

3.4 Files API:上传一次,多次引用

PDF很大(如100页 = 50MB),每次inline base64浪费带宽。Files API上传一次:

# files_api_demo.py
"""
Anthropic Files API: 上传PDF, 多次引用
"""
client = anthropic.Anthropic()

# 1. 上传
file_obj = client.beta.files.upload(
    file=("apple_10k.pdf", open("apple_10k.pdf", "rb"), "application/pdf")
)
file_id = file_obj.id
print(f"Uploaded: {file_id}")

# 2. 多次query不用重传
for question in [
    "What was Q3 revenue?",
    "What are the main risk factors?",
    "Summarize the management discussion."
]:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {"type": "document", "source": {"type": "file", "file_id": file_id}},
                {"type": "text", "text": question}
            ]
        }],
        extra_headers={"anthropic-beta": "files-api-2025-04-14"}
    )
    print(f"Q: {question}\nA: {response.content[0].text[:200]}\n")

# 3. 文件可以list/delete
files = client.beta.files.list()
# client.beta.files.delete(file_id)

3.5 Multi-image comparison

# multi_image_compare.py
"""
对比两张财报图,找出差异
"""
def compare_charts(img1_path, img2_path):
    img1 = encode_image(img1_path)
    img2 = encode_image(img2_path)

    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Image 1 (Q2):"},
                {"type": "image", "source": {"type": "base64",
                                              "media_type": "image/png", "data": img1}},
                {"type": "text", "text": "Image 2 (Q3):"},
                {"type": "image", "source": {"type": "base64",
                                              "media_type": "image/png", "data": img2}},
                {"type": "text", "text": "Compare these two quarterly revenue charts. Highlight the key differences."}
            ]
        }]
    )
    return response.content[0].text

3.6 Citations:让Claude标出"答案来自第几页"

# citations_demo.py
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {"type": "base64", "media_type": "application/pdf",
                           "data": pdf_data},
                "citations": {"enabled": True}  # <-- 开启
            },
            {"type": "text", "text": "What were operating expenses for FY26?"}
        ]
    }]
)

# response包含citation blocks
for block in response.content:
    if block.type == "text":
        print(block.text)
        if hasattr(block, "citations"):
            for cit in block.citations:
                print(f"  [Source: page {cit.start_page}-{cit.end_page}]")

3.7 Voice (STT用Whisper, TTS用ElevenLabs)

Anthropic没自家STT/TTS,配合外部:

# voice_pipeline.py
"""
Voice → Claude → Voice pipeline
"""
import openai  # for whisper
import elevenlabs  # for TTS

# 1. Speech-to-text (Whisper)
audio_file = open("user_audio.mp3", "rb")
transcript = openai.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file
).text

# 2. Claude处理
resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=512,
    messages=[{"role": "user", "content": transcript}]
)
answer = resp.content[0].text

# 3. Text-to-speech
audio = elevenlabs.generate(text=answer, voice="Rachel")
with open("response.mp3", "wb") as f:
    f.write(audio)

四、Anthropic API最佳实践

4.1 Image source三种方式

# 1. base64 inline (小图首选)
{"source": {"type": "base64", "media_type": "image/png", "data": "<b64>"}}

# 2. URL (Claude fetch)
{"source": {"type": "url", "url": "https://..."}}

# 3. Files API (大文件多次用)
{"source": {"type": "file", "file_id": "file_xxx"}}

4.2 图像预处理建议

  • resize到合理尺寸:>2000px长边没收益反而贵。1024-1568px长边是甜区。
  • 格式:PNG质量好但大;JPEG适合自然图像,OCR场景质量85+
  • 黑白文档OCR:转单色省70%文件大小,accuracy几乎不降
  • PDF:直接喂;不要先convert to image array再喂

4.3 Vision token cost monitoring

def estimate_vision_cost(width, height, model="claude-sonnet-4-6"):
    PRICES = {
        "claude-opus-4-7": 15.0,
        "claude-sonnet-4-6": 3.0,
        "claude-haiku-4-5": 0.8,
    }
    # 大致:image scaled到max 1568长边,按0.001 USD per 1000x1000 px (Sonnet)
    scaled_pixels = min(width * height, 1568 * 1568)
    tokens = scaled_pixels / 750  # rough
    return tokens * PRICES[model] / 1e6

print(estimate_vision_cost(1280, 1024, "claude-opus-4-7"))
# ~$0.026 per image with Opus

4.4 Cache control with vision

# 把stable的image cache起来(如logo、template)
content = [
    {"type": "image", "source": {...}, "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": "..."}
]

五、金融领域应用

案例1:扫描发票OCR + 结构化

INVOICE_TOOL = {
    "name": "submit_invoice",
    "input_schema": {
        "type": "object",
        "properties": {
            "vendor_name": {"type": "string"},
            "invoice_number": {"type": "string"},
            "date": {"type": "string", "format": "date"},
            "line_items": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string"},
                        "quantity": {"type": "number"},
                        "unit_price": {"type": "number"},
                        "amount": {"type": "number"}
                    }
                }
            },
            "subtotal": {"type": "number"},
            "tax": {"type": "number"},
            "total": {"type": "number"},
            "currency": {"type": "string"}
        },
        "required": ["vendor_name", "total", "currency"]
    }
}

def parse_invoice(image_path):
    img_data = encode_image(image_path)
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        tools=[INVOICE_TOOL],
        tool_choice={"type": "tool", "name": "submit_invoice"},
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64",
                                              "media_type": "image/jpeg",
                                              "data": img_data}},
                {"type": "text", "text": "Extract invoice data."}
            ]
        }]
    )
    for block in resp.content:
        if block.type == "tool_use":
            return block.input

案例2:财报PDF分析pipeline

def fin_report_pipeline(pdf_path):
    # 上传一次
    with open(pdf_path, "rb") as f:
        file_obj = client.beta.files.upload(
            file=(pdf_path, f, "application/pdf")
        )

    questions = [
        "Extract income statement (revenue, COGS, OpEx, Net Income).",
        "List the top 3 risks mentioned in the risk factor section.",
        "What's the management's outlook for next quarter?",
        "Are there any related party transactions disclosed?",
    ]

    results = {}
    for q in questions:
        resp = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=2048,
            messages=[{"role": "user", "content": [
                {"type": "document",
                 "source": {"type": "file", "file_id": file_obj.id},
                 "citations": {"enabled": True}},
                {"type": "text", "text": q}
            ]}],
            extra_headers={"anthropic-beta": "files-api-2025-04-14"}
        )
        results[q] = resp.content[0].text
    return results

案例3:图表读数验证(chart-to-data)

# 一些券商research报告里只有bar chart没数据表
# 用Claude vision从chart"读"出数据点
def read_bar_chart(chart_image, x_label, y_label):
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        messages=[{"role": "user", "content": [
            {"type": "image", "source": {...}},
            {"type": "text", "text": f"""Read this bar chart.
For each bar, output (x={x_label}, y={y_label}) as CSV.
Only output the CSV, no explanation."""}
        ]}]
    )
    return resp.content[0].text

六、常见陷阱

  1. 图像分辨率过高白白烧钱:原图4K没必要,Claude内部会downscale,但你已经付了upload bandwidth。前端先resize到1568长边
  2. PDF里有表格但无文字层:OCR PDF(扫描版)和digital PDF差异大。前者必须用vision(更贵),后者text extract更省。
  3. Multi-image顺序混乱:把"图1"标错为"图2"。用明确text label:"Image labelled A:"再放image。
  4. Files API limit:单文件500MB,单account 100GB total。生产要做cleanup。
  5. Citations only with citations: enabled:忘开就没citation,事后不能补。
  6. handwriting OCR非完美:accuracy 85-95%,关键应用要二次确认。

七、关键速查

Vision输入大小限制

单图: max 5MB
单request最多20图(Claude 4+)
PDF: 100页, 32MB inline / Files API更大
图像格式: JPEG, PNG, GIF, WEBP

Vision cost估算 (Sonnet 4.6)

小图 (~256×256):     ~80 tokens   ~$0.0002
中图 (~1024×1024):   ~1300 tokens ~$0.004
大图 (~1568×1568):   ~2500 tokens ~$0.008

何时用Vision vs 何时用OCR预处理

  • 文字主导文档:先pdfplumber/PyPDF2 extract → 喂text给Claude(便宜10x)
  • 图像/图表/扫描:直接Claude Vision
  • Mixed:分页判断,hybrid approach

八、面试题

Q1: Claude Vision比单独用OCR + LLM好在哪?

(a) Layout aware:Claude直接看页面,不丢"这是table还是paragraph"。(b) End-to-end:单次API call vs OCR+LLM两次。(c) Reasoning over visual:能理解chart shape趋势,不只读数字。代价:贵2-5x。

Q2: 设计一个发票自动化系统,怎么control cost?

(1) 先简单文件类型分流:digital PDF走text path,扫描走vision。(2) 用Haiku做first pass low-confidence retry on Sonnet。(3) Tools API强制schema。(4) Cache vendor logo / template image。(5) Files API避免重传。(6) Batch API做夜间批量50% off。

Q3: 让Claude看一张complex财报图,说出"Q3 revenue上升了多少",accuracy有多高?

实测Claude 4.7在清晰chart上数值读取accuracy ~95%;handwritten or unclear ~80%。关键应用必须人工抽样验证

Q4: Multi-modal模型未来会取代专门vision model吗(如YOLO检测)?

短期不会。专门model在specific task上faster + cheaper + 可微调。Multi-modal LLM胜在zero-shot generality。Hybrid是未来:YOLO做real-time detection,Claude做语义理解。


九、明日预告

Day 130: 长上下文工程 — 1M context、Lost-in-the-middle、prompt caching实测。