Expert Day 129

Multi-modal Prompting——Vision、Voice、Document与Claude Vision深度

多模态LLM架构（CLIP/encoder-decoder fusion）、Claude Vision、PDF processing、Voice (Whisper/STT/TTS)

2026-09-07

Phase 3 - LLM基础与Prompt工程 (Day 121-134)

VisionMultimodalClaudeVisionDocumentPDFOCR

日期: 2026-09-07 方向: AI系统工程阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #Vision #Multimodal #ClaudeVision #Document #PDF #OCR

今日目标

类型	内容
学习	多模态LLM架构（CLIP/encoder-decoder fusion）、Claude Vision、PDF processing、Voice (Whisper/STT/TTS)
实操	用Claude Vision处理金融报表PDF、bar chart图、扫描发票
产出	`mm_demo.py` + 多模态成本/精度对照表

一、理论基础

1.1 多模态LLM架构

主流pattern：

Vision Encoder + LLM：CLIP ViT等encoder提image embedding，project到LLM token space。如LLaVA、Claude 3+。
Native Multimodal：从训练开始就联合文本+图像（Gemini 1.5、GPT-4o）
Adapter approach：冻结LLM，只train adapter层（cheap但效果次）

Claude Vision (3+) 大概率走Vision Encoder + LLM融合架构（基于公开能力推测）。

1.2 图像token化

图像要变token喂给LLM：

Patch化：把1024×1024图切成32×32 patches（共1024 patches）
每个patch过Vision encoder得embedding
Project到LLM空间，每个patch约对应1 token

Claude Vision token cost估算： $$ \text{tokens} \approx \frac{w \times h}{750} $$ （基于Anthropic文档；实际是按scaled版本计算，但750 ratio是好rule of thumb）

例：

1280×1024 → ~1750 tokens
4K image (3840×2160) → ~11000 tokens（不一定真cost这么多，Anthropic会先downscale）

1.3 主流多模态能力

模型	图像	视频	音频	PDF native	Document layout
Claude 4.7	✅	❌ direct (frame extract)	❌	✅	✅ 强
GPT-5	✅	✅ (Sora integration)	✅ (TTS+STT)	✅	✅
Gemini 2.5 Pro	✅	✅ 1h+ video	✅	✅	✅
Llama 3.2 Vision	✅	❌	❌	❌	弱

1.4 Claude Vision专长

Anthropic官方强调的能力：

Chart/graph理解（金融、scientific）
Handwriting识别
Multi-image comparison
OCR-quality文字提取
Diagram analysis (UML/flowchart)
长document（PDF up to 100页）

二、直觉解释

为什么vision token那么贵？

一张图1000+ tokens，相当于500+ word的英文。但信息密度高（一张表格图= 几百行文字描述）。ROI看任务——OCR文档省掉手工输入，值；闲聊场景图就贵。

为什么LLM能"理解"图？

不是真的"看见"。是image encoder把image压成vector，这些vector在training中和"图片标题"、"图片说明"对齐。所以模型本质是"翻译"——把image vector翻译成"图片描述的token序列"。

Document处理为什么难？

PDF内有：

文本流（vector）
扫描版图像（raster）
复杂布局（multi-column、tables、footnotes）
嵌入图表

普通text extract丢失layout。Vision-LLM直接看页面图保留所有信息——但cost高。生产里混用：text-extractable的页面用文本，扫描页面用vision。

三、代码实现

3.1 Claude Vision基础调用

# vision_basics.py
"""
用Claude Vision分析金融图表
"""
import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def encode_image(image_path):
    """读图片转base64"""
    with open(image_path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

def analyze_chart(image_path, prompt="Describe this financial chart in detail."):
    image_data = encode_image(image_path)
    media_type = f"image/{Path(image_path).suffix[1:].lower()}"
    if media_type == "image/jpg":
        media_type = "image/jpeg"

    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": image_data,
                    }
                },
                {"type": "text", "text": prompt}
            ]
        }]
    )
    print(f"Tokens: {response.usage.input_tokens} in, {response.usage.output_tokens} out")
    return response.content[0].text

# 用法
# print(analyze_chart("aapl_revenue_chart.png"))

3.2 用URL方式（更省）

# Anthropic也支持URL直接拉取
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": "https://example.com/chart.png"
                }
            },
            {"type": "text", "text": "Extract all numeric values."}
        ]
    }]
)

3.3 PDF processing（Claude支持原生PDF）

# pdf_analysis.py
"""
直接喂PDF（Claude会做内部page-by-page vision processing）
"""
import anthropic
import base64

client = anthropic.Anthropic()

with open("apple_10k.pdf", "rb") as f:
    pdf_data = base64.standard_b64encode(f.read()).decode()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {
                    "type": "base64",
                    "media_type": "application/pdf",
                    "data": pdf_data
                }
            },
            {"type": "text", "text": "Extract Q3 income statement key metrics."}
        ]
    }]
)
print(response.content[0].text)
print(f"Cost: {response.usage.input_tokens} input tokens")

3.4 Files API：上传一次，多次引用

PDF很大（如100页 = 50MB），每次inline base64浪费带宽。Files API上传一次：

# files_api_demo.py
"""
Anthropic Files API: 上传PDF, 多次引用
"""
client = anthropic.Anthropic()

# 1. 上传
file_obj = client.beta.files.upload(
    file=("apple_10k.pdf", open("apple_10k.pdf", "rb"), "application/pdf")
)
file_id = file_obj.id
print(f"Uploaded: {file_id}")

# 2. 多次query不用重传
for question in [
    "What was Q3 revenue?",
    "What are the main risk factors?",
    "Summarize the management discussion."
]:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {"type": "document", "source": {"type": "file", "file_id": file_id}},
                {"type": "text", "text": question}
            ]
        }],
        extra_headers={"anthropic-beta": "files-api-2025-04-14"}
    )
    print(f"Q: {question}\nA: {response.content[0].text[:200]}\n")

# 3. 文件可以list/delete
files = client.beta.files.list()
# client.beta.files.delete(file_id)

3.5 Multi-image comparison

# multi_image_compare.py
"""
对比两张财报图，找出差异
"""
def compare_charts(img1_path, img2_path):
    img1 = encode_image(img1_path)
    img2 = encode_image(img2_path)

    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Image 1 (Q2):"},
                {"type": "image", "source": {"type": "base64",
                                              "media_type": "image/png", "data": img1}},
                {"type": "text", "text": "Image 2 (Q3):"},
                {"type": "image", "source": {"type": "base64",
                                              "media_type": "image/png", "data": img2}},
                {"type": "text", "text": "Compare these two quarterly revenue charts. Highlight the key differences."}
            ]
        }]
    )
    return response.content[0].text

3.6 Citations：让Claude标出"答案来自第几页"

# citations_demo.py
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {"type": "base64", "media_type": "application/pdf",
                           "data": pdf_data},
                "citations": {"enabled": True}  # <-- 开启
            },
            {"type": "text", "text": "What were operating expenses for FY26?"}
        ]
    }]
)

# response包含citation blocks
for block in response.content:
    if block.type == "text":
        print(block.text)
        if hasattr(block, "citations"):
            for cit in block.citations:
                print(f"  [Source: page {cit.start_page}-{cit.end_page}]")

3.7 Voice (STT用Whisper, TTS用ElevenLabs)

Anthropic没自家STT/TTS，配合外部：

# voice_pipeline.py
"""
Voice → Claude → Voice pipeline
"""
import openai  # for whisper
import elevenlabs  # for TTS

# 1. Speech-to-text (Whisper)
audio_file = open("user_audio.mp3", "rb")
transcript = openai.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file
).text

# 2. Claude处理
resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=512,
    messages=[{"role": "user", "content": transcript}]
)
answer = resp.content[0].text

# 3. Text-to-speech
audio = elevenlabs.generate(text=answer, voice="Rachel")
with open("response.mp3", "wb") as f:
    f.write(audio)

四、Anthropic API最佳实践

4.1 Image source三种方式

# 1. base64 inline (小图首选)
{"source": {"type": "base64", "media_type": "image/png", "data": "<b64>"}}

# 2. URL (Claude fetch)
{"source": {"type": "url", "url": "https://..."}}

# 3. Files API (大文件多次用)
{"source": {"type": "file", "file_id": "file_xxx"}}

4.2 图像预处理建议

resize到合理尺寸：>2000px长边没收益反而贵。1024-1568px长边是甜区。
格式：PNG质量好但大；JPEG适合自然图像，OCR场景质量85+
黑白文档OCR：转单色省70%文件大小，accuracy几乎不降
PDF：直接喂；不要先convert to image array再喂

4.3 Vision token cost monitoring

def estimate_vision_cost(width, height, model="claude-sonnet-4-6"):
    PRICES = {
        "claude-opus-4-7": 15.0,
        "claude-sonnet-4-6": 3.0,
        "claude-haiku-4-5": 0.8,
    }
    # 大致：image scaled到max 1568长边，按0.001 USD per 1000x1000 px (Sonnet)
    scaled_pixels = min(width * height, 1568 * 1568)
    tokens = scaled_pixels / 750  # rough
    return tokens * PRICES[model] / 1e6

print(estimate_vision_cost(1280, 1024, "claude-opus-4-7"))
# ~$0.026 per image with Opus

4.4 Cache control with vision

# 把stable的image cache起来（如logo、template）
content = [
    {"type": "image", "source": {...}, "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": "..."}
]

五、金融领域应用

案例1：扫描发票OCR + 结构化

INVOICE_TOOL = {
    "name": "submit_invoice",
    "input_schema": {
        "type": "object",
        "properties": {
            "vendor_name": {"type": "string"},
            "invoice_number": {"type": "string"},
            "date": {"type": "string", "format": "date"},
            "line_items": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string"},
                        "quantity": {"type": "number"},
                        "unit_price": {"type": "number"},
                        "amount": {"type": "number"}
                    }
                }
            },
            "subtotal": {"type": "number"},
            "tax": {"type": "number"},
            "total": {"type": "number"},
            "currency": {"type": "string"}
        },
        "required": ["vendor_name", "total", "currency"]
    }
}

def parse_invoice(image_path):
    img_data = encode_image(image_path)
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        tools=[INVOICE_TOOL],
        tool_choice={"type": "tool", "name": "submit_invoice"},
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64",
                                              "media_type": "image/jpeg",
                                              "data": img_data}},
                {"type": "text", "text": "Extract invoice data."}
            ]
        }]
    )
    for block in resp.content:
        if block.type == "tool_use":
            return block.input

案例2：财报PDF分析pipeline

def fin_report_pipeline(pdf_path):
    # 上传一次
    with open(pdf_path, "rb") as f:
        file_obj = client.beta.files.upload(
            file=(pdf_path, f, "application/pdf")
        )

    questions = [
        "Extract income statement (revenue, COGS, OpEx, Net Income).",
        "List the top 3 risks mentioned in the risk factor section.",
        "What's the management's outlook for next quarter?",
        "Are there any related party transactions disclosed?",
    ]

    results = {}
    for q in questions:
        resp = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=2048,
            messages=[{"role": "user", "content": [
                {"type": "document",
                 "source": {"type": "file", "file_id": file_obj.id},
                 "citations": {"enabled": True}},
                {"type": "text", "text": q}
            ]}],
            extra_headers={"anthropic-beta": "files-api-2025-04-14"}
        )
        results[q] = resp.content[0].text
    return results

案例3：图表读数验证（chart-to-data）

# 一些券商research报告里只有bar chart没数据表
# 用Claude vision从chart"读"出数据点
def read_bar_chart(chart_image, x_label, y_label):
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        messages=[{"role": "user", "content": [
            {"type": "image", "source": {...}},
            {"type": "text", "text": f"""Read this bar chart.
For each bar, output (x={x_label}, y={y_label}) as CSV.
Only output the CSV, no explanation."""}
        ]}]
    )
    return resp.content[0].text

六、常见陷阱

图像分辨率过高白白烧钱：原图4K没必要，Claude内部会downscale，但你已经付了upload bandwidth。前端先resize到1568长边。
PDF里有表格但无文字层：OCR PDF（扫描版）和digital PDF差异大。前者必须用vision（更贵），后者text extract更省。
Multi-image顺序混乱：把"图1"标错为"图2"。用明确text label："Image labelled A:"再放image。
Files API limit：单文件500MB，单account 100GB total。生产要做cleanup。
Citations only with citations: enabled：忘开就没citation，事后不能补。
handwriting OCR非完美：accuracy 85-95%，关键应用要二次确认。

七、关键速查

Vision输入大小限制

单图: max 5MB
单request最多20图（Claude 4+）
PDF: 100页, 32MB inline / Files API更大
图像格式: JPEG, PNG, GIF, WEBP

Vision cost估算 (Sonnet 4.6)

小图 (~256×256):     ~80 tokens   ~$0.0002
中图 (~1024×1024):   ~1300 tokens ~$0.004
大图 (~1568×1568):   ~2500 tokens ~$0.008

何时用Vision vs 何时用OCR预处理

文字主导文档：先pdfplumber/PyPDF2 extract → 喂text给Claude（便宜10x）
图像/图表/扫描：直接Claude Vision
Mixed：分页判断，hybrid approach

八、面试题

Q1: Claude Vision比单独用OCR + LLM好在哪？

(a) Layout aware：Claude直接看页面，不丢"这是table还是paragraph"。(b) End-to-end：单次API call vs OCR+LLM两次。(c) Reasoning over visual：能理解chart shape趋势，不只读数字。代价：贵2-5x。

Q2: 设计一个发票自动化系统，怎么control cost？

(1) 先简单文件类型分流：digital PDF走text path，扫描走vision。(2) 用Haiku做first pass low-confidence retry on Sonnet。(3) Tools API强制schema。(4) Cache vendor logo / template image。(5) Files API避免重传。(6) Batch API做夜间批量50% off。

Q3: 让Claude看一张complex财报图，说出"Q3 revenue上升了多少"，accuracy有多高？

实测Claude 4.7在清晰chart上数值读取accuracy ~95%；handwritten or unclear ~80%。关键应用必须人工抽样验证。

Q4: Multi-modal模型未来会取代专门vision model吗（如YOLO检测）？

短期不会。专门model在specific task上faster + cheaper + 可微调。Multi-modal LLM胜在zero-shot generality。Hybrid是未来：YOLO做real-time detection，Claude做语义理解。

九、明日预告

Day 130: 长上下文工程 — 1M context、Lost-in-the-middle、prompt caching实测。