Multi-modal Prompting——Vision、Voice、Document与Claude Vision深度
多模态LLM架构(CLIP/encoder-decoder fusion)、Claude Vision、PDF processing、Voice (Whisper/STT/TTS)
日期: 2026-09-07 方向: AI系统工程 阶段: Phase 3 - LLM基础与Prompt工程 (Day 121-134) 标签: #Vision #Multimodal #ClaudeVision #Document #PDF #OCR
今日目标
| 类型 | 内容 |
|---|---|
| 学习 | 多模态LLM架构(CLIP/encoder-decoder fusion)、Claude Vision、PDF processing、Voice (Whisper/STT/TTS) |
| 实操 | 用Claude Vision处理金融报表PDF、bar chart图、扫描发票 |
| 产出 | mm_demo.py + 多模态成本/精度对照表 |
一、理论基础
1.1 多模态LLM架构
主流pattern:
- Vision Encoder + LLM:CLIP ViT等encoder提image embedding,project到LLM token space。如LLaVA、Claude 3+。
- Native Multimodal:从训练开始就联合文本+图像(Gemini 1.5、GPT-4o)
- Adapter approach:冻结LLM,只train adapter层(cheap但效果次)
Claude Vision (3+) 大概率走Vision Encoder + LLM融合架构(基于公开能力推测)。
1.2 图像token化
图像要变token喂给LLM:
- Patch化:把1024×1024图切成32×32 patches(共1024 patches)
- 每个patch过Vision encoder得embedding
- Project到LLM空间,每个patch约对应1 token
Claude Vision token cost估算: $$ \text{tokens} \approx \frac{w \times h}{750} $$ (基于Anthropic文档;实际是按scaled版本计算,但750 ratio是好rule of thumb)
例:
- 1280×1024 → ~1750 tokens
- 4K image (3840×2160) → ~11000 tokens(不一定真cost这么多,Anthropic会先downscale)
1.3 主流多模态能力
| 模型 | 图像 | 视频 | 音频 | PDF native | Document layout |
|---|---|---|---|---|---|
| Claude 4.7 | ✅ | ❌ direct (frame extract) | ❌ | ✅ | ✅ 强 |
| GPT-5 | ✅ | ✅ (Sora integration) | ✅ (TTS+STT) | ✅ | ✅ |
| Gemini 2.5 Pro | ✅ | ✅ 1h+ video | ✅ | ✅ | ✅ |
| Llama 3.2 Vision | ✅ | ❌ | ❌ | ❌ | 弱 |
1.4 Claude Vision专长
Anthropic官方强调的能力:
- Chart/graph理解(金融、scientific)
- Handwriting识别
- Multi-image comparison
- OCR-quality文字提取
- Diagram analysis (UML/flowchart)
- 长document(PDF up to 100页)
二、直觉解释
为什么vision token那么贵?
一张图1000+ tokens,相当于500+ word的英文。但信息密度高(一张表格图= 几百行文字描述)。ROI看任务——OCR文档省掉手工输入,值;闲聊场景图就贵。
为什么LLM能"理解"图?
不是真的"看见"。是image encoder把image压成vector,这些vector在training中和"图片标题"、"图片说明"对齐。所以模型本质是"翻译"——把image vector翻译成"图片描述的token序列"。
Document处理为什么难?
PDF内有:
- 文本流(vector)
- 扫描版图像(raster)
- 复杂布局(multi-column、tables、footnotes)
- 嵌入图表
普通text extract丢失layout。Vision-LLM直接看页面图保留所有信息——但cost高。生产里混用:text-extractable的页面用文本,扫描页面用vision。
三、代码实现
3.1 Claude Vision基础调用
# vision_basics.py
"""
用Claude Vision分析金融图表
"""
import anthropic
import base64
from pathlib import Path
client = anthropic.Anthropic()
def encode_image(image_path):
"""读图片转base64"""
with open(image_path, "rb") as f:
return base64.standard_b64encode(f.read()).decode("utf-8")
def analyze_chart(image_path, prompt="Describe this financial chart in detail."):
image_data = encode_image(image_path)
media_type = f"image/{Path(image_path).suffix[1:].lower()}"
if media_type == "image/jpg":
media_type = "image/jpeg"
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": image_data,
}
},
{"type": "text", "text": prompt}
]
}]
)
print(f"Tokens: {response.usage.input_tokens} in, {response.usage.output_tokens} out")
return response.content[0].text
# 用法
# print(analyze_chart("aapl_revenue_chart.png"))
3.2 用URL方式(更省)
# Anthropic也支持URL直接拉取
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "url",
"url": "https://example.com/chart.png"
}
},
{"type": "text", "text": "Extract all numeric values."}
]
}]
)
3.3 PDF processing(Claude支持原生PDF)
# pdf_analysis.py
"""
直接喂PDF(Claude会做内部page-by-page vision processing)
"""
import anthropic
import base64
client = anthropic.Anthropic()
with open("apple_10k.pdf", "rb") as f:
pdf_data = base64.standard_b64encode(f.read()).decode()
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": pdf_data
}
},
{"type": "text", "text": "Extract Q3 income statement key metrics."}
]
}]
)
print(response.content[0].text)
print(f"Cost: {response.usage.input_tokens} input tokens")
3.4 Files API:上传一次,多次引用
PDF很大(如100页 = 50MB),每次inline base64浪费带宽。Files API上传一次:
# files_api_demo.py
"""
Anthropic Files API: 上传PDF, 多次引用
"""
client = anthropic.Anthropic()
# 1. 上传
file_obj = client.beta.files.upload(
file=("apple_10k.pdf", open("apple_10k.pdf", "rb"), "application/pdf")
)
file_id = file_obj.id
print(f"Uploaded: {file_id}")
# 2. 多次query不用重传
for question in [
"What was Q3 revenue?",
"What are the main risk factors?",
"Summarize the management discussion."
]:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "document", "source": {"type": "file", "file_id": file_id}},
{"type": "text", "text": question}
]
}],
extra_headers={"anthropic-beta": "files-api-2025-04-14"}
)
print(f"Q: {question}\nA: {response.content[0].text[:200]}\n")
# 3. 文件可以list/delete
files = client.beta.files.list()
# client.beta.files.delete(file_id)
3.5 Multi-image comparison
# multi_image_compare.py
"""
对比两张财报图,找出差异
"""
def compare_charts(img1_path, img2_path):
img1 = encode_image(img1_path)
img2 = encode_image(img2_path)
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Image 1 (Q2):"},
{"type": "image", "source": {"type": "base64",
"media_type": "image/png", "data": img1}},
{"type": "text", "text": "Image 2 (Q3):"},
{"type": "image", "source": {"type": "base64",
"media_type": "image/png", "data": img2}},
{"type": "text", "text": "Compare these two quarterly revenue charts. Highlight the key differences."}
]
}]
)
return response.content[0].text
3.6 Citations:让Claude标出"答案来自第几页"
# citations_demo.py
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "base64", "media_type": "application/pdf",
"data": pdf_data},
"citations": {"enabled": True} # <-- 开启
},
{"type": "text", "text": "What were operating expenses for FY26?"}
]
}]
)
# response包含citation blocks
for block in response.content:
if block.type == "text":
print(block.text)
if hasattr(block, "citations"):
for cit in block.citations:
print(f" [Source: page {cit.start_page}-{cit.end_page}]")
3.7 Voice (STT用Whisper, TTS用ElevenLabs)
Anthropic没自家STT/TTS,配合外部:
# voice_pipeline.py
"""
Voice → Claude → Voice pipeline
"""
import openai # for whisper
import elevenlabs # for TTS
# 1. Speech-to-text (Whisper)
audio_file = open("user_audio.mp3", "rb")
transcript = openai.audio.transcriptions.create(
model="whisper-1",
file=audio_file
).text
# 2. Claude处理
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=512,
messages=[{"role": "user", "content": transcript}]
)
answer = resp.content[0].text
# 3. Text-to-speech
audio = elevenlabs.generate(text=answer, voice="Rachel")
with open("response.mp3", "wb") as f:
f.write(audio)
四、Anthropic API最佳实践
4.1 Image source三种方式
# 1. base64 inline (小图首选)
{"source": {"type": "base64", "media_type": "image/png", "data": "<b64>"}}
# 2. URL (Claude fetch)
{"source": {"type": "url", "url": "https://..."}}
# 3. Files API (大文件多次用)
{"source": {"type": "file", "file_id": "file_xxx"}}
4.2 图像预处理建议
- resize到合理尺寸:>2000px长边没收益反而贵。1024-1568px长边是甜区。
- 格式:PNG质量好但大;JPEG适合自然图像,OCR场景质量85+
- 黑白文档OCR:转单色省70%文件大小,accuracy几乎不降
- PDF:直接喂;不要先convert to image array再喂
4.3 Vision token cost monitoring
def estimate_vision_cost(width, height, model="claude-sonnet-4-6"):
PRICES = {
"claude-opus-4-7": 15.0,
"claude-sonnet-4-6": 3.0,
"claude-haiku-4-5": 0.8,
}
# 大致:image scaled到max 1568长边,按0.001 USD per 1000x1000 px (Sonnet)
scaled_pixels = min(width * height, 1568 * 1568)
tokens = scaled_pixels / 750 # rough
return tokens * PRICES[model] / 1e6
print(estimate_vision_cost(1280, 1024, "claude-opus-4-7"))
# ~$0.026 per image with Opus
4.4 Cache control with vision
# 把stable的image cache起来(如logo、template)
content = [
{"type": "image", "source": {...}, "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": "..."}
]
五、金融领域应用
案例1:扫描发票OCR + 结构化
INVOICE_TOOL = {
"name": "submit_invoice",
"input_schema": {
"type": "object",
"properties": {
"vendor_name": {"type": "string"},
"invoice_number": {"type": "string"},
"date": {"type": "string", "format": "date"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "number"},
"unit_price": {"type": "number"},
"amount": {"type": "number"}
}
}
},
"subtotal": {"type": "number"},
"tax": {"type": "number"},
"total": {"type": "number"},
"currency": {"type": "string"}
},
"required": ["vendor_name", "total", "currency"]
}
}
def parse_invoice(image_path):
img_data = encode_image(image_path)
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
tools=[INVOICE_TOOL],
tool_choice={"type": "tool", "name": "submit_invoice"},
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64",
"media_type": "image/jpeg",
"data": img_data}},
{"type": "text", "text": "Extract invoice data."}
]
}]
)
for block in resp.content:
if block.type == "tool_use":
return block.input
案例2:财报PDF分析pipeline
def fin_report_pipeline(pdf_path):
# 上传一次
with open(pdf_path, "rb") as f:
file_obj = client.beta.files.upload(
file=(pdf_path, f, "application/pdf")
)
questions = [
"Extract income statement (revenue, COGS, OpEx, Net Income).",
"List the top 3 risks mentioned in the risk factor section.",
"What's the management's outlook for next quarter?",
"Are there any related party transactions disclosed?",
]
results = {}
for q in questions:
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
messages=[{"role": "user", "content": [
{"type": "document",
"source": {"type": "file", "file_id": file_obj.id},
"citations": {"enabled": True}},
{"type": "text", "text": q}
]}],
extra_headers={"anthropic-beta": "files-api-2025-04-14"}
)
results[q] = resp.content[0].text
return results
案例3:图表读数验证(chart-to-data)
# 一些券商research报告里只有bar chart没数据表
# 用Claude vision从chart"读"出数据点
def read_bar_chart(chart_image, x_label, y_label):
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
messages=[{"role": "user", "content": [
{"type": "image", "source": {...}},
{"type": "text", "text": f"""Read this bar chart.
For each bar, output (x={x_label}, y={y_label}) as CSV.
Only output the CSV, no explanation."""}
]}]
)
return resp.content[0].text
六、常见陷阱
- 图像分辨率过高白白烧钱:原图4K没必要,Claude内部会downscale,但你已经付了upload bandwidth。前端先resize到1568长边。
- PDF里有表格但无文字层:OCR PDF(扫描版)和digital PDF差异大。前者必须用vision(更贵),后者text extract更省。
- Multi-image顺序混乱:把"图1"标错为"图2"。用明确text label:
"Image labelled A:"再放image。 - Files API limit:单文件500MB,单account 100GB total。生产要做cleanup。
- Citations only with
citations: enabled:忘开就没citation,事后不能补。 - handwriting OCR非完美:accuracy 85-95%,关键应用要二次确认。
七、关键速查
Vision输入大小限制
单图: max 5MB
单request最多20图(Claude 4+)
PDF: 100页, 32MB inline / Files API更大
图像格式: JPEG, PNG, GIF, WEBP
Vision cost估算 (Sonnet 4.6)
小图 (~256×256): ~80 tokens ~$0.0002
中图 (~1024×1024): ~1300 tokens ~$0.004
大图 (~1568×1568): ~2500 tokens ~$0.008
何时用Vision vs 何时用OCR预处理
- 文字主导文档:先
pdfplumber/PyPDF2extract → 喂text给Claude(便宜10x) - 图像/图表/扫描:直接Claude Vision
- Mixed:分页判断,hybrid approach
八、面试题
Q1: Claude Vision比单独用OCR + LLM好在哪?
(a) Layout aware:Claude直接看页面,不丢"这是table还是paragraph"。(b) End-to-end:单次API call vs OCR+LLM两次。(c) Reasoning over visual:能理解chart shape趋势,不只读数字。代价:贵2-5x。
Q2: 设计一个发票自动化系统,怎么control cost?
(1) 先简单文件类型分流:digital PDF走text path,扫描走vision。(2) 用Haiku做first pass low-confidence retry on Sonnet。(3) Tools API强制schema。(4) Cache vendor logo / template image。(5) Files API避免重传。(6) Batch API做夜间批量50% off。
Q3: 让Claude看一张complex财报图,说出"Q3 revenue上升了多少",accuracy有多高?
实测Claude 4.7在清晰chart上数值读取accuracy ~95%;handwritten or unclear ~80%。关键应用必须人工抽样验证。
Q4: Multi-modal模型未来会取代专门vision model吗(如YOLO检测)?
短期不会。专门model在specific task上faster + cheaper + 可微调。Multi-modal LLM胜在zero-shot generality。Hybrid是未来:YOLO做real-time detection,Claude做语义理解。
九、明日预告
Day 130: 长上下文工程 — 1M context、Lost-in-the-middle、prompt caching实测。