Skills api-ai-claude-vision
Image understanding and document analysis with Claude's multimodal capabilities -- image input formats, PDF processing, multi-image patterns, structured extraction, and token cost estimation
git clone https://github.com/agents-inc/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/agents-inc/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/dist/plugins/api-ai-claude-vision/skills/api-ai-claude-vision" ~/.claude/skills/agents-inc-skills-api-ai-claude-vision && rm -rf "$T"
dist/plugins/api-ai-claude-vision/skills/api-ai-claude-vision/SKILL.mdClaude Vision Patterns
Quick Guide: Use
content blocks for images (base64, URL, or file_id) andtype: "image"content blocks for PDFs. Supported image formats: JPEG, PNG, GIF, WebP. Images before text in the content array improves results. Token cost formula:type: "document". Images are auto-resized if the long edge exceeds 1568px or exceeds ~1600 tokens. PDFs usetokens = (width * height) / 750withtype: "document". No OCR library needed -- Claude reads text directly from images and PDFs.media_type: "application/pdf"
<critical_requirements>
CRITICAL: Before Using This Skill
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
, named constants)import type
(You MUST use
for images and type: "image"
for PDFs -- they are different content block types)type: "document"
(You MUST place images and documents BEFORE text in the content array -- Claude performs better with visual content first)
(You MUST always provide
in every request -- it is required and has no default)max_tokens
(You MUST iterate over
blocks -- never assume a single text block in the response)response.content
(You MUST use named constants for max_tokens, token budgets, and pixel limits -- no magic numbers)
</critical_requirements>
Auto-detection: Claude vision, image analysis, image input, base64 image, URL image, type image, type document, media_type image/jpeg, media_type image/png, image/webp, image/gif, application/pdf, PDF processing, document extraction, multimodal, multi-image, image comparison, chart analysis, screenshot analysis, image understanding, visual content, vision API
When to use:
- Sending images to Claude for analysis, description, or data extraction
- Processing PDF documents for text extraction, chart analysis, or summarization
- Comparing multiple images in a single request
- Extracting structured data from screenshots, receipts, charts, or forms
- Building document processing pipelines with Claude
- Estimating token costs for image-heavy workloads
Key patterns covered:
- Image input via base64, URL, and Files API
- PDF document input and processing
- Multi-image requests and comparison patterns
- Image + text prompting best practices
- Token cost estimation and image sizing
- Structured data extraction from visual content
- Multi-turn vision conversations
- Prompt caching with images and PDFs
When NOT to use:
- General Claude API usage without images or documents -- use the Anthropic SDK skill instead
- Image generation or editing -- Claude is understanding-only, it cannot create or modify images
- Identifying specific people in images -- Claude refuses to name people (Anthropic policy)
- Medical diagnostic imaging (CTs, MRIs) -- not designed for clinical diagnosis
Examples Index
- Core: Image & PDF Input -- Base64, URL, file_id, PDF input, multi-image, token estimation
- Extraction & Prompting -- Structured extraction, comparison, prompting best practices, caching
- Quick API Reference -- Content block types, supported formats, size limits, token formula
<philosophy>
Philosophy
Claude's vision capabilities treat images and documents as first-class content blocks alongside text. There is no separate "vision API" -- you add image or document blocks to the same Messages API you already use for text.
Core principles:
- Images are content blocks, not attachments -- Images and PDFs are content blocks in the
array, interleaved with text. They are not uploaded separately or referenced by URL-only.messages - Image-first ordering -- Place images before text in the content array. This mirrors how
improves text prompts. Claude processes visual content better when it sees the image before the question.documents first, query last - No OCR needed -- Claude reads text directly from images and PDFs. You do not need to pre-extract text with an OCR library. For PDFs, Claude processes both the extracted text and a rendered image of each page.
- Token costs scale with pixels -- Image tokens are proportional to resolution:
. Downsizing images before sending saves tokens without losing meaningful detail for most use cases.tokens = (width * height) / 750 - PDFs are dual-processed -- Each PDF page is converted to an image AND has its text extracted. Claude sees both, giving it access to visual layout and textual content.
When to use vision:
- Analyzing screenshots, photos, charts, diagrams, or infographics
- Extracting data from forms, receipts, invoices, or tables
- Processing PDF documents for summarization, extraction, or analysis
- Comparing multiple images (before/after, A/B testing, design review)
- Understanding visual context that text alone cannot capture
When NOT to use:
- Pure text tasks with no visual component -- vision adds unnecessary token cost
- Tasks requiring pixel-perfect spatial precision -- Claude's spatial reasoning is approximate
- Identifying specific people -- Claude refuses to name individuals (Anthropic policy)
- Replacing professional medical imaging analysis (CTs, MRIs, X-rays)
<patterns>
Core Patterns
Pattern 1: Base64 Image Input
Read a local file, encode to base64, send as
type: "image" content block. Image block before text block.
// Image block first, text prompt second, iterate response content blocks content: [ { type: "image", source: { type: "base64", media_type: "image/png", data: imageData }, }, { type: "text", text: "Describe what you see in this image." }, ];
Why good: Image before text improves results, explicit media_type, structured content blocks
// BAD: base64 as text string -- Claude cannot interpret raw base64 content: "What's in this image? " + imageData;
Why bad: Passing base64 as text string instead of image content block, Claude cannot interpret raw base64 text as an image
See: examples/core.md for full runnable examples with base64, URL, and Files API
Pattern 2: URL vs Base64 vs Files API
Three source types for images. Choose based on where your image lives.
// URL source -- simplest, smallest payload source: { type: "url", url: "https://example.com/chart.png" } // Base64 source -- local files source: { type: "base64", media_type: "image/jpeg", data: base64String } // Files API source (beta) -- upload once, reuse across requests source: { type: "file", file_id: "file_abc123" }
When to use: URL for hosted images, base64 for local files, Files API for multi-turn or repeated use
See: examples/core.md for full examples of each source type
Pattern 3: PDF Document Input
PDFs use
type: "document" -- different from type: "image". This is the most common mistake.
// Correct: type "document" for PDFs { type: "document", source: { type: "base64", media_type: "application/pdf", data: pdfData } } // WRONG: type "image" for PDFs -- causes API errors { type: "image", source: { type: "base64", media_type: "application/pdf", data: pdfData } }
Why good:
type: "document" enables dual processing (text extraction + page rendering)
Why bad: Using
type: "image" for PDFs causes API errors. PDFs require type: "document".
See: examples/core.md for base64 and URL PDF examples, examples/extraction.md for PDF caching
Pattern 4: Multiple Images with Labels
Label images with text blocks so Claude can reference them clearly.
content: [ { type: "text", text: "Image 1:" }, { type: "image", source: { type: "base64", media_type: "image/jpeg", data: image1 }, }, { type: "text", text: "Image 2:" }, { type: "image", source: { type: "base64", media_type: "image/jpeg", data: image2 }, }, { type: "text", text: "Compare these two images and describe the differences.", }, ];
Why good: Labels let Claude reference specific images unambiguously
Why bad (without labels): Claude may confuse which image is which when no labels are provided
See: examples/core.md for full multi-image example
Pattern 5: Token Cost Estimation
Token formula:
tokens = (width * height) / 750. Auto-resize triggers at 1568px long edge or ~1.15 megapixels.
const TOKENS_PER_PIXEL_DIVISOR = 750; const MAX_LONG_EDGE_PX = 1568; const MAX_MEGAPIXELS = 1.15; function estimateImageTokens(width: number, height: number): number { let w = width, h = height; const longEdge = Math.max(w, h); const mp = (w * h) / 1_000_000; if (longEdge > MAX_LONG_EDGE_PX || mp > MAX_MEGAPIXELS) { const scale = Math.min( MAX_LONG_EDGE_PX / longEdge, Math.sqrt(MAX_MEGAPIXELS / mp), ); w = Math.round(width * scale); h = Math.round(height * scale); } return Math.ceil((w * h) / TOKENS_PER_PIXEL_DIVISOR); } // 200x200: ~54 tokens | 1000x1000: ~1334 | 4000x3000: ~1590 (auto-resized)
Why good: Named constants, accounts for auto-resize, documents the formula
See: reference.md for the complete size/token/cost table, examples/core.md for
countTokens() usage
Pattern 6: Structured Data Extraction
Combine vision with
messages.parse() and Zod schemas for typed extraction.
import { zodOutputFormat } from "@anthropic-ai/sdk/helpers/zod"; import { z } from "zod"; const ReceiptData = z.object({ merchant: z.string(), date: z.string(), items: z.array( z.object({ name: z.string(), quantity: z.number(), price: z.number() }), ), total: z.number(), currency: z.string(), }); const response = await client.messages.parse({ model: "claude-sonnet-4-6", max_tokens: MAX_TOKENS, messages: [ { role: "user", content: [ { type: "image", source: { type: "base64", media_type: "image/jpeg", data: receiptImage, }, }, { type: "text", text: "Extract all receipt information from this image.", }, ], }, ], output_config: { format: zodOutputFormat(ReceiptData) }, }); const receipt = response.parsed_output; // fully typed
Why good: Zod schema for type-safe extraction,
messages.parse() for auto-validation, image before text
See: examples/extraction.md for chart extraction, form extraction, multi-document extraction, PDF caching
</patterns><performance>
Performance Optimization
Image Sizing Strategy
Image resolution vs token cost: 200x200 -> ~54 tokens ($0.00016/image at Sonnet 4.6 pricing) 1000x1000 -> ~1334 tokens ($0.004/image) 1092x1092 -> ~1590 tokens ($0.0048/image) -- max 1:1 without auto-resize 4000x3000 -> ~1590 tokens (auto-resized to fit 1568px long edge)
- Pre-resize images to no more than 1568px on the long edge and 1.15 megapixels to avoid auto-resize latency
- Small images under 200px on any edge may degrade output quality
- Images over 8000x8000px are rejected outright
- 20+ images in one request limits each image to 2000x2000px max
Cost Reduction Techniques
- Resize before sending -- A 4000x3000 image is auto-resized to the same tokens as 1092x1092, but adds latency. Pre-resize to save time.
- Use URL sources when images are already hosted -- avoids encoding overhead and reduces request payload size
- Use the Files API for images used across multiple requests -- upload once, reference by
file_id - Cache PDFs with
when asking multiple questions about the same documentcache_control: { type: "ephemeral" } - Use token counting (
) before expensive requests to estimate costsclient.messages.countTokens()
PDF Token Costs
- Text extraction: ~1,500-3,000 tokens per page depending on density
- Image rendering: Each page also incurs image token costs (same formula)
- Total per page: text tokens + image tokens (dual processing)
<decision_framework>
Decision Framework
Image Source Type
Where is your image? +-- Local file -> Base64 encode with readFileSync().toString("base64") +-- Public URL -> Use type: "url" source (simplest, smallest payload) +-- Already uploaded -> Use type: "file" source with file_id (Files API, beta) +-- Multiple requests -> Upload once via Files API, reuse file_id
Image vs Document Block
What type of file? +-- JPEG, PNG, GIF, WebP -> type: "image" +-- PDF -> type: "document" with media_type: "application/pdf" +-- Other formats -> Convert to a supported format first
Token Budget for max_tokens
What kind of analysis? +-- Brief description -> 256-512 max_tokens +-- Detailed analysis -> 1024-2048 max_tokens +-- Document summarization -> 2048-4096 max_tokens +-- Structured extraction -> 1024 max_tokens (JSON output is compact)
</decision_framework>
<red_flags>
RED FLAGS
High Priority Issues:
- Using
for PDFs -- PDFs requiretype: "image"
withtype: "document"media_type: "application/pdf" - Passing base64 data as a text string instead of an image content block -- Claude cannot interpret raw base64 text
- Not providing
-- required on every request, no defaultmax_tokens - Images larger than 8000x8000px -- rejected by the API
- API file size limit is 5MB per image (10MB on claude.ai)
Medium Priority Issues:
- Placing text before images in the content array -- Claude performs better with images first
- Not labeling multiple images -- Claude may confuse which image is which without "Image 1:", "Image 2:" labels
- Sending full-resolution images when a smaller version would suffice -- wastes tokens and adds latency from auto-resizing
- Using base64 for publicly available images -- URL source is simpler and reduces payload
- Not using
when asking multiple questions about the same PDF -- each request re-processes the full documentcache_control
Common Mistakes:
- Expecting Claude to generate or edit images -- it is understanding-only
- Using vision for tasks requiring precise spatial reasoning (exact pixel coordinates, analog clock reading) -- Claude's spatial abilities are approximate
- Relying on Claude to identify specific people -- it refuses to name individuals per Anthropic policy
- Assuming exact object counts -- Claude gives approximate counts, especially for many small objects
- Forgetting that PDF pages are dual-processed (text + image) -- token costs are higher than text-only
Gotchas & Edge Cases:
- Images under 200px on any edge may produce lower quality analysis
- When sending 20+ images in a single request, each image is limited to 2000x2000px max
- API supports up to 600 images per request (100 for 200k context window models), but request size limits (32MB) are often reached first
- Claude does not read image EXIF metadata -- orientation, camera info, GPS data are not accessible
- PDFs with passwords or encryption are not supported -- only standard PDFs
- The Files API for images and documents is currently in beta (
)betas: ["files-api-2025-04-14"] - Multi-turn vision conversations do not require re-sending the image -- it persists in conversation history
- For PDFs, dense pages with complex tables or heavy graphics can fill the context window before reaching the 600-page limit
</red_flags>
<critical_reminders>
CRITICAL REMINDERS
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
, named constants)import type
(You MUST use
for images and type: "image"
for PDFs -- they are different content block types)type: "document"
(You MUST place images and documents BEFORE text in the content array -- Claude performs better with visual content first)
(You MUST always provide
in every request -- it is required and has no default)max_tokens
(You MUST iterate over
blocks -- never assume a single text block in the response)response.content
(You MUST use named constants for max_tokens, token budgets, and pixel limits -- no magic numbers)
Failure to follow these rules will produce API errors, degraded vision quality, unexpected token costs, or runtime crashes from untyped content blocks.
</critical_reminders>