Skillsbench gpt-multimodal
Analyze images and multi-frame sequences using OpenAI GPT series
install
source · Clone the upstream repo
git clone https://github.com/benchflow-ai/skillsbench
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/pedestrian-traffic-counting/environment/skills/gpt-multimodal" ~/.claude/skills/benchflow-ai-skillsbench-gpt-multimodal && rm -rf "$T"
manifest:
tasks/pedestrian-traffic-counting/environment/skills/gpt-multimodal/SKILL.mdsource content
OpenAI Vision Analysis Skill
Purpose
This skill enables image analysis, scene understanding, text extraction, and multi-frame comparison using OpenAI's vision-capable GPT models (e.g.,
gpt-4o, gpt-5). It supports single and multiple images analysis and sequential frames for temporal analysis.
When to Use
- Analyzing image content (objects, scenes, colors, spatial relationships)
- Extracting and reading text from images (OCR via vision models)
- Comparing multiple images to detect differences or changes
- Processing video frames to understand temporal progression
- Generating detailed image descriptions or captions
- Answering questions about visual content
Required Libraries
The following Python libraries are required:
from openai import OpenAI import base64 import json import os from pathlib import Path
Input Requirements
- File formats: JPG, JPEG, PNG, WEBP, non-animated GIF
- Image quality: Clear and legible; minimum 512×512px recommended
- File size: Under 20MB per image recommended
- Maximum per request: Up to 500 images, 50MB total payload
- URL or Base64: Images can be provided as URLs or base64-encoded data
Output Schema
All analysis results should be returned as valid JSON conforming to this schema:
{ "success": true, "model": "gpt-5", "analysis": "Detailed description or analysis of the image content...", "metadata": { "image_count": 1, "detail_level": "high", "tokens_used": 850, "processing_time_ms": 1234 }, "extracted_data": { "objects": ["car", "person", "building"], "text_found": "Sample text from image", "colors": ["blue", "white", "gray"], "scene_type": "urban street" }, "warnings": [] }
Field Descriptions
: Boolean indicating whether the API call succeededsuccess
: The GPT model used for analysis (e.g., "gpt-4o", "gpt-5")model
: Complete textual analysis or description from the modelanalysis
: Number of images analyzed in this requestmetadata.image_count
: Detail parameter used ("low", "high", or "auto")metadata.detail_level
: Approximate token count for the requestmetadata.tokens_used
: Time taken to process the requestmetadata.processing_time_ms
: Structured information extracted from the image(s)extracted_data
: Array of issues or limitations encounteredwarnings
Code Examples
Basic Image Analysis
from openai import OpenAI import base64 def analyze_image(image_path, prompt="What's in this image?"): """Analyze a single image using GPT-5 Vision.""" client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY")) # Read and encode image with open(image_path, "rb") as image_file: base64_image = base64.b64encode(image_file.read()).decode('utf-8') response = client.chat.completions.create( model="gpt-5", messages=[ { "role": "user", "content": [ {"type": "text", "text": prompt}, { "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{base64_image}" } } ] } ], max_tokens=300 ) return response.choices[0].message.content
Using Image URLs
from openai import OpenAI def analyze_image_url(image_url, prompt="Describe this image"): """Analyze an image from a URL.""" client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY")) response = client.chat.completions.create( model="gpt-5", messages=[ { "role": "user", "content": [ {"type": "text", "text": prompt}, { "type": "image_url", "image_url": {"url": image_url} } ] } ] ) return response.choices[0].message.content
Multiple Images Analysis
from openai import OpenAI import base64 def analyze_multiple_images(image_paths, prompt="Compare these images"): """Analyze multiple images in a single request.""" client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY")) # Build content array with text and all images content = [{"type": "text", "text": prompt}] for image_path in image_paths: with open(image_path, "rb") as image_file: base64_image = base64.b64encode(image_file.read()).decode('utf-8') content.append({ "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{base64_image}" } }) response = client.chat.completions.create( model="gpt-5", messages=[{"role": "user", "content": content}], max_tokens=500 ) return response.choices[0].message.content
Full Analysis with JSON Output
from openai import OpenAI import base64 import json import time def analyze_image_to_json(image_path, prompt="Analyze this image"): """Analyze image and return structured JSON output.""" client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY")) start_time = time.time() warnings = [] try: # Read and encode image with open(image_path, "rb") as image_file: base64_image = base64.b64encode(image_file.read()).decode('utf-8') # Make API call response = client.chat.completions.create( model="gpt-5", messages=[ { "role": "user", "content": [ {"type": "text", "text": prompt}, { "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{base64_image}", "detail": "high" } } ] } ], max_tokens=500 ) analysis = response.choices[0].message.content tokens_used = response.usage.total_tokens processing_time = int((time.time() - start_time) * 1000) result = { "success": True, "model": "gpt-5", "analysis": analysis, "metadata": { "image_count": 1, "detail_level": "high", "tokens_used": tokens_used, "processing_time_ms": processing_time }, "extracted_data": {}, "warnings": warnings } except Exception as e: result = { "success": False, "model": "gpt-5", "analysis": "", "metadata": { "image_count": 0, "detail_level": "high", "tokens_used": 0, "processing_time_ms": 0 }, "extracted_data": {}, "warnings": [f"API call failed: {str(e)}"] } return result # Usage result = analyze_image_to_json("photo.jpg", "Describe what you see in detail") print(json.dumps(result, indent=2))
Batch Processing with Sequential Frames
from openai import OpenAI import base64 from pathlib import Path def process_video_frames(frames_directory, analysis_prompt): """Process sequential video frames for temporal analysis.""" client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY")) image_extensions = {'.jpg', '.jpeg', '.png', '.webp'} frame_paths = sorted([ f for f in Path(frames_directory).iterdir() if f.suffix.lower() in image_extensions ]) # Analyze frames in groups (e.g., 5 frames at a time) batch_size = 5 results = [] for i in range(0, len(frame_paths), batch_size): batch = frame_paths[i:i+batch_size] # Build content with all frames in batch content = [{"type": "text", "text": analysis_prompt}] for frame_path in batch: with open(frame_path, "rb") as f: base64_image = base64.b64encode(f.read()).decode('utf-8') content.append({ "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{base64_image}", "detail": "low" # Use low detail for video frames to save tokens } }) response = client.chat.completions.create( model="gpt-5", messages=[{"role": "user", "content": content}], max_tokens=800 ) results.append({ "batch_index": i // batch_size, "frame_range": f"{batch[0].name} to {batch[-1].name}", "analysis": response.choices[0].message.content }) return results
Text Extraction from Images (OCR Alternative)
from openai import OpenAI import base64 def extract_text_with_gpt(image_path): """Extract text from image using GPT Vision as OCR alternative.""" client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY")) with open(image_path, "rb") as image_file: base64_image = base64.b64encode(image_file.read()).decode('utf-8') response = client.chat.completions.create( model="gpt-5", messages=[ { "role": "user", "content": [ { "type": "text", "text": "Extract all text from this image. Return only the text content, preserving the layout and structure." }, { "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{base64_image}", "detail": "high" } } ] } ], max_tokens=1000 ) return response.choices[0].message.content
Model Selection and Configuration
Available Models
# GPT-4o - Best for general vision tasks, fast and cost-effective model = "gpt-4o" # GPT-5-nano - Faster and cheaper for simple vision tasks model = "gpt-5-nano" # GPT-5 - More capable for complex reasoning model = "gpt-5"
Detail Level Configuration
Control how much visual detail the model processes:
# Low detail - 512×512px resolution, fewer tokens, faster "image_url": { "url": image_url, "detail": "low" } # High detail - Full resolution with tiling, more tokens, better accuracy "image_url": { "url": image_url, "detail": "high" } # Auto - Model chooses appropriate detail level "image_url": { "url": image_url, "detail": "auto" }
When to use each detail level:
- Low: Video frames, simple scene classification, color/shape detection
- High: Text extraction, detailed object detection, fine-grained analysis
- Auto: General purpose when unsure; model optimizes cost vs. quality
Token Cost Management
Understanding Image Tokens
Image tokens count toward your request limits and costs:
- Low detail: Fixed ~85 tokens per image (gpt-5)
- High detail: Base tokens + tile tokens based on image dimensions
- Images are scaled to fit within 2048×2048px
- Divided into 512×512px tiles
- Each tile costs additional tokens
Cost Calculation Examples
# For gpt-5 with high detail: # - Base: 85 tokens # - Per tile: 170 tokens # - Example: 1024×1024 image = 85 + (2×2 tiles × 170) = 765 tokens def estimate_tokens_high_detail(width, height): """Estimate token cost for high-detail image (gpt-5).""" # Scale to fit within 2048×2048 scale = min(2048 / width, 2048 / height, 1.0) scaled_w = int(width * scale) scaled_h = int(height * scale) # Calculate tiles (512×512) tiles_x = (scaled_w + 511) // 512 tiles_y = (scaled_h + 511) // 512 total_tiles = tiles_x * tiles_y # Token calculation base_tokens = 85 tile_tokens = total_tiles * 170 return base_tokens + tile_tokens
Cost Optimization Strategies
- Use low detail for video frames - Temporal analysis doesn't need high resolution
- Resize large images before uploading - Reduce dimensions to 1024×1024 if high detail not needed
- Batch related questions - Analyze multiple aspects in one API call
- Cache analysis results - Store results for repeated processing
Advanced Use Cases
Image Comparison
def compare_images(image1_path, image2_path): """Compare two images and identify differences.""" client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY")) images = [] for path in [image1_path, image2_path]: with open(path, "rb") as f: base64_image = base64.b64encode(f.read()).decode('utf-8') images.append(base64_image) response = client.chat.completions.create( model="gpt-5", messages=[ { "role": "user", "content": [ { "type": "text", "text": "Compare these two images. List all differences you observe, including changes in objects, colors, positions, or any other visual elements." }, {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{images[0]}"}}, {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{images[1]}"}} ] } ] ) return response.choices[0].message.content
Structured Data Extraction
def extract_structured_data(image_path, schema_description): """Extract structured information from image based on schema.""" client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY")) with open(image_path, "rb") as f: base64_image = base64.b64encode(f.read()).decode('utf-8') prompt = f"""Analyze this image and extract information in JSON format following this schema: {schema_description} Return only valid JSON, no additional text.""" response = client.chat.completions.create( model="gpt-5", messages=[ { "role": "user", "content": [ {"type": "text", "text": prompt}, { "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{base64_image}", "detail": "high" } } ] } ], response_format={"type": "json_object"} ) return json.loads(response.choices[0].message.content) # Usage example schema = """ { "products": [{"name": string, "price": number, "quantity": number}], "total": number, "date": string } """ data = extract_structured_data("receipt.jpg", schema)
Error Handling
Common Issues and Solutions
Issue: API authentication failed
# Verify API key is set import os api_key = os.environ.get("OPENAI_API_KEY") if not api_key: raise ValueError("OPENAI_API_KEY environment variable not set")
Issue: Image too large
from PIL import Image def resize_if_needed(image_path, max_size=2048): """Resize image if dimensions exceed maximum.""" img = Image.open(image_path) if max(img.size) > max_size: img.thumbnail((max_size, max_size), Image.Resampling.LANCZOS) resized_path = image_path.replace('.', '_resized.') img.save(resized_path, quality=95) return resized_path return image_path
Issue: Token limit exceeded
# Reduce max_tokens or use low detail mode response = client.chat.completions.create( model="gpt-5", messages=[...], max_tokens=300 # Reduce from default )
Issue: Rate limit errors
import time from openai import RateLimitError def analyze_with_retry(image_path, max_retries=3): """Analyze image with exponential backoff on rate limits.""" for attempt in range(max_retries): try: return analyze_image(image_path) except RateLimitError: if attempt < max_retries - 1: wait_time = 2 ** attempt # Exponential backoff print(f"Rate limit hit, waiting {wait_time}s...") time.sleep(wait_time) else: raise
Best Practices
Prompt Engineering for Vision
- Be specific: "Count the number of people wearing red shirts" vs "Analyze this image"
- Request structured output: Ask for JSON, lists, or tables when appropriate
- Provide context: "This is a medical diagram showing..." helps the model understand
- Use examples: Show the format you want in your prompt
Image Quality Guidelines
- Use clear, well-lit images
- Ensure text is readable at original size
- Avoid extreme angles or distortions
- Crop to relevant content to save tokens
- Use standard orientations (avoid rotated images)
Multi-Image Analysis
- Order matters: Present images in logical sequence
- Reference images explicitly: "In the first image..."
- Limit to 10-20 images per request for best results
- Use low detail for large batches of similar images
Quality Self-Check
Before returning results, verify:
- Output is valid JSON (use
to validate)json.loads() - All required fields are present and properly typed
- API errors are caught and handled gracefully
- Token usage is tracked and within limits
- Image formats are supported (JPG, PNG, WEBP, GIF)
- Base64 encoding is correct (no corruption)
- Model name is valid and available
- Results are consistent with the prompt request
Limitations
Vision Model Limitations
- Not for medical diagnosis: Cannot interpret CT scans, X-rays, or provide medical advice
- Poor text recognition: Struggles with rotated, upside-down, or very small text (< 10pt)
- Non-Latin scripts: Reduced accuracy for non-Latin alphabets and special characters
- Spatial reasoning: Weak at precise localization tasks (e.g., chess positions, exact coordinates)
- Graph interpretation: Cannot reliably distinguish line styles (solid vs dashed) or precise color gradients
- Object counting: Provides approximate counts; may miss or double-count objects
- Panoramic/fisheye: Distorted perspectives reduce accuracy
- Metadata loss: Original EXIF data and exact dimensions are not preserved
Performance Considerations
- Latency: High-detail mode takes longer to process
- Token costs: Multiple images can quickly consume token budgets
- Rate limits: Vision requests count toward TPM (tokens per minute) limits
- File size: 20MB per image practical limit; 50MB total per request
- Batch size: Over 20 images may degrade quality or hit timeouts