Skillsbench openai-vision
Analyze images and multi-frame sequences using OpenAI GPT vision models
install
source · Clone the upstream repo
git clone https://github.com/benchflow-ai/skillsbench
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/jpg-ocr-stat/environment/skills/openai-vision" ~/.claude/skills/benchflow-ai-skillsbench-openai-vision && rm -rf "$T"
manifest:
tasks/jpg-ocr-stat/environment/skills/openai-vision/SKILL.mdsource content
OpenAI Vision Analysis Skill
Purpose
This skill enables image analysis, scene understanding, text extraction, and multi-frame comparison using OpenAI's vision-capable GPT models (e.g.,
gpt-4o, gpt-4o-mini). It supports single images, multiple images for comparison, and sequential frames for temporal analysis.
When to Use
- Analyzing image content (objects, scenes, colors, spatial relationships)
- Extracting and reading text from images (OCR via vision models)
- Comparing multiple images to detect differences or changes
- Processing video frames to understand temporal progression
- Generating detailed image descriptions or captions
- Answering questions about visual content
Required Libraries
The following Python libraries are required:
from openai import OpenAI import base64 import json import os from pathlib import Path
Input Requirements
- File formats: JPG, JPEG, PNG, WEBP, non-animated GIF
- Image sources: URL, Base64-encoded data, or local file paths
- Size limits: Up to 20MB per image; total request payload under 50MB
- Maximum images: Up to 500 images per request
- Image quality: Clear, legible content; avoid watermarks or heavy distortions
Output Schema
Analysis results should be returned as valid JSON conforming to this schema:
{ "success": true, "images_analyzed": 1, "analysis": { "description": "A detailed scene description...", "objects": [ {"name": "car", "color": "red", "position": "foreground center"}, {"name": "tree", "count": 3, "position": "background"} ], "text_content": "Any text visible in the image...", "colors": ["blue", "green", "white"], "scene_type": "outdoor/urban" }, "comparison": { "differences": ["Object X appeared", "Color changed from A to B"], "similarities": ["Background unchanged", "Layout consistent"] }, "metadata": { "model_used": "gpt-4o", "detail_level": "high", "token_usage": {"prompt": 1500, "completion": 200} }, "warnings": [] }
Field Descriptions
: Boolean indicating whether analysis completedsuccess
: Number of images processed in the requestimages_analyzed
: Natural language description of the image contentanalysis.description
: Array of detected objects with attributesanalysis.objects
: Any text extracted from the imageanalysis.text_content
: Dominant colors identifiedanalysis.colors
: Classification of the sceneanalysis.scene_type
: Present when multiple images are analyzed; describes differences and similaritiescomparison
: The GPT model used for analysismetadata.model_used
: Resolution level used (metadata.detail_level
,low
, orhigh
)auto
: Token consumption for cost trackingmetadata.token_usage
: Array of any issues or limitations encounteredwarnings
Code Examples
Basic Image Analysis from URL
from openai import OpenAI client = OpenAI() def analyze_image_url(image_url, prompt="Describe this image in detail."): """Analyze an image from a URL using GPT-4o vision.""" response = client.chat.completions.create( model="gpt-4o", messages=[ { "role": "user", "content": [ {"type": "text", "text": prompt}, { "type": "image_url", "image_url": { "url": image_url, "detail": "high" } } ] } ], max_tokens=1000 ) return response.choices[0].message.content
Image Analysis from Local File (Base64)
from openai import OpenAI import base64 client = OpenAI() def encode_image_to_base64(image_path): """Encode a local image file to base64.""" with open(image_path, "rb") as image_file: return base64.standard_b64encode(image_file.read()).decode("utf-8") def get_image_media_type(image_path): """Determine the media type based on file extension.""" ext = image_path.lower().split('.')[-1] media_types = { 'jpg': 'image/jpeg', 'jpeg': 'image/jpeg', 'png': 'image/png', 'gif': 'image/gif', 'webp': 'image/webp' } return media_types.get(ext, 'image/jpeg') def analyze_local_image(image_path, prompt="Describe this image in detail."): """Analyze a local image file using GPT-4o vision.""" base64_image = encode_image_to_base64(image_path) media_type = get_image_media_type(image_path) response = client.chat.completions.create( model="gpt-4o", messages=[ { "role": "user", "content": [ {"type": "text", "text": prompt}, { "type": "image_url", "image_url": { "url": f"data:{media_type};base64,{base64_image}", "detail": "high" } } ] } ], max_tokens=1000 ) return response.choices[0].message.content
Multi-Image Comparison
from openai import OpenAI import base64 client = OpenAI() def compare_images(image_paths, comparison_prompt=None): """Compare multiple images and identify differences.""" if comparison_prompt is None: comparison_prompt = ( "Compare these images carefully. " "List all differences and similarities you observe. " "Describe any changes in objects, colors, positions, or text." ) content = [{"type": "text", "text": comparison_prompt}] for i, image_path in enumerate(image_paths): base64_image = encode_image_to_base64(image_path) media_type = get_image_media_type(image_path) # Add label for each image content.append({ "type": "text", "text": f"Image {i + 1}:" }) content.append({ "type": "image_url", "image_url": { "url": f"data:{media_type};base64,{base64_image}", "detail": "high" } }) response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": content}], max_tokens=2000 ) return response.choices[0].message.content
Multi-Frame Video Analysis
from openai import OpenAI import base64 from pathlib import Path client = OpenAI() def analyze_video_frames(frame_paths, analysis_prompt=None): """Analyze a sequence of video frames for temporal understanding.""" if analysis_prompt is None: analysis_prompt = ( "These are sequential frames from a video. " "Describe what is happening over time. " "Identify any motion, changes, or events that occur across the frames." ) content = [{"type": "text", "text": analysis_prompt}] for i, frame_path in enumerate(frame_paths): base64_image = encode_image_to_base64(frame_path) media_type = get_image_media_type(frame_path) content.append({ "type": "text", "text": f"Frame {i + 1}:" }) content.append({ "type": "image_url", "image_url": { "url": f"data:{media_type};base64,{base64_image}", "detail": "auto" # Use auto for frames to balance cost } }) response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": content}], max_tokens=2000 ) return response.choices[0].message.content
Full Analysis with JSON Output
from openai import OpenAI import base64 import json import os client = OpenAI() def analyze_image_to_json(image_path, extract_text=True): """Perform comprehensive image analysis and return structured JSON.""" filename = os.path.basename(image_path) prompt = """Analyze this image and return a JSON object with the following structure: { "description": "detailed scene description", "objects": [{"name": "object name", "attributes": "color, size, position"}], "text_content": "any visible text or null if none", "colors": ["dominant", "colors"], "scene_type": "indoor/outdoor/abstract/etc", "people_count": 0, "notable_features": ["list of notable visual elements"] } Return ONLY valid JSON, no other text.""" try: base64_image = encode_image_to_base64(image_path) media_type = get_image_media_type(image_path) response = client.chat.completions.create( model="gpt-4o", messages=[ { "role": "user", "content": [ {"type": "text", "text": prompt}, { "type": "image_url", "image_url": { "url": f"data:{media_type};base64,{base64_image}", "detail": "high" } } ] } ], max_tokens=1500 ) # Parse the response as JSON analysis_text = response.choices[0].message.content # Remove markdown code blocks if present if analysis_text.startswith("```"): analysis_text = analysis_text.split("```")[1] if analysis_text.startswith("json"): analysis_text = analysis_text[4:] analysis = json.loads(analysis_text.strip()) result = { "success": True, "filename": filename, "analysis": analysis, "metadata": { "model_used": "gpt-4o", "detail_level": "high", "token_usage": { "prompt": response.usage.prompt_tokens, "completion": response.usage.completion_tokens } }, "warnings": [] } except json.JSONDecodeError as e: result = { "success": False, "filename": filename, "analysis": {"raw_response": response.choices[0].message.content}, "metadata": {"model_used": "gpt-4o"}, "warnings": [f"Failed to parse JSON: {str(e)}"] } except Exception as e: result = { "success": False, "filename": filename, "analysis": {}, "metadata": {}, "warnings": [f"Analysis failed: {str(e)}"] } return result # Usage result = analyze_image_to_json("photo.jpg") print(json.dumps(result, indent=2))
Batch Processing Directory
from openai import OpenAI import base64 import json from pathlib import Path client = OpenAI() def process_image_directory(directory_path, output_file, prompt=None): """Process all images in a directory and save results.""" if prompt is None: prompt = "Describe this image briefly, including any visible text." image_extensions = {'.jpg', '.jpeg', '.png', '.webp', '.gif'} results = [] for file_path in sorted(Path(directory_path).iterdir()): if file_path.suffix.lower() in image_extensions: print(f"Processing: {file_path.name}") try: analysis = analyze_local_image(str(file_path), prompt) results.append({ "filename": file_path.name, "success": True, "analysis": analysis }) except Exception as e: results.append({ "filename": file_path.name, "success": False, "error": str(e) }) # Save results with open(output_file, 'w') as f: json.dump(results, f, indent=2) return results
Detail Level Configuration
The
detail parameter controls image resolution and token usage:
# Low detail: 512x512 fixed, ~85 tokens per image # Best for: Quick summaries, dominant colors, general scene type {"detail": "low"} # High detail: Full resolution processing # Best for: Reading text, detecting small objects, detailed analysis {"detail": "high"} # Auto: Model decides based on image size # Best for: General use when cost vs quality tradeoff is acceptable {"detail": "auto"}
Choosing Detail Level
def get_recommended_detail(task_type): """Recommend detail level based on task type.""" high_detail_tasks = { 'ocr', 'text_extraction', 'document_analysis', 'small_object_detection', 'detailed_comparison', 'fine_grained_analysis' } low_detail_tasks = { 'scene_classification', 'dominant_colors', 'general_description', 'thumbnail_preview' } if task_type.lower() in high_detail_tasks: return "high" elif task_type.lower() in low_detail_tasks: return "low" else: return "auto"
Text Extraction (Vision-based OCR)
For extracting text from images using vision models:
def extract_text_from_image(image_path, preserve_layout=False): """Extract text from an image using GPT-4o vision.""" if preserve_layout: prompt = ( "Extract ALL text visible in this image. " "Preserve the original layout and formatting as much as possible. " "Include headers, paragraphs, captions, and any other text. " "Return only the extracted text, nothing else." ) else: prompt = ( "Extract all text visible in this image. " "Return the text in reading order (top to bottom, left to right). " "Return only the extracted text, nothing else." ) return analyze_local_image(image_path, prompt) def extract_structured_text(image_path): """Extract text with structure information as JSON.""" prompt = """Extract all text from this image and return as JSON: { "headers": ["list of headers/titles"], "paragraphs": ["list of paragraph texts"], "labels": ["list of labels or captions"], "other_text": ["any other text elements"], "reading_order": ["all text in reading order"] } Return ONLY valid JSON.""" response = analyze_local_image(image_path, prompt) try: # Clean and parse JSON if response.startswith("```"): response = response.split("```")[1] if response.startswith("json"): response = response[4:] return json.loads(response.strip()) except json.JSONDecodeError: return {"raw_text": response, "parse_error": True}
Error Handling
Common Issues and Solutions
Issue: API rate limits exceeded
import time from openai import RateLimitError def analyze_with_retry(image_path, prompt, max_retries=3): """Analyze image with exponential backoff retry.""" for attempt in range(max_retries): try: return analyze_local_image(image_path, prompt) except RateLimitError: if attempt < max_retries - 1: wait_time = 2 ** attempt # Exponential backoff print(f"Rate limited, waiting {wait_time}s...") time.sleep(wait_time) else: raise
Issue: Image too large
from PIL import Image import io def resize_image_if_needed(image_path, max_size_mb=15): """Resize image if it exceeds size limit.""" file_size_mb = os.path.getsize(image_path) / (1024 * 1024) if file_size_mb <= max_size_mb: return encode_image_to_base64(image_path) # Resize the image img = Image.open(image_path) # Calculate new dimensions (reduce by 50% iteratively) while file_size_mb > max_size_mb: new_width = img.width // 2 new_height = img.height // 2 img = img.resize((new_width, new_height), Image.Resampling.LANCZOS) # Check new size buffer = io.BytesIO() img.save(buffer, format='JPEG', quality=85) file_size_mb = len(buffer.getvalue()) / (1024 * 1024) # Encode resized image buffer = io.BytesIO() img.save(buffer, format='JPEG', quality=85) return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")
Issue: Invalid image format
def validate_image(image_path): """Validate image before processing.""" valid_extensions = {'.jpg', '.jpeg', '.png', '.gif', '.webp'} path = Path(image_path) if not path.exists(): return False, "File does not exist" if path.suffix.lower() not in valid_extensions: return False, f"Unsupported format: {path.suffix}" try: with Image.open(image_path) as img: img.verify() return True, "Valid image" except Exception as e: return False, f"Invalid image: {str(e)}"
Quality Self-Check
Before returning results, verify:
- API response was received successfully
- Output is valid JSON (if structured output requested)
- All requested analysis fields are present
- Token usage is within expected bounds
- No error messages in the response
- For multi-image: all images were processed
- Confidence/warnings are included when analysis is uncertain
Limitations
- Medical imagery: Not suitable for diagnostic analysis of CT scans, X-rays, or MRIs
- Small text: Text smaller than ~12pt may be misread; use
detail: high - Rotated/skewed text: Accuracy decreases with text rotation; pre-process if needed
- Non-Latin scripts: Lower accuracy for complex scripts (CJK, Arabic, etc.)
- Object counting: Approximate counts only; may miss or double-count similar objects
- Spatial precision: Cannot provide pixel-accurate bounding boxes or measurements
- Panoramic/fisheye: Distorted images reduce analysis accuracy
- Graphs and charts: May misinterpret line styles, legends, or data points
- Metadata: Cannot access EXIF data, camera info, or GPS coordinates from images
- Cost: High-detail analysis of many images can be expensive; monitor token usage
Token Cost Estimation
Approximate token costs for image inputs:
| Detail Level | Tokens per Image | Best For |
|---|---|---|
| low | ~85 tokens (fixed) | Quick classification, color detection |
| high | 85 + 170 per 512x512 tile | OCR, detailed analysis, small objects |
| auto | Variable | General use |
def estimate_image_tokens(image_path, detail="high"): """Estimate token usage for an image.""" if detail == "low": return 85 with Image.open(image_path) as img: width, height = img.size # High detail: image is scaled to fit in 2048x2048, then tiled at 512x512 scale = min(2048 / max(width, height), 1.0) scaled_width = int(width * scale) scaled_height = int(height * scale) # Ensure minimum 768 on shortest side if min(scaled_width, scaled_height) < 768: scale = 768 / min(scaled_width, scaled_height) scaled_width = int(scaled_width * scale) scaled_height = int(scaled_height * scale) # Calculate tiles tiles_x = (scaled_width + 511) // 512 tiles_y = (scaled_height + 511) // 512 total_tiles = tiles_x * tiles_y return 85 + (170 * total_tiles)
Version History
- 1.0.0 (2026-01-21): Initial release with GPT-4o vision support