git clone https://github.com/benchflow-ai/skillsbench
T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/jpg-ocr-stat/environment/skills/image-ocr" ~/.claude/skills/benchflow-ai-skillsbench-image-ocr && rm -rf "$T"
tasks/jpg-ocr-stat/environment/skills/image-ocr/SKILL.mdImage OCR Skill
Purpose
This skill enables accurate text extraction from image files (JPG, PNG, etc.) using Tesseract OCR via the
pytesseract Python library. It is suitable for scanned documents, screenshots, photos of text, receipts, forms, and other visual content containing text.
When to Use
- Extracting text from scanned documents or photos
- Reading text from screenshots or image captures
- Processing batch image files that contain textual information
- Converting visual documents to machine-readable text
- Extracting structured data from forms, receipts, or tables in images
Required Libraries
The following Python libraries are required:
import pytesseract from PIL import Image import json import os
Input Requirements
- File formats: JPG, JPEG, PNG, WEBP
- Image quality: Minimum 300 DPI recommended for printed text; clear and legible text
- File size: Under 5MB per image (resize if necessary)
- Text language: Specify if non-English to improve accuracy
Output Schema
All extracted content must be returned as valid JSON conforming to this schema:
{ "success": true, "filename": "example.jpg", "extracted_text": "Full raw text extracted from the image...", "confidence": "high|medium|low", "metadata": { "language_detected": "en", "text_regions": 3, "has_tables": false, "has_handwriting": false }, "warnings": [ "Text partially obscured in bottom-right corner", "Low contrast detected in header section" ] }
Field Descriptions
: Boolean indicating whether text extraction completedsuccess
: Original image filenamefilename
: Complete text content in reading order (top-to-bottom, left-to-right)extracted_text
: Overall OCR confidence level based on image quality and text clarityconfidence
: ISO 639-1 language codemetadata.language_detected
: Number of distinct text blocks identifiedmetadata.text_regions
: Whether tabular data structures were detectedmetadata.has_tables
: Whether handwritten text was detectedmetadata.has_handwriting
: Array of quality issues or potential errorswarnings
Code Examples
Basic OCR Extraction
import pytesseract from PIL import Image def extract_text_from_image(image_path): """Extract text from a single image using Tesseract OCR.""" img = Image.open(image_path) text = pytesseract.image_to_string(img) return text.strip()
OCR with Confidence Data
import pytesseract from PIL import Image def extract_with_confidence(image_path): """Extract text with per-word confidence scores.""" img = Image.open(image_path) # Get detailed OCR data including confidence data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT) words = [] confidences = [] for i, word in enumerate(data['text']): if word.strip(): # Skip empty strings words.append(word) confidences.append(data['conf'][i]) # Calculate average confidence avg_confidence = sum(c for c in confidences if c > 0) / len([c for c in confidences if c > 0]) if confidences else 0 return { 'text': ' '.join(words), 'average_confidence': avg_confidence, 'word_count': len(words) }
Full OCR with JSON Output
import pytesseract from PIL import Image import json import os def ocr_to_json(image_path): """Perform OCR and return results as JSON.""" filename = os.path.basename(image_path) warnings = [] try: img = Image.open(image_path) # Get detailed OCR data data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT) # Extract text preserving structure text = pytesseract.image_to_string(img) # Calculate confidence confidences = [c for c in data['conf'] if c > 0] avg_conf = sum(confidences) / len(confidences) if confidences else 0 # Determine confidence level if avg_conf >= 80: confidence = "high" elif avg_conf >= 50: confidence = "medium" else: confidence = "low" warnings.append(f"Low OCR confidence: {avg_conf:.1f}%") # Count text regions (blocks) block_nums = set(data['block_num']) text_regions = len([b for b in block_nums if b > 0]) result = { "success": True, "filename": filename, "extracted_text": text.strip(), "confidence": confidence, "metadata": { "language_detected": "en", "text_regions": text_regions, "has_tables": False, "has_handwriting": False }, "warnings": warnings } except Exception as e: result = { "success": False, "filename": filename, "extracted_text": "", "confidence": "low", "metadata": { "language_detected": "unknown", "text_regions": 0, "has_tables": False, "has_handwriting": False }, "warnings": [f"OCR failed: {str(e)}"] } return result # Usage result = ocr_to_json("document.jpg") print(json.dumps(result, indent=2))
Batch Processing Multiple Images
import pytesseract from PIL import Image import json import os from pathlib import Path def process_image_directory(directory_path, output_file): """Process all images in a directory and save results.""" image_extensions = {'.jpg', '.jpeg', '.png', '.webp'} results = [] for file_path in sorted(Path(directory_path).iterdir()): if file_path.suffix.lower() in image_extensions: result = ocr_to_json(str(file_path)) results.append(result) print(f"Processed: {file_path.name}") # Save results with open(output_file, 'w') as f: json.dump(results, f, indent=2) return results
Tesseract Configuration Options
Language Selection
# Specify language (default is English) text = pytesseract.image_to_string(img, lang='eng') # Multiple languages text = pytesseract.image_to_string(img, lang='eng+fra+deu')
Page Segmentation Modes (PSM)
Use
--psm to control how Tesseract segments the image:
# PSM 3: Fully automatic page segmentation (default) text = pytesseract.image_to_string(img, config='--psm 3') # PSM 4: Assume single column of text text = pytesseract.image_to_string(img, config='--psm 4') # PSM 6: Assume uniform block of text text = pytesseract.image_to_string(img, config='--psm 6') # PSM 11: Sparse text - find as much text as possible text = pytesseract.image_to_string(img, config='--psm 11')
Common PSM values:
: Orientation and script detection (OSD) only0
: Fully automatic page segmentation (default)3
: Single column of text of variable sizes4
: Uniform block of text6
: Single text line7
: Sparse text11
: Raw line13
Image Preprocessing
For better OCR accuracy, preprocess images:
from PIL import Image, ImageFilter, ImageOps def preprocess_image(image_path): """Preprocess image for better OCR results.""" img = Image.open(image_path) # Convert to grayscale img = img.convert('L') # Increase contrast img = ImageOps.autocontrast(img) # Apply slight sharpening img = img.filter(ImageFilter.SHARPEN) return img # Use preprocessed image for OCR img = preprocess_image("document.jpg") text = pytesseract.image_to_string(img)
Advanced Preprocessing Strategies
For difficult images (low contrast, faded text, dark backgrounds), try multiple preprocessing approaches:
- Grayscale + Autocontrast - Basic enhancement for most images
- Inverted - Use
for dark backgrounds with light textImageOps.invert() - Scaling - Upscale small images (e.g., 2x) before OCR to improve character recognition
- Thresholding - Convert to binary using
with different threshold values (e.g., 100, 128)img.point(lambda p: 255 if p > threshold else 0) - Sharpening - Apply
to improve edge clarityImageFilter.SHARPEN
Multi-Pass OCR Strategy
For challenging images, a single OCR pass may miss text. Use multiple passes with different configurations:
-
Try multiple PSM modes - Different page segmentation modes work better for different layouts (e.g.,
for blocks,--psm 6
for columns,--psm 4
for sparse text)--psm 11 -
Try multiple preprocessing variants - Run OCR on several preprocessed versions of the same image
-
Combine results - Aggregate text from all passes to maximize extraction coverage
def multi_pass_ocr(image_path): """Run OCR with multiple strategies and combine results.""" img = Image.open(image_path) gray = ImageOps.grayscale(img) # Generate preprocessing variants variants = [ ImageOps.autocontrast(gray), ImageOps.invert(ImageOps.autocontrast(gray)), gray.filter(ImageFilter.SHARPEN), ] # PSM modes to try psm_modes = ['--psm 6', '--psm 4', '--psm 11'] all_text = [] for variant in variants: for psm in psm_modes: try: text = pytesseract.image_to_string(variant, config=psm) if text.strip(): all_text.append(text) except Exception: pass # Combine all extracted text return "\n".join(all_text)
This approach improves extraction for receipts, faded documents, and images with varying quality.
Error Handling
Common Issues and Solutions
Issue: Tesseract not found
# Verify Tesseract is installed try: pytesseract.get_tesseract_version() except pytesseract.TesseractNotFoundError: print("Tesseract is not installed or not in PATH")
Issue: Poor OCR quality
- Preprocess image (grayscale, contrast, sharpen)
- Use appropriate PSM mode for the document type
- Ensure image resolution is sufficient (300+ DPI)
Issue: Empty or garbage output
- Check if image contains actual text
- Try different PSM modes
- Verify image is not corrupted
Quality Self-Check
Before returning results, verify:
- Output is valid JSON (use
to validate)json.loads() - All required fields are present (
,success
,filename
,extracted_text
,confidence
)metadata - Text preserves logical reading order
- Confidence level reflects actual OCR quality
- Warnings array includes all detected issues
- Special characters are properly escaped in JSON
Limitations
- Tesseract works best with printed text; handwriting recognition is limited
- Accuracy decreases with decorative fonts, artistic text, or extreme stylization
- Mathematical equations and special notation may not extract accurately
- Redacted or watermarked text cannot be recovered
- Severe image degradation (blur, noise, low resolution) reduces accuracy
- Complex multi-column layouts may require custom PSM configuration
Version History
- 1.0.0 (2026-01-13): Initial release with Tesseract/pytesseract OCR