Claude-skill-registry document-ocr-processing
Process scanned documents and images containing Chuukese text using OCR with specialized post-processing for accent characters and traditional formatting. Use when working with scanned books, documents, or images that contain Chuukese text that needs to be digitized.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/document-ocr-processing" ~/.claude/skills/majiayu000-claude-skill-registry-document-ocr-processing && rm -rf "$T"
manifest:
skills/data/document-ocr-processing/SKILL.mdsource content
Document OCR Processing
Overview
Specialized OCR processing for documents containing Chuukese text, with enhanced accuracy for accented characters, traditional formatting patterns, and multilingual content. Designed to handle the unique challenges of digitizing historical and contemporary Chuukese documents.
Capabilities
- Chuukese-Aware OCR: Enhanced recognition of accented characters (á, é, í, ó, ú, ā, ē, ī, ō, ū)
- Traditional Format Recognition: Handle traditional document layouts and formatting
- Multilingual Processing: Process documents with both Chuukese and English text
- Quality Enhancement: Post-processing to improve OCR accuracy
- Batch Processing: Efficiently process multiple documents
- Format Preservation: Maintain original document structure and layout
Core Components
1. OCR Engine Setup
import pytesseract from PIL import Image import cv2 import numpy as np class ChuukeseOCRProcessor: def __init__(self): # Configure Tesseract for multi-language support self.tesseract_config = { 'chuukese_optimized': '--oem 3 --psm 6 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzáéíóúāēīōū0123456789.,!?;:()-"\' ', 'multilingual': '--oem 3 --psm 6', 'preserve_structure': '--oem 3 --psm 1' } # Chuukese character mappings for OCR corrections self.ocr_corrections = { # Common OCR mistakes for accented characters 'a´': 'á', 'a`': 'à', 'a¯': 'ā', 'e´': 'é', 'e`': 'è', 'e¯': 'ē', 'i´': 'í', 'i`': 'ì', 'i¯': 'ī', 'o´': 'ó', 'o`': 'ò', 'o¯': 'ō', 'u´': 'ú', 'u`': 'ù', 'u¯': 'ū', # Common character confusions '0': 'o', '1': 'l', '5': 's', 'rn': 'm', 'cl': 'd', 'ck': 'ch' } def preprocess_image(self, image_path): """Preprocess image for better OCR accuracy""" # Load image image = cv2.imread(image_path) # Convert to grayscale gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # Noise removal denoised = cv2.medianBlur(gray, 3) # Contrast enhancement clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8)) enhanced = clahe.apply(denoised) # Binarization _, binary = cv2.threshold(enhanced, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU) return binary
2. Post-Processing for Chuukese Text
class ChuukeseOCRPostProcessor: def __init__(self, dictionary_path=None): self.dictionary = {} if dictionary_path: self.load_chuukese_dictionary(dictionary_path) # Common OCR error patterns for Chuukese self.error_patterns = { # Accent corrections r'a[\'\`\´]': 'á', r'e[\'\`\´]': 'é', r'i[\'\`\´]': 'í', r'o[\'\`\´]': 'ó', r'u[\'\`\´]': 'ú', # Common character substitutions r'\b0(?=[aeiou])': 'o', # 0 at start of word -> o r'(?<=[aeiou])0\b': 'o', # 0 at end after vowel -> o r'\brn(?=[aeiou])': 'm', # rn -> m } def correct_ocr_errors(self, text): """Apply OCR error corrections specific to Chuukese""" corrected = text # Apply pattern-based corrections for pattern, replacement in self.error_patterns.items(): corrected = re.sub(pattern, replacement, corrected) return corrected
Usage Examples
Process Single Document
# Initialize processor processor = BatchOCRProcessor("output/ocr_results") # Process single document result = processor.process_document("scanned_chuukese_dictionary.jpg") # Access extracted text extracted_text = result['extracted_text'] dictionary_entries = result['document_structure']['dictionary_entries']
Batch Process Directory
# Process all images in a directory batch_results = processor.process_batch( "scanned_documents/", file_patterns=['*.jpg', '*.png'] ) print(f"Processed {batch_results['successfully_processed']} documents")
Best Practices
Image Preprocessing
- Quality assessment: Check image quality before processing
- Resolution optimization: Ensure minimum 300 DPI for OCR
- Noise reduction: Apply appropriate filtering for cleaner text
- Orientation correction: Detect and correct page rotation
OCR Accuracy
- Language-specific tuning: Optimize for Chuukese character set
- Confidence thresholds: Filter low-confidence results
- Multiple engine comparison: Use different OCR engines for comparison
- Human validation: Sample-based quality checking
Dependencies
: OCR engine interfacepytesseract
: Image preprocessingopencv-python
: Image handling and manipulationPillow
: Numerical operations for image processingnumpy