Skills image-to-text
install
source · Clone the upstream repo
git clone https://github.com/TerminalSkills/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/TerminalSkills/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/image-to-text" ~/.claude/skills/terminalskills-skills-image-to-text && rm -rf "$T"
manifest:
skills/image-to-text/SKILL.mdsource content
Image to Text
Overview
Extract all readable text from an image using OCR (Tesseract). Returns the full text content along with word-level bounding boxes and confidence scores.
- Reading text content from a screenshot or design mockup
- Extracting UI copy (labels, buttons, headings) so you don't have to retype it
- Getting text positions and bounding boxes from a design image
Instructions
- The image is passed to Tesseract.js for optical character recognition
- Tesseract segments the image into lines and words
- Returns the full text plus word-level details (position, confidence)
Run the extraction script:
bash <skill-path>/scripts/image-to-text.sh <image-path> [language]
Arguments:
— Path to the image file (required)image-path
— OCR language code (optional, defaults tolanguage
). Common:eng
,eng
,fra
,deu
,spa
,chi_simjpn
The script outputs JSON with extracted text and metadata:
{ "text": "Request work\nSuggestions\nPlumbing\nHVAC\nCleaning\nElectrical", "confidence": 87.4, "words": [ { "text": "Request", "confidence": 94.2, "bbox": { "x0": 142, "y0": 180, "x1": 268, "y1": 204 } } ], "lines": [ { "text": "Request work", "confidence": 95.1, "bbox": { "x0": 142, "y0": 180, "x1": 332, "y1": 204 } } ] }
After extracting text, present the content grouped by lines and use the extracted text directly when implementing UI copy from a design.
Examples
Example 1: Extract text from a mobile app screenshot
bash <skill-path>/scripts/image-to-text.sh ./screenshot.png
Output:
Extracted text (87.4% confidence): Request work Suggestions Plumbing HVAC Cleaning Electrical Found 6 lines, 6 words.
Example 2: Extract French text from a scanned invoice
bash <skill-path>/scripts/image-to-text.sh ./invoice-scan.png fra
Tesseract uses the French language model to correctly recognize accented characters and French-specific formatting. The extracted text can then be parsed for invoice fields like total, date, and line items.
Guidelines
- Tesseract works best with clean, high-contrast text. Screenshots of rendered UI work well. Photos of text at angles or with noise may produce poor results.
- Pass the correct language code as the second argument when processing non-English text. Tesseract needs the right language model to recognize characters.
- First run is slow because Tesseract downloads language data (~4MB for English). Subsequent runs are faster.
- For structured documents (tables, forms), post-process the extracted text to parse it into JSON or CSV format.