Learn-skills.dev paddleocr-vl-1.5

install
source · Clone the upstream repo
git clone https://github.com/NeverSight/learn-skills.dev
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/aidenwu0209/paddleocr-skills/paddleocr-vl-1.5" ~/.claude/skills/neversight-learn-skills-dev-paddleocr-vl-1-5 && rm -rf "$T"
manifest: data/skills-md/aidenwu0209/paddleocr-skills/paddleocr-vl-1.5/SKILL.md
source content

PaddleOCR-VL 1.5 Document Parsing Skill

When to Use This Skill

Use PaddleOCR-VL 1.5 for:

  • Documents with tables (invoices, financial reports, spreadsheets)
  • Documents with mathematical formulas (academic papers, scientific documents)
  • Documents with charts and diagrams
  • Multi-column layouts (newspapers, magazines, brochures)
  • Complex document structures requiring layout analysis
  • Any document requiring structured understanding

Use PP-OCRv5 instead for:

  • Simple text-only extraction
  • Quick OCR tasks where speed is critical
  • Screenshots or simple images with clear text

How to Use This Skill

⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔

  1. ONLY use PaddleOCR-VL 1.5 API - Execute the script
    python scripts/paddleocr-vl-1.5/vl_caller.py
  2. NEVER use Claude's built-in vision - Do NOT parse documents yourself
  3. NEVER offer alternatives - Do NOT suggest "I can try to analyze it" or similar
  4. IF API fails - Display the error message and STOP immediately
  5. NO fallback methods - Do NOT attempt document parsing any other way

If the script execution fails (API not configured, network error, etc.):

  • Show the error message to the user
  • Do NOT offer to help using your vision capabilities
  • Do NOT ask "Would you like me to try parsing it?"
  • Simply stop and wait for user to fix the configuration

Basic Workflow

  1. Execute document parsing:

    python scripts/paddleocr-vl-1.5/vl_caller.py --file-url "URL provided by user"
    

    Or for local files:

    python scripts/paddleocr-vl-1.5/vl_caller.py --file-path "file path"
    

    Save result to file (recommended):

    python scripts/paddleocr-vl-1.5/vl_caller.py --file-url "URL" --output result.json --pretty
    
    • The script will display:
      Result saved to: /absolute/path/to/result.json
    • This message appears on stderr, the JSON is saved to the file
    • Tell the user the file path shown in the message
  2. The script returns COMPLETE JSON with all document content:

    • Headers, footers, page numbers
    • Main text content
    • Tables with structure
    • Formulas (with LaTeX)
    • Figures and charts
    • Footnotes and references
    • Layout and reading order
  3. Extract what the user needs from the complete data based on their request.

IMPORTANT: Complete Content Display

CRITICAL: You must display the COMPLETE extracted content to the user based on their needs.

  • The script returns ALL document content in a structured format
  • Display the full content requested by the user, do NOT truncate or summarize
  • If user asks for "all text", show the entire
    result.full_text
  • If user asks for "tables", show ALL tables in the document
  • If user asks for "main content", filter out headers/footers but show ALL body text

What this means:

  • DO: Display complete text, all tables, all formulas as requested
  • DO: Present content in reading order using
    reading_order
    array
  • DON'T: Truncate with "..." unless content is excessively long (>10,000 chars)
  • DON'T: Summarize or provide excerpts when user asks for full content
  • DON'T: Say "Here's a preview" when user expects complete output

Example - Correct:

User: "Extract all the text from this document"
Claude: I've parsed the complete document. Here's all the extracted text:

[Display entire result.full_text or concatenated regions in reading order]

Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)

Example - Incorrect ❌:

User: "Extract all the text"
Claude: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"

Understanding the JSON Response

The script returns a complete JSON structure:

{
  "ok": true,
  "result": {
    "full_text": "Complete text with all content including headers, footers, etc.",
    "layout": {
      "regions": [
        {
          "id": 0,
          "type": "header",
          "content": "Chapter 3: Methods",
          "bbox": [100, 50, 500, 100]
        },
        {
          "id": 1,
          "type": "text",
          "content": "Main body text content...",
          "bbox": [100, 150, 500, 300]
        },
        {
          "id": 2,
          "type": "table",
          "content": {
            "rows": 3,
            "cols": 2,
            "cells": [["Header1", "Header2"], ["Data1", "Data2"]]
          },
          "bbox": [100, 350, 500, 550]
        },
        {
          "id": 3,
          "type": "formula",
          "content": "E = mc^2",
          "latex": "$E = mc^2$",
          "bbox": [200, 600, 400, 630]
        },
        {
          "id": 4,
          "type": "footnote",
          "content": "[1] Reference citation",
          "bbox": [100, 650, 500, 680]
        },
        {
          "id": 5,
          "type": "footer",
          "content": "University Name 2024",
          "bbox": [100, 750, 500, 800]
        },
        {
          "id": 6,
          "type": "page_number",
          "content": "25",
          "bbox": [250, 770, 280, 790]
        }
      ],
      "reading_order": [0, 1, 2, 3, 4, 5, 6]
    }
  },
  "metadata": {
    "processing_time_ms": 3500,
    "total_pages": 1,
    "model_version": "paddleocr-vl-1.5"
  }
}

Region Types

The

type
field indicates the element category:

TypeDescriptionTypically Include?
header
Page headers (chapter/section titles)Exclude (usually repetitive)
text
Main body textInclude
table
Tables with structured dataInclude
formula
Mathematical formulasInclude
figure
Images, charts, diagramsInclude
footnote
Footnotes and referencesInclude (often important)
footer
Page footers (author/institution)Exclude (usually repetitive)
page_number
Page numbersExclude (not content)
margin_note
Margin annotationsContext-dependent

Content Extraction Guidelines

Based on user intent, filter the regions:

User SaysWhat to ExtractHow
"Extract main content"text, table, formula, figure, footnoteSkip header, footer, page_number
"Get all tables"table onlyFilter by type="table"
"Extract formulas"formula onlyFilter by type="formula"
"Complete document"EverythingUse all regions or full_text
"Without headers/footers"Core contentSkip header, footer types
"Include page numbers"Core + page_numberKeep page_number type

Usage Examples

Example 1: Extract Main Content (default behavior)

python scripts/paddleocr-vl-1.5/vl_caller.py \
  --file-url "https://example.com/paper.pdf" \
  --pretty

Then filter JSON to extract core content:

  • Include: text, table, formula, figure, footnote
  • Exclude: header, footer, page_number

Example 2: Extract Tables Only

python scripts/paddleocr-vl-1.5/vl_caller.py \
  --file-path "./financial_report.pdf" \
  --pretty

Then filter JSON:

  • Only keep regions where type="table"
  • Present table content in markdown format

Example 3: Complete Document with Everything

python scripts/paddleocr-vl-1.5/vl_caller.py \
  --file-url "URL" \
  --pretty

Then use

result.full_text
or present all regions in reading_order.

First-Time Configuration

When API is not configured:

The error will show:

Configuration error: API not configured. Get your API at: https://aistudio.baidu.com/paddleocr/task

Auto-configuration workflow:

  1. Show the exact error message to user (including the URL)

  2. Tell user to provide credentials:

    Please visit the URL above to get your VL_API_URL and VL_TOKEN.
    Once you have them, send them to me and I'll configure it automatically.
    
  3. When user provides credentials (accept any format):

    • VL_API_URL=https://xxx.com/v1, VL_TOKEN=abc123...
    • Here's my API: https://xxx and token: abc123
    • Copy-pasted code format
    • Any other reasonable format
  4. Parse credentials from user's message:

    • Extract VL_API_URL value (look for URLs)
    • Extract VL_TOKEN value (long alphanumeric string, usually 40+ chars)
  5. Configure automatically:

    python scripts/paddleocr-vl-1.5/configure.py --api-url "PARSED_URL" --token "PARSED_TOKEN"
    
  6. If configuration succeeds:

    • Inform user: "Configuration complete! Parsing document now..."
    • Retry the original parsing task
  7. If configuration fails:

    • Show the error
    • Ask user to verify the credentials

IMPORTANT: The error message format is STRICT and must be shown exactly as provided by the script. Do not modify or paraphrase it.

Handling Large Files (>20MB)

Problem: Local files larger than 20MB are rejected by default.

Solutions (Choose based on your situation):

Solution 1: Use URL Upload (Recommended) ⭐

Upload your file to a web server and use

--file-url
:

# Instead of local file
python scripts/paddleocr-vl-1.5/vl_caller.py --file-path "large_file.pdf"  # ❌ May fail

# Use URL instead
python scripts/paddleocr-vl-1.5/vl_caller.py --file-url "https://your-server.com/large_file.pdf"  # ✅ No size limit

Benefits:

  • No local file size limit
  • Faster upload (direct from API server)
  • Suitable for very large files (>100MB)

Solution 2: Increase Size Limit

Adjust the limit in

.env
file:

# .env
VL_MAX_FILE_SIZE_MB=50  # Increase from 20MB to 50MB

Then process as usual:

python scripts/paddleocr-vl-1.5/vl_caller.py --file-path "large_file.pdf"

Note: Your API provider may still have upload limits. Check with your VL API service.

Solution 3: Compress/Optimize File

Use the built-in optimizer to reduce file size:

# Install optimization dependencies first
pip install -r scripts/paddleocr-vl-1.5/requirements-optimize.txt

# Optimize image (reduce quality)
python scripts/paddleocr-vl-1.5/optimize_file.py input.png output.png --quality 70

# Optimize PDF (compress images within)
python scripts/paddleocr-vl-1.5/optimize_file.py input.pdf output.pdf --target-size 15

# Then process optimized file
python scripts/paddleocr-vl-1.5/vl_caller.py --file-path "output.pdf" --pretty

The optimizer will:

  • Compress images (adjust JPEG quality)
  • Resize images if needed (maintain aspect ratio)
  • Compress PDF images (reduce DPI to 150)
  • Show before/after size comparison

Solution 4: Process Specific Pages (PDF Only)

If you only need certain pages from a large PDF, extract them first:

# Using PyMuPDF (requires: pip install PyMuPDF)
python -c "
import fitz
doc = fitz.open('large.pdf')
writer = fitz.open()
writer.insert_pdf(doc, from_page=0, to_page=4)  # Pages 1-5
writer.save('pages_1_5.pdf')
"

# Then process the smaller file
python scripts/paddleocr-vl-1.5/vl_caller.py --file-path "pages_1_5.pdf"

Solution 5: Upload to Cloud Storage

For extremely large files (>100MB), use cloud storage with public URLs:

# Example: Upload to AWS S3, Google Drive, or similar
# Get public URL: https://storage.example.com/my-document.pdf

# Process via URL
python scripts/paddleocr-vl-1.5/vl_caller.py --file-url "https://storage.example.com/my-document.pdf"

Comparison Table:

SolutionMax SizeSpeedComplexityBest For
URL UploadUnlimitedFastLowAny large file
Increase LimitConfigurableMediumVery LowSlightly over limit
Compress~70% reductionSlowMediumImages/PDFs with images
Extract PagesAs neededFastMediumMulti-page PDFs
Cloud StorageUnlimitedFastHighVery large files (>100MB)

Error Handling

Authentication failed (401/403):

error: Authentication failed

→ Token is invalid, reconfigure with correct credentials

API quota exceeded (429):

error: API quota exceeded

→ Daily API quota exhausted, inform user to wait or upgrade

Unsupported format:

error: Unsupported file format

→ File format not supported, convert to PDF/PNG/JPG

Pseudo-Code: Content Extraction

Extract main content (most common):

def extract_main_content(json_response):
    regions = json_response['result']['layout']['regions']

    # Keep only core content types
    core_types = ['text', 'table', 'formula', 'figure', 'footnote']
    main_regions = [r for r in regions if r['type'] in core_types]

    # Sort by reading order
    reading_order = json_response['result']['layout']['reading_order']
    sorted_regions = sort_by_order(main_regions, reading_order)

    # Present to user
    for region in sorted_regions:
        if region['type'] == 'text':
            print(region['content'])
        elif region['type'] == 'table':
            print_as_markdown_table(region['content'])
        elif region['type'] == 'formula':
            print(f"Formula: {region['latex']}")

Extract tables only:

def extract_tables(json_response):
    regions = json_response['result']['layout']['regions']
    tables = [r for r in regions if r['type'] == 'table']

    for i, table in enumerate(tables):
        print(f"Table {i+1}:")
        print_as_markdown_table(table['content'])

Extract everything:

def extract_complete(json_response):
    # Simply use the full_text which includes everything
    print(json_response['result']['full_text'])

    # Or present all regions in order
    regions = json_response['result']['layout']['regions']
    reading_order = json_response['result']['layout']['reading_order']

    for idx in reading_order:
        region = regions[idx]
        print(f"[{region['type']}] {region['content']}")

Important Notes

  • The script NEVER filters content - It always returns complete data
  • Claude decides what to present - Based on user's specific request
  • All data is always available - Can be re-interpreted for different needs
  • No information is lost - Complete document structure preserved

Reference Documentation

For in-depth understanding of the PaddleOCR-VL 1.5 system, refer to:

  • references/paddleocr-vl-1.5/vl_model_spec.md
    - VL model specifications
  • references/paddleocr-vl-1.5/layout_schema.md
    - Layout detection schema
  • references/paddleocr-vl-1.5/output_format.md
    - Complete output format specification

Load these reference documents into context when:

  • Debugging complex parsing issues
  • Understanding layout detection algorithm
  • Working with special document types
  • Customizing content extraction logic

Testing the Skill

To verify the skill is working properly:

python scripts/paddleocr-vl-1.5/smoke_test.py

This tests configuration and optionally API connectivity.