Claude-skill-registry document-rag-pipeline
Build complete document knowledge bases with PDF text extraction, OCR for scanned documents, vector embeddings, and semantic search. Use this for creating searchable document libraries from folders of PDFs, technical standards, or any document collection.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/document-rag-pipeline" ~/.claude/skills/majiayu000-claude-skill-registry-document-rag-pipeline && rm -rf "$T"
manifest:
skills/data/document-rag-pipeline/SKILL.mdsource content
Document RAG Pipeline Skill
Overview
This skill creates a complete Retrieval-Augmented Generation (RAG) system from a folder of documents. It handles:
- Regular PDF text extraction
- OCR for scanned/image-based PDFs
- DRM-protected file detection
- Text chunking with overlap
- Vector embedding generation
- SQLite storage with full-text search
- Semantic similarity search
Quick Start
# Install dependencies pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm # Build knowledge base python build_knowledge_base.py /path/to/documents --embed # Search documents python build_knowledge_base.py /path/to/documents --search "your query"
When to Use
- Building searchable knowledge bases from document folders
- Processing technical standards libraries (API, ISO, ASME, etc.)
- Creating semantic search over engineering documents
- OCR processing of scanned historical documents
- Any collection of PDFs needing intelligent search
Architecture
Document Folder │ ▼ ┌─────────────────────┐ │ 1. Build Inventory │ SQLite catalog of all files └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ 2. Extract Text │ PyMuPDF for regular PDFs └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ 3. OCR Scanned PDFs │ Tesseract + pytesseract └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ 4. Chunk Text │ 1000 chars, 200 overlap └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ 5. Generate Embeds │ sentence-transformers └──────────┬──────────┘ ▼ ┌─────────────────────┐ │ 6. Semantic Search │ Cosine similarity └─────────────────────┘
Prerequisites
System Dependencies
# Ubuntu/Debian sudo apt-get update sudo apt-get install -y tesseract-ocr tesseract-ocr-eng poppler-utils # macOS brew install tesseract poppler # Verify Tesseract tesseract --version # Should show 5.x
Python Dependencies
pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm
Or with UV:
uv pip install PyMuPDF pytesseract Pillow sentence-transformers numpy tqdm
Implementation
Step 1: Database Schema
import sqlite3 from pathlib import Path from datetime import datetime def create_database(db_path): """Create SQLite database with full schema.""" conn = sqlite3.connect(db_path, timeout=30) cursor = conn.cursor() # Documents table cursor.execute(''' CREATE TABLE IF NOT EXISTS documents ( id INTEGER PRIMARY KEY AUTOINCREMENT, filename TEXT NOT NULL, filepath TEXT UNIQUE NOT NULL, file_size INTEGER, file_type TEXT, page_count INTEGER, extraction_method TEXT, -- 'text', 'ocr', 'failed', 'drm_protected' created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) ''') # Text chunks table cursor.execute(''' CREATE TABLE IF NOT EXISTS text_chunks ( id INTEGER PRIMARY KEY AUTOINCREMENT, document_id INTEGER NOT NULL, chunk_num INTEGER NOT NULL, chunk_text TEXT NOT NULL, char_count INTEGER, embedding BLOB, embedding_model TEXT, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, FOREIGN KEY (document_id) REFERENCES documents(id), UNIQUE(document_id, chunk_num) ) ''') # Create indexes cursor.execute('CREATE INDEX IF NOT EXISTS idx_chunks_doc_id ON text_chunks(document_id)') cursor.execute('CREATE INDEX IF NOT EXISTS idx_docs_filepath ON documents(filepath)') conn.commit() return conn
Step 2: PDF Text Extraction
import fitz # PyMuPDF def extract_pdf_text(pdf_path): """Extract text from PDF using PyMuPDF.""" try: doc = fitz.open(pdf_path) text_parts = [] for page_num in range(len(doc)): page = doc[page_num] text = page.get_text() if text.strip(): text_parts.append(text) doc.close() full_text = "\n".join(text_parts) # Check if meaningful text extracted if len(full_text.strip()) < 100: return None, "no_text" return full_text, "text" except Exception as e: if "encrypted" in str(e).lower() or "drm" in str(e).lower(): return None, "drm_protected" return None, f"error: {str(e)}"
Step 3: OCR for Scanned PDFs
import fitz import pytesseract from PIL import Image import io def ocr_pdf(pdf_path, dpi=200): """OCR scanned PDF using Tesseract.""" try: doc = fitz.open(pdf_path) text_parts = [] for page_num in range(len(doc)): page = doc[page_num] # Convert page to image mat = fitz.Matrix(dpi/72, dpi/72) pix = page.get_pixmap(matrix=mat) # Convert to PIL Image img_data = pix.tobytes("png") img = Image.open(io.BytesIO(img_data)) # OCR with Tesseract text = pytesseract.image_to_string(img, lang='eng') if text.strip(): text_parts.append(text) doc.close() full_text = "\n".join(text_parts) if len(full_text.strip()) < 100: return None, "ocr_failed" return full_text, "ocr" except Exception as e: return None, f"ocr_error: {str(e)}"
Step 4: Text Chunking
def chunk_text(text, chunk_size=1000, overlap=200): """Split text into overlapping chunks.""" chunks = [] start = 0 text_len = len(text) while start < text_len: end = start + chunk_size chunk = text[start:end] # Try to break at sentence boundary if end < text_len: last_period = chunk.rfind('.') last_newline = chunk.rfind('\n') break_point = max(last_period, last_newline) if break_point > chunk_size * 0.7: chunk = text[start:start + break_point + 1] end = start + break_point + 1 chunks.append(chunk.strip()) start = end - overlap if start >= text_len: break return chunks
Step 5: Embedding Generation
from sentence_transformers import SentenceTransformer import numpy as np import pickle import os # Force CPU mode (for CUDA compatibility issues) os.environ["CUDA_VISIBLE_DEVICES"] = "" def create_embeddings(db_path, model_name='all-MiniLM-L6-v2', batch_size=100): """Generate embeddings for all chunks without embeddings.""" model = SentenceTransformer(model_name) conn = sqlite3.connect(db_path, timeout=30) cursor = conn.cursor() # Get chunks needing embeddings cursor.execute(''' SELECT id, chunk_text FROM text_chunks WHERE embedding IS NULL ''') chunks = cursor.fetchall() print(f"Generating embeddings for {len(chunks)} chunks...") for i in range(0, len(chunks), batch_size): batch = chunks[i:i+batch_size] ids = [c[0] for c in batch] texts = [c[1] for c in batch] # Generate embeddings embeddings = model.encode(texts, normalize_embeddings=True) # Store as pickled numpy arrays for chunk_id, emb in zip(ids, embeddings): emb_blob = pickle.dumps(emb.astype(np.float32)) cursor.execute(''' UPDATE text_chunks SET embedding = ?, embedding_model = ? WHERE id = ? ''', (emb_blob, model_name, chunk_id)) conn.commit() print(f" Embedded {min(i+batch_size, len(chunks))}/{len(chunks)}") conn.close() print("Embedding complete!")
Step 6: Semantic Search
def semantic_search(db_path, query, top_k=10, sample_size=50000): """Search for similar chunks using cosine similarity.""" # Force CPU mode os.environ["CUDA_VISIBLE_DEVICES"] = "" model = SentenceTransformer('all-MiniLM-L6-v2') query_emb = model.encode(query, normalize_embeddings=True) conn = sqlite3.connect(db_path, timeout=30) cursor = conn.cursor() # Get chunks with embeddings (sample if large) cursor.execute('SELECT COUNT(*) FROM text_chunks WHERE embedding IS NOT NULL') total = cursor.fetchone()[0] if total > sample_size: # Random sample for large databases cursor.execute(f''' SELECT tc.id, tc.chunk_text, tc.embedding, d.filename FROM text_chunks tc JOIN documents d ON tc.document_id = d.id WHERE tc.embedding IS NOT NULL ORDER BY RANDOM() LIMIT {sample_size} ''') else: cursor.execute(''' SELECT tc.id, tc.chunk_text, tc.embedding, d.filename FROM text_chunks tc JOIN documents d ON tc.document_id = d.id WHERE tc.embedding IS NOT NULL ''') results = [] for chunk_id, text, emb_blob, filename in cursor.fetchall(): emb = pickle.loads(emb_blob) # Cosine similarity (embeddings are normalized) similarity = np.dot(query_emb, emb) results.append({ 'id': chunk_id, 'text': text[:500], # Truncate for display 'filename': filename, 'score': float(similarity) }) conn.close() # Sort by similarity results.sort(key=lambda x: x['score'], reverse=True) return results[:top_k]
Complete Pipeline Script
#!/usr/bin/env python3 """ Document RAG Pipeline - Build searchable knowledge base from PDF folder. Usage: python build_knowledge_base.py /path/to/documents --db inventory.db python build_knowledge_base.py /path/to/documents --search "query text" """ import argparse import os from pathlib import Path from tqdm import tqdm def build_inventory(folder_path, db_path): """Build document inventory from folder.""" conn = create_database(db_path) cursor = conn.cursor() pdf_files = list(Path(folder_path).rglob("*.pdf")) print(f"Found {len(pdf_files)} PDF files") for pdf_path in tqdm(pdf_files, desc="Building inventory"): # Check if already processed cursor.execute('SELECT id FROM documents WHERE filepath = ?', (str(pdf_path),)) if cursor.fetchone(): continue file_size = pdf_path.stat().st_size cursor.execute(''' INSERT INTO documents (filename, filepath, file_size, file_type) VALUES (?, ?, ?, 'pdf') ''', (pdf_path.name, str(pdf_path), file_size)) conn.commit() conn.close() def process_documents(db_path, use_ocr=True): """Extract text from all unprocessed documents.""" conn = sqlite3.connect(db_path, timeout=30) cursor = conn.cursor() # Get unprocessed documents cursor.execute(''' SELECT id, filepath FROM documents WHERE extraction_method IS NULL ''') docs = cursor.fetchall() stats = {'text': 0, 'ocr': 0, 'failed': 0, 'drm': 0} for doc_id, filepath in tqdm(docs, desc="Extracting text"): # Try regular extraction first text, method = extract_pdf_text(filepath) # Try OCR if no text and OCR enabled if text is None and use_ocr and method == "no_text": text, method = ocr_pdf(filepath) if text: # Chunk and store chunks = chunk_text(text) for i, chunk in enumerate(chunks): cursor.execute(''' INSERT OR IGNORE INTO text_chunks (document_id, chunk_num, chunk_text, char_count) VALUES (?, ?, ?, ?) ''', (doc_id, i, chunk, len(chunk))) stats['text' if method == 'text' else 'ocr'] += 1 else: if 'drm' in method: stats['drm'] += 1 else: stats['failed'] += 1 # Update document status cursor.execute(''' UPDATE documents SET extraction_method = ? WHERE id = ? ''', (method, doc_id)) conn.commit() conn.close() return stats def main(): parser = argparse.ArgumentParser(description='Document RAG Pipeline') parser.add_argument('folder', help='Folder containing documents') parser.add_argument('--db', default='_inventory.db', help='Database path') parser.add_argument('--no-ocr', action='store_true', help='Skip OCR') parser.add_argument('--embed', action='store_true', help='Generate embeddings') parser.add_argument('--search', help='Search query') parser.add_argument('--top-k', type=int, default=10, help='Number of results') args = parser.parse_args() db_path = Path(args.folder) / args.db if args.search: # Search mode results = semantic_search(str(db_path), args.search, args.top_k) print(f"\nTop {len(results)} results for: '{args.search}'\n") for i, r in enumerate(results, 1): print(f"{i}. [{r['score']:.3f}] {r['filename']}") print(f" {r['text'][:200]}...\n") else: # Build mode print("Step 1: Building inventory...") build_inventory(args.folder, str(db_path)) print("\nStep 2: Extracting text...") stats = process_documents(str(db_path), use_ocr=not args.no_ocr) print(f"Results: {stats}") if args.embed: print("\nStep 3: Generating embeddings...") create_embeddings(str(db_path)) if __name__ == '__main__': main()
Usage Examples
Build Knowledge Base
# Full pipeline with OCR and embeddings python build_knowledge_base.py /path/to/documents --embed # Skip OCR (faster, text PDFs only) python build_knowledge_base.py /path/to/documents --no-ocr --embed # Just build inventory (no extraction) python build_knowledge_base.py /path/to/documents
Search Documents
# Semantic search python build_knowledge_base.py /path/to/documents --search "subsea wellhead design" # More results python build_knowledge_base.py /path/to/documents --search "fatigue analysis" --top-k 20
Quick Search Script
#!/bin/bash # search_docs.sh - Quick semantic search DB_PATH="${1:-/path/to/_inventory.db}" QUERY="$2" CUDA_VISIBLE_DEVICES="" python3 -c " import sqlite3, pickle, numpy as np from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') query_emb = model.encode('$QUERY', normalize_embeddings=True) conn = sqlite3.connect('$DB_PATH') cursor = conn.cursor() cursor.execute(''' SELECT tc.chunk_text, tc.embedding, d.filename FROM text_chunks tc JOIN documents d ON tc.document_id = d.id WHERE tc.embedding IS NOT NULL ORDER BY RANDOM() LIMIT 50000 ''') results = [] for text, emb_blob, filename in cursor.fetchall(): emb = pickle.loads(emb_blob) sim = float(np.dot(query_emb, emb)) results.append((sim, filename, text[:200])) for score, fname, text in sorted(results, reverse=True)[:10]: print(f'[{score:.3f}] {fname}') print(f' {text}...\n') "
Execution Checklist
- Install system dependencies (Tesseract, Poppler)
- Install Python dependencies
- Verify document folder exists
- Run inventory to catalog documents
- Extract text (with or without OCR)
- Generate embeddings
- Test semantic search
- Monitor for DRM-protected files
Error Handling
Common Errors
Error: CUDA not available
- Cause: CUDA driver issues or incompatible GPU
- Solution: Force CPU mode with
CUDA_VISIBLE_DEVICES=""
Error: Tesseract not found
- Cause: Tesseract OCR not installed
- Solution: Install with
orapt-get install tesseract-ocrbrew install tesseract
Error: DRM-protected files
- Cause: FileOpen or other DRM encryption
- Solution: Skip these files; list with
extraction_method = 'drm_protected'
Error: SQLite database locked
- Cause: Concurrent access without timeout
- Solution: Use
in sqlite3.connect()timeout=30
Error: Out of memory
- Cause: Large batch sizes or too many embeddings
- Solution: Reduce batch_size, use sampling for search
Metrics
| Metric | Typical Value |
|---|---|
| Text extraction | ~50 pages/second |
| OCR processing | ~2-5 pages/minute |
| Embedding generation | ~100 chunks/second (CPU) |
| Search latency | <2 seconds (50K chunks) |
| Memory usage | ~2GB for embeddings |
Performance Metrics (Real-World)
From O&G Standards processing (957 documents):
| Metric | Value |
|---|---|
| Total documents | 957 |
| Text extraction | 811 PDFs |
| OCR processed | 96 PDFs |
| DRM protected | 50 PDFs |
| Total chunks | 1,043,616 |
| Embedding time | ~4 hours (CPU) |
| Search latency | <2 seconds |
Related Skills
- Just text extractionpdf-text-extractor
- Just embeddings/searchsemantic-search-setup
- Add LLM Q&A layerrag-system-builder
- Simpler document catalogknowledge-base-builder
Version History
- 1.1.0 (2026-01-02): Added Quick Start, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
- 1.0.0 (2024-10-15): Initial release with OCR support, chunking, vector embeddings, semantic search