Claude-skill-registry document-indexing
Extract structured metadata from documents using AI. Classify content types, extract topics and tools. Supports async batch processing.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/document-indexing" ~/.claude/skills/majiayu000-claude-skill-registry-document-indexing && rm -rf "$T"
manifest:
skills/data/document-indexing/SKILL.mdsource content
Document Indexing
Overview
Extract structured metadata from fetched documents using LLM:
- Content type: blog, tutorial, guide, reference, etc.
- Topics & Tools: Main subjects and technologies
- Structure: Code examples, procedures, narrative
Creates
DocumentMetadata records for search and clustering.
Quick Start
# Index single document kurt index 5494cc13 # Batch index (async, 5-10x faster) kurt index --url-prefix https://example.com/ # Re-index with custom concurrency kurt index --url-prefix https://example.com/ --force --max-concurrent 10
Prerequisites: Documents must be FETCHED (
kurct content fetch)
Commands
# Single kurt index <doc-id> kurt index <doc-id> --force # Batch (async parallel) kurt index --url-prefix <url> kurt index --url-contains <string> kurt index --max-concurrent 10 # Default: 5 # Filters kurt index --status FETCHED --url-prefix <url>
Content Types
BLOG | TUTORIAL | GUIDE | REFERENCE | WHITEPAPER | CASE_STUDY | FAQ | CHANGELOG | MARKETING | OTHER
Extracted Metadata
{ "content_type": "TUTORIAL", "extracted_title": "Machine Learning Guide", "primary_topics": ["Machine Learning", "Python"], "tools_technologies": ["TensorFlow", "Pandas"], "has_code_examples": true, "has_step_by_step_procedures": true, "has_narrative_structure": false }
Performance
- Sequential: ~3-5s per document
- Parallel (5 concurrent): ~1s per document avg
- Example: 92 docs in 30s (parallel) vs 5 mins (sequential)
Python API
from kurt.indexing import extract_document_metadata, batch_extract_document_metadata import asyncio # Single result = extract_document_metadata("abc-123") # Batch results = asyncio.run(batch_extract_document_metadata( ["abc-123", "def-456"], max_concurrent=5 ))
Troubleshooting
| Issue | Solution |
|---|---|
| "Document not FETCHED" | Run first |
| "Content file not found" | Re-fetch document |
| Slow batch | Increase |
| Rate limits | Reduce |
Next Steps
- ingest-content-skill - Fetch documents first
- document-management-skill - Query and manage documents