Claude-skill-registry bookstrap-ingest
Load research corpus into the database by processing files, directories, or URLs through semantic chunking, embedding generation, entity extraction, and relationship building
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/bookstrap-ingest" ~/.claude/skills/majiayu000-claude-skill-registry-bookstrap-ingest && rm -rf "$T"
skills/data/bookstrap-ingest/SKILL.md/bookstrap-ingest - Load Research Corpus
Load initial research materials into the Bookstrap database for use during writing.
Purpose
Ingest source documents into the database to build the research corpus. This command processes multiple file types (PDF, markdown, HTML, plain text), generates embeddings for semantic search, extracts entities (characters, locations, events, concepts), builds relationships between entities, and constructs the knowledge timeline.
Input Arguments
Accept one or more of the following:
- File paths: Individual files to ingest (e.g.,
)./research/soe-training.pdf - Directories: Recursively process all files in a directory (e.g.,
)./research/ - URLs: Web pages to fetch and ingest (e.g.,
)https://example.com/article.html
Multiple sources can be provided in a single invocation:
/bookstrap-ingest ./research/documents/ https://example.com/article.html ./notes.md
Supported File Types
| File Type | Extension | Processing Method |
|---|---|---|
| Extract text via | |
| Markdown | , | Read directly |
| HTML | , | Parse and extract text content |
| Plain Text | | Read directly |
Processing Workflow
For each source provided:
1. Source Collection
- Parse arguments to identify file paths, directories, and URLs
- If directory: recursively find all supported files
- If URL: fetch content using WebFetch
- Track all sources for batch processing
2. File Ingestion
For each file, execute the ingestion pipeline:
python scripts/ingest-file.py <file-path>
The
ingest-file.py script handles:
- File reading: Load content from supported formats
- Semantic chunking: Use LLM to identify natural breakpoints (paragraph boundaries, topic shifts, scene changes) rather than fixed token windows
- Embedding generation: Call configured embedding provider (Gemini, OpenAI, Ollama, LM Studio) via
scripts/generate-embedding.py - Entity extraction: Use LLM to extract characters, locations, events, concepts with context via
scripts/extract-entities.py - Database storage: Store source, chunks, embeddings, and entities in SurrealDB
3. Relationship Building
After entity extraction, automatically create graph relationships:
- Link sources to extracted concepts (
)source->supports->concept - Link events in chronological order (
,event->precedes->event
)event->follows->event - Link entities mentioned together (
,character->knows->character
)concept->related_to->concept
4. Timeline Construction
Order events by:
- Extracted dates (if available)
- Sequence numbers from document structure
- Contextual ordering from content analysis
5. Metadata Recording
Store ingestion metadata for each source:
CREATE source SET title = $title, content = $content, embedding = $embedding_vector, url = $url, source_type = $source_type, -- 'primary', 'secondary', 'web' reliability = $reliability, -- 'high', 'medium', 'low' ingested_at = time::now(), ingested_during = 'bootstrap' ;
Statistics Reporting
After ingestion completes, report:
INGESTION COMPLETE ================== SOURCES PROCESSED: 15 files, 3 URLs - PDF: 8 files - Markdown: 5 files - HTML (web): 3 URLs - Plain Text: 2 files ENTITIES EXTRACTED: 247 total - Characters: 34 - Locations: 52 - Events: 123 - Concepts: 38 RELATIONSHIPS CREATED: 412 edges - source->supports->concept: 156 - event->precedes->event: 98 - event->follows->event: 98 - character->knows->character: 24 - concept->related_to->concept: 36 EMBEDDINGS GENERATED: 347 vectors - Sources: 15 - Chunks: 285 - Entities: 47 TIMELINE ENTRIES: 123 events ordered chronologically STORAGE ------- Database: bookstrap.my_book Namespace: bookstrap Total size: 12.4 MB
Error Handling
Handle common ingestion errors gracefully:
| Error | Recovery |
|---|---|
| File not found | Skip and report, continue with remaining files |
| Unsupported format | Warn user, skip file |
| URL fetch timeout | Retry once, then skip if still fails |
| Embedding API error | Retry with exponential backoff, fail if persistent |
| Database connection error | Abort ingestion, report last successful file |
Configuration
Ingestion behavior is controlled by
bookstrap.config.json:
{ "embeddings": { "provider": "gemini", "model": "text-embedding-004", "dimensions": 768 }, "extraction": { "provider": "llm", "chunking": { "strategy": "semantic", "max_tokens": 1024, "overlap": 128 } }, "surrealdb": { "host": "localhost", "port": 2665, "namespace": "bookstrap", "database": "my_book" } }
Implementation Notes
- Batch processing: Process files sequentially to avoid overwhelming the embedding API
- Rate limiting: Respect embedding provider rate limits (configured in
)bookstrap.config.json - Idempotency: Re-ingesting the same file updates existing records rather than creating duplicates (match on title + content hash)
- Progress tracking: Log each file as it's processed for visibility during long ingestions
- Database connection: Verify SurrealDB is running before starting ingestion
Example Usage
# Ingest a single file /bookstrap-ingest ./research/soe-training-manual.pdf # Ingest an entire directory /bookstrap-ingest ./research/primary-sources/ # Ingest multiple sources at once /bookstrap-ingest ./research/ https://en.wikipedia.org/wiki/SOE https://example.com/lyon-resistance.html # Ingest web content only /bookstrap-ingest https://archive.org/details/soe-field-manual
Pre-requisites
Before running
/bookstrap-ingest:
- BRD created:
must have been run to create the Book Requirements Document/bookstrap-init - SurrealDB running: Database must be accessible (started via
ordocker-compose up -d
)./scripts/start-surreal.sh - Schema initialized: Database schema must be loaded via
./scripts/init-schema.sh - API keys configured: Embedding provider API key must be set in environment variables (e.g.,
)GEMINI_API_KEY
Related Commands
- Create BRD and initialize database/bookstrap-init
- Identify knowledge gaps after ingestion/bookstrap-plan-research
- Fill gaps with targeted web research/bookstrap-research
- View corpus statistics and coverage/bookstrap-status
Supporting Scripts
| Script | Purpose |
|---|---|
| Main ingestion pipeline for file processing |
| Multi-provider embedding generation |
| LLM-based entity extraction with context |
| Semantic chunking using LLM |
See individual script documentation for configuration options and advanced usage.