Crucible distill
Convert heavy document formats (PDF, Word, Excel, PowerPoint, and 10+ others) to token-efficient Markdown/CSV with structurally-aware digest compression. Use when Claude needs to read documents without burning excessive context budget. Triggers on /distill, 'distill this', 'convert to markdown', 'make this readable'.
git clone https://github.com/raddue/crucible
T=$(mktemp -d) && git clone --depth=1 https://github.com/raddue/crucible "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/distill" ~/.claude/skills/raddue-crucible-distill && rm -rf "$T"
skills/distill/SKILL.mdDistill
Overview
<!-- CANONICAL: shared/dispatch-convention.md -->All subagent dispatches use disk-mediated dispatch. See
shared/dispatch-convention.md for the full protocol.
Convert heavy document formats to token-efficient representations (Markdown, CSV) for LLM consumption. The core deliverable is the
.digest.md — a structurally-aware compression at 20-30% of token count.
Skill type: Rigid — follow exactly, no shortcuts.
Models:
- PDF structuring agent: Sonnet
- Digest agent: Sonnet
- Orchestrator: runs on whatever model the session uses
Announce at start: "I'm using the distill skill to convert documents to token-efficient formats."
Invocation API
/distill <path> [path2 ...] /distill <directory>
Examples:
— convert one file/distill docs/report.pdf
— convert multiple files/distill docs/report.pdf data/sheet.xlsx slides/deck.pptx
— convert all supported files in directory (single-level, not recursive)/distill docs/
Mixed mode is supported:
/distill docs/ extra/report.pdf
The Process
Execute phases in this order. Each phase completes for all files before the next begins.
Phase 0: Tool Availability Check
At skill start, before processing any files, check for required tools:
| Check | Command | If Missing |
|---|---|---|
| Tier 1 | | "pandoc not found. Install: (Debian/Ubuntu) or (macOS). Tier 1 formats will be skipped." |
| Tier 2 | | "pdftotext not found. Install: (Debian/Ubuntu) or (macOS). PDF conversion will be skipped." |
| Tier 3 | | "python3 not found. PPTX and XLSX conversion will be skipped." |
| Pre-flight | | Skip zip bomb detection with note. Not a conversion blocker. |
| Pre-flight | | Skip PDF attachment detection with note. Not a conversion blocker. |
Build a set of available tiers. Route files only to available tiers. Files targeting unavailable tiers get routed to unsupported-with-guidance (Phase 1b).
Phase 1: Input Resolution
1a: Build File List
Individual file paths: Use directly. Verify each file exists.
Directory paths: Single-level glob for files with supported extensions (not recursive). Build file list sorted alphabetically. Report: "Found {N} convertible files in {directory}: {list}."
Supported extensions for glob:
.pdf, .docx, .rtf, .html, .htm, .odt, .epub, .rst, .org, .tex, .ipynb, .pptx, .xlsx
Mixed mode: Process both directory globs and individual paths. Deduplicate by absolute path.
1b: Route Files to Tiers
For each file, determine the conversion tier by extension:
| Extension | Tier | Format Flag |
|---|---|---|
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 2 | — |
| 3 | — |
| 3 | — |
Unsupported formats: Output actionable guidance per this table, then continue with remaining files:
| Extension | Guidance |
|---|---|
| "Legacy Excel format. Export as .xlsx from Excel/LibreOffice, then re-run /distill." |
| "OpenDocument Spreadsheet. Export as .csv (single-sheet) or .xlsx (multi-sheet), then re-run /distill." |
| "OpenDocument Presentation. Export as .pptx, then re-run /distill." |
| "Apple Keynote. Export as .pptx from Keynote, then re-run /distill." |
| "Apple Numbers. Export as .xlsx from Numbers, then re-run /distill." |
| "Apple Pages. Export as .docx from Pages, then re-run /distill." |
Unknown extensions: "Unsupported format: {ext}. Supported formats: docx, rtf, html, odt, epub, rst, org, tex, ipynb, pdf, pptx, xlsx."
Unavailable tier: If a file's tier is unavailable (tool missing from Phase 0), report: "{file}: requires {tool} (not installed). Skipping."
Phase 2: Pre-Flight Checks
Run per-file safety checks before conversion. Failures are per-file — do not halt the batch.
Zip Bomb Detection (docx, pptx, xlsx)
Office formats are ZIP archives. If
unzip is available:
UNCOMPRESSED=$(unzip -l "$INPUT_PATH" 2>/dev/null | tail -1 | awk '{print $1}')
If uncompressed size exceeds 500MB (524288000 bytes), abort this file: "File uncompressed size ({size}) exceeds 500MB safety limit. Skipping."
If
unzip is not available, skip this check (noted in Phase 0).
PDF Attachment Detection
For PDF files, if
pdfdetach is available:
ATTACHMENTS=$(pdfdetach -list "$INPUT_PATH" 2>/dev/null | grep -c "^[0-9]")
If attachments found, warn: "PDF contains {N} embedded attachments. These are not extracted — only text content is converted." Continue with conversion.
Encoding Validation
After conversion (not before), verify output is valid UTF-8:
file --mime-encoding "$OUTPUT_PATH"
If not UTF-8, attempt re-encoding:
iconv -f <detected-charset> -t UTF-8 "$OUTPUT_PATH" -o "$OUTPUT_PATH.tmp" && mv "$OUTPUT_PATH.tmp" "$OUTPUT_PATH". If re-encoding fails, report and skip.
Phase 3: Conversion
Process files sequentially. For each file:
Tier 1: Pandoc-Native
INPUT_PATH="$1" OUTPUT_PATH="${INPUT_PATH%.*}.md" FORMAT="$2" # from routing table pandoc -f "$FORMAT" -t markdown --wrap=none "$INPUT_PATH" -o "$OUTPUT_PATH"
Shell safety: All file paths via quoted shell variables. Never inline interpolation. Never use unquoted
$() or backtick interpolation of file paths.
Error handling:
- Non-zero exit code: report "pandoc conversion failed for {file}: {error}" and continue
- Empty output: report "pandoc produced empty output for {file}" and continue
Idempotency: Overwrites existing output files without warning.
Tier 2: PDF (pdftotext + Claude structuring)
Step 1 — Extract:
INPUT_PATH="$1" TEXT_PATH="${INPUT_PATH%.*}.txt" OUTPUT_PATH="${INPUT_PATH%.*}.md" pdftotext -layout "$INPUT_PATH" "$TEXT_PATH"
Scanned PDF detection: Count total characters and pages:
CHARS=$(wc -c < "$TEXT_PATH") PAGES=$(pdfinfo "$INPUT_PATH" 2>/dev/null | grep "^Pages:" | awk '{print $2}')
If
pdfinfo is unavailable, estimate pages from pdftotext output (count form-feed characters). If average chars/page < 50, report: "This PDF appears to be scanned/image-based. Text extraction produced minimal content. Consider OCR processing externally before distilling." Skip structuring pass. Clean up temp .txt file.
Step 2 — Structure: Dispatch a Sonnet agent using
skills/distill/pdf-structurer-prompt.md to transform the raw pdftotext output into clean Markdown with recovered headings, lists, tables, and code blocks. Write result to OUTPUT_PATH. Clean up temp .txt file.
Tier 3: Python Venv
Venv setup (once per invocation, only if Tier 3 files exist):
VENV="/tmp/crucible-distill-venv" # Health check if [ -d "$VENV" ]; then "$VENV/bin/python3" -c "import sys" 2>/dev/null || rm -rf "$VENV" fi # Create if missing if [ ! -d "$VENV" ]; then echo "Installing Python dependencies (one-time setup, ~15 seconds)..." python3 -m venv "$VENV" "$VENV/bin/pip" install --quiet python-pptx==1.0.2 openpyxl==3.1.5 if [ $? -ne 0 ]; then echo "Failed to install Python dependencies." echo "Manual install: pip install python-pptx==1.0.2 openpyxl==3.1.5" echo "PPTX and XLSX conversion will be skipped." # Route remaining Tier 3 files to unsupported return fi fi
PPTX conversion:
"$VENV/bin/python3" skills/distill/convert_pptx.py --input "$INPUT_PATH" --output "$OUTPUT_PATH"
XLSX conversion:
"$VENV/bin/python3" skills/distill/convert_xlsx.py --input "$INPUT_PATH" --output-dir "$(dirname "$INPUT_PATH")"
Output: one CSV per sheet at
{basename}-{sheetname}.csv. Sheetnames sanitized (spaces → hyphens, special chars stripped).
Phase 4: Digest Pass
After all conversions complete, run the digest pass on eligible files.
Eligibility:
- File is
(not.md
).csv - Word count > 500 words
- Word count ≤ 50,000 words (hard cap — report "File exceeds 50K word limit for digest pass. Consider splitting the document." for larger files)
Dispatch: For each eligible file, dispatch a Sonnet digest agent using
skills/distill/digest-prompt.md. Before dispatching, fill template placeholders: replace {{ORIGINAL_WORDS}} with the converted file's word count and {{TARGET_WORDS}} with 25% of that count. The raw pdftotext output (for pdf-structurer-prompt.md) or converted .md content (for digest-prompt.md) is included as a content block below the prompt template in the dispatch file.
Quality check: After the digest agent returns, count words in the digest:
- If digest is 15-35% of input word count: accept
- If digest exceeds 35%: re-dispatch with "Compress more aggressively. Target 20-25% of the original word count."
- If digest is below 15%: re-dispatch with "Preserve more detail. Target 25-30% of the original word count."
- One retry only. Second result accepted regardless.
Output: Write digest to
{original-path-without-ext}.digest.md.
Word count is a proxy for token count. These diverge for code-heavy or CJK content, but word count is sufficient for v1.
Phase 5: Summary
After all conversions and digests complete, output:
## Distill Summary | File | Format | Tier | Converted | Digest | Token Savings | |---|---|---|---|---|---| | {file} | {format} | {tier} | {output} ({words} words) | {digest} ({words} words) | ~{pct}% | **Total:** {N} files converted, {M} digests produced, ~{pct}% average token savings on digestible content. Generated files can be added to .gitignore if not needed in version control.
Token savings per file =
1 - (digest words / converted words) expressed as percentage.
Files that were skipped (unsupported, tool missing, pre-flight failure) are listed separately:
**Skipped:** {N} files - {file}: {reason}
Shell Safety (Non-Negotiable)
Every Bash command that touches file paths MUST use quoted shell variables:
# CORRECT pandoc -f "$FORMAT" -t markdown --wrap=none "$INPUT_PATH" -o "$OUTPUT_PATH" # WRONG — never do this pandoc -f $FORMAT -t markdown --wrap=none $INPUT_PATH -o $OUTPUT_PATH
- All paths passed as
, never bare"$VAR"$VAR - No unquoted
or backtick interpolation of paths$() - Python scripts receive paths via argparse, not shell interpolation
- Source files are NEVER modified or deleted
Error Handling
| Failure | Behavior |
|---|---|
| Tool not installed | Skip tier, report with install guidance, continue |
| Conversion fails (non-zero exit) | Report per-file, continue with remaining files |
| Empty conversion output | Report per-file, continue |
| Zip bomb detected | Skip file, report, continue |
| Scanned PDF | Report, skip digest, continue |
| Venv/pip failure | Skip Tier 3, report with manual install instructions |
| Digest out of range | One retry, accept second result regardless |
| File not found | Report, continue with remaining files |
| Permission denied | Report, continue |
| Encoding error | Attempt re-encode, skip on failure, continue |
Principle: Never halt the batch for a single file failure. Report and continue.
Integration
Standalone usage:
— convert one or more files/distill <path>
— convert all supported files in directory/distill <directory>
Called by:
- Any skill that needs to read heavy document formats
- User directly when preparing documents for LLM consumption
Dispatches:
- PDF structuring agent (Sonnet) via
skills/distill/pdf-structurer-prompt.md - Digest agent (Sonnet) via
skills/distill/digest-prompt.md
Does not dispatch: No quality gate, no red-team, no review loop. Distill is a utility skill — it converts and compresses. Quality is ensured by the digest quality metric (word count check + one retry).