OpenSpace document-gen-resilient-multiformat
Resilient multi-format document generation with environment checks, auto-sanitization, and fallback engines
git clone https://github.com/HKUDS/OpenSpace
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/document-gen-fallback-enhanced-enhanced-618c9b" ~/.claude/skills/hkuds-openspace-document-gen-resilient-multiformat && rm -rf "$T"
gdpval_bench/skills/document-gen-fallback-enhanced-enhanced-618c9b/SKILL.mdResilient Document Generation Workflow (Multi-Format + Unicode-Safe)
When to Use
Use this skill when generating documents in multiple formats (
.docx, .pdf, .html) and you need:
- Reliable execution with pre-flight environment verification
- Automatic Unicode handling based on target format
- Fallback options when primary tools (pandoc/LaTeX) are unavailable
- Clear error isolation through discrete, observable steps
Trigger this workflow when:
returns unknown/unclear errors on document generationshell_agent- You need multiple output formats from one source
- Documents contain special characters, symbols, or non-ASCII text
- Previous PDF generation attempts failed due to encoding or missing dependencies
Pre-Flight Environment Check
Before starting, verify your environment has the required tools:
run_shell command: which pandoc && pandoc --version | head -1
run_shell command: which pdflatex || which xelatex || which wkhtmltopdf || echo "No PDF engine found"
run_shell command: python3 -c "import fpdf; print('fpdf2 available')" 2>/dev/null || echo "fpdf2 not available"
run_shell command: python3 -c "import reportlab; print('reportlab available')" 2>/dev/null || echo "reportlab not available"
If pandoc is missing: Install via
apt-get install pandoc or brew install pandoc
If no PDF engine found: Choose fallback approach:
- Install LaTeX:
apt-get install texlive-latex-recommended texlive-fonts-recommended - Or use wkhtmltopdf:
apt-get install wkhtmltopdf - Or use Python fallback (fpdf2/reportlab) - see Alternative PDF Generation section
Core Technique
Split the workflow into discrete, observable steps with automatic format detection:
- Environment check → Verify tools are available
- Content creation → Use
to create source Markdownwrite_file - Auto-sanitization → Conditionally sanitize based on target format (PDF needs it, DOCX/HTML don't)
- Format conversion → Use appropriate tool for each format with fallback options
- Verification → Check output files exist and validate content
⚠️ Unicode & Format Compatibility Matrix
| Format | Unicode Support | Sanitization Needed | Recommended Engine |
|---|---|---|---|
| Excellent | No | pandoc (default) |
| Excellent | No | pandoc (default) |
| Limited (LaTeX) | Yes | xelatex > pdflatex > wkhtmltopdf > fpdf2 |
| Good | No | pandoc (if available) or python-pptx |
Critical Unicode Characters for PDF
| Character | Issue | Safe Replacement |
|---|---|---|
(em dash) | May not render | or |
(en dash) | May not render | |
(curly quotes) | Encoding errors | (straight quotes) |
(curly apostrophe) | Encoding errors | (straight apostrophe) |
(ellipsis) | May not render | |
(arrows) | LaTeX incompatibility | |
(checkmarks) | May not render | |
(symbols) | May not render | |
| May require packages | |
| Non-ASCII letters (é, ñ, ü) | Font-dependent | Use xeLaTeX or replace |
Step-by-Step Workflow
Step 0: Pre-Flight Check
Verify environment before proceeding:
run_shell command: pandoc --version >/dev/null 2>&1 && echo "PANDOC_OK" || echo "PANDOC_MISSING"
If PANDOC_MISSING: Either install pandoc or use Alternative PDF Generation (Python-based)
Step 1: Create Source Content with write_file
Write your document content as Markdown to a source file:
write_file path: /tmp/document_source.md content: | # Document Title ## Section 1 Content with original unicode characters... ## Section 2 More content...
Step 2: Conditional Unicode Sanitization
Only sanitize if generating PDF. Create sanitized version for PDF conversion:
write_file path: /tmp/document_source_sanitized.md content: | # Document Title ## Section 1 Content with unicode replaced (em-dash -> --, curly quotes -> straight, etc.) ## Section 2 More content...
Optional: Use sanitization script (see Unicode Sanitization Script section below):
run_shell command: ./sanitize_for_pdf.sh /tmp/document_source.md /tmp/document_source_sanitized.md
Note: Keep the original unsanitized file for DOCX/HTML conversion (these formats handle Unicode better).
Step 3: Convert to Target Formats with run_shell
Use appropriate commands for each format. Use sanitized source for PDF, original for others.
DOCX (from original):
run_shell command: pandoc /tmp/document_source.md -o output.docx
PDF (from sanitized, with engine priority):
run_shell command: pandoc /tmp/document_source_sanitized.md -o output.pdf --pdf-engine=xelatex
If xelatex fails, try pdflatex:
run_shell command: pandoc /tmp/document_source_sanitized.md -o output.pdf --pdf-engine=pdflatex
If LaTeX engines fail, try wkhtmltopdf:
run_shell command: pandoc /tmp/document_source_sanitized.md -o output.pdf --pdf-engine=wkhtmltopdf
HTML (from original):
run_shell command: pandoc /tmp/document_source.md -o output.html
Step 4: Verify Outputs
Check that files were created and have content:
run_shell command: ls -lh output.* && file output.*
run_shell command: test -s output.pdf && echo "PDF has content" || echo "PDF is empty"
PDF Engine Decision Tree
Is pandoc available? ├─ NO → Use Python fallback (fpdf2 or reportlab) └─ YES → Is LaTeX available? ├─ YES (xelatex) → Use: pandoc --pdf-engine=xelatex (best Unicode) ├─ YES (pdflatex) → Use: pandoc --pdf-engine=pdflatex + sanitization ├─ YES (wkhtmltopdf) → Use: pandoc --pdf-engine=wkhtmltopdf └─ NO → Install LaTeX or use Python fallback
Alternative PDF Generation (Python Fallback)
When pandoc or LaTeX is unavailable, use Python libraries directly:
Option A: fpdf2 (Simple, fast)
run_shell command: python3 << 'EOF' from fpdf import FPDF pdf = FPDF() pdf.add_page() pdf.set_font("Arial", size=12) # Note: fpdf2 has limited Unicode support - use Latin-1 or embed fonts pdf.cell(200, 10, txt="Document Title", ln=True, align='C') pdf.output("output.pdf") EOF
Option B: reportlab (More control, better Unicode)
run_shell command: python3 << 'EOF' from reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas from reportlab.pdfbase import pdfmetrics from reportlab.pdfbase.ttfonts import TTFont # Register a Unicode font if needed # pdfmetrics.registerFont(TTFont('UnicodeFont', 'path/to/font.ttf')) c = canvas.Canvas("output.pdf", pagesize=letter) c.drawString(100, 750, "Document Title") c.save() EOF
Option C: Markdown to PDF with Markdown2 + ReportLab
run_shell command: python3 << 'EOF' import markdown from reportlab.lib.pagesizes import letter from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer from reportlab.lib.styles import getSampleStyleSheet # Read markdown with open('/tmp/document_source.md', 'r', encoding='utf-8') as f: md_content = f.read() # Convert to HTML html_content = markdown.markdown(md_content) # Create PDF doc = SimpleDocTemplate("output.pdf", pagesize=letter) styles = getSampleStyleSheet() story = [] # Parse HTML and add to story (simplified - use html2text for production) story.append(Paragraph("Document Content", styles['Normal'])) doc.build(story) print("PDF created successfully") EOF
Complete Example
# Generate Multi-Format Report ## Step 0: Check environment run_shell command: pandoc --version >/dev/null 2>&1 && echo "PANDOC_OK" || echo "PANDOC_MISSING" ## Step 1: Write Markdown source write_file path: /tmp/report.md content: | # Quarterly Report ## Executive Summary Performance metrics and analysis... ## Key Findings — Major finding with em-dash "Quote" with curly quotes ✓ Completed items ## Step 2: Create sanitized version (for PDF only) write_file path: /tmp/report_sanitized.md content: | # Quarterly Report ## Executive Summary Performance metrics and analysis... ## Key Findings -- Major finding with em-dash "Quote" with straight quotes [x] Completed items ## Step 3: Convert to DOCX (from original) run_shell command: pandoc /tmp/report.md -o report.docx ## Step 4: Convert to PDF (from sanitized, with xelatex) run_shell command: pandoc /tmp/report_sanitized.md -o report.pdf --pdf-engine=xelatex ## Step 5: Convert to HTML (from original) run_shell command: pandoc /tmp/report.md -o report.html ## Step 6: Verify all outputs run_shell command: ls -lh report.* && echo "All files created"
Error Handling & Recovery
| Error | Likely Cause | Recovery Action |
|---|---|---|
| Pandoc not installed | Install pandoc or use Python fallback |
| LaTeX missing | Use or |
| Missing LaTeX package | Install package or use xelatex/wkhtmltopdf |
| Unicode in PDF | Use sanitized file + xelatex engine |
| wkhtmltopdf missing | Install or use xelatex/pdflatex |
| Conversion silently failed | Check pandoc stderr, try alternative engine |
| Non-Latin characters | Use reportlab with Unicode font or sanitize |
Recovery Workflow
- Check error message for specific tool/engine failure
- Try next engine in priority: xelatex → pdflatex → wkhtmltopdf → fpdf2/reportlab
- If all pandoc attempts fail: Switch to Python fallback (fpdf2/reportlab)
- If Unicode issues persist: Apply stricter sanitization or use xelatex with Unicode font
- Document which approach succeeded for future reference
Unicode Sanitization Script (Reusable)
Save as
sanitize_for_pdf.sh:
#!/bin/bash # sanitize_for_pdf.sh - Replace problematic unicode chars for LaTeX/PDF if [ -z "$1" ]; then echo "Usage: $0 <input.md> [output.md]" exit 1 fi INPUT="$1" OUTPUT="${2:-${1%.md}_sanitized.md}" sed -e 's/—/--/g' \ -e 's/–/-/g' \ -e 's/"([^"]*)"/"\1"/g' \ -e "s/'([^']*)/'\1'/g" \ -e 's/…/.../g' \ -e 's/→/->/g' \ -e 's/←/<-/g' \ -e 's/✓/[x]/g' \ -e 's/✗/[ ]/g' \ -e 's/©/(c)/g' \ -e 's/®/(r)/g' \ -e 's/™/(tm)/g' \ "$INPUT" > "$OUTPUT" echo "Sanitized: $INPUT -> $OUTPUT"
Make executable:
chmod +x sanitize_for_pdf.sh
Usage:
run_shell command: ./sanitize_for_pdf.sh /tmp/document_source.md /tmp/document_source_sanitized.md
Common pandoc Commands Reference
# Markdown to Word pandoc input.md -o output.docx # Markdown to PDF (LaTeX - default pdflatex) pandoc input.md -o output.pdf # Markdown to PDF with xelatex (best Unicode support) pandoc input.md -o output.pdf --pdf-engine=xelatex # Markdown to PDF with wkhtmltopdf (HTML-based, no LaTeX) pandoc input.md -o output.pdf --pdf-engine=wkhtmltopdf # Markdown to HTML pandoc input.md -o output.html # With metadata pandoc input.md -o output.pdf --metadata title="Document Title" # With custom reference doc (DOCX) pandoc input.md -o output.docx --reference-doc=template.docx # With custom template (HTML/PDF) pandoc input.md --template=template.html -o output.html # Force UTF-8 input pandoc -f markdown+utf8 input.md -o output.pdf
Troubleshooting Quick Reference
PDF generation fails with encoding error:
- Use sanitized markdown file
- Try
for better Unicode support--pdf-engine=xelatex - Add
to pandoc command-f markdown+utf8
PDF generation fails: LaTeX not found:
- Install:
apt-get install texlive-latex-recommended texlive-fonts-recommended - Or use:
--pdf-engine=wkhtmltopdf - Or use: Python fallback (fpdf2/reportlab)
DOCX formatting issues:
- Add
for custom styles--reference-doc=template.docx - Check markdown structure (headings, lists)
Unicode/encoding errors in any format:
- Ensure source file is UTF-8:
file -i source.md - Add
to pandoc command-f markdown+utf8 - Use xelatex for PDF
Special characters not rendering in PDF:
- Use the character replacement table
- Create sanitized version before PDF conversion
- Try xelatex with Unicode font
Missing pandoc:
- Ubuntu/Debian:
apt-get install pandoc - macOS:
brew install pandoc - Fallback: Use Python libraries directly
When to Use shell_agent vs Manual Workflow
| Scenario | Recommended Approach |
|---|---|
| Simple DOCX/HTML, no Unicode | is fine |
| PDF generation required | Manual workflow (this skill) |
| Multiple formats from one source | Manual workflow (this skill) |
| Heavy Unicode/special characters | Manual workflow with sanitization |
failed with unknown error | Manual workflow (this skill) |
| Need explicit error visibility | Manual workflow (this skill) |
| Environment constraints (no pandoc) | Python fallback (this skill) |
After successful manual workflow: You can attempt
shell_agent for similar future tasks, but keep this workflow as your known-working fallback.
Related Skills
: Original fallback workflow (less Unicode guidance)document-gen-fallback
: For when multiple tools fail simultaneouslywrite-file-fallback-report
: For Excel/CSV generation with Pythonspreadsheet-direct-python
Skill Health & Metrics
Expected success rate: 85%+ with proper environment setup Fallback activation: Use Python fallback if pandoc/LaTeX unavailable Verification: Always verify output files exist AND have content (>0 bytes) *** End Files *** Begin Files