OpenSpace document-gen-resilient-multiformat

Resilient multi-format document generation with environment checks, auto-sanitization, and fallback engines

install

source · Clone the upstream repo

git clone https://github.com/HKUDS/OpenSpace

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/document-gen-fallback-enhanced-enhanced-618c9b" ~/.claude/skills/hkuds-openspace-document-gen-resilient-multiformat && rm -rf "$T"

manifest: gdpval_bench/skills/document-gen-fallback-enhanced-enhanced-618c9b/SKILL.md

source content

Resilient Document Generation Workflow (Multi-Format + Unicode-Safe)

When to Use

Use this skill when generating documents in multiple formats (

.docx

.pdf

.html

) and you need:

Reliable execution with pre-flight environment verification
Automatic Unicode handling based on target format
Fallback options when primary tools (pandoc/LaTeX) are unavailable
Clear error isolation through discrete, observable steps

Trigger this workflow when:

```
shell_agent
```
returns unknown/unclear errors on document generation
You need multiple output formats from one source
Documents contain special characters, symbols, or non-ASCII text
Previous PDF generation attempts failed due to encoding or missing dependencies

Pre-Flight Environment Check

Before starting, verify your environment has the required tools:

run_shell
command: which pandoc && pandoc --version | head -1

run_shell
command: which pdflatex || which xelatex || which wkhtmltopdf || echo "No PDF engine found"

run_shell
command: python3 -c "import fpdf; print('fpdf2 available')" 2>/dev/null || echo "fpdf2 not available"

run_shell
command: python3 -c "import reportlab; print('reportlab available')" 2>/dev/null || echo "reportlab not available"

If pandoc is missing: Install via

apt-get install pandoc

brew install pandoc

If no PDF engine found: Choose fallback approach:

Install LaTeX:

apt-get install texlive-latex-recommended texlive-fonts-recommended

Or use wkhtmltopdf:
```
apt-get install wkhtmltopdf
```
Or use Python fallback (fpdf2/reportlab) - see Alternative PDF Generation section

Core Technique

Split the workflow into discrete, observable steps with automatic format detection:

Environment check → Verify tools are available
Content creation → Use
```
write_file
```
to create source Markdown
Auto-sanitization → Conditionally sanitize based on target format (PDF needs it, DOCX/HTML don't)
Format conversion → Use appropriate tool for each format with fallback options
Verification → Check output files exist and validate content

⚠️ Unicode & Format Compatibility Matrix

Format	Unicode Support	Sanitization Needed	Recommended Engine
`.docx`	Excellent	No	pandoc (default)
`.html`	Excellent	No	pandoc (default)
`.pdf`	Limited (LaTeX)	Yes	xelatex > pdflatex > wkhtmltopdf > fpdf2
`.pptx`	Good	No	pandoc (if available) or python-pptx

Critical Unicode Characters for PDF

Character	Issue	Safe Replacement
`—` (em dash)	May not render	`--` or `-`
`–` (en dash)	May not render	`-`
`" "` (curly quotes)	Encoding errors	`" "` (straight quotes)
`' '` (curly apostrophe)	Encoding errors	`'` (straight apostrophe)
`…` (ellipsis)	May not render	`...`
`→` `←` `↑` `↓` (arrows)	LaTeX incompatibility	`->` `<-` `^` `v`
`✓` `✗` (checkmarks)	May not render	`[x]` `[ ]`
`★` `●` (symbols)	May not render	`*` `-`
`©` `®` `™`	May require packages	`(c)` `(r)` `(tm)`
Non-ASCII letters (é, ñ, ü)	Font-dependent	Use xeLaTeX or replace

Step-by-Step Workflow

Step 0: Pre-Flight Check

Verify environment before proceeding:

run_shell
command: pandoc --version >/dev/null 2>&1 && echo "PANDOC_OK" || echo "PANDOC_MISSING"

If PANDOC_MISSING: Either install pandoc or use Alternative PDF Generation (Python-based)

Step 1: Create Source Content with write_file

Write your document content as Markdown to a source file:

write_file
path: /tmp/document_source.md
content: |
  # Document Title
  
  ## Section 1
  Content with original unicode characters...
  
  ## Section 2
  More content...

Step 2: Conditional Unicode Sanitization

Only sanitize if generating PDF. Create sanitized version for PDF conversion:

write_file
path: /tmp/document_source_sanitized.md
content: |
  # Document Title
  
  ## Section 1
  Content with unicode replaced (em-dash -> --, curly quotes -> straight, etc.)
  
  ## Section 2
  More content...

Optional: Use sanitization script (see Unicode Sanitization Script section below):

run_shell
command: ./sanitize_for_pdf.sh /tmp/document_source.md /tmp/document_source_sanitized.md

Note: Keep the original unsanitized file for DOCX/HTML conversion (these formats handle Unicode better).

Step 3: Convert to Target Formats with run_shell

Use appropriate commands for each format. Use sanitized source for PDF, original for others.

DOCX (from original):

run_shell
command: pandoc /tmp/document_source.md -o output.docx

PDF (from sanitized, with engine priority):

run_shell
command: pandoc /tmp/document_source_sanitized.md -o output.pdf --pdf-engine=xelatex

If xelatex fails, try pdflatex:

run_shell
command: pandoc /tmp/document_source_sanitized.md -o output.pdf --pdf-engine=pdflatex

If LaTeX engines fail, try wkhtmltopdf:

run_shell
command: pandoc /tmp/document_source_sanitized.md -o output.pdf --pdf-engine=wkhtmltopdf

HTML (from original):

run_shell
command: pandoc /tmp/document_source.md -o output.html

Step 4: Verify Outputs

Check that files were created and have content:

run_shell
command: ls -lh output.* && file output.*

run_shell
command: test -s output.pdf && echo "PDF has content" || echo "PDF is empty"

PDF Engine Decision Tree

Is pandoc available?
├─ NO → Use Python fallback (fpdf2 or reportlab)
└─ YES → Is LaTeX available?
    ├─ YES (xelatex) → Use: pandoc --pdf-engine=xelatex (best Unicode)
    ├─ YES (pdflatex) → Use: pandoc --pdf-engine=pdflatex + sanitization
    ├─ YES (wkhtmltopdf) → Use: pandoc --pdf-engine=wkhtmltopdf
    └─ NO → Install LaTeX or use Python fallback

Alternative PDF Generation (Python Fallback)

When pandoc or LaTeX is unavailable, use Python libraries directly:

Option A: fpdf2 (Simple, fast)

run_shell
command: python3 << 'EOF'
from fpdf import FPDF
pdf = FPDF()
pdf.add_page()
pdf.set_font("Arial", size=12)
# Note: fpdf2 has limited Unicode support - use Latin-1 or embed fonts
pdf.cell(200, 10, txt="Document Title", ln=True, align='C')
pdf.output("output.pdf")
EOF

Option B: reportlab (More control, better Unicode)

run_shell
command: python3 << 'EOF'
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont

# Register a Unicode font if needed
# pdfmetrics.registerFont(TTFont('UnicodeFont', 'path/to/font.ttf'))

c = canvas.Canvas("output.pdf", pagesize=letter)
c.drawString(100, 750, "Document Title")
c.save()
EOF

Option C: Markdown to PDF with Markdown2 + ReportLab

run_shell
command: python3 << 'EOF'
import markdown
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet

# Read markdown
with open('/tmp/document_source.md', 'r', encoding='utf-8') as f:
    md_content = f.read()

# Convert to HTML
html_content = markdown.markdown(md_content)

# Create PDF
doc = SimpleDocTemplate("output.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

# Parse HTML and add to story (simplified - use html2text for production)
story.append(Paragraph("Document Content", styles['Normal']))

doc.build(story)
print("PDF created successfully")
EOF

Complete Example

# Generate Multi-Format Report

## Step 0: Check environment
run_shell
command: pandoc --version >/dev/null 2>&1 && echo "PANDOC_OK" || echo "PANDOC_MISSING"

## Step 1: Write Markdown source
write_file
path: /tmp/report.md
content: |
  # Quarterly Report
  
  ## Executive Summary
  Performance metrics and analysis...
  
  ## Key Findings
  — Major finding with em-dash
  "Quote" with curly quotes
  ✓ Completed items

## Step 2: Create sanitized version (for PDF only)
write_file
path: /tmp/report_sanitized.md
content: |
  # Quarterly Report
  
  ## Executive Summary
  Performance metrics and analysis...
  
  ## Key Findings
  -- Major finding with em-dash
  "Quote" with straight quotes
  [x] Completed items

## Step 3: Convert to DOCX (from original)
run_shell
command: pandoc /tmp/report.md -o report.docx

## Step 4: Convert to PDF (from sanitized, with xelatex)
run_shell
command: pandoc /tmp/report_sanitized.md -o report.pdf --pdf-engine=xelatex

## Step 5: Convert to HTML (from original)
run_shell
command: pandoc /tmp/report.md -o report.html

## Step 6: Verify all outputs
run_shell
command: ls -lh report.* && echo "All files created"

Error Handling & Recovery

Error	Likely Cause	Recovery Action
`pandoc: command not found`	Pandoc not installed	Install pandoc or use Python fallback
`pdflatex not found`	LaTeX missing	Use `--pdf-engine=xelatex` or `wkhtmltopdf`
`! LaTeX Error: File X.sty not found`	Missing LaTeX package	Install package or use xelatex/wkhtmltopdf
`Encoding error`	Unicode in PDF	Use sanitized file + xelatex engine
`wkhtmltopdf: command not found`	wkhtmltopdf missing	Install or use xelatex/pdflatex
`PDF is empty (0 bytes)`	Conversion silently failed	Check pandoc stderr, try alternative engine
`fpdf2 Unicode error`	Non-Latin characters	Use reportlab with Unicode font or sanitize

Recovery Workflow

Check error message for specific tool/engine failure
Try next engine in priority: xelatex → pdflatex → wkhtmltopdf → fpdf2/reportlab
If all pandoc attempts fail: Switch to Python fallback (fpdf2/reportlab)
If Unicode issues persist: Apply stricter sanitization or use xelatex with Unicode font
Document which approach succeeded for future reference

Unicode Sanitization Script (Reusable)

Save as

sanitize_for_pdf.sh

#!/bin/bash
# sanitize_for_pdf.sh - Replace problematic unicode chars for LaTeX/PDF
if [ -z "$1" ]; then
  echo "Usage: $0 <input.md> [output.md]"
  exit 1
fi
INPUT="$1"
OUTPUT="${2:-${1%.md}_sanitized.md}"

sed -e 's/—/--/g' \
    -e 's/–/-/g' \
    -e 's/"([^"]*)"/"\1"/g' \
    -e "s/'([^']*)/'\1'/g" \
    -e 's/…/.../g' \
    -e 's/→/->/g' \
    -e 's/←/<-/g' \
    -e 's/✓/[x]/g' \
    -e 's/✗/[ ]/g' \
    -e 's/©/(c)/g' \
    -e 's/®/(r)/g' \
    -e 's/™/(tm)/g' \
    "$INPUT" > "$OUTPUT"

echo "Sanitized: $INPUT -> $OUTPUT"

Make executable:

chmod +x sanitize_for_pdf.sh

Usage:

run_shell
command: ./sanitize_for_pdf.sh /tmp/document_source.md /tmp/document_source_sanitized.md

Common pandoc Commands Reference

# Markdown to Word
pandoc input.md -o output.docx

# Markdown to PDF (LaTeX - default pdflatex)
pandoc input.md -o output.pdf

# Markdown to PDF with xelatex (best Unicode support)
pandoc input.md -o output.pdf --pdf-engine=xelatex

# Markdown to PDF with wkhtmltopdf (HTML-based, no LaTeX)
pandoc input.md -o output.pdf --pdf-engine=wkhtmltopdf

# Markdown to HTML
pandoc input.md -o output.html

# With metadata
pandoc input.md -o output.pdf --metadata title="Document Title"

# With custom reference doc (DOCX)
pandoc input.md -o output.docx --reference-doc=template.docx

# With custom template (HTML/PDF)
pandoc input.md --template=template.html -o output.html

# Force UTF-8 input
pandoc -f markdown+utf8 input.md -o output.pdf

Troubleshooting Quick Reference

PDF generation fails with encoding error:

Use sanitized markdown file
Try
```
--pdf-engine=xelatex
```
for better Unicode support
Add
```
-f markdown+utf8
```
to pandoc command

PDF generation fails: LaTeX not found:

Install:

apt-get install texlive-latex-recommended texlive-fonts-recommended

Or use:
```
--pdf-engine=wkhtmltopdf
```
Or use: Python fallback (fpdf2/reportlab)

DOCX formatting issues:

Add
```
--reference-doc=template.docx
```
for custom styles
Check markdown structure (headings, lists)

Unicode/encoding errors in any format:

Ensure source file is UTF-8:
```
file -i source.md
```
Add
```
-f markdown+utf8
```
to pandoc command
Use xelatex for PDF

Special characters not rendering in PDF:

Use the character replacement table
Create sanitized version before PDF conversion
Try xelatex with Unicode font

Missing pandoc:

Ubuntu/Debian:
```
apt-get install pandoc
```
macOS:
```
brew install pandoc
```
Fallback: Use Python libraries directly

When to Use shell_agent vs Manual Workflow

Scenario	Recommended Approach
Simple DOCX/HTML, no Unicode	`shell_agent` is fine
PDF generation required	Manual workflow (this skill)
Multiple formats from one source	Manual workflow (this skill)
Heavy Unicode/special characters	Manual workflow with sanitization
`shell_agent` failed with unknown error	Manual workflow (this skill)
Need explicit error visibility	Manual workflow (this skill)
Environment constraints (no pandoc)	Python fallback (this skill)

After successful manual workflow: You can attempt

shell_agent

for similar future tasks, but keep this workflow as your known-working fallback.

Related Skills

```
document-gen-fallback
```
: Original fallback workflow (less Unicode guidance)
```
write-file-fallback-report
```
: For when multiple tools fail simultaneously
```
spreadsheet-direct-python
```
: For Excel/CSV generation with Python

Skill Health & Metrics

Expected success rate: 85%+ with proper environment setup Fallback activation: Use Python fallback if pandoc/LaTeX unavailable Verification: Always verify output files exist AND have content (>0 bytes) *** End Files *** Begin Files