OpenSpace document-gen-resilient-multiformat

Resilient multi-format document generation with environment checks, auto-sanitization, and fallback engines

install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/document-gen-fallback-enhanced-enhanced-618c9b" ~/.claude/skills/hkuds-openspace-document-gen-resilient-multiformat && rm -rf "$T"
manifest: gdpval_bench/skills/document-gen-fallback-enhanced-enhanced-618c9b/SKILL.md
source content

Resilient Document Generation Workflow (Multi-Format + Unicode-Safe)

When to Use

Use this skill when generating documents in multiple formats (

.docx
,
.pdf
,
.html
) and you need:

  • Reliable execution with pre-flight environment verification
  • Automatic Unicode handling based on target format
  • Fallback options when primary tools (pandoc/LaTeX) are unavailable
  • Clear error isolation through discrete, observable steps

Trigger this workflow when:

  • shell_agent
    returns unknown/unclear errors on document generation
  • You need multiple output formats from one source
  • Documents contain special characters, symbols, or non-ASCII text
  • Previous PDF generation attempts failed due to encoding or missing dependencies

Pre-Flight Environment Check

Before starting, verify your environment has the required tools:

run_shell
command: which pandoc && pandoc --version | head -1
run_shell
command: which pdflatex || which xelatex || which wkhtmltopdf || echo "No PDF engine found"
run_shell
command: python3 -c "import fpdf; print('fpdf2 available')" 2>/dev/null || echo "fpdf2 not available"
run_shell
command: python3 -c "import reportlab; print('reportlab available')" 2>/dev/null || echo "reportlab not available"

If pandoc is missing: Install via

apt-get install pandoc
or
brew install pandoc

If no PDF engine found: Choose fallback approach:

  • Install LaTeX:
    apt-get install texlive-latex-recommended texlive-fonts-recommended
  • Or use wkhtmltopdf:
    apt-get install wkhtmltopdf
  • Or use Python fallback (fpdf2/reportlab) - see Alternative PDF Generation section

Core Technique

Split the workflow into discrete, observable steps with automatic format detection:

  1. Environment check → Verify tools are available
  2. Content creation → Use
    write_file
    to create source Markdown
  3. Auto-sanitization → Conditionally sanitize based on target format (PDF needs it, DOCX/HTML don't)
  4. Format conversion → Use appropriate tool for each format with fallback options
  5. Verification → Check output files exist and validate content

⚠️ Unicode & Format Compatibility Matrix

FormatUnicode SupportSanitization NeededRecommended Engine
.docx
ExcellentNopandoc (default)
.html
ExcellentNopandoc (default)
.pdf
Limited (LaTeX)Yesxelatex > pdflatex > wkhtmltopdf > fpdf2
.pptx
GoodNopandoc (if available) or python-pptx

Critical Unicode Characters for PDF

CharacterIssueSafe Replacement
(em dash)
May not render
--
or
-
(en dash)
May not render
-
" "
(curly quotes)
Encoding errors
" "
(straight quotes)
' '
(curly apostrophe)
Encoding errors
'
(straight apostrophe)
(ellipsis)
May not render
...
(arrows)
LaTeX incompatibility
->
<-
^
v
(checkmarks)
May not render
[x]
[ ]
(symbols)
May not render
*
-
©
®
May require packages
(c)
(r)
(tm)
Non-ASCII letters (é, ñ, ü)Font-dependentUse xeLaTeX or replace

Step-by-Step Workflow

Step 0: Pre-Flight Check

Verify environment before proceeding:

run_shell
command: pandoc --version >/dev/null 2>&1 && echo "PANDOC_OK" || echo "PANDOC_MISSING"

If PANDOC_MISSING: Either install pandoc or use Alternative PDF Generation (Python-based)

Step 1: Create Source Content with write_file

Write your document content as Markdown to a source file:

write_file
path: /tmp/document_source.md
content: |
  # Document Title
  
  ## Section 1
  Content with original unicode characters...
  
  ## Section 2
  More content...

Step 2: Conditional Unicode Sanitization

Only sanitize if generating PDF. Create sanitized version for PDF conversion:

write_file
path: /tmp/document_source_sanitized.md
content: |
  # Document Title
  
  ## Section 1
  Content with unicode replaced (em-dash -> --, curly quotes -> straight, etc.)
  
  ## Section 2
  More content...

Optional: Use sanitization script (see Unicode Sanitization Script section below):

run_shell
command: ./sanitize_for_pdf.sh /tmp/document_source.md /tmp/document_source_sanitized.md

Note: Keep the original unsanitized file for DOCX/HTML conversion (these formats handle Unicode better).

Step 3: Convert to Target Formats with run_shell

Use appropriate commands for each format. Use sanitized source for PDF, original for others.

DOCX (from original):

run_shell
command: pandoc /tmp/document_source.md -o output.docx

PDF (from sanitized, with engine priority):

run_shell
command: pandoc /tmp/document_source_sanitized.md -o output.pdf --pdf-engine=xelatex

If xelatex fails, try pdflatex:

run_shell
command: pandoc /tmp/document_source_sanitized.md -o output.pdf --pdf-engine=pdflatex

If LaTeX engines fail, try wkhtmltopdf:

run_shell
command: pandoc /tmp/document_source_sanitized.md -o output.pdf --pdf-engine=wkhtmltopdf

HTML (from original):

run_shell
command: pandoc /tmp/document_source.md -o output.html

Step 4: Verify Outputs

Check that files were created and have content:

run_shell
command: ls -lh output.* && file output.*
run_shell
command: test -s output.pdf && echo "PDF has content" || echo "PDF is empty"

PDF Engine Decision Tree

Is pandoc available?
├─ NO → Use Python fallback (fpdf2 or reportlab)
└─ YES → Is LaTeX available?
    ├─ YES (xelatex) → Use: pandoc --pdf-engine=xelatex (best Unicode)
    ├─ YES (pdflatex) → Use: pandoc --pdf-engine=pdflatex + sanitization
    ├─ YES (wkhtmltopdf) → Use: pandoc --pdf-engine=wkhtmltopdf
    └─ NO → Install LaTeX or use Python fallback

Alternative PDF Generation (Python Fallback)

When pandoc or LaTeX is unavailable, use Python libraries directly:

Option A: fpdf2 (Simple, fast)

run_shell
command: python3 << 'EOF'
from fpdf import FPDF
pdf = FPDF()
pdf.add_page()
pdf.set_font("Arial", size=12)
# Note: fpdf2 has limited Unicode support - use Latin-1 or embed fonts
pdf.cell(200, 10, txt="Document Title", ln=True, align='C')
pdf.output("output.pdf")
EOF

Option B: reportlab (More control, better Unicode)

run_shell
command: python3 << 'EOF'
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont

# Register a Unicode font if needed
# pdfmetrics.registerFont(TTFont('UnicodeFont', 'path/to/font.ttf'))

c = canvas.Canvas("output.pdf", pagesize=letter)
c.drawString(100, 750, "Document Title")
c.save()
EOF

Option C: Markdown to PDF with Markdown2 + ReportLab

run_shell
command: python3 << 'EOF'
import markdown
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet

# Read markdown
with open('/tmp/document_source.md', 'r', encoding='utf-8') as f:
    md_content = f.read()

# Convert to HTML
html_content = markdown.markdown(md_content)

# Create PDF
doc = SimpleDocTemplate("output.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

# Parse HTML and add to story (simplified - use html2text for production)
story.append(Paragraph("Document Content", styles['Normal']))

doc.build(story)
print("PDF created successfully")
EOF

Complete Example

# Generate Multi-Format Report

## Step 0: Check environment
run_shell
command: pandoc --version >/dev/null 2>&1 && echo "PANDOC_OK" || echo "PANDOC_MISSING"

## Step 1: Write Markdown source
write_file
path: /tmp/report.md
content: |
  # Quarterly Report
  
  ## Executive Summary
  Performance metrics and analysis...
  
  ## Key Findings
  — Major finding with em-dash
  "Quote" with curly quotes
  ✓ Completed items

## Step 2: Create sanitized version (for PDF only)
write_file
path: /tmp/report_sanitized.md
content: |
  # Quarterly Report
  
  ## Executive Summary
  Performance metrics and analysis...
  
  ## Key Findings
  -- Major finding with em-dash
  "Quote" with straight quotes
  [x] Completed items

## Step 3: Convert to DOCX (from original)
run_shell
command: pandoc /tmp/report.md -o report.docx

## Step 4: Convert to PDF (from sanitized, with xelatex)
run_shell
command: pandoc /tmp/report_sanitized.md -o report.pdf --pdf-engine=xelatex

## Step 5: Convert to HTML (from original)
run_shell
command: pandoc /tmp/report.md -o report.html

## Step 6: Verify all outputs
run_shell
command: ls -lh report.* && echo "All files created"

Error Handling & Recovery

ErrorLikely CauseRecovery Action
pandoc: command not found
Pandoc not installedInstall pandoc or use Python fallback
pdflatex not found
LaTeX missingUse
--pdf-engine=xelatex
or
wkhtmltopdf
! LaTeX Error: File X.sty not found
Missing LaTeX packageInstall package or use xelatex/wkhtmltopdf
Encoding error
Unicode in PDFUse sanitized file + xelatex engine
wkhtmltopdf: command not found
wkhtmltopdf missingInstall or use xelatex/pdflatex
PDF is empty (0 bytes)
Conversion silently failedCheck pandoc stderr, try alternative engine
fpdf2 Unicode error
Non-Latin charactersUse reportlab with Unicode font or sanitize

Recovery Workflow

  1. Check error message for specific tool/engine failure
  2. Try next engine in priority: xelatex → pdflatex → wkhtmltopdf → fpdf2/reportlab
  3. If all pandoc attempts fail: Switch to Python fallback (fpdf2/reportlab)
  4. If Unicode issues persist: Apply stricter sanitization or use xelatex with Unicode font
  5. Document which approach succeeded for future reference

Unicode Sanitization Script (Reusable)

Save as

sanitize_for_pdf.sh
:

#!/bin/bash
# sanitize_for_pdf.sh - Replace problematic unicode chars for LaTeX/PDF
if [ -z "$1" ]; then
  echo "Usage: $0 <input.md> [output.md]"
  exit 1
fi
INPUT="$1"
OUTPUT="${2:-${1%.md}_sanitized.md}"

sed -e 's/—/--/g' \
    -e 's/–/-/g' \
    -e 's/"([^"]*)"/"\1"/g' \
    -e "s/'([^']*)/'\1'/g" \
    -e 's/…/.../g' \
    -e 's/→/->/g' \
    -e 's/←/<-/g' \
    -e 's/✓/[x]/g' \
    -e 's/✗/[ ]/g' \
    -e 's/©/(c)/g' \
    -e 's/®/(r)/g' \
    -e 's/™/(tm)/g' \
    "$INPUT" > "$OUTPUT"

echo "Sanitized: $INPUT -> $OUTPUT"

Make executable:

chmod +x sanitize_for_pdf.sh

Usage:

run_shell
command: ./sanitize_for_pdf.sh /tmp/document_source.md /tmp/document_source_sanitized.md

Common pandoc Commands Reference

# Markdown to Word
pandoc input.md -o output.docx

# Markdown to PDF (LaTeX - default pdflatex)
pandoc input.md -o output.pdf

# Markdown to PDF with xelatex (best Unicode support)
pandoc input.md -o output.pdf --pdf-engine=xelatex

# Markdown to PDF with wkhtmltopdf (HTML-based, no LaTeX)
pandoc input.md -o output.pdf --pdf-engine=wkhtmltopdf

# Markdown to HTML
pandoc input.md -o output.html

# With metadata
pandoc input.md -o output.pdf --metadata title="Document Title"

# With custom reference doc (DOCX)
pandoc input.md -o output.docx --reference-doc=template.docx

# With custom template (HTML/PDF)
pandoc input.md --template=template.html -o output.html

# Force UTF-8 input
pandoc -f markdown+utf8 input.md -o output.pdf

Troubleshooting Quick Reference

PDF generation fails with encoding error:

  • Use sanitized markdown file
  • Try
    --pdf-engine=xelatex
    for better Unicode support
  • Add
    -f markdown+utf8
    to pandoc command

PDF generation fails: LaTeX not found:

  • Install:
    apt-get install texlive-latex-recommended texlive-fonts-recommended
  • Or use:
    --pdf-engine=wkhtmltopdf
  • Or use: Python fallback (fpdf2/reportlab)

DOCX formatting issues:

  • Add
    --reference-doc=template.docx
    for custom styles
  • Check markdown structure (headings, lists)

Unicode/encoding errors in any format:

  • Ensure source file is UTF-8:
    file -i source.md
  • Add
    -f markdown+utf8
    to pandoc command
  • Use xelatex for PDF

Special characters not rendering in PDF:

  • Use the character replacement table
  • Create sanitized version before PDF conversion
  • Try xelatex with Unicode font

Missing pandoc:

  • Ubuntu/Debian:
    apt-get install pandoc
  • macOS:
    brew install pandoc
  • Fallback: Use Python libraries directly

When to Use shell_agent vs Manual Workflow

ScenarioRecommended Approach
Simple DOCX/HTML, no Unicode
shell_agent
is fine
PDF generation requiredManual workflow (this skill)
Multiple formats from one sourceManual workflow (this skill)
Heavy Unicode/special charactersManual workflow with sanitization
shell_agent
failed with unknown error
Manual workflow (this skill)
Need explicit error visibilityManual workflow (this skill)
Environment constraints (no pandoc)Python fallback (this skill)

After successful manual workflow: You can attempt

shell_agent
for similar future tasks, but keep this workflow as your known-working fallback.

Related Skills

  • document-gen-fallback
    : Original fallback workflow (less Unicode guidance)
  • write-file-fallback-report
    : For when multiple tools fail simultaneously
  • spreadsheet-direct-python
    : For Excel/CSV generation with Python

Skill Health & Metrics

Expected success rate: 85%+ with proper environment setup Fallback activation: Use Python fallback if pandoc/LaTeX unavailable Verification: Always verify output files exist AND have content (>0 bytes) *** End Files *** Begin Files