OpenSpace document-gen-resilient-workflow

Robust document generation with engine fallback chain, error diagnostics, and Python alternatives

install

source · Clone the upstream repo

git clone https://github.com/HKUDS/OpenSpace

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/document-gen-fallback-enhanced-enhanced" ~/.claude/skills/hkuds-openspace-document-gen-resilient-workflow-cadbfe && rm -rf "$T"

manifest: gdpval_bench/skills/document-gen-fallback-enhanced-enhanced/SKILL.md

source content

Resilient Document Generation Workflow

When to Use

Use this skill when document generation tasks encounter tool errors or failures, especially:

```
run_shell
```
commands return 'unknown error' without diagnostic details
PDF generation fails with LaTeX encoding or missing engine errors
You need fallback options when primary conversion methods fail
Workspace or path issues cause file access problems
Unicode/special characters cause rendering issues in PDF output
```
shell_agent
```
has failed multiple times on similar tasks

Core Technique

Split document generation into observable, verifiable steps with explicit error capture and progressive fallback:

Validate workspace → Confirm current directory and write permissions
Pre-flight checks → Verify required tools (pandoc, LaTeX engines) are available
Content creation → Use
```
write_file
```
for source Markdown with full visibility
Unicode sanitization → Preprocess for PDF conversion when needed
Progressive conversion → Try multiple engines/methods until one succeeds
Explicit error capture → Redirect stderr to diagnose failures
Verification → Confirm output files exist and are readable

⚠️ Critical: Tool Error Patterns

'unknown error' from run_shell: This typically indicates:

Sandbox execution context issues (not the command itself)
Missing stderr capture preventing diagnosis
Transient file access problems

Solution: Use explicit error capture with

2>&1

and verify tool availability first:

run_shell
command: pandoc --version 2>&1 || echo "PANDOC NOT AVAILABLE"

Pre-Flight Checklist

Before starting conversion, verify your environment:

## Pre-Flight Checks

### Check 1: Verify working directory
run_shell
command: pwd && ls -la

### Check 2: Verify pandoc availability
run_shell
command: pandoc --version 2>&1 | head -3 || echo "PANDOC MISSING"

### Check 3: Check LaTeX engines
run_shell
command: which pdflatex xelatex wkhtmltopdf 2>&1 || echo "Some engines missing"

### Check 4: Check Python libraries (fallback option)
run_shell
command: python3 -c "import fpdf; print('fpdf OK')" 2>&1 || echo "fpdf missing"
run_shell
command: python3 -c "import docx; print('python-docx OK')" 2>&1 || echo "python-docx missing"

Complete Step-by-Step Workflow

Step 1: Validate Workspace

Confirm you're in the correct directory and can write files:

run_shell
command: pwd && echo "Test write" > /tmp/workspace_test_$$.txt && ls -la /tmp/workspace_test_$$.txt && rm /tmp/workspace_test_$$.txt

If this fails, you may have directory permission issues. Use absolute paths for all files.

Step 2: Create Source Content with write_file

Write document content as Markdown to a source file:

write_file
path: /tmp/document_source.md
content: |
  # Document Title
  
  ## Section 1
  Content here...

Step 3: Unicode Sanitization (PDF Only)

Only needed for PDF conversion via LaTeX engines. DOCX and HTML handle Unicode natively.

Create a sanitized version replacing LaTeX-problematic characters:

write_file
path: /tmp/document_source_sanitized.md
content: |
  # Document Title
  
  [Same content with: em-dash → --, curly quotes → straight, → arrows → text, etc.]

Common replacements (see full table below):

Original	Replace With
`—` (em dash)	`--`
`" "` (curly quotes)	`" "` (straight)
`…` (ellipsis)	`...`
`→` `←` (arrows)	`->` `<-`
`✓` `✗`	`[x]` `[ ]`
`©` `®` `™`	`(c)` `(r)` `(tm)`

Alternative: Use sanitization script (see Scripts section below).

Step 4: Progressive Format Conversion

Try multiple methods in order of preference. Capture stderr explicitly for diagnosis.

Method A: DOCX Conversion (Most Reliable)

run_shell
command: pandoc /tmp/document_source.md -o output.docx 2>&1 && echo "SUCCESS" || echo "FAILED: $?"

If this fails, try with explicit UTF-8 flag:

run_shell
command: pandoc /tmp/document_source.md -f markdown+utf8 -o output.docx 2>&1

Method B: PDF Conversion (Progressive Fallback)

Try engines in this order until one succeeds:

run_shell
command: pandoc /tmp/document_source_sanitized.md -o output.pdf 2>&1 | tee /tmp/pdf_error.log && echo "PDF SUCCESS"

If Method B fails, check error log and try next engine:

run_shell
command: pandoc /tmp/document_source_sanitized.md --pdf-engine=xelatex -o output.pdf 2>&1 | tee /tmp/pdf_error.log && echo "PDF SUCCESS"

If xelatex fails:

run_shell
command: pandoc /tmp/document_source_sanitized.md --pdf-engine=wkhtmltopdf -o output.pdf 2>&1 | tee /tmp/pdf_error.log && echo "PDF SUCCESS"

Method C: Python Library Fallback (When pandoc Unavailable)

If all pandoc methods fail, use Python libraries:

Option C1: FPDF2 for PDF

execute_code_sandbox
language: python
code: |
  from fpdf import FPDF
  
  pdf = FPDF()
  pdf.add_page()
  pdf.set_font('Arial', 'B', 16)
  pdf.cell(40, 10, 'Document Title')
  pdf.ln(20)
  pdf.set_font('Arial', '', 12)
  pdf.multi_cell(0, 10, 'Your content here...')
  pdf.output('output.pdf')
  print('PDF created successfully')

Option C2: python-docx for DOCX

execute_code_sandbox
language: python
code: |
  from docx import Document
  
  doc = Document()
  doc.add_heading('Document Title', 0)
  doc.add_paragraph('Your content here...')
  doc.save('output.docx')
  print('DOCX created successfully')

Step 5: Verify Output Files

Don't just check existence—verify files are valid:

run_shell
command: ls -lh output.* 2>&1

run_shell
command: file output.docx output.pdf 2>&1

For DOCX, verify content:

run_shell
command: unzip -p output.docx word/document.xml 2>&1 | head -50 || echo "Cannot read DOCX"

For PDF, check page count:

run_shell
command: pdfinfo output.pdf 2>&1 | grep Pages || echo "pdfinfo not available"

Alternative verification via Python:

execute_code_sandbox
code: |
  import os
  for f in ['output.docx', 'output.pdf', 'output.html']:
    if os.path.exists(f):
      size = os.path.getsize(f)
      print(f'{f}: {size} bytes')
    else:
      print(f'{f}: NOT FOUND')

Complete Example: Generating a Report

# Generate Quarterly Report (DOCX + PDF)

## Step 1: Pre-flight checks
run_shell
command: pandoc --version 2>&1 | head -1

## Step 2: Write Markdown source
write_file
path: /tmp/quarterly_report.md
content: |
  # Quarterly Business Report
  
  ## Executive Summary
  Q4 performance exceeded expectations...
  
  ## Key Metrics
  - Revenue: $1.2M
  - Growth: 15%
  
  ## Recommendations
  Continue current strategy...

## Step 3: Create sanitized version for PDF
write_file
path: /tmp/quarterly_report_sanitized.md
content: |
  # Quarterly Business Report
  
  ## Executive Summary
  Q4 performance exceeded expectations...
  
  [Rest of content with special chars replaced]

## Step 4: Generate DOCX
run_shell
command: pandoc /tmp/quarterly_report.md -o quarterly_report.docx 2>&1 && echo "DOCX OK"

## Step 5: Generate PDF (try xelatex for Unicode safety)
run_shell
command: pandoc /tmp/quarterly_report_sanitized.md --pdf-engine=xelatex -o quarterly_report.pdf 2>&1 && echo "PDF OK"

## Step 6: Verify outputs
run_shell
command: ls -lh quarterly_report.* && file quarterly_report.*

Full Unicode Replacement Table

Character	Unicode	Issue	Safe Replacement
`—`	U+2014	LaTeX incompatible	`--`
`–`	U+2013	LaTeX incompatible	`-`
`"` `"`	U+201C/U+201D	Encoding errors	`"` (straight)
`'` `'`	U+2018/U+2019	Encoding errors	`'` (straight)
`…`	U+2026	May not render	`...`
`→`	U+2192	LaTeX incompatible	`->` or `to`
`←`	U+2190	LaTeX incompatible	`<-`
`↑` `↓`	U+2191/U+2193	LaTeX incompatible	`^` `v`
`•`	U+2022	Font-dependent	`-` or `*`
`✓`	U+2713	May not render	`[x]`
`✗`	U+2717	May not render	`[ ]`
`★` `●`	U+2605/U+25CF	May not render	`*` `-`
`©`	U+00A9	May require packages	`(c)`
`®`	U+00AE	May require packages	`(r)`
`™`	U+2122	May require packages	`(tm)`
`é` `ñ` `ü`	Various	Font-dependent	Use xelatex or replace

Reusable Sanitization Script

Save this script for repeated use:

#!/bin/bash
# sanitize_for_pdf.sh - Replace problematic unicode chars for LaTeX/PDF
set -e
if [ -z "$1" ]; then
  echo "Usage: $0 <input.md> [output.md]"
  exit 1
fi
INPUT="$1"
OUTPUT="${2:-${1%.md}_sanitized.md}"

sed -e 's/—/--/g' \
    -e 's/–/-/g' \
    -e 's/"\([^"]*\)"/"\1"/g' \
    -e "s/'\([^']*\)'/\'\1\'/g" \
    -e 's/…/.../g' \
    -e 's/→/->/g' \
    -e 's/←/<-/g' \
    -e 's/↑/^/g' \
    -e 's/↓/v/g' \
    -e 's/•/-/g' \
    -e 's/✓/[x]/g' \
    -e 's/✗/[ ]/g' \
    -e 's/★/*/g' \
    -e 's/©/(c)/g' \
    -e 's/®/(r)/g' \
    -e 's/™/(tm)/g' \
    "$INPUT" > "$OUTPUT"

echo "Sanitized: $INPUT -> $OUTPUT"

Usage:

run_shell
command: chmod +x sanitize_for_pdf.sh && ./sanitize_for_pdf.sh /tmp/document_source.md /tmp/document_source_sanitized.md

Error Diagnosis Guide

Error: "unknown error" from run_shell

Diagnosis:

run_shell
command: pandoc --version 2>&1
run_shell
command: echo "test" | pandoc -f markdown -t plain 2>&1

If pandoc works in isolation, the issue is likely:

File path problems (use absolute paths)
Sandbox context issues (retry or use shell_agent)
Transient errors (retry the same command)

Error: "LaTeX not found" or "pdflatex: command not found"

Solutions in order:

Try xelatex:
```
--pdf-engine=xelatex
```
Try wkhtmltopdf:
```
--pdf-engine=wkhtmltopdf
```

Install LaTeX:

apt-get install texlive-latex-recommended texlive-fonts-recommended

Use Python fpdf2 fallback (see Step 4, Method C)

Error: "Encoding error" or "Invalid UTF-8"

Solutions:

Add UTF-8 flag:
```
pandoc -f markdown+utf8
```
Check file encoding:
```
file -i source.md
```

Re-save with explicit UTF-8:

run_shell
command: iconv -f UTF-8 -t UTF-8 source.md -o source_utf8.md

Use sanitization step

Error: PDF generated but empty or garbled

Diagnosis:

run_shell
command: pdfinfo output.pdf 2>&1

execute_code_sandbox
code: |
  import fitz  # PyMuPDF
  doc = fitz.open('output.pdf')
  print(f'Pages: {doc.page_count}')
  if doc.page_count > 0:
      page = doc[0]
      print(f'Text: {page.get_text()[:200]}')

Solutions:

Use xelatex engine with Unicode content
Check sanitization step (if using pdflatex)
Verify source markdown is valid

Error: DOCX file won't open or shows error

Diagnosis:

run_shell
command: unzip -l output.docx 2>&1 | head -10

Solutions:

Add
```
--reference-doc
```
with valid template
Ensure markdown syntax is valid
Try pandoc verbose mode:
```
pandoc -v input.md -o output.docx 2>&1
```

Workspace Troubleshooting

Files appearing in wrong directory

Diagnosis:

run_shell
command: pwd && find /tmp -name "*.docx" -o -name "*.pdf" 2>/dev/null | head -10

Solutions:

Use absolute paths in all commands
Explicitly change to target directory:
```
cd /path/to/workspace && ...
```

After generation, copy files to correct location:

run_shell
command: cp /tmp/output.docx ./output.docx

Permission denied errors

Solutions:

Use /tmp directory for intermediate files
Copy final output to target location
Check disk space:
```
df -h .
```

When to Use shell_agent vs Manual Workflow

Situation	Recommended Approach
Simple DOCX generation, no special chars	`shell_agent` (faster)
PDF with Unicode/symbols	Manual workflow with sanitization
Multiple 'unknown error' failures	Manual workflow with error capture
Need to debug conversion issues	Manual workflow (visible steps)
Time-critical, simple format	`shell_agent`
Critical document, must succeed	Manual workflow with fallbacks

Alternative: Hybrid Approach

For reliability with less manual effort:

## Hybrid Workflow

1. **First attempt**: `shell_agent` with clear task description

2. **On first error**: Switch to manual workflow:
   - `write_file` for source
   - `run_shell` with error capture (2>&1)
   - Verify with explicit checks

3. **On repeated errors**: Use Python library fallback (execute_code_sandbox)

This balances speed with reliability.

Related Skills

```
workspace-path-troubleshooting
```
: For diagnosing file location issues
```
pdf-verification-cli
```
: For validating generated PDFs
```
write-file-fallback-report
```
: For content-first document creation when web sources fail

Quick Reference Card

QUICK START: Generate Document in 3 Formats

1. Write source: write_file → /tmp/doc.md

2. Sanitize (PDF only): 
   Replace: — → --, "" → "", … → ...

3. Convert with error capture:
   DOCX: pandoc /tmp/doc.md -o out.docx 2>&1
   PDF:  pandoc /tmp/doc_sanitized.md --pdf-engine=xelatex -o out.pdf 2>&1
   HTML: pandoc /tmp/doc.md -o out.html 2>&1

4. Verify: ls -lh out.* && file out.*

5. If pandoc fails: Use execute_code_sandbox with fpdf2 or python-docx