OpenSpace pandoc-unicode-sanitize

Pre-process Markdown to replace non-ASCII characters before pandoc PDF conversion

install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/pandoc-unicode-sanitize" ~/.claude/skills/hkuds-openspace-pandoc-unicode-sanitize && rm -rf "$T"
manifest: gdpval_bench/skills/pandoc-unicode-sanitize/SKILL.md
source content

Pandoc Unicode Sanitization for PDF Conversion

Problem

LaTeX (used by pandoc for PDF output) may fail or produce errors when encountering non-ASCII Unicode characters like currency symbols (₹, €, £), special quotes, or other Unicode characters not supported by the default LaTeX engine.

Solution

Pre-process Markdown files to replace problematic Unicode characters with ASCII equivalents before running pandoc PDF conversion.

Steps

1. Identify Problematic Characters

Common problematic characters include:

  • Currency symbols: ₹, €, £, ¥
  • Special quotes: " " ' '
  • Arrows: →, ←, ↑, ↓
  • Bullets: •, ◦
  • Other symbols: ©, ®, ™, ±, ×, ÷, —, –

2. Create a Sed Substitution Script

Create a shell script for reusable sanitization:

#!/bin/bash
# sanitize-for-pdf.sh

sed -e 's/₹/Rs/g' \
    -e 's/€/EUR/g' \
    -e 's/£/GBP/g' \
    -e 's/¥/JPY/g' \
    -e 's/"/"/g' \
    -e 's/"/"/g' \
    -e "s/'/'/g" \
    -e "s/'/'/g" \
    -e 's/→/->/g' \
    -e 's/←/<-/g' \
    -e 's/•/-/g' \
    -e 's/©/(c)/g' \
    -e 's/®/(R)/g' \
    -e 's/™/(TM)/g' \
    -e 's/±/+/ -/g' \
    -e 's/×/x/g' \
    -e 's/÷/\//g' \
    -e 's/—/--/g' \
    -e 's/–/-/g' \
    "$1"

3. Run Sanitization Before Pandoc

# Using the script
./sanitize-for-pdf.sh input.md > sanitized.md
pandoc sanitized.md -o output.pdf

# Or inline
cat input.md | sed 's/₹/Rs/g; s/€/EUR/g; s/£/GBP/g' | pandoc -o output.pdf

4. Alternative: Use iconv for Automatic Transliteration

For broader Unicode coverage:

iconv -f UTF-8 -t ASCII//TRANSLIT < input.md > sanitized.md
pandoc sanitized.md -o output.pdf

5. Alternative: Use XeLaTeX or LuaLaTeX Engine

If you want to preserve Unicode characters, use a Unicode-capable engine:

pandoc input.md -o output.pdf --pdf-engine=xelatex
# or
pandoc input.md -o output.pdf --pdf-engine=lualatex

Best Practices

  • Sanitize files before conversion rather than debugging LaTeX errors after
  • Keep a mapping of your organization's common symbols and preferred replacements
  • Test with a small sample before processing large documents
  • Consider using XeLaTeX/LuaLaTeX if Unicode preservation is required
  • Document which characters cause issues in your specific environment

Common Error Signs

Look for these LaTeX errors that indicate Unicode issues:

  • Package inputenc Error: Unicode character ... not set up
  • LaTeX Error: Unicode character ... cannot be used with pdflatex
  • Missing or garbled characters in the output PDF