OpenSpace docx-dual-parse
Extract text from DOCX files using shell or Python zipfile, with environment-aware fallback
git clone https://github.com/HKUDS/OpenSpace
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/docx-shell-parse-enhanced-4bba79" ~/.claude/skills/hkuds-openspace-docx-dual-parse && rm -rf "$T"
gdpval_bench/skills/docx-shell-parse-enhanced-4bba79/SKILL.mdDOCX Dual-Method Text Extraction
Extract text from Microsoft Word (.docx) files using either shell commands or Python's zipfile module, automatically selecting the most reliable method for your environment.
When to Use
- Need reliable DOCX text extraction in varying environments (containers, sandboxes, minimal images)
- Python environment may lack
but has standard library accesspython-docx - Shell utilities (
,unzip
) may be unavailable or restrictedsed - Want environment-aware fallback without manual intervention
Core Technique
DOCX files are ZIP archives containing XML files. This skill provides two extraction methods:
Method A (Shell):
unzip -p + sed for tag stripping
Method B (Python): zipfile module for archive access + string parsing
Environment Detection
Before extraction, detect which method will work:
# Quick shell method test if unzip -v >/dev/null 2>&1; then echo "Shell method available" else echo "Shell method unavailable, try Python" fi
# Quick Python method test python3 -c "import zipfile; print('Python method available')" 2>/dev/null
Method A: Shell-Based Extraction
Use when
unzip and sed are available and the environment allows shell operations.
Step-by-Step Instructions
1. Verify the DOCX file exists
ls -la document.docx
2. Extract raw XML content
unzip -p document.docx word/document.xml
3. Strip XML tags from content
unzip -p document.docx word/document.xml | sed -e 's/<[^>]*>//g'
4. Clean up whitespace (optional)
unzip -p document.docx word/document.xml | \ sed -e 's/<[^>]*>//g' | \ sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//' | \ sed -e '/^$/d'
5. Save extracted text to file
unzip -p document.docx word/document.xml | \ sed -e 's/<[^>]*>//g' > output.txt
Reusable Shell Function
parse_docx_shell() { local file="$1" if [ ! -f "$file" ]; then echo "Error: File not found: $file" >&2 return 1 fi if ! command -v unzip >/dev/null 2>&1; then echo "Error: unzip not available" >&2 return 1 fi unzip -p "$file" word/document.xml 2>/dev/null | \ sed -e 's/<[^>]*>//g' | \ sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//' | \ sed -e '/^$/d' } # Usage: parse_docx_shell document.docx
Method B: Python Zipfile Extraction
Use when shell method fails or Python environment is more reliable than shell.
Step-by-Step Instructions
1. Verify the DOCX file exists
ls -la document.docx
2. Run Python extraction via run_shell
run_shell 'python3 -c " import zipfile import re with zipfile.ZipFile(\"document.docx\", \"r\") as z: content = z.read(\"word/document.xml\").decode(\"utf-8\") text = re.sub(r\"<[^>]*>\", \"\", content) lines = [l.strip() for l in text.splitlines() if l.strip()] for line in lines: print(line) "'
3. Save to file by redirecting output
run_shell 'python3 -c " import zipfile import re with zipfile.ZipFile(\"document.docx\", \"r\") as z: content = z.read(\"word/document.xml\").decode(\"utf-8\") text = re.sub(r\"<[^>]*>\", \"\", content) lines = [l.strip() for l in text.splitlines() if l.strip()] with open(\"output.txt\", \"w\") as f: for line in lines: f.write(line + \"\\n\") "'
Reusable Python Function (via run_shell)
parse_docx_python() { local file="$1" local output="$2" if [ ! -f "$file" ]; then echo "Error: File not found: $file" >&2 return 1 fi run_shell "python3 -c \" import zipfile import re import sys try: with zipfile.ZipFile(\\'$file\\', \\'r\\') as z: content = z.read(\\'word/document.xml\\').decode(\\'utf-8\\') text = re.sub(r\\'<[^>]*>\\', \\'\\', content) lines = [l.strip() for l in text.splitlines() if l.strip()] for line in lines: print(line) except Exception as e: print(f\\'Error: {e}\\', file=sys.stderr) sys.exit(1) \"" } # Usage: parse_docx_python document.docx # Or to file: parse_docx_python document.docx > output.txt
Unified Dual-Method Function
Automatically tries shell first, falls back to Python if shell fails:
parse_docx() { local file="$1" if [ ! -f "$file" ]; then echo "Error: File not found: $file" >&2 return 1 fi # Try shell method first if command -v unzip >/dev/null 2>&1; then result=$(unzip -p "$file" word/document.xml 2>/dev/null | \ sed -e 's/<[^>]*>//g' | \ sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//' | \ sed -e '/^$/d') if [ -n "$result" ]; then echo "$result" return 0 fi fi # Fallback to Python method python3 -c " import zipfile import re import sys try: with zipfile.ZipFile('$file', 'r') as z: content = z.read('word/document.xml').decode('utf-8') text = re.sub(r'<[^>]*>', '', content) lines = [l.strip() for l in text.splitlines() if l.strip()] for line in lines: print(line) except Exception as e: print(f'Error: {e}', file=sys.stderr) sys.exit(1) " } # Usage: parse_docx document.docx
Alternative: Extract to Temporary Directory
For complex parsing needs or debugging:
Shell approach:
tmpdir=$(mktemp -d) unzip document.docx -d "$tmpdir" cat "$tmpdir/word/document.xml" | sed -e 's/<[^>]*>//g' rm -rf "$tmpdir"
Python approach:
python3 -c " import zipfile import tempfile import os with zipfile.ZipFile('document.docx', 'r') as z: tmpdir = tempfile.mkdtemp() z.extractall(tmpdir) with open(os.path.join(tmpdir, 'word/document.xml')) as f: print(f.read()) "
Verification
Confirm extraction worked:
# Check output has content parse_docx document.docx | head -20 # Verify file was created (if saving to file) ls -la output.txt wc -l output.txt
Method Selection Guide
| Environment | Recommended Method |
|---|---|
| Standard Linux with unzip | Shell (faster, simpler) |
| Container without unzip | Python zipfile |
| Sandboxed execution | Python via execute_code_sandbox or run_shell |
| Minimal/busybox images | Shell if unzip available |
| Unknown/restricted | Use unified function |
Limitations
- Does not preserve formatting, images, or table structure
- May include some residual XML entity references ( , etc.)
- Works best for simple text extraction needs
- DOCX must be a valid Office Open XML format
- Protected/encrypted DOCX files require additional handling
Error Handling Tips
- Always check file existence before parsing
- Test method availability in the target environment
- Capture stderr for debugging failed extractions
- Validate output is non-empty before proceeding
- Handle XML entity decoding if needed (sed can expand basic entities)