OpenSpace docx-parse-resilient
Extract text from DOCX files with shell-primary approach and Python zipfile fallback for maximum reliability
git clone https://github.com/HKUDS/OpenSpace
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/docx-shell-parse-enhanced" ~/.claude/skills/hkuds-openspace-docx-parse-resilient && rm -rf "$T"
gdpval_bench/skills/docx-shell-parse-enhanced/SKILL.mdResilient DOCX Text Extraction
Extract text from Microsoft Word (.docx) files using a robust two-tier approach: shell-based extraction as the primary method, with Python zipfile fallback when shell commands fail or return no output.
When to Use
- Python environment may lack
butpython-docx
module is available (standard library)zipfile - Working in constrained or inconsistent environments (containers, minimal images, CI/CD)
- Shell
command returns errors or no outputunzip - Need reliable extraction with automatic fallback
Core Technique
DOCX files are ZIP archives containing XML files. This skill provides two extraction methods:
- Primary (Shell):
+unzip -p
for fast extractionsed - Fallback (Python):
module for reliable extraction when shell failszipfile
Step-by-Step Instructions
1. Verify the DOCX file exists
ls -la document.docx
2. Test shell extraction first (recommended)
Try the shell-based approach:
unzip -p document.docx word/document.xml 2>/dev/null | sed -e 's/<[^>]*>//g'
3. Check if shell extraction produced output
Verify the shell method returned content:
content=$(unzip -p document.docx word/document.xml 2>/dev/null | sed -e 's/<[^>]*>//g') if [ -z "$content" ]; then echo "Shell extraction returned no output, trying Python fallback..." fi
4. Use Python zipfile fallback if needed
When shell commands fail or return empty output, use Python's standard
zipfile module:
python3 -c " import zipfile import sys import re try: with zipfile.ZipFile('document.docx', 'r') as z: content = z.read('word/document.xml').decode('utf-8') # Strip XML tags text = re.sub(r'<[^>]*>', '', content) # Clean whitespace lines = [line.strip() for line in text.split('\n') if line.strip()] print('\n'.join(lines)) except Exception as e: print(f'Error: {e}', file=sys.stderr) sys.exit(1) "
5. Save extracted text to file
# Try shell first unzip -p document.docx word/document.xml 2>/dev/null | \ sed -e 's/<[^>]*>//g' > output.txt # Verify output has content if [ ! -s output.txt ]; then # Fallback to Python python3 -c " import zipfile, re with zipfile.ZipFile('document.docx', 'r') as z: content = z.read('word/document.xml').decode('utf-8') text = re.sub(r'<[^>]*>', '', content) lines = [line.strip() for line in text.split('\n') if line.strip()] print('\n'.join(lines)) " > output.txt fi
Complete Shell Function with Fallback
Add this resilient function to your scripts:
parse_docx_resilient() { local file="$1" local output="$2" if [ ! -f "$file" ]; then echo "Error: File not found: $file" >&2 return 1 fi # Primary: Shell extraction local content content=$(unzip -p "$file" word/document.xml 2>/dev/null | \ sed -e 's/<[^>]*>//g' | \ sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//' | \ sed -e '/^$/d') # Check if shell extraction succeeded if [ -n "$content" ]; then echo "$content" > "${output:-/dev/stdout}" return 0 fi # Fallback: Python zipfile echo "Shell extraction failed, using Python fallback..." >&2 python3 -c " import zipfile, sys, re try: with zipfile.ZipFile('$file', 'r') as z: content = z.read('word/document.xml').decode('utf-8') text = re.sub(r'<[^>]*>', '', content) lines = [line.strip() for line in text.split('\n') if line.strip()] print('\n'.join(lines)) except Exception as e: print(f'Python extraction failed: {e}', file=sys.stderr) sys.exit(1) " > "${output:-/dev/stdout}" || return 1 } # Usage examples: # parse_docx_resilient document.docx # Output to stdout # parse_docx_resilient document.docx out.txt # Output to file
Python Script Alternative
For complex workflows, save as a standalone script:
#!/usr/bin/env python3 """DOCX text extractor with resilient fallback.""" import zipfile import sys import re import subprocess def extract_with_shell(filepath): """Try shell-based extraction first.""" try: result = subprocess.run( ['unzip', '-p', filepath, 'word/document.xml'], capture_output=True, text=True, timeout=10 ) if result.returncode == 0 and result.stdout.strip(): text = re.sub(r'<[^>]*>', '', result.stdout) lines = [l.strip() for l in text.split('\n') if l.strip()] return '\n'.join(lines) except Exception: pass return None def extract_with_python(filepath): """Fallback Python zipfile extraction.""" with zipfile.ZipFile(filepath, 'r') as z: content = z.read('word/document.xml').decode('utf-8') text = re.sub(r'<[^>]*>', '', content) lines = [l.strip() for l in text.split('\n') if l.strip()] return '\n'.join(lines) def parse_docx_resilient(filepath): """Extract text with automatic fallback.""" # Try shell first content = extract_with_shell(filepath) if content: return content, 'shell' # Fallback to Python content = extract_with_python(filepath) return content, 'python' if __name__ == '__main__': if len(sys.argv) < 2: print("Usage: parse_docx_resilient.py <file.docx>", file=sys.stderr) sys.exit(1) content, method = parse_docx_resilient(sys.argv[1]) if content: print(f"# Extracted using {method} method", file=sys.stderr) print(content) else: print("Failed to extract text from DOCX", file=sys.stderr) sys.exit(1)
Limitations
- Does not preserve formatting, images, or tables structure
- May include some residual XML entity references
- Works best for simple text extraction needs
- DOCX must be a valid Office Open XML format
- Python fallback requires Python 3 with standard library (no external packages)
Verification
Confirm extraction worked by checking output:
# Test shell method parse_docx_resilient document.docx | head -20 # Test with file output parse_docx_resilient document.docx extracted.txt wc -l extracted.txt # Should show line count > 0 # Verify content grep -c "[a-zA-Z]" extracted.txt # Should show character content
Troubleshooting
Shell returns "unknown error" or no output:
- This is expected in some environments
- The function automatically falls back to Python zipfile
- Check
to verify unzip is availablewhich unzip
Python also fails:
- Verify the file is a valid DOCX:
file document.docx - Check if file is corrupted:
unzip -t document.docx - Ensure Python 3 is available:
python3 --version
File not found errors:
- Use absolute path or verify working directory
- Check file permissions:
ls -la document.docx
Environment Detection
To pre-detect which method to use:
# Check if unzip is available if command -v unzip &> /dev/null; then echo "Shell method available" else echo "Only Python method available" fi # Check if Python 3 is available if command -v python3 &> /dev/null; then echo "Python fallback available" else echo "Warning: No extraction method available!" fi