OpenSpace docx-shell-extract
Extract text from DOCX files using shell commands when python-docx is unavailable
install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/docx-shell-extract" ~/.claude/skills/hkuds-openspace-docx-shell-extract && rm -rf "$T"
manifest:
gdpval_bench/skills/docx-shell-extract/SKILL.mdsource content
DOCX Shell Extraction
When to Use This Skill
Use this pattern when you need to read or extract text from Microsoft Word (.docx) files in constrained environments where:
- The
library is not availablepython-docx - You cannot install additional Python packages
- You need a quick, reliable shell-based solution
Core Technique
DOCX files are ZIP archives containing XML files. The main document content is stored in
word/document.xml. You can extract and parse this using standard shell tools.
Step-by-Step Instructions
Step 1: Extract the document.xml content
unzip -p filename.docx word/document.xml
The
-p flag pipes the content to stdout without extracting to disk.
Step 2: Strip XML tags to get plain text
unzip -p filename.docx word/document.xml | sed 's/<[^>]*>//g'
This removes all XML tags, leaving the text content.
Step 3: Clean up whitespace (optional)
For cleaner output, add additional sed processing:
unzip -p filename.docx word/document.xml | \ sed 's/<[^>]*>//g' | \ sed 's/&[^;]*;//g' | \ sed 's/^[[:space:]]*//' | \ sed 's/[[:space:]]*$//' | \ sed '/^$/d'
This removes:
- XML tags
- XML entities (like
,&
)< - Leading/trailing whitespace
- Empty lines
Step 4: Save to a text file (optional)
unzip -p filename.docx word/document.xml | \ sed 's/<[^>]*>//g' > output.txt
Complete Example
# Extract text from a Word document DOCX_FILE="report.docx" OUTPUT_FILE="report_text.txt" unzip -p "$DOCX_FILE" word/document.xml | \ sed 's/<[^>]*>//g' | \ sed 's/&[^;]*;//g' | \ sed '/^$/d' > "$OUTPUT_FILE" echo "Extracted text saved to $OUTPUT_FILE"
Verification
After extraction, verify the content was captured:
# Check if output file has content if [ -s "$OUTPUT_FILE" ]; then echo "Successfully extracted $(wc -l < "$OUTPUT_FILE") lines" head -5 "$OUTPUT_FILE" else echo "Warning: Output file is empty" fi
Limitations
- This method extracts raw text without formatting
- Complex layouts, tables, and images are not preserved
- Some special characters may need additional handling
- Works best for text-heavy documents
Alternatives to Explore
If this approach fails or the DOCX structure differs:
- Check for
existence:word/document.xmlunzip -l filename.docx | grep document.xml - Some documents may use
with different namingword/*.xml - Consider
if available:pandocpandoc filename.docx -t plain