OpenSpace docx-shell-parse

Extract text from DOCX files using shell commands when python-docx is unavailable

install

source · Clone the upstream repo

git clone https://github.com/HKUDS/OpenSpace

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/docx-shell-parse" ~/.claude/skills/hkuds-openspace-docx-shell-parse && rm -rf "$T"

manifest: gdpval_bench/skills/docx-shell-parse/SKILL.md

source content

DOCX Shell Parsing Workaround

When you need to read content from Microsoft Word (.docx) files but python-docx or similar libraries are unavailable, use this shell-based approach to extract text reliably.

When to Use

Python environment lacks
```
python-docx
```
or similar libraries
You need quick text extraction without installing dependencies
Working in constrained environments (containers, minimal images, etc.)

Core Technique

DOCX files are ZIP archives containing XML files. Extract and parse the main document XML:

unzip -p filename.docx word/document.xml | sed -e 's/<[^>]*>//g'

Step-by-Step Instructions

1. Verify the DOCX file exists

ls -la document.docx

2. Extract raw XML content

Use

unzip -p

to pipe the document.xml content directly to stdout:

unzip -p document.docx word/document.xml

3. Strip XML tags from content

Pipe through

sed

to remove all XML tags:

unzip -p document.docx word/document.xml | sed -e 's/<[^>]*>//g'

4. Clean up whitespace (optional)

For cleaner output, remove excessive whitespace and newlines:

unzip -p document.docx word/document.xml | \
  sed -e 's/<[^>]*>//g' | \
  sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//' | \
  sed -e '/^$/d'

5. Save extracted text to file

unzip -p document.docx word/document.xml | \
  sed -e 's/<[^>]*>//g' > output.txt

Complete Shell Function

Add this reusable function to your scripts:

parse_docx() {
    local file="$1"
    if [ ! -f "$file" ]; then
        echo "Error: File not found: $file" >&2
        return 1
    fi
    unzip -p "$file" word/document.xml 2>/dev/null | \
        sed -e 's/<[^>]*>//g' | \
        sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//' | \
        sed -e '/^$/d'
}

# Usage: parse_docx document.docx

Limitations

Does not preserve formatting, images, or tables structure
May include some residual XML entity references
Works best for simple text extraction needs
DOCX must be a valid Office Open XML format

Verification

Confirm extraction worked by checking output:

parse_docx document.docx | head -20

Alternative: Extract to Temporary Directory

For more complex parsing needs:

tmpdir=$(mktemp -d)
unzip document.docx -d "$tmpdir"
cat "$tmpdir/word/document.xml" | sed -e 's/<[^>]*>//g'
rm -rf "$tmpdir"