OpenSpace docx-shell-extract

Extract text from DOCX files using shell commands when python-docx is unavailable

install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/docx-shell-extract" ~/.claude/skills/hkuds-openspace-docx-shell-extract && rm -rf "$T"
manifest: gdpval_bench/skills/docx-shell-extract/SKILL.md
source content

DOCX Shell Extraction

When to Use This Skill

Use this pattern when you need to read or extract text from Microsoft Word (.docx) files in constrained environments where:

  • The
    python-docx
    library is not available
  • You cannot install additional Python packages
  • You need a quick, reliable shell-based solution

Core Technique

DOCX files are ZIP archives containing XML files. The main document content is stored in

word/document.xml
. You can extract and parse this using standard shell tools.

Step-by-Step Instructions

Step 1: Extract the document.xml content

unzip -p filename.docx word/document.xml

The

-p
flag pipes the content to stdout without extracting to disk.

Step 2: Strip XML tags to get plain text

unzip -p filename.docx word/document.xml | sed 's/<[^>]*>//g'

This removes all XML tags, leaving the text content.

Step 3: Clean up whitespace (optional)

For cleaner output, add additional sed processing:

unzip -p filename.docx word/document.xml | \
  sed 's/<[^>]*>//g' | \
  sed 's/&[^;]*;//g' | \
  sed 's/^[[:space:]]*//' | \
  sed 's/[[:space:]]*$//' | \
  sed '/^$/d'

This removes:

  • XML tags
  • XML entities (like
    &amp;
    ,
    &lt;
    )
  • Leading/trailing whitespace
  • Empty lines

Step 4: Save to a text file (optional)

unzip -p filename.docx word/document.xml | \
  sed 's/<[^>]*>//g' > output.txt

Complete Example

# Extract text from a Word document
DOCX_FILE="report.docx"
OUTPUT_FILE="report_text.txt"

unzip -p "$DOCX_FILE" word/document.xml | \
  sed 's/<[^>]*>//g' | \
  sed 's/&[^;]*;//g' | \
  sed '/^$/d' > "$OUTPUT_FILE"

echo "Extracted text saved to $OUTPUT_FILE"

Verification

After extraction, verify the content was captured:

# Check if output file has content
if [ -s "$OUTPUT_FILE" ]; then
    echo "Successfully extracted $(wc -l < "$OUTPUT_FILE") lines"
    head -5 "$OUTPUT_FILE"
else
    echo "Warning: Output file is empty"
fi

Limitations

  • This method extracts raw text without formatting
  • Complex layouts, tables, and images are not preserved
  • Some special characters may need additional handling
  • Works best for text-heavy documents

Alternatives to Explore

If this approach fails or the DOCX structure differs:

  • Check for
    word/document.xml
    existence:
    unzip -l filename.docx | grep document.xml
  • Some documents may use
    word/*.xml
    with different naming
  • Consider
    pandoc
    if available:
    pandoc filename.docx -t plain