Hacktricks-skills pdf-forensics-analysis
Analyze PDF files for security forensics, CTF challenges, and malicious content detection. Use this skill whenever the user needs to examine a PDF file for hidden data, malicious scripts, embedded files, or suspicious constructs. Trigger on requests involving PDF analysis, PDF forensics, PDF security review, suspicious PDF investigation, CTF PDF challenges, or any task requiring deep inspection of PDF structure and content.
git clone https://github.com/abelrguezr/hacktricks-skills
skills/generic-methodologies-and-resources/basic-forensic-methodology/specific-software-file-type-tricks/pdf-file-analysis/SKILL.MDPDF Forensics Analysis
A comprehensive skill for analyzing PDF files for security forensics, CTF challenges, and malicious content detection.
When to Use This Skill
Use this skill when:
- Investigating a suspicious PDF file
- Solving CTF challenges involving PDF forensics
- Performing security reviews of PDF documents
- Looking for hidden data, steganography, or malicious content in PDFs
- Analyzing PDF structure for anomalies or obfuscation
- Extracting embedded files or metadata from PDFs
Quick Triage Workflow
Step 1: Initial Assessment
Start with fast keyword statistics to identify potential issues:
# Install tools if needed pip install pdfid pdf-parser peepdf # Quick triage - keyword statistics pdfid.py <suspicious.pdf> # Check file type (may reveal polyglot files) file <suspicious.pdf> # Look for multiple EOF markers (incremental updates) grep -c '%%EOF' <suspicious.pdf>
Step 2: Deep Inspection
Use pdf-parser for interactive or automatic analysis:
# Interactive mode - explore object tree pdf-parser.py -f <suspicious.pdf> # Automatic report generation pdf-parser.py -a <suspicious.pdf> # Search for specific objects pdf-parser.py -search "/JS" -raw <suspicious.pdf>
Step 3: Extract Embedded Content
# List embedded files peepdf "open <suspicious.pdf>" "objects embeddedfile" # Extract specific embedded objects peepdf "open <suspicious.pdf>" "objects embeddedfile" "extract <obj-num>" -o dumps/ # Extract all streams pdf-parser.py -extract <suspicious.pdf>
Step 4: Decrypt if Needed
# Remove password protection qpdf --password='<password>' --decrypt <encrypted.pdf> <clean.pdf> # Linearize and remove unreferenced objects qpdf --qdf --remove-unreferenced <suspicious.pdf> <clean.pdf>
Common Malicious Constructs to Hunt
Search for these keywords in PDF content:
| Object | Description | Risk Level |
|---|---|---|
, | Auto-exec on open | High |
, | Embedded JavaScript | High |
, | External process launchers | High |
, | URL redirects | Medium |
, | File attachments | High |
, , | Multimedia objects | Medium |
, , | Object streams/forms | Medium |
Multiple | Incremental updates | Medium |
Suspicious String Patterns
When combined with malicious keywords, these strings warrant deeper analysis:
,powershell
,cmd.execalc.exe
encoded contentbase64
,http://
(especially with suspicious domains)https://
,mshta
,wscriptcscript
,certutilbitsadmin
Hidden Data Locations
Check these areas for concealed content:
- Invisible layers - Objects with zero opacity or outside viewport
- XMP metadata - Adobe's metadata format often contains hidden info
- Incremental generations - Data appended after signing
- Same-color text - Text matching background color
- Text behind images - Overlapping objects
- Non-displayed comments - Annotation objects
- Polyglot files - Valid PDF + another format (e.g., Word with macros)
Recent Attack Techniques (2023-2025)
MalDoc in PDF Polyglot (2023)
- MHT-based Word document appended after
%%EOF - File is both valid PDF and DOC
- Contains
string<w:WordDocument> - AV engines parsing only PDF layer miss the macro
Detection:
grep -a '<w:WordDocument>' <suspicious.pdf>
Shadow-Incremental Updates (2024)
- Second
with malicious/Catalog/OpenAction - First revision appears benign and signed
- Bypasses tools inspecting only first xref table
Detection:
# Count catalog objects grep -c '/Catalog' <suspicious.pdf> # Check for multiple Prev offsets grep '/Prev' <suspicious.pdf>
Font Parsing UAF (CVE-2024-30284)
- Vulnerable CoolType.dll function
- Triggered from embedded CIDType2 fonts
- Remote code execution when opened
- Patched in APSB24-29, May 2024
YARA Rule Template
Use this template for custom detection rules:
rule Suspicious_PDF_AutoExec { meta: description = "Generic detection of PDFs with auto-exec actions and JS" author = "Your Name" last_update = "2025-07-20" strings: $pdf_magic = { 25 50 44 46 } // %PDF $aa = "/AA" ascii nocase $openact = "/OpenAction" ascii nocase $js = "/JS" ascii nocase $launch = "/Launch" ascii nocase $embedded = "/EmbeddedFile" ascii nocase condition: $pdf_magic at 0 and ( ($aa and $js) or ($openact and $js) or $launch or $embedded ) }
Defensive Recommendations
For Enterprise Environments
- Patch management - Keep Acrobat/Reader on latest Continuous track
- Gateway sanitization - Use
orpdfcpu sanitizeqpdf --qdf --remove-unreferenced - Content Disarm & Reconstruction (CDR) - Convert to images or PDF/A in sandbox
- Block risky features - Disable JavaScript, multimedia, 3D rendering in Reader
- User education - Train on social engineering (invoice/resume lures)
For Individual Analysis
- Never open suspicious PDFs directly - Use sandboxed environments
- Static analysis first - Always triage before opening
- Check multiple tools - Different tools catch different issues
- Preserve evidence - Work on copies, maintain chain of custody
Tool Reference
| Tool | Purpose | Install |
|---|---|---|
| Quick keyword statistics | |
| Deep inspection, extraction | |
| Interactive analysis, extraction | |
| Decrypt, linearize, sanitize | |
| Validate, sanitize, extract | |
| Visual object graph browser | Browser-based |
| Safe rendering to images | |
Analysis Checklist
- Run
for initial triagepdfid.py - Check file type with
commandfile - Count
markers%%EOF - Search for
,/JS
,/OpenAction/AA - Look for embedded files
- Check for polyglot indicators (
)<w:WordDocument> - Decrypt if password-protected
- Validate structure with
pdfcpu - Extract and analyze suspicious streams
- Review XMP metadata
- Check for multiple
objects/Catalog