Skills theharvester
install
source · Clone the upstream repo
git clone https://github.com/TerminalSkills/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/TerminalSkills/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/theharvester" ~/.claude/skills/terminalskills-skills-theharvester && rm -rf "$T"
manifest:
skills/theharvester/SKILL.mdsafety · automated scan (high risk)
This is a pattern-based risk scan, not a security review. Our crawler flagged:
- pip install
- shell exec via library
- makes HTTP requests (curl)
- references API keys
Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.
source content
theHarvester
Overview
theHarvester is a passive OSINT tool that aggregates information about a target domain from multiple public sources. It finds email addresses, subdomains, hostnames, and IP ranges without making any direct requests to the target — making it ideal for stealth recon during the pre-engagement phase of penetration tests or OSINT investigations.
Sources include: Google, Bing, DuckDuckGo, LinkedIn, Shodan, Hunter.io, CertSpotter, DNSDumpster, VirusTotal, and more.
Instructions
Step 1: Install theHarvester
# Option 1: pip (in a virtual environment recommended) pip install theHarvester # Option 2: Clone from GitHub (most up-to-date) git clone https://github.com/laramies/theHarvester.git cd theHarvester pip install -r requirements/base.txt # Option 3: Docker docker pull ghcr.io/laramies/theharvester docker run ghcr.io/laramies/theharvester -d example.com -b google
Step 2: Basic usage
# Syntax: theHarvester -d <domain> -b <source> [options] # -d target domain # -b data source(s) # -l limit results (default: 500) # -f output filename (supports XML and JSON) # -n DNS lookup on discovered hosts # -v verify host via DNS resolution # Search a single source theHarvester -d example.com -b google # Search all available sources theHarvester -d example.com -b all # Limit results, enable DNS lookup, save output theHarvester -d example.com -b google,bing,linkedin -l 200 -n -f results_example # Run from cloned repo python3 theHarvester.py -d example.com -b all -l 500 -f output
Step 3: Choose sources strategically
# Email harvesting — best sources theHarvester -d example.com -b google,bing,hunter,linkedin # Subdomain enumeration — best sources theHarvester -d example.com -b certspotter,dnsdumpster,virustotal,shodan # Comprehensive (slower, uses all sources) theHarvester -d example.com -b all -l 1000 -f full_recon_example # LinkedIn employee discovery (requires LinkedIn API key in api-keys.yaml) theHarvester -d example.com -b linkedin -l 200
Step 4: Configure API keys
# api-keys.yaml (place in theHarvester directory or specify with -c flag) apikeys: hunter: key: YOUR_HUNTER_IO_KEY shodan: key: YOUR_SHODAN_KEY virustotal: key: YOUR_VIRUSTOTAL_KEY binaryedge: key: YOUR_BINARYEDGE_KEY fullhunt: key: YOUR_FULLHUNT_KEY securityTrails: key: YOUR_SECURITYTRAILS_KEY github: key: YOUR_GITHUB_TOKEN
Step 5: Parse and process output with Python
import json import subprocess import re def run_harvester(domain, sources="google,bing,certspotter,dnsdumpster", limit=500): """Run theHarvester and return parsed results.""" output_file = f"harvester_{domain.replace('.', '_')}" cmd = [ "theHarvester", "-d", domain, "-b", sources, "-l", str(limit), "-f", output_file, ] print(f"Running: {' '.join(cmd)}") result = subprocess.run(cmd, capture_output=True, text=True, timeout=300) print(result.stdout) # Parse JSON output json_file = f"{output_file}.json" try: with open(json_file) as f: data = json.load(f) return data except FileNotFoundError: # Fall back to parsing stdout return parse_stdout(result.stdout) def parse_stdout(output): """Extract emails, hosts, and IPs from raw stdout.""" emails = set(re.findall(r'[\w\.-]+@[\w\.-]+\.\w+', output)) # Filter out false positives emails = {e for e in emails if not e.endswith(('.png', '.jpg', '.css', '.js'))} hosts = set(re.findall(r'[\w\.-]+\.\w{2,}', output)) ips = set(re.findall(r'\b(?:\d{1,3}\.){3}\d{1,3}\b', output)) return {"emails": list(emails), "hosts": list(hosts), "ips": list(ips)} def deduplicate_and_report(data, domain): """Clean and summarize harvested data.""" emails = sorted(set(data.get("emails", []))) hosts = sorted(set(data.get("hosts", []))) ips = sorted(set(data.get("ips", []))) # Filter to target domain domain_emails = [e for e in emails if domain in e] domain_hosts = [h for h in hosts if domain in h] print(f"\n=== Harvest Report: {domain} ===") print(f"Emails found: {len(domain_emails)}") print(f"Subdomains: {len(domain_hosts)}") print(f"IP addresses: {len(ips)}") if domain_emails: print("\nEmails:") for e in domain_emails[:20]: print(f" {e}") if domain_hosts: print("\nSubdomains:") for h in domain_hosts[:20]: print(f" {h}") return { "emails": domain_emails, "subdomains": domain_hosts, "ips": ips, } # Usage results = run_harvester("target-company.com", sources="google,bing,certspotter,hunter") clean = deduplicate_and_report(results, "target-company.com") # Save cleaned results with open("clean_results.json", "w") as f: json.dump(clean, f, indent=2)
Step 6: Combine with other tools
# Pass discovered subdomains to nmap (only with explicit authorization) theHarvester -d example.com -b all -f hosts cat hosts.json | python3 -c " import json, sys data = json.load(sys.stdin) for host in data.get('hosts', []): print(host) " > subdomains.txt # Feed subdomains into amass for deeper DNS enumeration cat subdomains.txt | amass enum -df - -passive # Check emails against breach databases cat emails.txt | while read email; do curl -s "https://haveibeenpwned.com/api/v3/breachedaccount/$email" \ -H "hibp-api-key: YOUR_HIBP_KEY" done
Available Sources Reference
| Source | Data Type | API Key Required |
|---|---|---|
| Emails, subdomains | No |
| Emails, subdomains | No |
| Emails, subdomains | No |
| Employees, emails | Optional |
| Emails | Yes |
| Subdomains (SSL certs) | No |
| Subdomains, IPs | No |
| Subdomains | Yes |
| IPs, open ports | Yes |
| Subdomains, DNS | Yes |
| Emails, code | Yes |
| IPs, services | Yes |
Guidelines
- Always get authorization before running theHarvester against a target — passive does not mean invisible. Data queries may be logged by third-party services.
- Rate limits: Without API keys, theHarvester relies on scraping search engines which may throttle or block requests. Add API keys for reliable results.
- Combine sources: No single source is complete. Use multiple sources and deduplicate.
- Email format detection: Once you have a few emails (e.g.,
,jsmith@corp.com
), infer the naming convention and use it to generate a target list.john.smith@corp.com - DNS verification: Always use
or-n
to verify discovered hosts are live before reporting.-v