Skills theharvester

install
source · Clone the upstream repo
git clone https://github.com/TerminalSkills/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/TerminalSkills/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/theharvester" ~/.claude/skills/terminalskills-skills-theharvester && rm -rf "$T"
manifest: skills/theharvester/SKILL.md
safety · automated scan (high risk)
This is a pattern-based risk scan, not a security review. Our crawler flagged:
  • pip install
  • shell exec via library
  • makes HTTP requests (curl)
  • references API keys
Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.
source content

theHarvester

Overview

theHarvester is a passive OSINT tool that aggregates information about a target domain from multiple public sources. It finds email addresses, subdomains, hostnames, and IP ranges without making any direct requests to the target — making it ideal for stealth recon during the pre-engagement phase of penetration tests or OSINT investigations.

Sources include: Google, Bing, DuckDuckGo, LinkedIn, Shodan, Hunter.io, CertSpotter, DNSDumpster, VirusTotal, and more.

Instructions

Step 1: Install theHarvester

# Option 1: pip (in a virtual environment recommended)
pip install theHarvester

# Option 2: Clone from GitHub (most up-to-date)
git clone https://github.com/laramies/theHarvester.git
cd theHarvester
pip install -r requirements/base.txt

# Option 3: Docker
docker pull ghcr.io/laramies/theharvester
docker run ghcr.io/laramies/theharvester -d example.com -b google

Step 2: Basic usage

# Syntax: theHarvester -d <domain> -b <source> [options]
# -d  target domain
# -b  data source(s)
# -l  limit results (default: 500)
# -f  output filename (supports XML and JSON)
# -n  DNS lookup on discovered hosts
# -v  verify host via DNS resolution

# Search a single source
theHarvester -d example.com -b google

# Search all available sources
theHarvester -d example.com -b all

# Limit results, enable DNS lookup, save output
theHarvester -d example.com -b google,bing,linkedin -l 200 -n -f results_example

# Run from cloned repo
python3 theHarvester.py -d example.com -b all -l 500 -f output

Step 3: Choose sources strategically

# Email harvesting — best sources
theHarvester -d example.com -b google,bing,hunter,linkedin

# Subdomain enumeration — best sources
theHarvester -d example.com -b certspotter,dnsdumpster,virustotal,shodan

# Comprehensive (slower, uses all sources)
theHarvester -d example.com -b all -l 1000 -f full_recon_example

# LinkedIn employee discovery (requires LinkedIn API key in api-keys.yaml)
theHarvester -d example.com -b linkedin -l 200

Step 4: Configure API keys

# api-keys.yaml (place in theHarvester directory or specify with -c flag)
apikeys:
  hunter:
    key: YOUR_HUNTER_IO_KEY
  shodan:
    key: YOUR_SHODAN_KEY
  virustotal:
    key: YOUR_VIRUSTOTAL_KEY
  binaryedge:
    key: YOUR_BINARYEDGE_KEY
  fullhunt:
    key: YOUR_FULLHUNT_KEY
  securityTrails:
    key: YOUR_SECURITYTRAILS_KEY
  github:
    key: YOUR_GITHUB_TOKEN

Step 5: Parse and process output with Python

import json
import subprocess
import re

def run_harvester(domain, sources="google,bing,certspotter,dnsdumpster", limit=500):
    """Run theHarvester and return parsed results."""
    output_file = f"harvester_{domain.replace('.', '_')}"
    cmd = [
        "theHarvester",
        "-d", domain,
        "-b", sources,
        "-l", str(limit),
        "-f", output_file,
    ]
    print(f"Running: {' '.join(cmd)}")
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=300)
    print(result.stdout)

    # Parse JSON output
    json_file = f"{output_file}.json"
    try:
        with open(json_file) as f:
            data = json.load(f)
        return data
    except FileNotFoundError:
        # Fall back to parsing stdout
        return parse_stdout(result.stdout)

def parse_stdout(output):
    """Extract emails, hosts, and IPs from raw stdout."""
    emails = set(re.findall(r'[\w\.-]+@[\w\.-]+\.\w+', output))
    # Filter out false positives
    emails = {e for e in emails if not e.endswith(('.png', '.jpg', '.css', '.js'))}

    hosts = set(re.findall(r'[\w\.-]+\.\w{2,}', output))
    ips = set(re.findall(r'\b(?:\d{1,3}\.){3}\d{1,3}\b', output))

    return {"emails": list(emails), "hosts": list(hosts), "ips": list(ips)}

def deduplicate_and_report(data, domain):
    """Clean and summarize harvested data."""
    emails = sorted(set(data.get("emails", [])))
    hosts = sorted(set(data.get("hosts", [])))
    ips = sorted(set(data.get("ips", [])))

    # Filter to target domain
    domain_emails = [e for e in emails if domain in e]
    domain_hosts = [h for h in hosts if domain in h]

    print(f"\n=== Harvest Report: {domain} ===")
    print(f"Emails found:    {len(domain_emails)}")
    print(f"Subdomains:      {len(domain_hosts)}")
    print(f"IP addresses:    {len(ips)}")

    if domain_emails:
        print("\nEmails:")
        for e in domain_emails[:20]:
            print(f"  {e}")

    if domain_hosts:
        print("\nSubdomains:")
        for h in domain_hosts[:20]:
            print(f"  {h}")

    return {
        "emails": domain_emails,
        "subdomains": domain_hosts,
        "ips": ips,
    }

# Usage
results = run_harvester("target-company.com", sources="google,bing,certspotter,hunter")
clean = deduplicate_and_report(results, "target-company.com")

# Save cleaned results
with open("clean_results.json", "w") as f:
    json.dump(clean, f, indent=2)

Step 6: Combine with other tools

# Pass discovered subdomains to nmap (only with explicit authorization)
theHarvester -d example.com -b all -f hosts
cat hosts.json | python3 -c "
import json, sys
data = json.load(sys.stdin)
for host in data.get('hosts', []):
    print(host)
" > subdomains.txt

# Feed subdomains into amass for deeper DNS enumeration
cat subdomains.txt | amass enum -df - -passive

# Check emails against breach databases
cat emails.txt | while read email; do
    curl -s "https://haveibeenpwned.com/api/v3/breachedaccount/$email" \
         -H "hibp-api-key: YOUR_HIBP_KEY"
done

Available Sources Reference

SourceData TypeAPI Key Required
google
Emails, subdomainsNo
bing
Emails, subdomainsNo
duckduckgo
Emails, subdomainsNo
linkedin
Employees, emailsOptional
hunter
EmailsYes
certspotter
Subdomains (SSL certs)No
dnsdumpster
Subdomains, IPsNo
virustotal
SubdomainsYes
shodan
IPs, open portsYes
securitytrails
Subdomains, DNSYes
github
Emails, codeYes
binaryedge
IPs, servicesYes

Guidelines

  • Always get authorization before running theHarvester against a target — passive does not mean invisible. Data queries may be logged by third-party services.
  • Rate limits: Without API keys, theHarvester relies on scraping search engines which may throttle or block requests. Add API keys for reliable results.
  • Combine sources: No single source is complete. Use multiple sources and deduplicate.
  • Email format detection: Once you have a few emails (e.g.,
    jsmith@corp.com
    ,
    john.smith@corp.com
    ), infer the naming convention and use it to generate a target list.
  • DNS verification: Always use
    -n
    or
    -v
    to verify discovered hosts are live before reporting.