Claude-skill-registry create-semgrep-rule

Create custom Semgrep rules for vulnerability detection. Use when writing new rules for specific vulnerability patterns, creating org-specific detections, or building rules for novel attack vectors discovered during bug bounty hunting.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/create-semgrep-rule" ~/.claude/skills/majiayu000-claude-skill-registry-create-semgrep-rule && rm -rf "$T"
manifest: skills/data/create-semgrep-rule/SKILL.md
source content

Create Custom Semgrep Rules

Expert workflow for creating high-quality, low-false-positive Semgrep rules for security vulnerability detection.

When to Create Custom Rules

Create custom rules when:

  • Novel vulnerability patterns not covered by
    p/default
    or existing custom rules
  • Org-specific code patterns (custom frameworks, internal APIs, coding conventions)
  • Chained vulnerabilities requiring multi-step detection
  • Language/framework-specific bugs (e.g., PHP
    parse_url
    bypass, Go unsafe patterns)
  • High-value targets warranting deeper, targeted analysis
  • CVE variant hunting - Finding the same vulnerable pattern in other codebases

CVE-to-Rule Workflow

When creating rules from CVEs, the goal is to find the underlying vulnerable code pattern in OTHER codebases - NOT to detect the vulnerable library (SCA tools like Dependabot/Snyk do that better).

Anti-Pattern: SCA-Style Detection (DON'T DO THIS)

# WRONG - This is SCA work, not pattern detection
# Dependabot/Snyk already do this, and do it better
patterns:
  - pattern: require("loader-utils").parseQuery(...)
  - pattern: import { parseQuery } from "loader-utils"
  - pattern: require("vulnerable-package")

This approach:

  • Duplicates what SCA tools already do
  • Only finds the specific library, not the pattern
  • Misses the same vulnerability in custom code
  • Provides no value for bug bounty hunting

Correct Approach: Pattern Detection

Step 1: Fetch and analyze the fix commit

# Get the patch diff
curl -s https://github.com/org/repo/commit/abc123.patch

Ask yourself:

  • What was the root cause of the vulnerability?
  • What code pattern made it exploitable?
  • How did the fix address the root cause?
  • What would this pattern look like in custom code?

Step 2: Abstract the pattern

The key question: "If a developer wrote similar functionality from scratch, what would the vulnerable version look like?"

Don't think about the library. Think about the category of code that has this problem.

Step 3: Create a library-agnostic rule

The rule should find the SAME MISTAKE anywhere, not just in the specific library.

Example: CVE-2022-37601 (loader-utils Prototype Pollution)

Fix commit analysis:

// BEFORE (vulnerable)
const result = {};           // Has prototype chain
result[key] = value;         // key could be "__proto__"

// AFTER (fixed)
const result = Object.create(null);  // No prototype chain
result[key] = value;                 // "__proto__" is just a regular key

Root cause: Query string parsing into

{}
with unsanitized dynamic keys.

Abstracted pattern: Any code that:

  1. Creates an object with
    {}
    (not
    Object.create(null)
    )
  2. Assigns properties using dynamic/user-controlled keys
  3. Doesn't validate against
    __proto__
    ,
    constructor
    ,
    prototype

Rule focus: Find custom query parsers, config loaders, merge utilities, or any key-value processing with this antipattern.

What to detect:

// DETECT: Custom query parser with same vulnerability
function parseConfig(input) {
  const config = {};                    // Vulnerable: has prototype
  for (const [key, val] of entries) {
    config[key] = val;                  // Unsanitized key assignment
  }
  return config;
}

// DETECT: Custom merge/extend function
function merge(target, source) {
  for (const key in source) {
    target[key] = source[key];          // Prototype pollution sink
  }
}

What NOT to detect:

// SKIP: Using the library (SCA handles this)
const { parseQuery } = require("loader-utils");

// SKIP: Already using safe pattern
const result = Object.create(null);
result[key] = value;

// SKIP: Has prototype pollution guard
if (key === "__proto__" || key === "constructor") continue;

CVE-to-Rule Checklist

Before writing the rule, verify:

CheckQuestion
Root cause identifiedWhat code pattern caused the vulnerability?
Pattern abstractedWould I find this in custom code, not just the library?
Not SCAAm I detecting a pattern, not a library import?
Realistic matchesWill this find bugs in real-world code?
Low FP rateAre there clear safe patterns to exclude?

Common CVE Pattern Categories

CVE TypeRoot Cause PatternRule Focus
Prototype Pollution
obj[userKey] = val
on
{}
Custom parsers, merge functions
Template InjectionUser input in template optionsCustom template rendering
Command InjectionString concat to shell execCustom exec wrappers
Path TraversalUser input in file pathsCustom file handlers
SSRFUser input in URL constructionCustom HTTP clients
DeserializationUntrusted data to deserializerCustom data loaders

Rule Broadness: When Patterns Are Too Generic

Some vulnerability patterns are too common to detect without drowning in false positives. Before writing a rule, assess whether it will produce signal or noise.

Pattern Frequency Spectrum

Signal LevelPattern TypeExampleApproach
HIGHRare sink + user input
res.render(tpl, req.query)
Direct detection, HIGH confidence
MEDIUMCommon pattern + specific context
obj[key] = val
in loops
Audit rule, MEDIUM confidence
LOWUbiquitous pattern
obj[key] = val
anywhere
Skip or sink-focused only

Example: Prototype Pollution

Too broad (produces noise):

# This matches almost every JS file
pattern: $OBJ[$KEY] = $VALUE

Specific enough (produces signal):

# Recursive descent pattern - characteristic of vulnerable merge functions
patterns:
  - pattern: $SMTH = $SMTH[$A]
  - pattern-inside: |
      for (...) { ... }

Sink-focused (best signal):

# Detect where pollution becomes exploitable
pattern-sinks:
  - pattern: res.render($T, $OPTS)  # Template options = RCE
  - pattern: spawn($CMD, $ARGS, $OPTS)  # child_process options

When to Use Audit vs Vuln Rules

Rule TypeConfidenceUse Case
subcategory: vuln
HIGHRare pattern, clear exploit, few FPs
subcategory: audit
LOW-MEDIUMCommon pattern, needs manual review

If you can't achieve HIGH confidence, mark the rule as

audit
with LOW confidence. The official Semgrep registry does this for prototype pollution:

metadata:
  subcategory: audit
  confidence: LOW
  likelihood: LOW

Sink-Focused vs Pattern-Focused Rules

When a vulnerability pattern is too common to detect directly, focus on the sinks where it becomes exploitable:

VulnerabilityPattern-Focused (noisy)Sink-Focused (high signal)
Prototype Pollution
obj[key] = val
Template options, child_process options
XSSString concatenation
innerHTML
,
document.write
SQLiString + variable
cursor.execute
, ORM raw queries

Rule of thumb: If the source pattern is ubiquitous, detect at the sink instead.

Project Structure

custom-rules/
├── 0xdea-semgrep-rules/     # Third-party: Memory safety, C/C++ vulns
├── open-semgrep-rules/      # Third-party: Multi-language security rules
├── web-vulns/               # Web-specific injection rules
└── custom/                  # YOUR custom rules
    ├── org-specific/        # Rules targeting specific organizations
    │   └── <org-name>/      # Per-org rule directories
    └── novel-vulns/         # Novel vulnerability patterns

CRITICAL: Rule Quality Standards

Custom rules must meet these standards before use:

  • LOW false positive rate - Every FP wastes time; add exclusions aggressively
  • Clear security impact - Rule must detect exploitable vulnerabilities, not code smells
  • Tested against real code - Validate on target repos before adding to pipeline
  • Complete metadata - CWE, severity, confidence, references
  • Path exclusions for performance - Exclude bundled/minified files to prevent timeouts

CRITICAL: Path Exclusions for Performance

Taint mode rules are computationally expensive and will timeout on large bundled/minified files. Always add path exclusions to your rules.

Required Path Exclusions

Add this

paths
block to EVERY rule (especially taint mode):

rules:
  - id: my-taint-rule
    mode: taint
    paths:
      exclude:
        # Package managers
        - "**/node_modules/**"
        - "**/vendor/**"
        # Build output
        - "**/dist/**"
        - "**/build/**"
        # Minified/bundled files (specific patterns only)
        - "**/*.min.js"
        - "**/*.min.mjs"
        - "**/*.bundle.js"
        - "**/*.chunk.js"
        - "**/*.chunk.mjs"
        - "**/*-init.mjs"
        # NOTE: Do NOT use broad patterns like "**/js/*.js" or "**/assets/**"
        # as they exclude legitimate source files in some repos
    # ... rest of rule

Why This Matters

File TypeTypical SizeTaint Mode Behavior
Source file1-50 KBFast analysis
Bundled JS100KB-2MBTIMEOUT (30s default)
Minified JS50KB-500KBTIMEOUT or very slow

Real example: A 588KB Vite bundle (

viewer-init.mjs
) caused 3 timeout errors and blocked rule execution until path exclusions were added.

Signs You Need More Exclusions

When running your rule, watch for:

Warning: 3 timeout error(s) in path/to/file.mjs when running rules...
Semgrep stopped running rules on path/to/file.mjs after 3 timeout error(s).

Add the problematic file pattern to your

paths.exclude
list.

Workflow

Step 1: Define the Vulnerability

Before writing any YAML, answer these questions:

Vulnerability Type: [e.g., Command Injection, SSRF, SQLi]
CWE ID: [e.g., CWE-78]
Security Impact: [e.g., Remote code execution as web server user]
Vulnerable Pattern: [e.g., os.system() with user-controlled input]
Exploit Scenario: [e.g., Attacker controls filename parameter, injects shell commands]

Find 2-3 real examples from target codebase to guide pattern creation.

Step 2: Choose Rule Mode

ModeUse WhenExample
Pattern-basedSingle function calls, hardcoded values, dangerous API usage
eval()
, hardcoded secrets, weak crypto
Taint modeData flows from user input to dangerous sinkSQLi, XSS, command injection, SSRF

Decision guide:

  • "Is user input involved?" → Taint mode
  • "Is it a dangerous function regardless of input?" → Pattern mode
  • "Do I need to track data across variables/functions?" → Taint mode

Step 3: Write the Rule

Pattern-Based Rule Template

rules:
  - id: <org>-<vuln-type>-<specific-pattern>
    languages:
      - python
    message: |
      <Clear description of what was detected and why it's dangerous>

      Remediation: <Specific fix recommendation>
    severity: ERROR  # ERROR, WARNING, or INFO
    metadata:
      cwe: "CWE-XX"
      owasp:
        - "A03:2021-Injection"
      category: security
      confidence: HIGH  # HIGH, MEDIUM, LOW
      author: "Your Name"
      references:
        - https://cwe.mitre.org/data/definitions/XX.html
    patterns:
      - pattern-either:
          - pattern: dangerous_function($ARG)
          - pattern: other_dangerous_function($ARG)
      - pattern-not: safe_wrapper(...)
      - pattern-not-inside: |
          if $X is None:
              ...

Taint Mode Rule Template

rules:
  - id: <org>-<vuln-type>-taint
    mode: taint
    languages:
      - python  # or javascript, typescript, etc.
    # CRITICAL: Always include path exclusions for taint mode
    paths:
      exclude:
        - "**/node_modules/**"
        - "**/vendor/**"
        - "**/dist/**"
        - "**/build/**"
        - "**/*.min.js"
        - "**/*.min.mjs"
        - "**/*.bundle.js"
        - "**/*.chunk.js"
        - "**/*.chunk.mjs"
        - "**/*-init.mjs"
    message: |
      User input flows to <dangerous sink> without proper sanitization.
      This could allow <attack type>.

      Remediation: <Specific fix>
    severity: ERROR
    metadata:
      cwe: "CWE-XX"
      owasp:
        - "A03:2021-Injection"
      category: security
      confidence: HIGH
      author: "Your Name"
    pattern-sources:
      - pattern: request.args.get(...)
      - pattern: request.form[...]
      - pattern: request.json[...]
    pattern-sinks:
      - pattern: cursor.execute($QUERY, ...)
        focus-metavariable: $QUERY
    pattern-sanitizers:
      - pattern: escape(...)
      - pattern: int(...)
      - pattern: parameterized_query(...)

Step 4: Reduce False Positives

This is the most critical step. For every rule, consider:

Exclusion patterns to add:

# Exclude hardcoded/literal strings (not user input)
- pattern-not: $FUNC("...", ...)

# Exclude safe wrappers
- pattern-not: safe_execute(...)

# Exclude already-validated contexts
- pattern-not-inside: |
    if validate($INPUT):
        ...

# Exclude test files (if not already in .semgrepignore)
- pattern-not-inside: |
    def test_...:
        ...

Common FP sources:

  • Hardcoded strings (not user-controlled)
  • Test/example code
  • Already-sanitized inputs
  • Framework auto-escaping
  • Admin-only code paths

Step 5: Test the Rule

Create test file alongside rule:

custom-rules/custom/novel-vulns/
├── command-injection-eval.yml
└── command-injection-eval.py    # Test cases

Test file format:

# ruleid: command-injection-eval
eval(user_input)

# ruleid: command-injection-eval
exec(request.args.get('code'))

# ok: command-injection-eval
eval("2 + 2")  # Hardcoded, safe

# ok: command-injection-eval
safe_eval(user_input)  # Uses sanitizer

Run validation:

# Test rule syntax and test cases
semgrep --config custom-rules/custom/novel-vulns/command-injection-eval.yml \
        --test custom-rules/custom/novel-vulns/

# Test against real target repo
semgrep --config custom-rules/custom/novel-vulns/command-injection-eval.yml \
        repos/<org>/<repo>/

# Count findings
semgrep --config custom-rules/custom/novel-vulns/command-injection-eval.yml \
        repos/<org>/ --json | jq '.results | length'

Step 5b: Test Performance (CRITICAL for Taint Mode)

Taint mode rules can timeout on large files. Always test on repos with bundled JS:

# Test against a repo known to have bundled files
time semgrep --config my-rule.yaml repos/<org>/<repo-with-bundles>/ 2>&1 | grep -E "(timeout|Error|Ran)"

Watch for these warning signs:

Warning: 3 timeout error(s) in path/to/file.mjs when running rules...

If you see timeouts:

  1. Check which files are causing issues:

    ls -la path/to/problematic/file.mjs  # Check file size
    head -c 200 path/to/problematic/file.mjs  # Check if minified
    
  2. Add path exclusions to your rule:

    paths:
      exclude:
        - "**/path/pattern/*.mjs"
    
  3. Re-test until no timeouts:

    # Should complete in seconds, not timeout
    time semgrep --config my-rule.yaml repos/<org>/<repo>/
    

Performance targets:

Repo SizeExpected TimeAction if Slower
Small (<100 files)< 5 secondsCheck for bundled files
Medium (100-1000 files)< 30 secondsAdd path exclusions
Large (1000+ files)< 2 minutesVerify exclusions working

Verify findings still work after exclusions:

# Run on source directory only (where real vulns are)
semgrep --config my-rule.yaml repos/<org>/<repo>/src/

Step 6: Integrate with Pipeline

Rules in

custom-rules/
are automatically included when running:

./scripts/scan-semgrep.sh <org-name>

To use only your custom rule:

semgrep --config custom-rules/custom/novel-vulns/my-rule.yml repos/<org>/

Pattern Operators Reference

Basic Matching

OperatorPurposeExample
pattern
Match exact code
os.system($CMD)
pattern-either
Match any (OR)Multiple dangerous functions
patterns
Match all (AND)Function + constraint

Metavariables

SyntaxMeaning
$VAR
Capture any expression
$_
Match anything (no capture)
$...ARGS
Match multiple arguments
<... $X ...>
Match $X nested at any depth
...
Match any statements between

Exclusions (Critical for FP reduction)

pattern-not: safe_function(...)           # Exclude specific pattern
pattern-not-inside: |                     # Exclude if inside context
  if validated($X):
      ...

Metavariable Constraints

# Regex match on captured variable
metavariable-regex:
  metavariable: $FUNC
  regex: "(system|exec|popen)"

# Pattern match on captured variable
metavariable-pattern:
  metavariable: $ARG
  pattern-either:
    - pattern: request.args[...]
    - pattern: request.form[...]

# Entropy analysis (detect secrets)
metavariable-analysis:
  analyzer: entropy
  metavariable: $VALUE

# Highlight specific variable in output
focus-metavariable: $DANGEROUS_ARG

Taint Mode Operators

mode: taint                    # Enable taint tracking

pattern-sources:               # Where tainted data enters
  - pattern: request.args[...]

pattern-sinks:                 # Where tainted data causes harm
  - pattern: cursor.execute($Q)
    focus-metavariable: $Q

pattern-sanitizers:            # Functions that clean data
  - pattern: escape(...)
  - pattern: int(...)

pattern-propagators:           # Custom taint spread (Pro only)
  - pattern: $TO = transform($FROM)
    from: $FROM
    to: $TO

Common Rule Patterns

Command Injection

patterns:
  - pattern-either:
      - pattern: os.system($CMD)
      - pattern: os.popen($CMD)
      - pattern: subprocess.call($CMD, shell=True, ...)
      - pattern: subprocess.Popen($CMD, shell=True, ...)
  - pattern-not: $FUNC("...", ...)  # Exclude hardcoded strings

SQL Injection (Taint)

mode: taint
pattern-sources:
  - pattern: request.$METHOD[...]
  - pattern: request.$METHOD.get(...)
pattern-sinks:
  - pattern: $CURSOR.execute($QUERY, ...)
  - pattern: $CURSOR.executemany($QUERY, ...)
pattern-sanitizers:
  - pattern: $CURSOR.execute("...", ($PARAM,))  # Parameterized

Hardcoded Secrets

patterns:
  - pattern: $VAR = "..."
  - metavariable-regex:
      metavariable: $VAR
      regex: "(?i)(password|secret|api_key|token|private_key)"
  - metavariable-analysis:
      analyzer: entropy
      metavariable: $VAR
  - pattern-not-inside: |
      # Example: ...

Insecure Cryptography

pattern-either:
  - pattern: hashlib.md5(...)
  - pattern: hashlib.sha1(...)
  - pattern: DES.new(...)
  - pattern: Blowfish.new(...)
  - pattern: ARC4.new(...)

Path Traversal

mode: taint
pattern-sources:
  - pattern: request.args.get("...")
  - pattern: request.form["..."]
pattern-sinks:
  - pattern: open($PATH, ...)
  - pattern: os.path.join(..., $PATH, ...)
pattern-sanitizers:
  - pattern: os.path.basename(...)
  - pattern: secure_filename(...)

Metadata Standards

Every rule MUST include:

metadata:
  # Required
  cwe: "CWE-78"                      # Primary CWE ID
  category: security                  # Always "security" for vulns
  confidence: HIGH                    # HIGH, MEDIUM, LOW

  # Recommended
  owasp:
    - "A03:2021-Injection"           # OWASP Top 10 2021
  likelihood: HIGH                    # Exploitation probability
  impact: HIGH                        # Damage if exploited
  subcategory:
    - vuln                           # vuln, audit, guardrail

  # For custom rules
  author: "Your Name"
  created: "2025-01-15"
  tested_against: "org-name"         # Where you validated it
  references:
    - https://cwe.mitre.org/...
    - https://blog.example.com/...   # Writeups explaining the vuln

Severity Guidelines

SeverityUse ForExamples
ERROR
Exploitable vulns with high impactRCE, SQLi, auth bypass
WARNING
Likely vulns needing verificationPotential XSS, weak crypto
INFO
Code smells, audit pointsMissing headers, debug code

Pro Engine Features

When running with

--pro
(our default), you get:

  • Cross-file taint tracking - Follow data across imports
  • Interprocedural analysis - Track through function calls
  • Field sensitivity - Track object properties

These are automatic; no rule changes needed.

Debugging Rules

Rule not matching expected code?

# Verbose output shows matching attempts
semgrep --config rule.yml target/ --debug

# Test specific pattern interactively
semgrep --pattern 'os.system($X)' target/

Too many false positives?

  • Add
    pattern-not
    for safe patterns
  • Add
    pattern-not-inside
    for safe contexts
  • Use
    metavariable-regex
    to constrain variable names
  • Lower
    confidence
    in metadata if FPs are expected

Output

Save completed rules to:

custom-rules/custom/
├── org-specific/<org-name>/    # Org-targeted rules
└── novel-vulns/                # General novel patterns

Rules are automatically picked up by

./scripts/scan-semgrep.sh
.

References