Claude-skills searching-codebases

install
source · Clone the upstream repo
git clone https://github.com/oaustegard/claude-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/oaustegard/claude-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/searching-codebases" ~/.claude/skills/oaustegard-claude-skills-searching-codebases && rm -rf "$T"
manifest: searching-codebases/SKILL.md
source content

Searching Codebases

Find code in any codebase by pattern or concept. One entry point, two search strategies, automatic routing.

Prerequisites

uv tool install ripgrep

tree-sitting (for structural context expansion) installs automatically when the

--expand
flag is used.

Primary Command

SKILL_DIR=/mnt/skills/user/searching-codebases

python3 $SKILL_DIR/scripts/search.py SOURCE "query1" ["query2" ...] [OPTIONS]

SOURCE is any of:

  • Local directory path
  • GitHub URL (downloads tarball automatically)
  • uploads
    (uses
    /mnt/user-data/uploads/
    )
  • project
    (uses
    /mnt/project/
    )
  • Path to a
    .zip
    or
    .tar.gz
    archive

Search Modes

Regex mode (patterns, identifiers, literal text):

python3 $SKILL_DIR/scripts/search.py ./repo "def handle_error"
python3 $SKILL_DIR/scripts/search.py ./repo "class.*Exception" --regex
python3 $SKILL_DIR/scripts/search.py ./repo "TODO|FIXME|HACK"

Semantic mode (concepts, natural language):

python3 $SKILL_DIR/scripts/search.py ./repo "retry logic with backoff" --semantic
python3 $SKILL_DIR/scripts/search.py ./repo "authentication flow"
python3 $SKILL_DIR/scripts/search.py ./repo "error handling strategy"

Auto-detection: short queries and code-like tokens → regex. Multi-word natural language → semantic. Override with

--regex
or
--semantic
.

Options

  • --regex
    /
    --semantic
    : Force search mode
  • --expand
    : Return full function bodies via tree-sitting AST context
  • --benchmark
    : Compare indexed regex vs brute-force ripgrep
  • --branch NAME
    : Git branch for GitHub URLs (default: main)
  • --skip DIRS
    : Comma-separated directories to skip
  • --json
    : Machine-readable output
  • -v
    : Show index stats and query routing decisions

How It Works

Regex search builds a sparse n-gram inverted index over all files. Queries are decomposed into literal fragments, looked up in the index to identify candidate files (typically 90-99% reduction), then verified with ripgrep. Frequency-weighted n-grams make rare character sequences more selective.

Semantic search builds a TF-IDF index over code chunks (functions, classes, structural entries). Queries are ranked by cosine similarity.

Context expansion (

--expand
) uses tree-sitting's AST cache to identify function/class boundaries, returning complete structural units rather than line fragments. On first use, tree-sitting scans the repo (~700ms for 250 files); subsequent expansions are sub-millisecond.

Small codebases (< 20 files) skip indexing entirely — direct ripgrep is faster when there's nothing to narrow.

Mixed Queries

Multiple queries can use different modes in a single invocation. Each query is auto-routed independently, and indexes are built once per mode:

python3 $SKILL_DIR/scripts/search.py ./repo \
  "class.*Error" \
  "error recovery strategy" \
  "def retry"

Dependencies

  • tree-sitting: Provides AST-based context expansion for
    --expand
    . Not required — search works without it, just with less structural context in results.
  • ripgrep: Required for regex verification. Install via
    uv tool install ripgrep
    .
  • scikit-learn: Required for semantic mode. Installs automatically.

When to Use

  • Known target: "where is the retry logic?", "find all error handlers"
  • Pattern matching: regex across large codebases with indexed speedup
  • Concept search: "authentication flow", "database connection pooling"
  • Cross-reference: find all callers/users of a specific function

When NOT to Use

  • First encounter: "what does this repo do?" → use exploring-codebases
  • Repos under ~10 files: just read them directly
  • Exact symbol lookup:
    find_symbol('ClassName')
    via tree-sitting is simpler
  • Structural overview: use tree-sitting's
    tree_overview()
    /
    dir_overview()

Files

  • scripts/search.py
    — Entry point, query routing, output formatting
  • scripts/resolve.py
    — Input source resolution (GitHub, uploads, archives)
  • scripts/context.py
    — tree-sitting-based AST context expansion
  • scripts/ngram_index.py
    — Sparse n-gram inverted index, regex decomposition
  • scripts/sparse_ngrams.py
    — Core n-gram algorithms, frequency weights
  • scripts/code_rag.py
    — TF-IDF semantic search over code chunks