Claude-skills searching-codebases
git clone https://github.com/oaustegard/claude-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/oaustegard/claude-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/searching-codebases" ~/.claude/skills/oaustegard-claude-skills-searching-codebases && rm -rf "$T"
searching-codebases/SKILL.mdSearching Codebases
Find code in any codebase by pattern or concept. One entry point, two search strategies, automatic routing.
Prerequisites
uv tool install ripgrep
tree-sitting (for structural context expansion) installs automatically when the
--expand flag is used.
Primary Command
SKILL_DIR=/mnt/skills/user/searching-codebases python3 $SKILL_DIR/scripts/search.py SOURCE "query1" ["query2" ...] [OPTIONS]
SOURCE is any of:
- Local directory path
- GitHub URL (downloads tarball automatically)
(usesuploads
)/mnt/user-data/uploads/
(usesproject
)/mnt/project/- Path to a
or.zip
archive.tar.gz
Search Modes
Regex mode (patterns, identifiers, literal text):
python3 $SKILL_DIR/scripts/search.py ./repo "def handle_error" python3 $SKILL_DIR/scripts/search.py ./repo "class.*Exception" --regex python3 $SKILL_DIR/scripts/search.py ./repo "TODO|FIXME|HACK"
Semantic mode (concepts, natural language):
python3 $SKILL_DIR/scripts/search.py ./repo "retry logic with backoff" --semantic python3 $SKILL_DIR/scripts/search.py ./repo "authentication flow" python3 $SKILL_DIR/scripts/search.py ./repo "error handling strategy"
Auto-detection: short queries and code-like tokens → regex. Multi-word natural language → semantic. Override with
--regex or --semantic.
Options
/--regex
: Force search mode--semantic
: Return full function bodies via tree-sitting AST context--expand
: Compare indexed regex vs brute-force ripgrep--benchmark
: Git branch for GitHub URLs (default: main)--branch NAME
: Comma-separated directories to skip--skip DIRS
: Machine-readable output--json
: Show index stats and query routing decisions-v
How It Works
Regex search builds a sparse n-gram inverted index over all files. Queries are decomposed into literal fragments, looked up in the index to identify candidate files (typically 90-99% reduction), then verified with ripgrep. Frequency-weighted n-grams make rare character sequences more selective.
Semantic search builds a TF-IDF index over code chunks (functions, classes, structural entries). Queries are ranked by cosine similarity.
Context expansion (
--expand) uses tree-sitting's AST cache to
identify function/class boundaries, returning complete structural units
rather than line fragments. On first use, tree-sitting scans the repo
(~700ms for 250 files); subsequent expansions are sub-millisecond.
Small codebases (< 20 files) skip indexing entirely — direct ripgrep is faster when there's nothing to narrow.
Mixed Queries
Multiple queries can use different modes in a single invocation. Each query is auto-routed independently, and indexes are built once per mode:
python3 $SKILL_DIR/scripts/search.py ./repo \ "class.*Error" \ "error recovery strategy" \ "def retry"
Dependencies
- tree-sitting: Provides AST-based context expansion for
. Not required — search works without it, just with less structural context in results.--expand - ripgrep: Required for regex verification. Install via
.uv tool install ripgrep - scikit-learn: Required for semantic mode. Installs automatically.
When to Use
- Known target: "where is the retry logic?", "find all error handlers"
- Pattern matching: regex across large codebases with indexed speedup
- Concept search: "authentication flow", "database connection pooling"
- Cross-reference: find all callers/users of a specific function
When NOT to Use
- First encounter: "what does this repo do?" → use exploring-codebases
- Repos under ~10 files: just read them directly
- Exact symbol lookup:
via tree-sitting is simplerfind_symbol('ClassName') - Structural overview: use tree-sitting's
/tree_overview()dir_overview()
Files
— Entry point, query routing, output formattingscripts/search.py
— Input source resolution (GitHub, uploads, archives)scripts/resolve.py
— tree-sitting-based AST context expansionscripts/context.py
— Sparse n-gram inverted index, regex decompositionscripts/ngram_index.py
— Core n-gram algorithms, frequency weightsscripts/sparse_ngrams.py
— TF-IDF semantic search over code chunksscripts/code_rag.py