Claude-skills exploring-codebases
git clone https://github.com/oaustegard/claude-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/oaustegard/claude-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/exploring-codebases" ~/.claude/skills/oaustegard-claude-skills-exploring-codebases && rm -rf "$T"
exploring-codebases/SKILL.mdExploring Codebases
Exploratory code analysis for unfamiliar repositories. Orchestrates tree-sitting (structural) and featuring (semantic) over a local copy.
Workflow
Five numbered steps, in order. Do not skip step 0.
0. Setup (once per session)
uv venv /home/claude/.venv 2>/dev/null uv pip install tree-sitter-language-pack --python /home/claude/.venv/bin/python export PYTHON=/home/claude/.venv/bin/python export TREESIT=/mnt/skills/user/tree-sitting/scripts/treesit.py export GATHER=/mnt/skills/user/featuring/scripts/gather.py
If step 2's
--stats later reports Scanned 0 files ... Errors: 1, the
language pack isn't loaded — come back here and install. Treesit fails
silently on missing deps; it does not raise a useful error.
1. Get the repo (tarball, not per-file)
OWNER=... REPO=... REF=main # branch name, tag, or SHA. For a PR: pull/N/head curl -sL -H "Authorization: Bearer $GH_TOKEN" \ "https://api.github.com/repos/$OWNER/$REPO/tarball/$REF" -o /tmp/$REPO.tar.gz mkdir -p /tmp/$REPO && tar -xzf /tmp/$REPO.tar.gz -C /tmp/$REPO --strip-components=1 ls /tmp/$REPO | head # sanity check — did extraction land?
One HTTP call gets the whole repo. Do NOT curl README, cat files, or fetch via
contents/PATH first — they're in the tarball. The
Authorization header is only needed for private repos; public repos
work without it.
Ref selection matters. If exploring a feature branch, PR, or tag, set
REF accordingly. The default main will silently give you stale
code if the question is about an unmerged branch.
2. Structural scan
$PYTHON $TREESIT /tmp/$REPO --stats
Read the output. It gives file counts, symbol counts, languages, and per-directory symbol density. This IS the orienting artifact — treat it as the product of this step, not warm-up.
Drill only if you have a specific question. For pure "what is this repo" exploration, skip drilling and go to step 3 — featuring surfaces the interesting paths for you. Drill when a user asked about a specific subsystem, or when step 3's output raises a question that needs source.
When you do drill, BATCH queries in one call — each extra query adds ~0ms, separate invocations re-scan from scratch:
$PYTHON $TREESIT /tmp/$REPO --path=SUBDIR --detail=full \ 'find:*Handler*:function' 'source:main' 'refs:Config'
3. Feature synthesis
$PYTHON $GATHER /tmp/$REPO \ --skip tests,.github,node_modules --source-budget 8000
Output includes a "Candidate areas for sub-files (by symbol density)" list near the top — that's your drill-target picker, ranked.
4. Reason about the combined output
Synthesize 2+3: capabilities, feature groups, architecture, entry points, anomalies. Produce
_FEATURES.md when warranted. This is the
LLM step; everything before was mechanical.
When to Use This vs Other Skills
| Situation | Use |
|---|---|
| "I just cloned this, what is it?" | exploring-codebases (this skill) |
| "Where is the retry logic?" | searching-codebases |
"Find all files matching " | searching-codebases |
| "Show me the symbols in auth.py" | tree-sitting directly |
| "Document what this codebase does" | featuring directly |
Exploring is the divergent skill — you don't know what you're looking for yet. Searching is the convergent skill — you know what you want.
Notes
- Large repos (>100 files): use
in step 2 to focus the scan.--skip tests,vendored,docs,... - Monorepos: treat each package/service as a separate exploration.
Generate per-subsystem
files linked from a root index._FEATURES.md - Drill heuristics (if step 2 drilling is warranted): directories
with high symbol-to-file ratio (dense logic), entry-point names
(
,main
,cli
,app
,server
), files with many imports (integration points).routes