Claude-skill-registry-data mechinterp-overview

Quick "first look" overview of SAE features - top tokens, activation stats, weapons, families, sample contexts

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry-data

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/mechinterp-overview" ~/.claude/skills/majiayu000-claude-skill-registry-data-mechinterp-overview && rm -rf "$T"

manifest: data/mechinterp-overview/SKILL.md

source content

MechInterp Overview

Get a comprehensive first-look overview of an SAE feature before deep investigation. This skill provides a fast summary of key characteristics to help you decide what hypotheses to test.

⚠️ CRITICAL: Overview is NOT Findings

The overview shows CORRELATIONS, not CAUSATION. It is a starting point for generating hypotheses, NOT a source of conclusions.

Overview Shows	What It Actually Means
Top tokens (PageRank)	Tokens that CO-OCCUR with high activation (correlation)
Family breakdown	Which ability families appear in high-activation examples
Top weapons	Weapons present in high-activation examples

You CANNOT conclude from overview alone:

That a token "drives" or "causes" activation
That the feature "detects" a specific ability
That correlations are meaningful vs spurious

To make conclusions, you MUST run experiments (see mechinterp-investigator for deep dive basics).

Purpose

The overview skill:

Computes PageRank-weighted top tokens for a feature
Shows activation statistics (mean, std, median, sparsity)
Aggregates tokens by ability family
Lists top weapons associated with the feature
Provides sample high-activation contexts
Checks for existing labels and ReLU floor issues

When to Use

Use this skill when:

Starting to investigate a new feature
You want a quick summary before running experiments
Deciding which feature to label next
Checking if a feature has already been labeled

DO NOT use overview results as final findings. Always follow up with experiments.

Output Information

Section	Description
Activation Stats	Mean, std, median, sparsity percentage, example count
Top Tokens	PageRank-weighted most important tokens (enhancers)
Bottom Tokens	Tokens suppressed in high-activation examples
Family Breakdown	Aggregated scores by ability family (SCU, SSU, etc.)
Top Weapons	Weapons with most examples for this feature
Sample Contexts	3-5 high-activation example builds
Existing Label	Current label if one exists
ReLU Floor	Warning if feature is mostly zeros (>50%)

Sparsity Definition

Sparsity = % of examples where feature activation is ZERO

A high sparsity percentage means the feature fires RARELY (is selective):

Sparsity	Meaning	Interpretation
95%+	Very sparse	Fires on only 5% of examples - very specific pattern
80-95%	Moderately sparse	Good discriminative feature (fires on 5-20% of examples)
50-80%	Dense	Fires often (20-50% of examples) - broad pattern
<50%	Very dense	Fires on majority of examples - may be baseline feature

Common confusion: "89% sparsity" means "fires on 11% of examples" NOT "fires often."

Think of it as: Sparsity = how empty/silent the feature usually is.

CRITICAL: Always check the Bottom Tokens section! Tokens that rarely appear in high-activation examples reveal what the feature avoids, which is often more informative than what it detects.

Usage

Command Line

cd /root/dev/SplatNLP

# Basic overview (markdown output)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 18712 \
    --model ultra

# JSON output for programmatic use
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 18712 \
    --model ultra \
    --format json

# More top tokens
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 18712 \
    --model ultra \
    --top-k 25

# Full model
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 5432 \
    --model full

# Verbose logging
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 18712 \
    --model ultra \
    --verbose

Extended Analyses

Additional analysis flags provide deeper insights:

# Token enrichment (enhancers/suppressors)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --enrichment

# Activation region breakdown (anti-flanderization)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --regions

# Binary ability enrichment (main-only abilities)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --binary

# Sub/special weapon breakdown (kit analysis)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --kit

# All extended analyses at once
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --all

# Customize high-activation threshold (default: 0.90 = top 10%)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --enrichment --high-percentile 0.95

Extended Analysis Reference

Flag	Purpose	Output
`--enrichment`	Token enrichment ratios	Suppressors (<0.8x) and enhancers (>1.2x)
`--regions`	Activation regions	Floor/Low/Core/High/Flanderization breakdown
`--binary`	Binary ability presence	Enrichment for main-only abilities (Comeback, Stealth Jump, etc.)
`--kit`	Sub/special breakdown	Which subs/specials appear in core region
`--all`	Enable all above	Combined output
`--kit-region`	Region for kit analysis	`core` (default), `high` , or `all`
`--high-percentile`	Threshold for "high"	Default: 0.90 (top 10%)

Programmatic

from splatnlp.mechinterp.labeling import FeatureOverview, compute_overview
from splatnlp.mechinterp.skill_helpers import load_context

# Load context
ctx = load_context("ultra")

# Compute overview
overview = compute_overview(
    feature_id=18712,
    ctx=ctx,
    top_k_tokens=15,
    n_sample_contexts=5,
)

# Display markdown
print(overview.to_markdown())

# Access fields directly
print(f"Mean: {overview.activation_mean}")
print(f"Top token: {overview.top_tokens[0]}")
print(f"Main family: {max(overview.family_breakdown.items(), key=lambda x: x[1])}")

Sample Output

## Feature 18712 Overview (ultra)

### Activation Stats
- Mean: 0.5056
- Std: 0.5163
- Median: 0.3835
- Sparsity: 97.1%
- Examples: 108,163

### Top Tokens (PageRank)
1. `special_charge_up` (0.274)
2. `swim_speed_up` (0.099)
3. `ink_saver_sub` (0.084)
4. `stealth_jump` (0.049)
5. `run_speed_up` (0.048)

### Family Breakdown
- special_charge_up: 31.2%
- swim_speed_up: 11.2%
- ink_saver_sub: 9.6%

### Top Weapons
- weapon_id_5021: 28
- weapon_id_220: 28

### Bottom Tokens (Suppressors)
Tokens rarely present in high-activation examples:
1. `respawn_punisher` (high_rate_ratio=0.00) - Never in high activation
2. `special_saver` (high_rate_ratio=0.16) - 6x less common than baseline
3. `quick_respawn` (high_rate_ratio=0.47) - 2x less common than baseline

### Sample Contexts (High Activation)
1. [weapon_id_1111] special_charge_up_6, special_charge_up_57 (act=0.731)
2. [weapon_id_1111] special_charge_up_6, special_charge_up_51... (act=0.724)

FeatureOverview Dataclass

@dataclass
class FeatureOverview:
    feature_id: int
    model_type: str

    # Activation statistics
    activation_mean: float
    activation_std: float
    activation_median: float
    sparsity: float  # Percentage (0-100)
    n_examples: int

    # PageRank-weighted top tokens
    top_tokens: list[tuple[str, float]]

    # Bottom tokens (suppressors) - tokens excluded from high activation
    bottom_tokens: list[tuple[str, float]]  # (token, high_rate_ratio)

    # Detailed token influence statistics
    token_influences: list[TokenInfluence]

    # Aggregated by family
    family_breakdown: dict[str, float]

    # Weapon breakdown
    top_weapons: list[tuple[str, int]]

    # Sample high-activation contexts
    sample_contexts: list[SampleContext]

    # Diagnostic flags
    relu_floor_rate: float
    existing_label: str | None

Performance

Typical runtime: 30-60 seconds (dominated by PageRank computation)
Loads activation data lazily from efficient database
Caches context between calls in the same session

Interpretation Tips

High sparsity (>90%): Most inputs don't activate this feature. Look at what's special about the ones that do.
ReLU floor warning: If >50% of examples hit the ReLU floor, the feature may be hard to interpret or require special handling.
Single dominant family: If one family has >50% of the breakdown, the feature likely responds to that ability family.
Multiple families: If breakdown is spread across families, look for interactions or common contexts.
Weapon concentration: If a few weapons dominate, the feature may be weapon-specific rather than ability-specific.

⚠️ CRITICAL: Super-Stimuli Detection

Don't only examine high activations - they may be "super-stimuli"!

High activation examples can be exaggerated, "flanderized" versions of the true concept. The core region (25-75% of effective max) often reveals the actual feature meaning better than the flanderization zone (90%+ of effective max).

Why "effective max"? Activation distributions are heavy-tailed. Use

effective_max = 99.5th percentile of nonzero activations

to prevent single outliers from making your core region nearly empty.

Warning Signs of Super-Stimuli

Pattern	What It Means
90%+ activations only on 3-4 niche weapons	Flanderization zone = super-stimuli
Core region (25-75%) has diverse mainstream weapons	TRUE concept is in core region
One weapon spans ALL activation levels continuously	Feature is general, not weapon-specific

Activation Region Bins

Use these standard bins (as % of effective max = 99.5th percentile) to analyze feature behavior:

Region	Range (% of effective max)	Typical Interpretation
Floor	≤1%	Feature not activated
Low	1-10%	Weak signal, early detection
Below Core	10-25%	Emerging pattern
Core	25-75%	TRUE CONCEPT (examine carefully!)
High	75-90%	Strong expression
Flanderization Zone	90%+	Potential super-stimuli

Example: Feature 9971

Initial analysis (looking only at 90%+ activations):

Top weapons: Bloblobber, Glooga Deco, Range Blaster, Octobrush
Conclusion: "SCU stacker on special-dependent weapons"

After region analysis (examining core 25-75%):

Core region: Splattershot (115), Wellstring (65), Sploosh (57)
Splattershot appears in EVERY region (29→125→83→115→61→19)
True concept: "General offensive investment (death-averse)"
Flanderization zone (90%+): "Super-stimuli" version on niche special-dependent weapons

Key insight: Label the core-region concept, not the flanderized extreme!

Coverage Threshold Rule

When overview shows a dominant token or weapon, CHECK CORE-REGION COVERAGE before treating it as the concept.

A token can have high enrichment in the tail but be a tail marker, not the true concept.

Metric	Interpretation
>50% core coverage	Primary concept - safe to use in label
30-50% core coverage	Significant but not universal - note in label, don't headline
<30% core coverage	Tail marker / super-stimulus - NOT the concept

Example (Feature 13934):

Overview showed: respawn_punisher with 8.57x tail enrichment
BUT: RP only present in 12% of core-region examples

⚠️ Flag in overview: "respawn_punisher: high enrichment (8.57x) but <30% core coverage - may be tail marker, not core concept"

When to flag: If any token in top-10 has enrichment >3x but core coverage <30%, add a warning note.

Weapon Outlier Detection: If a single weapon has >2x the examples of the second weapon, this is a weapon-dominated feature:
- Use splatoon3-meta skill to look up the weapon's kit (sub + special)
- Check if other high-activation weapons share the same sub OR special
- If they share kit components, the feature may encode kit behavior, not weapon behavior
- Run kit_sweep experiment to analyze activation by sub/special
Check suppressors: Always examine bottom tokens! If death-mitigation abilities (QR, SS, CB) are suppressed, the feature encodes "death-averse" builds. See mechinterp-ability-semantics for semantic groupings.
Enhancers + Suppressors together: The combination tells the full story. A feature with SCU enhanced AND death-perks suppressed isn't just "SCU detector" - it's "death-averse special builds".
"Weak activation" ≠ "unimportant feature": If all scaling effects are weak (max_delta < 0.03), don't immediately label as "weak feature". Check the feature's decoder weights to output tokens. Net influence = activation × decoder weight. A feature with low activation effects but high decoder weights may still strongly influence predictions.

⚠️ WARNING: Correlation ≠ Causation

PageRank scores show correlation, NOT causation. Tokens appearing in the overview may be:

True drivers: Actually cause activation changes
Spurious correlations: Just happen to co-occur with the true driver

How to Distinguish

Run 1D sweep for top token (likely primary driver)
If confirmed, run 2D heatmaps for other tokens:
- ```
PRIMARY × SECONDARY
```
  reveals if secondary has conditional effect
- If secondary shows effect only at high primary → true interaction
- If secondary shows NO effect at any primary level → spurious

Example: Feature 18712

Overview showed: SCU (24%), Opening Gambit (17%), SSU (12%)

1D sweeps:
- SCU: strong effect (0.03→0.58) ✅ PRIMARY
- OG: delta ≈ 0 → appears to have no effect
- SSU: delta ≈ 0 → appears to have no effect

BUT WAIT! 1D sweeps for secondary abilities are MISLEADING.

2D heatmaps (SCU × OG, SCU × SSU):
- Both show NO conditional effect at any SCU level
- Conclusion: OG and SSU were SPURIOUS correlations

2D heatmaps (SCU × QR, SCU × SS):
- QR_12+ SUPPRESSES activation by 70-99% at high SCU!
- SS_12+ SUPPRESSES activation by 40-60%!
- Conclusion: Feature is DEATH-AVERSE (not visible in 1D)

Always verify top overview tokens with conditional 2D testing!

See mechinterp-investigator for the full Iterative Conditional Testing Protocol.