Claude-skill-registry-data mechinterp-decoder

Analyze SAE decoder weights - output influence, feature importance, and decoder similarity

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry-data

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/mechinterp-decoder" ~/.claude/skills/majiayu000-claude-skill-registry-data-mechinterp-decoder && rm -rf "$T"

manifest: data/mechinterp-decoder/SKILL.md

source content

MechInterp Decoder

Analyze SAE features through their decoder weights. This skill answers: "What does this feature RECOMMEND?" rather than "What activates this feature?"

Purpose

Decoder analysis provides a complementary perspective to activation analysis:

Analysis Type	Question Answered
Activation (overview, sweeps)	"What inputs activate this feature?"
Decoder (this skill)	"What outputs does this feature promote?"

For diffuse or heterogeneous features where activation analysis shows multiple modes, decoder analysis often reveals the unifying concept.

When to Use

Use this skill when:

Activation analysis is inconclusive - Multiple modes or no clear pattern
Feature appears heterogeneous - Different builds activate it for different reasons
Looking for "what does it recommend" - Shift from inputs to outputs
Checking AP level preferences - Does feature prefer low-AP (_3, _6) vs high-AP (_57)?
Finding similar features - Cluster features by decoder similarity

Commands

Output Influence

Show what tokens a feature promotes (positive contribution) or suppresses (negative contribution):

cd /root/dev/SplatNLP

# Basic output influence
poetry run python -m splatnlp.mechinterp.cli.decoder_cli output-influence \
    --feature-id 13934 \
    --model ultra

# JSON output
poetry run python -m splatnlp.mechinterp.cli.decoder_cli output-influence \
    --feature-id 13934 \
    --model ultra \
    --format json

# More tokens
poetry run python -m splatnlp.mechinterp.cli.decoder_cli output-influence \
    --feature-id 13934 \
    --model ultra \
    --top-k 25

Sample Output:

## Feature 13934 Output Influence (ultra)

### Tokens This Feature PROMOTES

| Token | Contribution | Family | AP Level |
|-------|--------------|--------|----------|
| respawn_punisher | +0.232 | respawn_punisher | binary |
| comeback | +0.159 | comeback | binary |
| quick_super_jump_6 | +0.155 | quick_super_jump | 6 |
| intensify_action_3 | +0.140 | intensify_action | 3 |
| ink_saver_main_6 | +0.128 | ink_saver_main | 6 |

### Tokens This Feature SUPPRESSES

| Token | Contribution | Family | AP Level |
|-------|--------------|--------|----------|
| run_speed_up_57 | -0.301 | run_speed_up | 57 |
| quick_respawn_57 | -0.247 | quick_respawn | 57 |
| swim_speed_up_57 | -0.209 | swim_speed_up | 57 |

### Interpretation
- **Top promoted**: respawn_punisher (+0.232)
- **Top suppressed**: run_speed_up_57 (-0.301)
- **Pattern**: Promotes low-AP tokens, suppresses high-AP stacking

Weight Percentile

Check how important a feature is by its decoder weight magnitude:

poetry run python -m splatnlp.mechinterp.cli.decoder_cli weight-percentile \
    --feature-id 13934 \
    --model ultra

Sample Output:

## Feature 13934 Decoder Weight (ultra)

- **Magnitude**: 2.3456
- **Percentile**: 78.5%
- **Total features**: 24576

Interpretation:

High percentile (>90%): Feature has strong output influence
Low percentile (<10%): Feature has weak output influence
Note: Low-magnitude features may still be important for specific tokens

Similar Features (by Decoder)

Find features with similar decoder patterns (what they recommend):

poetry run python -m splatnlp.mechinterp.cli.decoder_cli similar \
    --feature-id 13934 \
    --model ultra \
    --top-k 10

Sample Output:

## Features Similar to 13934 (ultra)

| Feature ID | Cosine Similarity |
|------------|-------------------|
| 13892 | 0.9234 |
| 14501 | 0.8876 |
| 12044 | 0.8521 |

Experiment Runner

For programmatic use or integration with runner_cli:

# Create spec file
cat > decoder_spec.json << 'EOF'
{
  "type": "decoder_output_analysis",
  "feature_id": 13934,
  "model_type": "ultra",
  "variables": {
    "top_k_promoted": 15,
    "top_k_suppressed": 15,
    "group_by_family": true,
    "include_ap_level": true
  }
}
EOF

# Run via runner CLI
poetry run python -m splatnlp.mechinterp.cli.runner_cli \
    --spec-path decoder_spec.json

Interpretation Guide

AP Level Patterns

Pattern	Meaning
Promotes _3, _6; Suppresses _51, _57	"Use balanced spread, not stacking"
Promotes _57; Suppresses low AP	"Heavy stacking is the goal"
Promotes binary (RP, CB, OG)	"These specific abilities are key"
Mixed AP levels promoted	"Ability presence matters, not amount"

Common Feature Types

Output Pattern	Feature Type
Single family promoted	Family detector (e.g., SCU detector)
Low-AP promoted, high-AP suppressed	"Balanced utility recommendation"
Binary abilities promoted	"Build style marker" (aggressive, defensive)
Death perks promoted (QR, SS, CB)	"Death-tolerant" archetype
Death perks suppressed	"Death-averse" archetype

Integration with Investigation Workflow

Decoder analysis fits into the investigation workflow as follows:

1. Overview (mechinterp-overview)
   ↓
2. Hypothesis formation
   ↓
3. 1D Sweeps (mechinterp-runner)
   ↓
4. Core Coverage Check ← NEW: Catch tail markers
   ↓
5. If diffuse/heterogeneous:
   → Decoder Output Analysis ← THIS SKILL
   ↓
6. Label formulation

Example: Feature 13934 (from investigation log)

Problem: Activation analysis showed two opposite modes (RP anchor vs Zombie builds).

Solution: Decoder analysis revealed unifying pattern:

PROMOTES: low-AP utility (_3, _6 tokens)
SUPPRESSES: heavy stacking (_51, _57 tokens)

→ Feature recommends "balanced utility spread" regardless of death strategy

Key Insight: Different builds (RP vs Zombie) activate the feature because they share a NEED (balanced utility), not a BUILD pattern.

Claude-skill-registry-data mechinterp-decoder

MechInterp Decoder

Purpose

When to Use

Commands

Output Influence

Weight Percentile

Similar Features (by Decoder)

Experiment Runner

Interpretation Guide

AP Level Patterns

Common Feature Types

Integration with Investigation Workflow

Example: Feature 13934 (from investigation log)

See Also