Claude-awesome-stack ml-debug

Debug ML issues systematically -- tensor shapes, NaN propagation, gradient flow, device mismatches, dtype errors. Use when encountering training failures, unexpected outputs, or numerical issues.

install
source · Clone the upstream repo
git clone https://github.com/giacomogaglione/claude-awesome-stack
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/giacomogaglione/claude-awesome-stack "$T" && mkdir -p ~/.claude/skills && cp -r "$T/stacks/python-ml/skills/ml-debug" ~/.claude/skills/giacomogaglione-claude-awesome-stack-ml-debug && rm -rf "$T"
manifest: stacks/python-ml/skills/ml-debug/SKILL.md
source content

ML Debugging Skill

When debugging an ML issue, work through these checks in order. Stop at the first check that reveals the problem.

1. Shape Verification

Trace tensor shapes through the computation:

  • Print shapes at every function boundary:
    print(f"input: {x.shape}, output: {y.shape}")
  • Verify batch dimension is preserved through all operations
  • Check that reshape/view operations don't silently permute data
  • Verify attention mask shapes match query/key dimensions

2. Dtype and Device Checks

Look for silent type/device mismatches:

  • Print
    tensor.dtype
    and
    tensor.device
    at suspicious points
  • Check for float32/float64 mixing (common in loss computation)
  • Verify all tensors in an operation are on the same device
  • Check for integer overflow in index operations

3. NaN/Inf Propagation

Trace where NaN or Inf first appears:

  • Insert
    assert not torch.isnan(x).any(), f"NaN at {name}"
    after each operation
  • Enable anomaly detection:
    torch.autograd.set_detect_anomaly(True)
  • Check for division by zero in normalization layers
  • Check for log(0) or log(negative) in loss functions
  • Check for extremely large values before softmax

4. Gradient Flow

Diagnose vanishing or exploding gradients:

  • Print gradient norms:
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
  • Check if any parameter has
    requires_grad=False
    unintentionally
  • Check for dead ReLU units (all-zero activations)
  • Verify loss is connected to all trainable parameters

5. Data Pipeline

Check if the issue is in the data, not the model:

  • Verify labels match inputs (visualize a few samples)
  • Check for data leakage between train/val/test
  • Verify normalization statistics (mean, std) are computed on training set only
  • Check class imbalance
  • Verify data augmentation isn't corrupting labels

6. Common Library Gotchas

PyTorch

  • model.eval()
    vs
    model.train()
    -- affects dropout and batchnorm
  • torch.no_grad()
    must wrap inference code
  • loss.backward()
    accumulates gradients -- call
    optimizer.zero_grad()
    first
  • DataLoader
    with
    num_workers > 0
    can hide errors in worker processes

pandas

  • Index alignment:
    df1 + df2
    aligns on index, not position
  • SettingWithCopyWarning
    means you're modifying a view, not a copy
  • groupby().apply()
    can call the function twice on the first group

numpy

  • Broadcasting:
    (3,1) * (1,4)
    gives
    (3,4)
    -- verify this is intended
  • np.array
    copies by default,
    np.asarray
    does not
  • Integer arrays silently overflow