Claude-awesome-stack ml-debug
Debug ML issues systematically -- tensor shapes, NaN propagation, gradient flow, device mismatches, dtype errors. Use when encountering training failures, unexpected outputs, or numerical issues.
install
source · Clone the upstream repo
git clone https://github.com/giacomogaglione/claude-awesome-stack
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/giacomogaglione/claude-awesome-stack "$T" && mkdir -p ~/.claude/skills && cp -r "$T/stacks/python-ml/skills/ml-debug" ~/.claude/skills/giacomogaglione-claude-awesome-stack-ml-debug && rm -rf "$T"
manifest:
stacks/python-ml/skills/ml-debug/SKILL.mdsource content
ML Debugging Skill
When debugging an ML issue, work through these checks in order. Stop at the first check that reveals the problem.
1. Shape Verification
Trace tensor shapes through the computation:
- Print shapes at every function boundary:
print(f"input: {x.shape}, output: {y.shape}") - Verify batch dimension is preserved through all operations
- Check that reshape/view operations don't silently permute data
- Verify attention mask shapes match query/key dimensions
2. Dtype and Device Checks
Look for silent type/device mismatches:
- Print
andtensor.dtype
at suspicious pointstensor.device - Check for float32/float64 mixing (common in loss computation)
- Verify all tensors in an operation are on the same device
- Check for integer overflow in index operations
3. NaN/Inf Propagation
Trace where NaN or Inf first appears:
- Insert
after each operationassert not torch.isnan(x).any(), f"NaN at {name}" - Enable anomaly detection:
torch.autograd.set_detect_anomaly(True) - Check for division by zero in normalization layers
- Check for log(0) or log(negative) in loss functions
- Check for extremely large values before softmax
4. Gradient Flow
Diagnose vanishing or exploding gradients:
- Print gradient norms:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) - Check if any parameter has
unintentionallyrequires_grad=False - Check for dead ReLU units (all-zero activations)
- Verify loss is connected to all trainable parameters
5. Data Pipeline
Check if the issue is in the data, not the model:
- Verify labels match inputs (visualize a few samples)
- Check for data leakage between train/val/test
- Verify normalization statistics (mean, std) are computed on training set only
- Check class imbalance
- Verify data augmentation isn't corrupting labels
6. Common Library Gotchas
PyTorch
vsmodel.eval()
-- affects dropout and batchnormmodel.train()
must wrap inference codetorch.no_grad()
accumulates gradients -- callloss.backward()
firstoptimizer.zero_grad()
withDataLoader
can hide errors in worker processesnum_workers > 0
pandas
- Index alignment:
aligns on index, not positiondf1 + df2
means you're modifying a view, not a copySettingWithCopyWarning
can call the function twice on the first groupgroupby().apply()
numpy
- Broadcasting:
gives(3,1) * (1,4)
-- verify this is intended(3,4)
copies by default,np.array
does notnp.asarray- Integer arrays silently overflow