Claude-awesome-stack ml-debug

Debug ML issues systematically -- tensor shapes, NaN propagation, gradient flow, device mismatches, dtype errors. Use when encountering training failures, unexpected outputs, or numerical issues.

install

source · Clone the upstream repo

git clone https://github.com/giacomogaglione/claude-awesome-stack

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/giacomogaglione/claude-awesome-stack "$T" && mkdir -p ~/.claude/skills && cp -r "$T/stacks/python-ml/skills/ml-debug" ~/.claude/skills/giacomogaglione-claude-awesome-stack-ml-debug && rm -rf "$T"

manifest: stacks/python-ml/skills/ml-debug/SKILL.md

source content

ML Debugging Skill

When debugging an ML issue, work through these checks in order. Stop at the first check that reveals the problem.

1. Shape Verification

Trace tensor shapes through the computation:

Print shapes at every function boundary:

print(f"input: {x.shape}, output: {y.shape}")

Verify batch dimension is preserved through all operations
Check that reshape/view operations don't silently permute data
Verify attention mask shapes match query/key dimensions

2. Dtype and Device Checks

Look for silent type/device mismatches:

Print
```
tensor.dtype
```
and
```
tensor.device
```
at suspicious points
Check for float32/float64 mixing (common in loss computation)
Verify all tensors in an operation are on the same device
Check for integer overflow in index operations

3. NaN/Inf Propagation

Trace where NaN or Inf first appears:

Insert

assert not torch.isnan(x).any(), f"NaN at {name}"

after each operation

Enable anomaly detection:
```
torch.autograd.set_detect_anomaly(True)
```
Check for division by zero in normalization layers
Check for log(0) or log(negative) in loss functions
Check for extremely large values before softmax

4. Gradient Flow

Diagnose vanishing or exploding gradients:

Print gradient norms:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Check if any parameter has
```
requires_grad=False
```
unintentionally
Check for dead ReLU units (all-zero activations)
Verify loss is connected to all trainable parameters

5. Data Pipeline

Check if the issue is in the data, not the model:

Verify labels match inputs (visualize a few samples)
Check for data leakage between train/val/test
Verify normalization statistics (mean, std) are computed on training set only
Check class imbalance
Verify data augmentation isn't corrupting labels

6. Common Library Gotchas

PyTorch

```
model.eval()
```
vs
```
model.train()
```
-- affects dropout and batchnorm
```
torch.no_grad()
```
must wrap inference code
```
loss.backward()
```
accumulates gradients -- call
```
optimizer.zero_grad()
```
first
```
DataLoader
```
with
```
num_workers > 0
```
can hide errors in worker processes

pandas

Index alignment:
```
df1 + df2
```
aligns on index, not position
```
SettingWithCopyWarning
```
means you're modifying a view, not a copy
```
groupby().apply()
```
can call the function twice on the first group

numpy

Broadcasting:
```
(3,1) * (1,4)
```
gives
```
(3,4)
```
-- verify this is intended
```
np.array
```
copies by default,
```
np.asarray
```
does not
Integer arrays silently overflow