Claude-skill-registry-data ml-training-debugger

Diagnose and stabilize ML training runs, recover from failures, and deliver validated fixes.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry-data
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/ml-training-debugger" ~/.claude/skills/majiayu000-claude-skill-registry-data-ml-training-debugger && rm -rf "$T"
manifest: data/ml-training-debugger/SKILL.md
source content

STANDARD OPERATING PROCEDURE

Purpose

Rapidly triage ML training incidents (instability, divergence, degraded metrics) and deliver validated remediations with traceable evidence.

Triggers

  • Positive: Failing/unstable training runs, unexplained metric drops, NaNs/exploding gradients, data/label issues, reproducibility gaps.
  • Negative: New model development without incident (route to
    ml-expert
    ) or lightweight prototyping (route to
    ml
    ).

Guardrails

  • Structure-first: ensure
    SKILL.md
    ,
    README
    ,
    examples/
    ,
    tests/
    ,
    resources/
    , and
    agents/
    exist; create missing docs before work.
  • Constraint extraction: clarify environment (hardware/framework), data provenance, metric targets, and incident timeline.
  • Validation discipline: reproduce issue, isolate variables (data/model/optim), run minimal change tests; adversarially probe for leakage and nondeterminism.
  • Confidence ceiling enforced (inference/report 0.70; research 0.85; observation/definition 0.95) with evidence per finding.
  • Safety: preserve checkpoints/logs; avoid destructive changes; keep rollback path ready.

Execution Phases

  1. Intake & Evidence Gathering
    • Collect logs, metrics, configs, seeds, hardware info, and recent code changes.
    • Confirm baseline vs expected behavior and incident start time.
  2. Reproduction & Isolation
    • Reproduce on smallest dataset slice; fix seeds; disable randomness.
    • Binary-search variables: data batches, preprocessing, model changes, optimizer settings.
  3. Hypothesis & Experiment Plan
    • Form hypotheses (data corruption, label leakage, optimizer instability, precision issues).
    • Plan targeted experiments with success/fail criteria.
  4. Fix & Validation
    • Implement minimal fixes; run controlled tests (train/val curves, gradient norms, loss stats).
    • Validate against performance/latency targets; ensure no regression on baseline metrics.
  5. Handoff & Prevention
    • Document root cause, applied fixes, and remaining risks.
    • Add monitors/tests to prevent recurrence; package rollback instructions.

Output Format

  • Incident summary and constraints.
  • Reproduction steps, hypotheses, and experiments run.
  • Fixes applied with before/after metrics.
  • Prevention plan and owners.
  • Confidence statement with ceiling.

Validation Checklist

  • Issue reproduced with fixed seeds and minimal data slice.
  • Hypotheses tested; experiments documented.
  • Metrics reported per split; gradients/loss inspected where relevant.
  • Regression checks executed; rollback path documented.
  • Confidence ceiling stated.

VCL COMPLIANCE APPENDIX (Internal)

[[HON:teineigo]] [[MOR:root:H-T-A]] [[COM:Hata+Teshis+Analiz]] [[CLS:ge_skill]] [[EVD:-DI<gozlem>]] [[ASP:nesov.]] [[SPC:path:/skills/specialists/ml-training-debugger]]

[[HON:teineigo]] [[MOR:root:E-P-S]] [[COM:Epistemik+Tavan]] [[CLS:ge_rule]] [[EVD:-DI<gozlem>]] [[ASP:nesov.]] [[SPC:coord:EVD-CONF]]

Confidence: 0.74 (ceiling: inference 0.70) - SOP rebuilt with prompt-architect constraint discipline and skill-forge structure/validation rules.