Claude-skill-registry atft-training
Run and monitor ATFT-GAT-FAN training loops, hyper-parameter sweeps, and safety modes on A100 GPUs.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/atft-training" ~/.claude/skills/majiayu000-claude-skill-registry-atft-training && rm -rf "$T"
manifest:
skills/data/atft-training/SKILL.mdsource content
ATFT Training Skill
Mission
- Launch production-grade training for the Graph Attention Network forecaster with correct dataset/version parity.
- Tune hyper-parameters (LR, batch size, horizons, latent dims) exploiting 80GB GPU headroom.
- Safely resume, stop, or monitor long-running jobs and record experiment metadata.
Engagement Triggers
- Requests to “train”, “fine-tune”, “HP optimize”, “resume training”, or “monitor training logs”.
- Need to validate new dataset compatibility with model code.
- Investigations into training stalls, divergence, or GPU under-utilization.
Preflight Safety Checks
- Dataset freshness:
thenls -lh output/ml_dataset_latest_full.parquet
.python scripts/utils/dataset_guard.py --assert-recency 72 - Environment health:
.tools/project-health-check.sh --section training - GPU allocation:
(target >60% util, <76GB used baseline).nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv - Git hygiene:
ensure working tree state is understood (avoid accidental overrides during long runs).git status --short
Training Playbooks
1. Production Optimized Training (default 120 epochs)
— compiles TorchInductor + FlashAttention2.make train-optimized DATASET=output/ml_dataset_latest_full.parquet
— tailsmake train-monitor
._logs/training/train-optimized.log
— polls background process; ensure ETA < 7h.make train-status- Post-run validation:
— compute Sharpe, RankIC, hit ratios.python scripts/eval/aggregate_metrics.py runs/latest- Update
.results/latest_training_summary.md
2. Quick Validation / Smoke
— run in foreground.make train-quick EPOCHS=3
for additional regression guard.python scripts/smoke_test.py --max-epochs 1 --subset 512
if suspicious gradients.pytest tests/integration/test_training_loop.py::test_forward_backward
3. Safe Mode / Debug
— disables compile, single-worker dataloading.make train-safe
if hung jobs detected (consultmake train-stop
)._logs/training/pids/
— capture flamegraph topython scripts/integrated_ml_training_pipeline.py --profile --epochs 2 --no-compile
.benchmark_output/
4. Hyper-Parameter Exploration
- Ensure
backend running if required (mlflow
).make mlflow-up
— uses Optuna integration.make hpo-run HPO_TRIALS=24 HPO_STUDY=atft_prod_lr_sched
— track trial completions.make hpo-status- Promote winning config →
and document inconfigs/training/atft_prod.yaml
.EXPERIMENT_STATUS.md
Monitoring & Telemetry
- Training logs:
(includes gradient norms, learning rate schedule, GPU temp)._logs/training/*.log - Metrics JSONL:
.runs/<timestamp>/metrics.jsonl - Checkpoint artifacts:
.models/checkpoints/<timestamp>/epoch_###.pt - GPU telemetry:
orwatch -n 30 nvidia-smi
.python tools/gpu_monitor.py --pid $(cat _logs/training/pids/train.pid)
Failure Handling
- NaN loss → run
withmake train-safe
, inspectFP32=1
.runs/<ts>/nan_batches.json - Slow dataloading → regenerate dataset with
or enable PyTorch compile caching.make dataset-gpu GRAPH_WINDOW=90 - OOM → set
or reduceGRADIENT_ACCUMULATION_STEPS=2
; confirm memory fragments viaBATCH_SIZE
.python tools/gpu_memory_report.py - Divergent metrics → verify
; runconfigs/training/schedule.yaml
.pytest tests/unit/test_loss_functions.py
Codex Collaboration
- Invoke
when novel optimizer or architecture strategy is required../tools/codex.sh --max "Design a new learning rate policy for ATFT-GAT-FAN" - Use
for automated postmortems.codex exec --model gpt-5-codex "Analyze runs/<timestamp>/metrics.jsonl and suggest fixes" - Share Codex-discovered tuning insights in
and update config files/documents accordingly.results/training_runs/
Post-Training Handoff
- Persist summary in
noting dataset hash and commit SHA.results/training_runs/<timestamp>.md - Push model weights to
with namingmodels/artifacts/
.gatfan_<date>_Sharpe<score>.pt - Notify research team via
.docs/research/changelog.md