Skillsbench nanogpt-training
Train GPT-2 scale models (~124M parameters) efficiently on a single GPU. Covers GPT-124M architecture, tokenized dataset loading (e.g., HuggingFace Hub shards), modern optimizers (Muon, AdamW), mixed precision training, and training loop implementation.
install
source · Clone the upstream repo
git clone https://github.com/benchflow-ai/skillsbench
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/mhc-layer-impl/environment/skills/nanogpt-training" ~/.claude/skills/benchflow-ai-skillsbench-nanogpt-training && rm -rf "$T"
manifest:
tasks/mhc-layer-impl/environment/skills/nanogpt-training/SKILL.mdsource content
NanoGPT Training
Overview
Training GPT-2 scale models (~124M parameters) efficiently on a single GPU. It provides:
- GPT-124M Architecture: Standard transformer with RoPE and modern optimizations
- Tokenized Datasets: Loading pre-tokenized shards from HuggingFace Hub or local files
- Modern Optimizers: Muon optimizer with Newton-Schulz orthogonalization
- Mixed Precision: bfloat16 training on A100 for 2x speedup
Training options:
- Baseline GPT: Standard residual connections
- Experimental residual variants: Optional alternative residual schemes for stability/efficiency
Quick Reference
| Topic | Reference |
|---|---|
| Model Architecture | GPT Architecture |
| Data Loading | Tokenized Data |
| Optimizers | Optimizers |
| Training Loop | Training Loop |
| Hyperparameters | Hyperparameters |
Installation
pip install torch einops numpy huggingface_hub
Minimal Example
import modal app = modal.App("gpt-training") image = modal.Image.debian_slim(python_version="3.11").pip_install( "torch", "einops", "numpy", "huggingface_hub" ) @app.function(gpu="A100", image=image, timeout=3600) def train(): import torch from dataclasses import dataclass @dataclass class GPTConfig: block_size: int = 1024 vocab_size: int = 50257 n_layer: int = 12 n_head: int = 12 n_embd: int = 768 dropout: float = 0.0 bias: bool = False # Download data, build model, train # ... (see references for full implementation) return {"final_loss": final_loss} @app.local_entrypoint() def main(): results = train.remote() print(results)
Common Imports
import torch import torch.nn as nn import torch.nn.functional as F from torch.cuda.amp import autocast, GradScaler from dataclasses import dataclass from einops import rearrange, repeat, reduce import numpy as np import math
When to Use What
| Scenario | Approach |
|---|---|
| Standard GPT training | Use baseline model with standard residuals |
| Stability experiments | Try alternative residual variants or extra streams |
| Small experiments | Use T4/A10G GPU |
| Full training | Use A100 with bfloat16 |
| Custom data | Modify the dataset loader class |
| Different model size | Adjust GPTConfig parameters |
Metrics to Monitor
| Metric | Typical Signal | Notes |
|---|---|---|
| Validation loss | Steady decrease | Absolute value depends on dataset/tokenizer |
| Grad norm | Moderate, stable range | Large spikes indicate instability |
| Training stability | Smooth curves | Frequent spikes suggest LR/batch issues |
| Throughput | Consistent tokens/sec | Use for comparing configs |
External Resources
- nanoGPT: https://github.com/karpathy/nanoGPT
- build-nanogpt: https://github.com/karpathy/build-nanogpt
- modded-nanogpt: https://github.com/KellerJordan/modded-nanogpt
- FineWeb-Edu token shards: https://huggingface.co/datasets/karpathy/fineweb-edu-100B-gpt2-token-shards