Hacktricks-skills llm-pretraining-helper

How to train LLMs from scratch using PyTorch, including model architecture setup, data preparation, training loops, loss monitoring, and model saving/loading. Use this skill whenever the user wants to train a language model from scratch, understand pre-training workflows, set up GPT architectures, configure training parameters, monitor loss/perplexity, or load/save model checkpoints. Make sure to use this skill when users mention training LLMs, pre-training, model checkpoints, GPT architectures, training loops, or want to build language models from the ground up.

install

source · Clone the upstream repo

git clone https://github.com/abelrguezr/hacktricks-skills

manifest: skills/AI/AI-llm-architecture/6.-pre-training-and-loading-models/SKILL.MD

source content

LLM Pre-training Helper

A skill for training language models from scratch using PyTorch, following best practices from the "LLMs from Scratch" methodology.

What this skill does

This skill helps you:

Set up GPT model architectures with proper configuration
Prepare training data with tokenization and data loaders
Configure training loops with loss monitoring and evaluation
Implement text generation with sampling strategies
Save and load model checkpoints
Visualize training progress (loss, perplexity)

When to use this skill

Use this skill when:

You want to train an LLM from scratch on your own dataset
You need to understand the pre-training workflow
You're setting up GPT model configurations
You want to monitor training metrics (loss, perplexity)
You need to save/load model checkpoints
You're implementing text generation with temperature/top-k sampling

Quick Start

1. Set up model configuration

GPT_CONFIG = {
    "vocab_size": 50257,      # GPT-2 vocabulary size
    "context_length": 256,    # Context window (adjust based on data)
    "emb_dim": 768,           # Embedding dimension
    "n_heads": 12,            # Attention heads
    "n_layers": 12,           # Transformer layers
    "drop_rate": 0.1,         # Dropout rate
    "qkv_bias": False         # Query-key-value bias
}

2. Prepare your data

# Load your text data
text_data = "your training text here"

# Split into train/validation (90/10 is common)
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

# Create data loaders
train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,
    max_length=GPT_CONFIG["context_length"],
    stride=GPT_CONFIG["context_length"],
    shuffle=True,
    drop_last=True
)

val_loader = create_dataloader_v1(
    val_data,
    batch_size=2,
    max_length=GPT_CONFIG["context_length"],
    stride=GPT_CONFIG["context_length"],
    shuffle=False,
    drop_last=False
)

3. Initialize model and start training

import torch

# Set seed for reproducibility
torch.manual_seed(123)

# Initialize model
model = GPTModel(GPT_CONFIG)

# Select device
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

model.to(device)

# Setup optimizer
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=0.0004,
    weight_decay=0.1
)

# Train
num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs,
    eval_freq=5,           # Evaluate every 5 steps
    eval_iter=5,           # Use 5 batches for evaluation
    start_context="Your starting phrase",
    tokenizer=tokenizer
)

Core Components

Model Architecture

The GPT model consists of:

Token embeddings: Convert token IDs to vectors
Positional embeddings: Add position information
Transformer blocks: Multi-head attention + feed-forward
Output head: Maps embeddings back to vocabulary

Training Loop Structure

For each epoch:
  For each batch:
    1. Zero gradients
    2. Forward pass → get logits
    3. Calculate loss (cross-entropy)
    4. Backward pass → compute gradients
    5. Optimizer step → update weights
    6. (Optional) Evaluate and log metrics

Loss Functions

Cross-entropy loss: Measures difference between predicted and actual token distributions
Perplexity:
```
exp(loss)
```
- represents model uncertainty (lower is better)

Text Generation Strategies

Strategy	Description	Use Case
Greedy	Always pick highest probability token	Deterministic output
Top-k	Sample from top k tokens	Balanced diversity
Temperature	Scale logits before softmax	Control randomness
Top-p (nucleus)	Sample until cumulative probability threshold	Adaptive diversity

Training Parameters Guide

Learning Rate

Small (1e-5 to 1e-4): Precise convergence, slower training
Large (1e-3 to 1e-2): Faster training, risk of overshooting
Recommended: Start with 4e-4 for AdamW

Batch Size

Small (1-4): More frequent updates, noisier gradients
Large (8-32): Smoother gradients, more memory
Recommended: 2-4 for CPU, 8-16 for GPU

Context Length

Short (128-256): Faster training, less context
Long (512-1024): More context, slower training
Recommended: Match your use case, start with 256

Number of Epochs

Few (5-10): Quick iteration, may underfit
Many (20-50): Better convergence, risk of overfitting
Recommended: Monitor validation loss, stop when it plateaus

Monitoring Training

Key Metrics to Track

Training Loss: Should decrease over time
Validation Loss: Should decrease, watch for overfitting
Perplexity:
```
exp(loss)
```
, lower is better
Tokens Seen: Track progress through dataset

Signs of Good Training

Training loss steadily decreases
Validation loss follows training loss
Generated text becomes more coherent
Perplexity drops significantly

Signs of Problems

Overfitting: Training loss ↓, Validation loss ↑
Underfitting: Both losses stay high
Exploding gradients: Loss becomes NaN or inf
Vanishing gradients: Loss stops decreasing

Saving and Loading Models

Save Full Checkpoint (for resuming training)

torch.save({
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    "epoch": current_epoch,
    "loss": current_loss
}, "checkpoint.pth")

Load Full Checkpoint

checkpoint = torch.load("checkpoint.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train()

Save Model Only (for inference)

torch.save(model.state_dict(), "model.pth")

Load Model Only

model = GPTModel(GPT_CONFIG)
model.load_state_dict(torch.load("model.pth", map_location=device))
model.eval()

Common Issues and Solutions

"Not enough tokens for training"

Solution: Reduce
```
context_length
```
or increase training data

Check:

total_tokens * train_ratio >= context_length

"CUDA out of memory"

Solution: Reduce batch size or context length
Alternative: Use gradient accumulation

"Loss not decreasing"

Check: Learning rate (try 1e-4 to 1e-3)
Check: Data quality and tokenization
Check: Model is in training mode (
```
model.train()
```
)

"Validation loss increasing"

Solution: Early stopping, reduce epochs
Alternative: Add regularization (dropout, weight decay)

Advanced Techniques (Not in Base Code)

Learning Rate Scheduling

Linear Warmup: Start small, increase to max LR
Cosine Decay: Gradually reduce LR after warmup

Gradient Clipping

Prevents exploding gradients
Set
```
max_norm
```
in optimizer or use
```
torch.nn.utils.clip_grad_norm_
```

Top-p Sampling (Nucleus)

More adaptive than top-k
Sums probabilities until threshold (e.g., 0.9)

Beam Search

Explores multiple sequences simultaneously
Better quality than greedy, more expensive

Next Steps

After training:

Evaluate: Test on held-out data
Fine-tune: Adapt to specific tasks
Deploy: Use for inference or as base model
Iterate: Adjust hyperparameters and retrain