Hacktricks-skills llm-training-guide
Guide for building and training large language models from scratch. Use this skill whenever the user wants to understand LLM training concepts, implement tokenization, data sampling, embeddings, attention mechanisms, model architecture, pre-training, or fine-tuning workflows. Trigger on mentions of LLM training, building models from scratch, tokenization, embeddings, attention, pre-training, fine-tuning, LoRA, or any LLM development task.
git clone https://github.com/abelrguezr/hacktricks-skills
skills/AI/AI-llm-architecture/AI-llm-architecture/SKILL.MDLLM Training Guide
A comprehensive guide for building and training large language models from scratch, based on the Manning book "Build a Large Language Model from Scratch".
Overview
This skill covers the complete LLM training pipeline:
- Tokenization - Converting text to token IDs
- Data Sampling - Preparing training data
- Token Embeddings - Vector representations
- Attention Mechanisms - Capturing word relationships
- LLM Architecture - Full model structure
- Pre-training - Training from scratch
- Fine-tuning - Adapting for specific tasks
Phase 1: Tokenization
Goal: Divide input text into tokens (IDs) in a meaningful way.
Key Concepts
- Tokens: The basic units the model processes (can be characters, words, or subwords)
- Vocabulary: The set of all unique tokens
- Token IDs: Numeric identifiers for each token in the vocabulary
Implementation Steps
- Build vocabulary from your training corpus
- Create token-to-ID mapping (tokenizer)
- Create ID-to-token mapping (for decoding)
- Encode text → convert to token IDs
- Decode IDs → convert back to text
Best Practices
- Use subword tokenization (like BPE or WordPiece) for better coverage
- Include special tokens:
,<pad>
,<unk>
,<bos><eos> - Keep vocabulary size reasonable (typically 50K-100K tokens)
- Consider your domain when building vocabulary
Phase 2: Data Sampling
Goal: Sample input data and prepare it for training by separating into sequences of specific length and generating expected responses.
Key Concepts
- Sequence length: Fixed number of tokens per training example
- Context window: How much history the model sees
- Target generation: What the model should predict (next token)
Implementation Steps
- Load and concatenate all training text
- Tokenize the entire corpus
- Split into sequences of fixed length (e.g., 1024 tokens)
- Create input/target pairs:
- Input: tokens [0, 1, 2, ..., n-1]
- Target: tokens [1, 2, 3, ..., n]
- Batch sequences for efficient training
Best Practices
- Use sequence lengths that fit your GPU memory
- Shuffle sequences between epochs
- Consider overlapping sequences for more training data
- Balance dataset across domains if using mixed data
Phase 3: Token Embeddings
Goal: Assign each token a vector representation of desired dimensions. Each word becomes a point in X-dimensional space.
Key Concepts
- Embedding dimension: Size of the vector (e.g., 512, 1024, 4096)
- Learnable parameters: Embeddings are initialized randomly and trained
- Position embeddings: Additional vectors encoding word position
Implementation Steps
- Initialize token embeddings randomly (vocab_size × embedding_dim)
- Initialize position embeddings randomly (max_seq_len × embedding_dim)
- Combine embeddings: token_embedding + position_embedding
- Train embeddings alongside model parameters
Position Embedding Types
- Absolute: Fixed position encoding (simple, effective)
- Relative: Encodes distance between tokens
- Rotary: Rotates embeddings based on position (RoPE)
Best Practices
- Embedding dimension should match model hidden size
- Use learned embeddings rather than fixed ones
- Consider sinusoidal position embeddings for extrapolation
Phase 4: Attention Mechanisms
Goal: Apply attention layers to capture relationships between words in the sentence.
Key Concepts
- Self-attention: Each token attends to all tokens in the sequence
- Query, Key, Value: Three projections for attention computation
- Multi-head attention: Multiple attention heads in parallel
- Causal masking: Prevents attending to future tokens (for training)
Implementation Steps
- Project embeddings to Q, K, V matrices
- Compute attention scores: Q × K^T / sqrt(d_k)
- Apply causal mask (for decoder-only models)
- Softmax to get attention weights
- Weighted sum: attention_weights × V
- Combine heads and project back
Attention Formula
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
Best Practices
- Use multi-head attention (8-16 heads typical)
- Apply layer normalization before and after attention
- Use residual connections around attention blocks
- Consider flash attention for efficiency
Phase 5: LLM Architecture
Goal: Develop the full LLM architecture by combining all components.
Standard Transformer Decoder Architecture
Input → Token Embedding → Position Embedding → [N × (Attention → MLP)] → Output Projection → Logits
Components
- Embedding Layer: Token + Position embeddings
- N Transformer Blocks:
- Multi-head self-attention
- Layer normalization
- Feed-forward MLP (2-4x hidden size)
- Layer normalization
- Output Projection: Hidden size → vocabulary size
- Loss Function: Cross-entropy on next token prediction
Implementation Steps
- Define model class with all layers
- Implement forward pass through all components
- Implement training loop with loss computation
- Implement generation (sampling, beam search, etc.)
- Add saving/loading for model checkpoints
Best Practices
- Use pre-norm architecture (norm before attention/MLP)
- Initialize weights carefully (e.g., Xavier, He initialization)
- Use gradient clipping to prevent exploding gradients
- Implement mixed precision training for efficiency
Phase 6: Pre-training
Goal: Train the model from scratch using the defined architecture, loss functions, and optimizer.
Training Loop
for epoch in epochs: for batch in dataloader: # Forward pass logits = model(input_tokens) # Compute loss loss = cross_entropy(logits, target_tokens) # Backward pass loss.backward() # Update weights optimizer.step() optimizer.zero_grad()
Key Hyperparameters
- Learning rate: 1e-4 to 3e-4 (with warmup)
- Batch size: Depends on GPU memory (effective batch size 1024-4096)
- Optimizer: AdamW with weight decay (0.01-0.1)
- Learning rate schedule: Cosine decay or linear warmup + decay
- Gradient accumulation: For larger effective batch sizes
Best Practices
- Use learning rate warmup (first 10% of steps)
- Monitor training loss and perplexity
- Save checkpoints regularly
- Use gradient checkpointing for memory efficiency
- Consider distributed training for large models
Phase 7: Fine-tuning
7.0 LoRA (Low-Rank Adaptation)
Goal: Reduce computation needed for fine-tuning by training only small adapter matrices.
How LoRA Works
- Freeze pre-trained weights
- Add small rank-r matrices to attention layers
- Train only the LoRA parameters
- Merge LoRA weights with base model for inference
Implementation Steps
- Freeze base model parameters
- Add LoRA adapters to attention Q, V, (optionally K, O)
- Train only LoRA parameters
- Merge weights for deployment
Best Practices
- Rank r: 8-64 (higher for more capacity)
- Alpha: Scaling factor (typically 2× rank)
- Apply to attention layers primarily
- Use lower learning rate than pre-training
7.1 Fine-tuning for Classification
Goal: Adapt pre-trained model to classify text into categories.
Implementation Steps
- Load pre-trained model (frozen or partially unfrozen)
- Add classification head on top of embeddings
- Prepare labeled dataset with categories
- Train with cross-entropy loss on labels
- Evaluate with accuracy, F1, etc.
Best Practices
- Use mean pooling or [CLS] token for classification
- Fine-tune last 1-2 layers initially
- Use smaller learning rate than pre-training
- Consider few-shot learning for limited data
7.2 Fine-tuning for Instruction Following
Goal: Adapt pre-trained model to follow instructions (chat, tasks, etc.).
Implementation Steps
- Prepare instruction dataset (instruction, input, output format)
- Format examples with special tokens:
<instruction> {instruction} <input> {input} <output> {output} - Train on formatted data with next-token prediction
- Evaluate on instruction following benchmarks
Best Practices
- Use diverse instruction templates
- Include both simple and complex instructions
- Consider supervised fine-tuning (SFT) before RLHF
- Use quality datasets (e.g., Alpaca, Dolly)
- Monitor for instruction following vs. memorization
Common Issues and Solutions
| Issue | Solution |
|---|---|
| Training loss not decreasing | Check learning rate, batch size, data quality |
| Model generates repetitive text | Adjust temperature, use top-k/top-p sampling |
| Out of memory | Reduce batch size, use gradient checkpointing |
| Slow training | Use mixed precision, flash attention |
| Poor generalization | More data, regularization, better architecture |
Next Steps
After completing these phases, you can:
- Deploy your model for inference
- Optimize with quantization, pruning
- Scale to larger datasets and models
- Experiment with different architectures
- Fine-tune for your specific use case
References
- Manning Book: "Build a Large Language Model from Scratch"
- Original Transformer Paper: "Attention Is All You Need"
- LoRA Paper: "LoRA: Low-Rank Adaptation of Large Language Models"
- Various implementation guides and tutorials