Hacktricks-skills llm-training-guide

Guide for building and training large language models from scratch. Use this skill whenever the user wants to understand LLM training concepts, implement tokenization, data sampling, embeddings, attention mechanisms, model architecture, pre-training, or fine-tuning workflows. Trigger on mentions of LLM training, building models from scratch, tokenization, embeddings, attention, pre-training, fine-tuning, LoRA, or any LLM development task.

install

source · Clone the upstream repo

git clone https://github.com/abelrguezr/hacktricks-skills

manifest: skills/AI/AI-llm-architecture/AI-llm-architecture/SKILL.MD

source content

LLM Training Guide

A comprehensive guide for building and training large language models from scratch, based on the Manning book "Build a Large Language Model from Scratch".

Overview

This skill covers the complete LLM training pipeline:

Tokenization - Converting text to token IDs
Data Sampling - Preparing training data
Token Embeddings - Vector representations
Attention Mechanisms - Capturing word relationships
LLM Architecture - Full model structure
Pre-training - Training from scratch
Fine-tuning - Adapting for specific tasks

Phase 1: Tokenization

Goal: Divide input text into tokens (IDs) in a meaningful way.

Key Concepts

Tokens: The basic units the model processes (can be characters, words, or subwords)
Vocabulary: The set of all unique tokens
Token IDs: Numeric identifiers for each token in the vocabulary

Implementation Steps

Build vocabulary from your training corpus
Create token-to-ID mapping (tokenizer)
Create ID-to-token mapping (for decoding)
Encode text → convert to token IDs
Decode IDs → convert back to text

Best Practices

Use subword tokenization (like BPE or WordPiece) for better coverage
Include special tokens:
```
<pad>
```
,
```
<unk>
```
,
```
<bos>
```
,
```
<eos>
```
Keep vocabulary size reasonable (typically 50K-100K tokens)
Consider your domain when building vocabulary

Phase 2: Data Sampling

Goal: Sample input data and prepare it for training by separating into sequences of specific length and generating expected responses.

Key Concepts

Sequence length: Fixed number of tokens per training example
Context window: How much history the model sees
Target generation: What the model should predict (next token)

Implementation Steps

Load and concatenate all training text
Tokenize the entire corpus
Split into sequences of fixed length (e.g., 1024 tokens)
Create input/target pairs:
- Input: tokens [0, 1, 2, ..., n-1]
- Target: tokens [1, 2, 3, ..., n]
Batch sequences for efficient training

Best Practices

Use sequence lengths that fit your GPU memory
Shuffle sequences between epochs
Consider overlapping sequences for more training data
Balance dataset across domains if using mixed data

Phase 3: Token Embeddings

Goal: Assign each token a vector representation of desired dimensions. Each word becomes a point in X-dimensional space.

Key Concepts

Embedding dimension: Size of the vector (e.g., 512, 1024, 4096)
Learnable parameters: Embeddings are initialized randomly and trained
Position embeddings: Additional vectors encoding word position

Implementation Steps

Initialize token embeddings randomly (vocab_size × embedding_dim)
Initialize position embeddings randomly (max_seq_len × embedding_dim)
Combine embeddings: token_embedding + position_embedding
Train embeddings alongside model parameters

Position Embedding Types

Absolute: Fixed position encoding (simple, effective)
Relative: Encodes distance between tokens
Rotary: Rotates embeddings based on position (RoPE)

Best Practices

Embedding dimension should match model hidden size
Use learned embeddings rather than fixed ones
Consider sinusoidal position embeddings for extrapolation

Phase 4: Attention Mechanisms

Goal: Apply attention layers to capture relationships between words in the sentence.

Key Concepts

Self-attention: Each token attends to all tokens in the sequence
Query, Key, Value: Three projections for attention computation
Multi-head attention: Multiple attention heads in parallel
Causal masking: Prevents attending to future tokens (for training)

Implementation Steps

Project embeddings to Q, K, V matrices
Compute attention scores: Q × K^T / sqrt(d_k)
Apply causal mask (for decoder-only models)
Softmax to get attention weights
Weighted sum: attention_weights × V
Combine heads and project back

Attention Formula

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V

Best Practices

Use multi-head attention (8-16 heads typical)
Apply layer normalization before and after attention
Use residual connections around attention blocks
Consider flash attention for efficiency

Phase 5: LLM Architecture

Goal: Develop the full LLM architecture by combining all components.

Standard Transformer Decoder Architecture

Input → Token Embedding → Position Embedding → [N × (Attention → MLP)] → Output Projection → Logits

Components

Embedding Layer: Token + Position embeddings
N Transformer Blocks:
- Multi-head self-attention
- Layer normalization
- Feed-forward MLP (2-4x hidden size)
- Layer normalization
Output Projection: Hidden size → vocabulary size
Loss Function: Cross-entropy on next token prediction

Implementation Steps

Define model class with all layers
Implement forward pass through all components
Implement training loop with loss computation
Implement generation (sampling, beam search, etc.)
Add saving/loading for model checkpoints

Best Practices

Use pre-norm architecture (norm before attention/MLP)
Initialize weights carefully (e.g., Xavier, He initialization)
Use gradient clipping to prevent exploding gradients
Implement mixed precision training for efficiency

Phase 6: Pre-training

Goal: Train the model from scratch using the defined architecture, loss functions, and optimizer.

Training Loop

for epoch in epochs:
    for batch in dataloader:
        # Forward pass
        logits = model(input_tokens)
        
        # Compute loss
        loss = cross_entropy(logits, target_tokens)
        
        # Backward pass
        loss.backward()
        
        # Update weights
        optimizer.step()
        optimizer.zero_grad()

Key Hyperparameters

Learning rate: 1e-4 to 3e-4 (with warmup)
Batch size: Depends on GPU memory (effective batch size 1024-4096)
Optimizer: AdamW with weight decay (0.01-0.1)
Learning rate schedule: Cosine decay or linear warmup + decay
Gradient accumulation: For larger effective batch sizes

Best Practices

Use learning rate warmup (first 10% of steps)
Monitor training loss and perplexity
Save checkpoints regularly
Use gradient checkpointing for memory efficiency
Consider distributed training for large models

Phase 7: Fine-tuning

7.0 LoRA (Low-Rank Adaptation)

Goal: Reduce computation needed for fine-tuning by training only small adapter matrices.

How LoRA Works

Freeze pre-trained weights
Add small rank-r matrices to attention layers
Train only the LoRA parameters
Merge LoRA weights with base model for inference

Implementation Steps

Freeze base model parameters
Add LoRA adapters to attention Q, V, (optionally K, O)
Train only LoRA parameters
Merge weights for deployment

Best Practices

Rank r: 8-64 (higher for more capacity)
Alpha: Scaling factor (typically 2× rank)
Apply to attention layers primarily
Use lower learning rate than pre-training

7.1 Fine-tuning for Classification

Goal: Adapt pre-trained model to classify text into categories.

Implementation Steps

Load pre-trained model (frozen or partially unfrozen)
Add classification head on top of embeddings
Prepare labeled dataset with categories
Train with cross-entropy loss on labels
Evaluate with accuracy, F1, etc.

Best Practices

Use mean pooling or [CLS] token for classification
Fine-tune last 1-2 layers initially
Use smaller learning rate than pre-training
Consider few-shot learning for limited data

7.2 Fine-tuning for Instruction Following

Goal: Adapt pre-trained model to follow instructions (chat, tasks, etc.).

Implementation Steps

Prepare instruction dataset (instruction, input, output format)

Format examples with special tokens:

<instruction> {instruction} <input> {input} <output> {output}

Train on formatted data with next-token prediction
Evaluate on instruction following benchmarks

Best Practices

Use diverse instruction templates
Include both simple and complex instructions
Consider supervised fine-tuning (SFT) before RLHF
Use quality datasets (e.g., Alpaca, Dolly)
Monitor for instruction following vs. memorization

Common Issues and Solutions

Issue	Solution
Training loss not decreasing	Check learning rate, batch size, data quality
Model generates repetitive text	Adjust temperature, use top-k/top-p sampling
Out of memory	Reduce batch size, use gradient checkpointing
Slow training	Use mixed precision, flash attention
Poor generalization	More data, regularization, better architecture

Next Steps

After completing these phases, you can:

Deploy your model for inference
Optimize with quantization, pruning
Scale to larger datasets and models
Experiment with different architectures
Fine-tune for your specific use case

References

Manning Book: "Build a Large Language Model from Scratch"
Original Transformer Paper: "Attention Is All You Need"
LoRA Paper: "LoRA: Low-Rank Adaptation of Large Language Models"
Various implementation guides and tutorials