Hacktricks-skills llm-architecture
Build and understand LLM architecture from scratch. Use this skill whenever the user needs to create GPT models, implement transformer components (attention, feedforward, layer norm), calculate model parameters, or generate text with a trained model. Trigger for any request about LLM architecture, transformer blocks, GPT implementation, token embeddings, positional embeddings, or building neural networks for language modeling.
git clone https://github.com/abelrguezr/hacktricks-skills
skills/AI/AI-llm-architecture/5.-llm-architecture/SKILL.MDLLM Architecture Builder
A skill for building and understanding Large Language Model architecture from scratch, following the GPT-style transformer design.
When to Use This Skill
Use this skill when the user needs to:
- Build a GPT model from scratch
- Implement transformer components (attention, feedforward, layer normalization)
- Calculate the number of parameters in an LLM
- Generate text using a trained model
- Understand how token and positional embeddings work
- Create or modify LLM architecture configurations
Core Components
1. GELU Activation Function
GELU (Gaussian Error Linear Unit) introduces non-linearity into the model. Unlike ReLU which zeroes out negative inputs, GELU smoothly maps inputs to outputs, allowing for small non-zero values for negative inputs.
Use the bundled script:
scripts/gelu.py
2. FeedForward Network
A position-wise feedforward network that applies a two-layer fully connected network to each position:
- First linear layer: expands dimensionality from
toemb_dim4 * emb_dim - GELU activation: applies non-linearity
- Second linear layer: reduces dimensionality back to
emb_dim
Use the bundled script:
scripts/feedforward.py
3. Multi-Head Attention
Allows the model to focus on different positions within the input sequence:
- Queries, Keys, Values: Linear projections of the input
- Heads: Multiple attention mechanisms running in parallel
- Causal Mask: Prevents attending to future tokens (autoregressive)
- Dropout: Prevents overfitting
Use the bundled script:
scripts/multihead_attention.py
4. Layer Normalization
Normalizes inputs across features for each example in a batch:
- Computes mean and variance across embedding dimension
- Normalizes to mean=0, variance=1
- Applies learnable scale and shift parameters
- Stabilizes training of deep networks
Use the bundled script:
scripts/layernorm.py
5. Transformer Block
Combines all components with residual connections:
- First residual path: LayerNorm → Multi-Head Attention → Dropout → Add residual
- Second residual path: LayerNorm → FeedForward → Dropout → Add residual
Use the bundled script:
scripts/transformer_block.py
6. GPTModel
The complete model that:
- Converts token indices to embeddings
- Adds positional embeddings
- Passes through multiple transformer blocks
- Applies final normalization
- Projects to vocabulary size for token prediction
Use the bundled script:
scripts/gpt_model.py
Standard Configuration
The default 124M parameter configuration:
GPT_CONFIG_124M = { "vocab_size": 50257, # Vocabulary size "context_length": 1024, # Context length "emb_dim": 768, # Embedding dimension "n_heads": 12, # Number of attention heads "n_layers": 12, # Number of layers "drop_rate": 0.1, # Dropout rate "qkv_bias": False # Query-Key-Value bias }
Parameter Calculation
To calculate the number of parameters in your model:
Use the bundled script:
scripts/calculate_params.py
This script breaks down parameters by component:
- Token embeddings:
vocab_size * emb_dim - Position embeddings:
context_length * emb_dim - Multi-head attention per block: Q, K, V projections + output projection
- Feedforward per block: two linear layers
- Layer normalizations: scale and shift parameters
- Output projection:
emb_dim * vocab_size
Text Generation
To generate text with a trained model:
Use the bundled script:
scripts/generate_text.py
The generation process:
- Encode the starting text to token indices
- Pass through the model to get logits
- Apply softmax to get probabilities
- Select the token with highest probability
- Append to sequence and repeat
Workflow
Building a Complete Model
- Define configuration - Set vocab_size, context_length, emb_dim, n_heads, n_layers
- Create model - Use
with your configscripts/create_gpt_model.py - Calculate parameters - Use
to verifyscripts/calculate_params.py - Test generation - Use
with sample inputscripts/generate_text.py
Understanding Components
- Read component documentation - Each script has docstrings explaining its purpose
- Run with sample data - Scripts include example usage
- Inspect shapes - Comments show tensor shapes at each step
Examples
Example 1: Create a Small Model
python scripts/create_gpt_model.py --emb-dim 256 --n-layers 4 --n-heads 4
This creates a smaller model for testing/learning.
Example 2: Calculate Parameters
python scripts/calculate_params.py --config GPT_CONFIG_124M
Output shows breakdown by component and total (163,009,536 for 124M config).
Example 3: Generate Text
python scripts/generate_text.py --model checkpoint.pt --prompt "Hello, I am" --max-tokens 10
Key Concepts
Token Embeddings
- Convert token indices to dense vectors
- Shape:
(vocab_size, emb_dim) - Learnable parameters that represent each token
Positional Embeddings
- Add position information to token embeddings
- Shape:
(context_length, emb_dim) - Critical for understanding word order in sequences
Residual Connections
- Add input to output of each sub-layer
- Prevent vanishing gradients in deep networks
- Enable training of many transformer blocks
Causal Masking
- Masks future tokens during training
- Ensures autoregressive property (can't see future)
- Applied in multi-head attention
Best Practices
- Start small - Use smaller configs for testing before scaling up
- Check shapes - Verify tensor shapes match expected dimensions
- Use dropout - Essential for preventing overfitting
- LayerNorm before - Apply normalization before attention/feedforward
- Seed for reproducibility - Set random seed for consistent results