Hacktricks-skills llm-architecture

Build and understand LLM architecture from scratch. Use this skill whenever the user needs to create GPT models, implement transformer components (attention, feedforward, layer norm), calculate model parameters, or generate text with a trained model. Trigger for any request about LLM architecture, transformer blocks, GPT implementation, token embeddings, positional embeddings, or building neural networks for language modeling.

install

source · Clone the upstream repo

git clone https://github.com/abelrguezr/hacktricks-skills

manifest: skills/AI/AI-llm-architecture/5.-llm-architecture/SKILL.MD

source content

LLM Architecture Builder

A skill for building and understanding Large Language Model architecture from scratch, following the GPT-style transformer design.

When to Use This Skill

Use this skill when the user needs to:

Build a GPT model from scratch
Implement transformer components (attention, feedforward, layer normalization)
Calculate the number of parameters in an LLM
Generate text using a trained model
Understand how token and positional embeddings work
Create or modify LLM architecture configurations

Core Components

1. GELU Activation Function

GELU (Gaussian Error Linear Unit) introduces non-linearity into the model. Unlike ReLU which zeroes out negative inputs, GELU smoothly maps inputs to outputs, allowing for small non-zero values for negative inputs.

Use the bundled script:

scripts/gelu.py

2. FeedForward Network

A position-wise feedforward network that applies a two-layer fully connected network to each position:

First linear layer: expands dimensionality from
```
emb_dim
```
to
```
4 * emb_dim
```
GELU activation: applies non-linearity
Second linear layer: reduces dimensionality back to
```
emb_dim
```

Use the bundled script:

scripts/feedforward.py

3. Multi-Head Attention

Allows the model to focus on different positions within the input sequence:

Queries, Keys, Values: Linear projections of the input
Heads: Multiple attention mechanisms running in parallel
Causal Mask: Prevents attending to future tokens (autoregressive)
Dropout: Prevents overfitting

Use the bundled script:

scripts/multihead_attention.py

4. Layer Normalization

Normalizes inputs across features for each example in a batch:

Computes mean and variance across embedding dimension
Normalizes to mean=0, variance=1
Applies learnable scale and shift parameters
Stabilizes training of deep networks

Use the bundled script:

scripts/layernorm.py

5. Transformer Block

Combines all components with residual connections:

First residual path: LayerNorm → Multi-Head Attention → Dropout → Add residual
Second residual path: LayerNorm → FeedForward → Dropout → Add residual

Use the bundled script:

scripts/transformer_block.py

6. GPTModel

The complete model that:

Converts token indices to embeddings
Adds positional embeddings
Passes through multiple transformer blocks
Applies final normalization
Projects to vocabulary size for token prediction

Use the bundled script:

scripts/gpt_model.py

Standard Configuration

The default 124M parameter configuration:

GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

Parameter Calculation

To calculate the number of parameters in your model:

Use the bundled script:

scripts/calculate_params.py

This script breaks down parameters by component:

Token embeddings:
```
vocab_size * emb_dim
```
Position embeddings:
```
context_length * emb_dim
```
Multi-head attention per block: Q, K, V projections + output projection
Feedforward per block: two linear layers
Layer normalizations: scale and shift parameters
Output projection:
```
emb_dim * vocab_size
```

Text Generation

To generate text with a trained model:

Use the bundled script:

scripts/generate_text.py

The generation process:

Encode the starting text to token indices
Pass through the model to get logits
Apply softmax to get probabilities
Select the token with highest probability
Append to sequence and repeat

Workflow

Building a Complete Model

Define configuration - Set vocab_size, context_length, emb_dim, n_heads, n_layers
Create model - Use
```
scripts/create_gpt_model.py
```
with your config
Calculate parameters - Use
```
scripts/calculate_params.py
```
to verify
Test generation - Use
```
scripts/generate_text.py
```
with sample input

Understanding Components

Read component documentation - Each script has docstrings explaining its purpose
Run with sample data - Scripts include example usage
Inspect shapes - Comments show tensor shapes at each step

Examples

Example 1: Create a Small Model

python scripts/create_gpt_model.py --emb-dim 256 --n-layers 4 --n-heads 4

This creates a smaller model for testing/learning.

Example 2: Calculate Parameters

python scripts/calculate_params.py --config GPT_CONFIG_124M

Output shows breakdown by component and total (163,009,536 for 124M config).

Example 3: Generate Text

python scripts/generate_text.py --model checkpoint.pt --prompt "Hello, I am" --max-tokens 10

Key Concepts

Token Embeddings

Convert token indices to dense vectors
Shape:
```
(vocab_size, emb_dim)
```
Learnable parameters that represent each token

Positional Embeddings

Add position information to token embeddings
Shape:
```
(context_length, emb_dim)
```
Critical for understanding word order in sequences

Residual Connections

Add input to output of each sub-layer
Prevent vanishing gradients in deep networks
Enable training of many transformer blocks

Causal Masking

Masks future tokens during training
Ensures autoregressive property (can't see future)
Applied in multi-head attention

Best Practices

Start small - Use smaller configs for testing before scaling up
Check shapes - Verify tensor shapes match expected dimensions
Use dropout - Essential for preventing overfitting
LayerNorm before - Apply normalization before attention/feedforward
Seed for reproducibility - Set random seed for consistent results

Hacktricks-skills llm-architecture

LLM Architecture Builder

When to Use This Skill

Core Components

1. GELU Activation Function

2. FeedForward Network

3. Multi-Head Attention

4. Layer Normalization

5. Transformer Block

6. GPTModel

Standard Configuration

Parameter Calculation

Text Generation

Workflow

Building a Complete Model

Understanding Components

Examples

Example 1: Create a Small Model

Example 2: Calculate Parameters

Example 3: Generate Text

Key Concepts

Token Embeddings

Positional Embeddings

Residual Connections

Causal Masking

Best Practices

References