Skilllibrary model-architecture
Design and implement transformer model architectures including attention variants (MHA, GQA, MQA), positional encodings (RoPE, ALiBi), normalization strategies (Pre-LN, RMSNorm), and FFN activations (SwiGLU, GELU). Use when defining or modifying LlamaConfig, GPT-style, or Mistral-style model configs in PyTorch or HuggingFace transformers. Do not use for training loops, data pipelines, or inference serving.
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/model-architecture" ~/.claude/skills/merceralex397-collab-skilllibrary-model-architecture && rm -rf "$T"
manifest:
12-ai-llm-training-architecture-and-research/model-architecture/SKILL.mdsource content
Purpose
Guide the design and implementation of transformer-based language model architectures, covering component selection (attention, FFN, normalization, positional encoding) and scaling decisions using PyTorch and HuggingFace transformers.
When to use this skill
Use this skill when:
- Defining a new model config:
LlamaConfig(hidden_size=4096, num_attention_heads=32, num_key_value_heads=8, intermediate_size=11008) - Choosing between attention variants: MHA (all heads unique), GQA (grouped key-value heads), MQA (single KV head shared)
- Selecting positional encoding: RoPE (
), ALiBi (linear bias), or learned absolute embeddingsrotary_emb - Deciding normalization placement (Pre-LN vs Post-LN) or type (RMSNorm vs LayerNorm)
- Configuring FFN activation: SwiGLU (
), GELU (LlamaMLP
), or ReLUGPT2MLP - Scaling a model (width vs depth tradeoffs, attention head sizing
)d_head = hidden_size / num_heads
Do not use this skill when
- The task is about training loops, optimizers, or learning rate schedules — use
pretraining-pipeline - The task is about MoE routing or expert design — use
moe-architecture - The task is about tokenizer vocabulary or encoding — use
tokenizer-design - The task is pure software engineering with no model structure concerns
Operating procedure
- Identify target scale: Determine parameter count target. Compute
for a standard transformer with L layers and hidden dim d.params ≈ 12 * L * d² - Select attention variant: Use MHA for small models. Use GQA (
) for 7B+ to reduce KV-cache memory. Use MQA only when inference latency is critical.num_key_value_heads < num_attention_heads - Choose positional encoding: Default to RoPE for most modern architectures (LLaMA, Mistral). Use ALiBi for length extrapolation without fine-tuning. Avoid learned absolute embeddings for new designs.
- Set normalization: Use Pre-LN (norm before attention/FFN) with RMSNorm for training stability. Post-LN requires careful LR tuning. Example:
.LlamaRMSNorm(hidden_size, eps=1e-6) - Configure FFN block: Use SwiGLU for LLaMA-style:
withgate_proj * silu(up_proj)
rounded to multiple of 256. Use GELU for GPT-style.intermediate_size ≈ 8/3 * hidden_size - Validate dimensions: Ensure
. Typicalhidden_size % num_attention_heads == 0
values: 64 (GPT-2), 128 (LLaMA). GQA requiresd_head
.num_attention_heads % num_key_value_heads == 0 - Document config: Output a complete HuggingFace config class with all architectural parameters specified explicitly.
Decision rules
- GQA with
is the default for models ≥ 7B parameters (Mistral, LLaMA-2 70B pattern)num_kv_heads = num_heads / 4 - RoPE
is standard; increase base (e.g., 500000) or apply NTK-aware scaling for longer contextbase=10000 - SwiGLU + RMSNorm + Pre-LN + RoPE is the modern default stack (LLaMA-2/3, Mistral, Qwen2)
- Width scaling (larger
) is more compute-efficient than depth scaling (more layers) at fixed FLOP budgethidden_size - Attention head dim of 128 is preferred over 64 for models ≥ 7B (better per-head capacity)
Output requirements
— Complete HuggingFace-compatible config with all dimensions, head counts, activation, norm typeArchitecture Config
— Why each variant was chosen (attention type, positional encoding, activation)Component Justification
— Breakdown: embedding, attention, FFN, norm, LM headParameter Count Estimate
— How the architecture changes at 2x or 10x parameter countScaling Notes
References
- Vaswani et al., "Attention Is All You Need" (arxiv 1706.03762)
- Touvron et al., "LLaMA: Open and Efficient Foundation Language Models" (arxiv 2302.13971)
- Jiang et al., "Mistral 7B" (arxiv 2310.06825)
- Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (arxiv 2104.09864)
- Shazeer, "GLU Variants Improve Transformer" (arxiv 2002.05202)
- HuggingFace
,transformers.LlamaConfig
source codetransformers.MistralConfig
Related skills
— for Mixture-of-Experts extensions of these base architecturesmoe-architecture
— for training the defined architecturepretraining-pipeline
— for vocabulary and embedding layer decisionstokenizer-design
— for post-training compression of the architecturequantization-research
Failure handling
- If
, reject the config and suggest valid dimension pairs.hidden_size % num_attention_heads != 0 - If GQA is requested but
, compute nearest valid divisor.num_attention_heads % num_key_value_heads != 0 - If parameter count exceeds target by >20%, reduce
first, thennum_hidden_layers
.hidden_size - If unsure between architecture styles, default to LLaMA-3 pattern (GQA + RoPE + SwiGLU + RMSNorm).