Skilllibrary model-architecture

Design and implement transformer model architectures including attention variants (MHA, GQA, MQA), positional encodings (RoPE, ALiBi), normalization strategies (Pre-LN, RMSNorm), and FFN activations (SwiGLU, GELU). Use when defining or modifying LlamaConfig, GPT-style, or Mistral-style model configs in PyTorch or HuggingFace transformers. Do not use for training loops, data pipelines, or inference serving.

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/skilllibrary

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/model-architecture" ~/.claude/skills/merceralex397-collab-skilllibrary-model-architecture && rm -rf "$T"

manifest: 12-ai-llm-training-architecture-and-research/model-architecture/SKILL.md

source content

Purpose

Guide the design and implementation of transformer-based language model architectures, covering component selection (attention, FFN, normalization, positional encoding) and scaling decisions using PyTorch and HuggingFace transformers.

When to use this skill

Use this skill when:

Defining a new model config:

LlamaConfig(hidden_size=4096, num_attention_heads=32, num_key_value_heads=8, intermediate_size=11008)

Choosing between attention variants: MHA (all heads unique), GQA (grouped key-value heads), MQA (single KV head shared)
Selecting positional encoding: RoPE (
```
rotary_emb
```
), ALiBi (linear bias), or learned absolute embeddings
Deciding normalization placement (Pre-LN vs Post-LN) or type (RMSNorm vs LayerNorm)
Configuring FFN activation: SwiGLU (
```
LlamaMLP
```
), GELU (
```
GPT2MLP
```
), or ReLU
Scaling a model (width vs depth tradeoffs, attention head sizing
```
d_head = hidden_size / num_heads
```
)

Do not use this skill when

The task is about training loops, optimizers, or learning rate schedules — use
```
pretraining-pipeline
```
The task is about MoE routing or expert design — use
```
moe-architecture
```
The task is about tokenizer vocabulary or encoding — use
```
tokenizer-design
```
The task is pure software engineering with no model structure concerns

Operating procedure

Identify target scale: Determine parameter count target. Compute
```
params ≈ 12 * L * d²
```
for a standard transformer with L layers and hidden dim d.
Select attention variant: Use MHA for small models. Use GQA (
```
num_key_value_heads < num_attention_heads
```
) for 7B+ to reduce KV-cache memory. Use MQA only when inference latency is critical.
Choose positional encoding: Default to RoPE for most modern architectures (LLaMA, Mistral). Use ALiBi for length extrapolation without fine-tuning. Avoid learned absolute embeddings for new designs.
Set normalization: Use Pre-LN (norm before attention/FFN) with RMSNorm for training stability. Post-LN requires careful LR tuning. Example:
```
LlamaRMSNorm(hidden_size, eps=1e-6)
```
.
Configure FFN block: Use SwiGLU for LLaMA-style:
```
gate_proj * silu(up_proj)
```
with
```
intermediate_size ≈ 8/3 * hidden_size
```
rounded to multiple of 256. Use GELU for GPT-style.

Validate dimensions: Ensure

hidden_size % num_attention_heads == 0

. Typical

d_head

values: 64 (GPT-2), 128 (LLaMA). GQA requires

num_attention_heads % num_key_value_heads == 0

Document config: Output a complete HuggingFace config class with all architectural parameters specified explicitly.

Decision rules

GQA with
```
num_kv_heads = num_heads / 4
```
is the default for models ≥ 7B parameters (Mistral, LLaMA-2 70B pattern)
RoPE
```
base=10000
```
is standard; increase base (e.g., 500000) or apply NTK-aware scaling for longer context
SwiGLU + RMSNorm + Pre-LN + RoPE is the modern default stack (LLaMA-2/3, Mistral, Qwen2)
Width scaling (larger
```
hidden_size
```
) is more compute-efficient than depth scaling (more layers) at fixed FLOP budget
Attention head dim of 128 is preferred over 64 for models ≥ 7B (better per-head capacity)

Output requirements

```
Architecture Config
```
— Complete HuggingFace-compatible config with all dimensions, head counts, activation, norm type
```
Component Justification
```
— Why each variant was chosen (attention type, positional encoding, activation)
```
Parameter Count Estimate
```
— Breakdown: embedding, attention, FFN, norm, LM head
```
Scaling Notes
```
— How the architecture changes at 2x or 10x parameter count

References

Vaswani et al., "Attention Is All You Need" (arxiv 1706.03762)
Touvron et al., "LLaMA: Open and Efficient Foundation Language Models" (arxiv 2302.13971)
Jiang et al., "Mistral 7B" (arxiv 2310.06825)
Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (arxiv 2104.09864)
Shazeer, "GLU Variants Improve Transformer" (arxiv 2002.05202)

HuggingFace

transformers.LlamaConfig

transformers.MistralConfig

source code

Related skills

```
moe-architecture
```
— for Mixture-of-Experts extensions of these base architectures
```
pretraining-pipeline
```
— for training the defined architecture
```
tokenizer-design
```
— for vocabulary and embedding layer decisions
```
quantization-research
```
— for post-training compression of the architecture

Failure handling

If
```
hidden_size % num_attention_heads != 0
```
, reject the config and suggest valid dimension pairs.
If GQA is requested but
```
num_attention_heads % num_key_value_heads != 0
```
, compute nearest valid divisor.
If parameter count exceeds target by >20%, reduce
```
num_hidden_layers
```
first, then
```
hidden_size
```
.
If unsure between architecture styles, default to LLaMA-3 pattern (GQA + RoPE + SwiGLU + RMSNorm).