Skilllibrary moe-architecture

Design Mixture-of-Experts transformer architectures including router/gating design (top-k softmax), expert FFN configuration, load balancing loss, capacity factors, and expert parallelism. Use when implementing MixtralConfig, Switch Transformer, or custom MoE layers. Do not use for dense transformer design or inference optimization.

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/skilllibrary

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/moe-architecture" ~/.claude/skills/merceralex397-collab-skilllibrary-moe-architecture && rm -rf "$T"

manifest: 12-ai-llm-training-architecture-and-research/moe-architecture/SKILL.md

source content

Purpose

Design and implement Mixture-of-Experts (MoE) transformer architectures, covering router/gating mechanisms, expert FFN configuration, load balancing strategies, and distributed expert parallelism using HuggingFace transformers and PyTorch.

When to use this skill

Use this skill when:

Defining MoE config:

MixtralConfig(num_local_experts=8, num_experts_per_tok=2, router_aux_loss_coef=0.01)

Designing router/gating: top-k selection with softmax, noisy top-k, or expert-choice routing
Tuning load balancing:
```
router_aux_loss_coef
```
(auxiliary loss), capacity factor, token dropping
Choosing between sparse MoE patterns: Mixtral-style (top-2 of 8), Switch (top-1), or GShard
Planning expert parallelism for distributed training across GPUs/nodes
Analyzing total params vs active params tradeoffs (e.g., 47B total, 13B active per token)

Do not use this skill when

Designing a dense (non-MoE) transformer — use
```
model-architecture
```
Building training loops or data pipelines — use
```
pretraining-pipeline
```
Optimizing MoE inference serving (kernel fusion, expert offloading) — use
```
serving-architecture
```
The task has no model architecture concerns

Operating procedure

Define expert structure: Each expert is a standard FFN (MLP) block. In Mixtral:
```
expert = MistralMLP(hidden_size=4096, intermediate_size=14336, act_fn=silu)
```
. All experts share the same architecture but have independent weights.
Design the router: The gating network is a linear layer
```
nn.Linear(hidden_size, num_experts, bias=False)
```
producing logits per token. Apply softmax, select top-k experts. Standard:
```
top_k=2
```
for Mixtral,
```
top_k=1
```
for Switch Transformer.
Implement routing logic: For each token, compute
```
router_logits = gate(hidden_states)
```
. Select top-k expert indices and weights. Combine expert outputs:
```
output = sum(weight_i * expert_i(input))
```
for selected experts.
Add load balancing loss: Without auxiliary loss, routers collapse to using 1-2 experts. Add
```
aux_loss = num_experts * sum(fraction_tokens_i * mean_routing_prob_i)
```
scaled by
```
router_aux_loss_coef
```
(typically 0.01-0.02). This is added to the main language modeling loss.
Set capacity factor:
```
capacity = (tokens_per_batch / num_experts) * capacity_factor
```
. Tokens exceeding expert capacity are dropped or routed to a shared fallback. Typical capacity_factor: 1.0-1.5 for training, 2.0+ for inference.
Plan expert parallelism: Distribute experts across GPUs. With 8 experts and 8 GPUs, place 1 expert per GPU. All-to-all communication moves tokens to their assigned expert's GPU and back. This is orthogonal to tensor/pipeline parallelism.
Validate training dynamics: Monitor per-expert token allocation (should be roughly uniform). Track router entropy (higher = more balanced). Watch for expert collapse (one expert receiving >50% of tokens).

Decision rules

Top-2 routing (Mixtral-style) gives better quality than top-1 at ~2x compute per token — use when quality matters more than speed
Top-1 routing (Switch) maximizes throughput — use for very large models where compute budget is tight
Set
```
num_experts
```
as a power of 2 (8, 16, 64) for clean distribution across GPUs
More experts with top-2 routing (e.g., 64 experts, top-2) gives more total parameters with similar active compute
If any expert consistently receives <1% of tokens, increase
```
router_aux_loss_coef
```
or add jitter noise to router logits
MoE models need ~4x the total parameters of a dense model to match quality at the same active-parameter compute budget

Output requirements

```
MoE Config
```
— Complete
```
MixtralConfig
```
or equivalent with expert count, top-k, aux loss coefficient, capacity factor
```
Router Design
```
— Gating mechanism, top-k strategy, noise injection, and load balancing approach
```
Compute Analysis
```
— Total params, active params per token, FLOPs comparison vs equivalent dense model
```
Distribution Plan
```
— Expert parallelism layout across GPUs, communication pattern (all-to-all)

References

Jiang et al., "Mixtral of Experts" (arxiv 2401.04088)
Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models" (arxiv 2101.03961)
Lepikhin et al., "GShard: Scaling Giant Models with Conditional Computation" (arxiv 2006.16668)
Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (arxiv 1701.06538)

HuggingFace

transformers.MixtralConfig

transformers.MixtralForCausalLM

source code

Related skills

```
model-architecture
```
— for the dense transformer base that MoE extends
```
pretraining-pipeline
```
— for training MoE models with expert parallelism
```
distillation-compression
```
— for distilling MoE into dense student models
```
training-infrastructure
```
— for multi-node expert parallelism setup

Failure handling

If expert utilization is severely imbalanced (>3x variance across experts), increase
```
router_aux_loss_coef
```
by 2-5x and add router z-loss.
If all-to-all communication dominates training time (>30% of step), reduce number of experts or switch to expert-choice routing.
If MoE model quality is worse than dense baseline at same active params, verify router is not collapsing — check expert assignment entropy.
If OOM during training, reduce
```
capacity_factor
```
to drop overflow tokens rather than buffering them.