Skilllibrary moe-architecture
Design Mixture-of-Experts transformer architectures including router/gating design (top-k softmax), expert FFN configuration, load balancing loss, capacity factors, and expert parallelism. Use when implementing MixtralConfig, Switch Transformer, or custom MoE layers. Do not use for dense transformer design or inference optimization.
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/moe-architecture" ~/.claude/skills/merceralex397-collab-skilllibrary-moe-architecture && rm -rf "$T"
manifest:
12-ai-llm-training-architecture-and-research/moe-architecture/SKILL.mdsource content
Purpose
Design and implement Mixture-of-Experts (MoE) transformer architectures, covering router/gating mechanisms, expert FFN configuration, load balancing strategies, and distributed expert parallelism using HuggingFace transformers and PyTorch.
When to use this skill
Use this skill when:
- Defining MoE config:
MixtralConfig(num_local_experts=8, num_experts_per_tok=2, router_aux_loss_coef=0.01) - Designing router/gating: top-k selection with softmax, noisy top-k, or expert-choice routing
- Tuning load balancing:
(auxiliary loss), capacity factor, token droppingrouter_aux_loss_coef - Choosing between sparse MoE patterns: Mixtral-style (top-2 of 8), Switch (top-1), or GShard
- Planning expert parallelism for distributed training across GPUs/nodes
- Analyzing total params vs active params tradeoffs (e.g., 47B total, 13B active per token)
Do not use this skill when
- Designing a dense (non-MoE) transformer — use
model-architecture - Building training loops or data pipelines — use
pretraining-pipeline - Optimizing MoE inference serving (kernel fusion, expert offloading) — use
serving-architecture - The task has no model architecture concerns
Operating procedure
- Define expert structure: Each expert is a standard FFN (MLP) block. In Mixtral:
. All experts share the same architecture but have independent weights.expert = MistralMLP(hidden_size=4096, intermediate_size=14336, act_fn=silu) - Design the router: The gating network is a linear layer
producing logits per token. Apply softmax, select top-k experts. Standard:nn.Linear(hidden_size, num_experts, bias=False)
for Mixtral,top_k=2
for Switch Transformer.top_k=1 - Implement routing logic: For each token, compute
. Select top-k expert indices and weights. Combine expert outputs:router_logits = gate(hidden_states)
for selected experts.output = sum(weight_i * expert_i(input)) - Add load balancing loss: Without auxiliary loss, routers collapse to using 1-2 experts. Add
scaled byaux_loss = num_experts * sum(fraction_tokens_i * mean_routing_prob_i)
(typically 0.01-0.02). This is added to the main language modeling loss.router_aux_loss_coef - Set capacity factor:
. Tokens exceeding expert capacity are dropped or routed to a shared fallback. Typical capacity_factor: 1.0-1.5 for training, 2.0+ for inference.capacity = (tokens_per_batch / num_experts) * capacity_factor - Plan expert parallelism: Distribute experts across GPUs. With 8 experts and 8 GPUs, place 1 expert per GPU. All-to-all communication moves tokens to their assigned expert's GPU and back. This is orthogonal to tensor/pipeline parallelism.
- Validate training dynamics: Monitor per-expert token allocation (should be roughly uniform). Track router entropy (higher = more balanced). Watch for expert collapse (one expert receiving >50% of tokens).
Decision rules
- Top-2 routing (Mixtral-style) gives better quality than top-1 at ~2x compute per token — use when quality matters more than speed
- Top-1 routing (Switch) maximizes throughput — use for very large models where compute budget is tight
- Set
as a power of 2 (8, 16, 64) for clean distribution across GPUsnum_experts - More experts with top-2 routing (e.g., 64 experts, top-2) gives more total parameters with similar active compute
- If any expert consistently receives <1% of tokens, increase
or add jitter noise to router logitsrouter_aux_loss_coef - MoE models need ~4x the total parameters of a dense model to match quality at the same active-parameter compute budget
Output requirements
— CompleteMoE Config
or equivalent with expert count, top-k, aux loss coefficient, capacity factorMixtralConfig
— Gating mechanism, top-k strategy, noise injection, and load balancing approachRouter Design
— Total params, active params per token, FLOPs comparison vs equivalent dense modelCompute Analysis
— Expert parallelism layout across GPUs, communication pattern (all-to-all)Distribution Plan
References
- Jiang et al., "Mixtral of Experts" (arxiv 2401.04088)
- Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models" (arxiv 2101.03961)
- Lepikhin et al., "GShard: Scaling Giant Models with Conditional Computation" (arxiv 2006.16668)
- Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (arxiv 1701.06538)
- HuggingFace
,transformers.MixtralConfig
source codetransformers.MixtralForCausalLM
Related skills
— for the dense transformer base that MoE extendsmodel-architecture
— for training MoE models with expert parallelismpretraining-pipeline
— for distilling MoE into dense student modelsdistillation-compression
— for multi-node expert parallelism setuptraining-infrastructure
Failure handling
- If expert utilization is severely imbalanced (>3x variance across experts), increase
by 2-5x and add router z-loss.router_aux_loss_coef - If all-to-all communication dominates training time (>30% of step), reduce number of experts or switch to expert-choice routing.
- If MoE model quality is worse than dense baseline at same active params, verify router is not collapsing — check expert assignment entropy.
- If OOM during training, reduce
to drop overflow tokens rather than buffering them.capacity_factor