AutoSkill Implement MoE-Mamba Text Generation Model
Implement a Mixture-of-Experts (MoE) Mamba model architecture for text generation, including data loading, training loop, and autoregressive text generation with loss tracking.
install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/implement-moe-mamba-text-generation-model" ~/.claude/skills/ecnu-icalk-autoskill-implement-moe-mamba-text-generation-model && rm -rf "$T"
manifest:
SkillBank/ConvSkill/english_gpt4_8/implement-moe-mamba-text-generation-model/SKILL.mdsource content
Implement MoE-Mamba Text Generation Model
Implement a Mixture-of-Experts (MoE) Mamba model architecture for text generation, including data loading, training loop, and autoregressive text generation with loss tracking.
Prompt
Role & Objective
You are a Deep Learning Engineer. Your task is to implement a MoE-Mamba model for text generation based on specific architectural requirements and a defined training pipeline.
Operational Rules & Constraints
Model Architecture
- Expert Module: Define a simple feedforward network with
andinput_dim
. Structure: Linear(input, hidden) -> ReLU -> Linear(hidden, input).hidden_dim - MoELayer Module: Define a Mixture of Experts layer.
- Initialize a
ofModuleList
modules.Expert - Define a
as a Linear layer mappinggate
toinput_dim
.num_experts - Forward pass: Calculate gating distribution via Softmax. Stack expert outputs. Compute weighted sum using
.torch.einsum
- Initialize a
- SelectionMechanism Module: Define the input-dependent state update mechanism.
- Initialize a
as a Linear layer mappingselection_layer
toinput_dim + state_dim
.state_dim - Forward pass: Concatenate
andstate
along dimension 1. Pass through the selection layer.u
- Initialize a
- StateSpaceMamba Module: Define the main model.
- Initialize
as a Parameterstate
.torch.zeros(1, state_dim) - Initialize
(Linear),input_layer
, andselection_mechanism
.moe_layer - Forward pass: Iterate through the input sequence. Update state using
. Project input usingselection_mechanism(state, u)
. Add state to projected input. Pass throughinput_layer
. Return stacked outputs.moe_layer
- Initialize
Data Processing & Training
- Data Loading: Load text from a file. Tokenize using
. Build vocabulary with special tokens (basic_english
,<unk>
,<pad>
,<sos>
). Numericalize tokens.<eos> - Batching: Calculate
. Reshape tokens intonum_batches
. Ensure(batch_size, -1)
is not zero to avoid division errors.num_batches - Training Loop: Use
andCrossEntropyLoss
optimizer. Iterate over epochs. Calculate loss, backpropagate, and step optimizer. Track and returnAdam
.loss_history - Generation: Implement an autoregressive generation function. Use a temperature parameter for sampling. Update the input sequence iteratively.
- Visualization: Plot the training loss history using
.matplotlib
Anti-Patterns
- Do not use RNNs or standard Transformers for the core architecture; use the specified StateSpaceMamba structure.
- Do not omit the dimensionality checks for tensor concatenation in the SelectionMechanism.
- Do not forget to handle the case where
might be zero.num_batches
Triggers
- build a moe-mamba model
- implement mamba text generation
- train mamba on text dataset
- code selection mechanism and moe layer