AutoSkill Implement MoE-Mamba Text Generation Model

Implement a Mixture-of-Experts (MoE) Mamba model architecture for text generation, including data loading, training loop, and autoregressive text generation with loss tracking.

install

source · Clone the upstream repo

git clone https://github.com/ECNU-ICALK/AutoSkill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/implement-moe-mamba-text-generation-model" ~/.claude/skills/ecnu-icalk-autoskill-implement-moe-mamba-text-generation-model && rm -rf "$T"

manifest: SkillBank/ConvSkill/english_gpt4_8/implement-moe-mamba-text-generation-model/SKILL.md

source content

Implement MoE-Mamba Text Generation Model

Implement a Mixture-of-Experts (MoE) Mamba model architecture for text generation, including data loading, training loop, and autoregressive text generation with loss tracking.

Prompt

Role & Objective

You are a Deep Learning Engineer. Your task is to implement a MoE-Mamba model for text generation based on specific architectural requirements and a defined training pipeline.

Operational Rules & Constraints

Model Architecture

Expert Module: Define a simple feedforward network with
```
input_dim
```
and
```
hidden_dim
```
. Structure: Linear(input, hidden) -> ReLU -> Linear(hidden, input).
MoELayer Module: Define a Mixture of Experts layer.
- Initialize a
```
ModuleList
```
  of
```
Expert
```
  modules.
- Define a
```
gate
```
  as a Linear layer mapping
```
input_dim
```
  to
```
num_experts
```
  .
- Forward pass: Calculate gating distribution via Softmax. Stack expert outputs. Compute weighted sum using
```
torch.einsum
```
  .
SelectionMechanism Module: Define the input-dependent state update mechanism.
- Initialize a
```
selection_layer
```
  as a Linear layer mapping
```
input_dim + state_dim
```
  to
```
state_dim
```
  .
- Forward pass: Concatenate
```
state
```
  and
```
u
```
  along dimension 1. Pass through the selection layer.
StateSpaceMamba Module: Define the main model.
- Initialize
```
state
```
  as a Parameter
```
torch.zeros(1, state_dim)
```
  .
- Initialize
```
input_layer
```
  (Linear),
```
selection_mechanism
```
  , and
```
moe_layer
```
  .
- Forward pass: Iterate through the input sequence. Update state using
```
selection_mechanism(state, u)
```
  . Project input using
```
input_layer
```
  . Add state to projected input. Pass through
```
moe_layer
```
  . Return stacked outputs.

Data Processing & Training

Data Loading: Load text from a file. Tokenize using
```
basic_english
```
. Build vocabulary with special tokens (
```
<unk>
```
,
```
<pad>
```
,
```
<sos>
```
,
```
<eos>
```
). Numericalize tokens.
Batching: Calculate
```
num_batches
```
. Reshape tokens into
```
(batch_size, -1)
```
. Ensure
```
num_batches
```
is not zero to avoid division errors.
Training Loop: Use
```
CrossEntropyLoss
```
and
```
Adam
```
optimizer. Iterate over epochs. Calculate loss, backpropagate, and step optimizer. Track and return
```
loss_history
```
.
Generation: Implement an autoregressive generation function. Use a temperature parameter for sampling. Update the input sequence iteratively.
Visualization: Plot the training loss history using
```
matplotlib
```
.

Anti-Patterns

Do not use RNNs or standard Transformers for the core architecture; use the specified StateSpaceMamba structure.
Do not omit the dimensionality checks for tensor concatenation in the SelectionMechanism.
Do not forget to handle the case where
```
num_batches
```
might be zero.

Triggers

build a moe-mamba model
implement mamba text generation
train mamba on text dataset
code selection mechanism and moe layer