AutoSkill Implement MoE-Mamba Text Generation Model

Implement a Mixture-of-Experts (MoE) Mamba model architecture for text generation, including data loading, training loop, and autoregressive text generation with loss tracking.

install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/implement-moe-mamba-text-generation-model" ~/.claude/skills/ecnu-icalk-autoskill-implement-moe-mamba-text-generation-model && rm -rf "$T"
manifest: SkillBank/ConvSkill/english_gpt4_8/implement-moe-mamba-text-generation-model/SKILL.md
source content

Implement MoE-Mamba Text Generation Model

Implement a Mixture-of-Experts (MoE) Mamba model architecture for text generation, including data loading, training loop, and autoregressive text generation with loss tracking.

Prompt

Role & Objective

You are a Deep Learning Engineer. Your task is to implement a MoE-Mamba model for text generation based on specific architectural requirements and a defined training pipeline.

Operational Rules & Constraints

Model Architecture

  1. Expert Module: Define a simple feedforward network with
    input_dim
    and
    hidden_dim
    . Structure: Linear(input, hidden) -> ReLU -> Linear(hidden, input).
  2. MoELayer Module: Define a Mixture of Experts layer.
    • Initialize a
      ModuleList
      of
      Expert
      modules.
    • Define a
      gate
      as a Linear layer mapping
      input_dim
      to
      num_experts
      .
    • Forward pass: Calculate gating distribution via Softmax. Stack expert outputs. Compute weighted sum using
      torch.einsum
      .
  3. SelectionMechanism Module: Define the input-dependent state update mechanism.
    • Initialize a
      selection_layer
      as a Linear layer mapping
      input_dim + state_dim
      to
      state_dim
      .
    • Forward pass: Concatenate
      state
      and
      u
      along dimension 1. Pass through the selection layer.
  4. StateSpaceMamba Module: Define the main model.
    • Initialize
      state
      as a Parameter
      torch.zeros(1, state_dim)
      .
    • Initialize
      input_layer
      (Linear),
      selection_mechanism
      , and
      moe_layer
      .
    • Forward pass: Iterate through the input sequence. Update state using
      selection_mechanism(state, u)
      . Project input using
      input_layer
      . Add state to projected input. Pass through
      moe_layer
      . Return stacked outputs.

Data Processing & Training

  1. Data Loading: Load text from a file. Tokenize using
    basic_english
    . Build vocabulary with special tokens (
    <unk>
    ,
    <pad>
    ,
    <sos>
    ,
    <eos>
    ). Numericalize tokens.
  2. Batching: Calculate
    num_batches
    . Reshape tokens into
    (batch_size, -1)
    . Ensure
    num_batches
    is not zero to avoid division errors.
  3. Training Loop: Use
    CrossEntropyLoss
    and
    Adam
    optimizer. Iterate over epochs. Calculate loss, backpropagate, and step optimizer. Track and return
    loss_history
    .
  4. Generation: Implement an autoregressive generation function. Use a temperature parameter for sampling. Update the input sequence iteratively.
  5. Visualization: Plot the training loss history using
    matplotlib
    .

Anti-Patterns

  • Do not use RNNs or standard Transformers for the core architecture; use the specified StateSpaceMamba structure.
  • Do not omit the dimensionality checks for tensor concatenation in the SelectionMechanism.
  • Do not forget to handle the case where
    num_batches
    might be zero.

Triggers

  • build a moe-mamba model
  • implement mamba text generation
  • train mamba on text dataset
  • code selection mechanism and moe layer