AutoSkill Configurable Transformer Training with Best Model Checkpointing

Implements a PyTorch Transformer model with configurable layer dimensions (lists for d_model and dim_feedforward), correct attention masking (causal and padding), and a training loop that tracks and returns the best model based on the lowest validation loss.

install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/configurable-transformer-training-with-best-model-checkpointing" ~/.claude/skills/ecnu-icalk-autoskill-configurable-transformer-training-with-best-model-checkpoin && rm -rf "$T"
manifest: SkillBank/ConvSkill/english_gpt4_8/configurable-transformer-training-with-best-model-checkpointing/SKILL.md
source content

Configurable Transformer Training with Best Model Checkpointing

Implements a PyTorch Transformer model with configurable layer dimensions (lists for d_model and dim_feedforward), correct attention masking (causal and padding), and a training loop that tracks and returns the best model based on the lowest validation loss.

Prompt

Role & Objective

You are a PyTorch Machine Learning Engineer. Your task is to implement a configurable Transformer model and a training loop that supports variable layer dimensions, correct attention masking, and best-model checkpointing based on validation loss.

Communication & Style Preferences

  • Use clear, idiomatic PyTorch code.
  • Ensure type hints are used for function signatures.
  • Provide comments explaining the masking logic and dimension handling.

Operational Rules & Constraints

  1. Configurable Model Architecture:

    • Implement a
      ConfigurableTransformer
      class that accepts
      d_model_configs
      (list of ints) and
      dim_feedforward_configs
      (list of ints).
    • The model should iterate through these lists to create
      TransformerEncoderLayer
      instances.
    • If
      d_model
      changes between layers, insert a
      nn.Linear
      projection to match dimensions.
    • Include an embedding layer and a final output projection layer.
  2. Attention Masking:

    • Implement a helper function
      generate_square_subsequent_mask(sz)
      that returns a float tensor of shape
      [sz, sz]
      with
      -inf
      in the upper triangle (for causal masking).
    • Implement a helper function
      create_padding_mask(seq, pad_idx)
      that returns a boolean tensor of shape
      [batch, seq_len]
      where
      True
      indicates valid tokens and
      False
      indicates padding.
    • In the model's
      forward
      method, accept
      src_mask
      (causal) and
      src_key_padding_mask
      (padding) and pass them correctly to
      nn.TransformerEncoder
      .
  3. Training Loop with Best Model Checkpointing:

    • Implement a
      train_model
      function that accepts
      model
      ,
      train_loader
      ,
      val_loader
      ,
      optimizer
      ,
      criterion
      ,
      num_epochs
      , and
      device
      .
    • Inside the epoch loop, calculate validation loss using
      val_loader
      .
    • Track the
      best_loss
      and
      best_model_state
      (using
      copy.deepcopy
      ).
    • If the current validation loss is lower than
      best_loss
      , update
      best_model_state
      .
    • Return the
      best_model_state
      at the end of training.
  4. Positional Encoding:

    • Include a standard sinusoidal positional encoding function that is added to the embeddings.

Anti-Patterns

  • Do not mix up
    src_mask
    (float) and
    src_key_padding_mask
    (boolean). They serve different purposes.
  • Do not use global variables for tracking the best model; pass state explicitly or return it.
  • Do not assume fixed dimensions; handle the list-based configuration dynamically.

Interaction Workflow

  1. Define the
    ConfigurableTransformer
    class.
  2. Define the masking helper functions.
  3. Define the
    train_model
    function with the checkpointing logic.
  4. (Optional) Provide a usage example showing how to instantiate the model with lists and run the training loop.

Triggers

  • implement configurable transformer with variable layer dimensions
  • add attention mask for transformer
  • save best model based on validation loss
  • train transformer with checkpointing
  • pytorch transformer list of dimensions