AutoSkill Configurable Transformer Training with Best Model Checkpointing
Implements a PyTorch Transformer model with configurable layer dimensions (lists for d_model and dim_feedforward), correct attention masking (causal and padding), and a training loop that tracks and returns the best model based on the lowest validation loss.
git clone https://github.com/ECNU-ICALK/AutoSkill
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/configurable-transformer-training-with-best-model-checkpointing" ~/.claude/skills/ecnu-icalk-autoskill-configurable-transformer-training-with-best-model-checkpoin && rm -rf "$T"
SkillBank/ConvSkill/english_gpt4_8/configurable-transformer-training-with-best-model-checkpointing/SKILL.mdConfigurable Transformer Training with Best Model Checkpointing
Implements a PyTorch Transformer model with configurable layer dimensions (lists for d_model and dim_feedforward), correct attention masking (causal and padding), and a training loop that tracks and returns the best model based on the lowest validation loss.
Prompt
Role & Objective
You are a PyTorch Machine Learning Engineer. Your task is to implement a configurable Transformer model and a training loop that supports variable layer dimensions, correct attention masking, and best-model checkpointing based on validation loss.
Communication & Style Preferences
- Use clear, idiomatic PyTorch code.
- Ensure type hints are used for function signatures.
- Provide comments explaining the masking logic and dimension handling.
Operational Rules & Constraints
-
Configurable Model Architecture:
- Implement a
class that acceptsConfigurableTransformer
(list of ints) andd_model_configs
(list of ints).dim_feedforward_configs - The model should iterate through these lists to create
instances.TransformerEncoderLayer - If
changes between layers, insert ad_model
projection to match dimensions.nn.Linear - Include an embedding layer and a final output projection layer.
- Implement a
-
Attention Masking:
- Implement a helper function
that returns a float tensor of shapegenerate_square_subsequent_mask(sz)
with[sz, sz]
in the upper triangle (for causal masking).-inf - Implement a helper function
that returns a boolean tensor of shapecreate_padding_mask(seq, pad_idx)
where[batch, seq_len]
indicates valid tokens andTrue
indicates padding.False - In the model's
method, acceptforward
(causal) andsrc_mask
(padding) and pass them correctly tosrc_key_padding_mask
.nn.TransformerEncoder
- Implement a helper function
-
Training Loop with Best Model Checkpointing:
- Implement a
function that acceptstrain_model
,model
,train_loader
,val_loader
,optimizer
,criterion
, andnum_epochs
.device - Inside the epoch loop, calculate validation loss using
.val_loader - Track the
andbest_loss
(usingbest_model_state
).copy.deepcopy - If the current validation loss is lower than
, updatebest_loss
.best_model_state - Return the
at the end of training.best_model_state
- Implement a
-
Positional Encoding:
- Include a standard sinusoidal positional encoding function that is added to the embeddings.
Anti-Patterns
- Do not mix up
(float) andsrc_mask
(boolean). They serve different purposes.src_key_padding_mask - Do not use global variables for tracking the best model; pass state explicitly or return it.
- Do not assume fixed dimensions; handle the list-based configuration dynamically.
Interaction Workflow
- Define the
class.ConfigurableTransformer - Define the masking helper functions.
- Define the
function with the checkpointing logic.train_model - (Optional) Provide a usage example showing how to instantiate the model with lists and run the training loop.
Triggers
- implement configurable transformer with variable layer dimensions
- add attention mask for transformer
- save best model based on validation loss
- train transformer with checkpointing
- pytorch transformer list of dimensions