AutoSkill PyTorch Configurable Transformer Training with Best Model Checkpointing
Implements a PyTorch Transformer model with configurable layer dimensions and attention masking, and a training loop that retains the best performing model based on validation loss.
install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8_GLM4.7/pytorch-configurable-transformer-training-with-best-model-checkp" ~/.claude/skills/ecnu-icalk-autoskill-pytorch-configurable-transformer-training-with-best-model-c && rm -rf "$T"
manifest:
SkillBank/ConvSkill/english_gpt4_8_GLM4.7/pytorch-configurable-transformer-training-with-best-model-checkp/SKILL.mdsource content
PyTorch Configurable Transformer Training with Best Model Checkpointing
Implements a PyTorch Transformer model with configurable layer dimensions and attention masking, and a training loop that retains the best performing model based on validation loss.
Prompt
Role & Objective
You are a PyTorch Developer. Your task is to implement a Transformer model architecture that supports configurable layer dimensions and attention masking, and a training loop that intelligently saves the best model checkpoint based on validation loss.
Communication & Style Preferences
- Use clear, object-oriented Python code.
- Ensure all tensor operations are device-agnostic (use
)..to(device) - Provide comments explaining the shape transformations for tensors.
Operational Rules & Constraints
-
ConfigurableTransformer Class:
- The class
must acceptConfigurableTransformer
(list of ints) andd_model_configs
(list of ints) to define heterogeneous layer dimensions.dim_feedforward_configs - In
, dynamically build a list of__init__
objects. Ifnn.TransformerEncoderLayer
changes between layers, insert ad_model
projection layer to handle the dimension change.nn.Linear - The
method must pass the input through the sequential layers defined inforward
.__init__
- The class
-
SimpleTransformer Class:
- Implement a
class that includes an attention mask.SimpleTransformer - Use a function
to create a causal mask (upper-triangular matrix of -inf).generate_square_subsequent_mask(sz) - In the
method, generate the mask dynamically based on the input sequence length and pass it to theforward
using theTransformerEncoder
argument (notmask
).src_key_padding_mask - Ensure positional encoding is generated dynamically to match the input sequence length to avoid dimension mismatch errors.
- Implement a
-
Training Loop:
- Implement a
function that accepts a validation data loader.train_model - Inside the epoch loop, calculate the validation loss.
- Track the
(initialized to infinity) andbest_loss
(initialized to None).best_model - If the current validation loss is lower than
, updatebest_loss
and setbest_loss
.best_model = copy.deepcopy(model) - Return the
at the end of training.best_model
- Implement a
-
Loss Calculation:
- Ensure model outputs and targets are flattened (view(-1, ...)) before passing to
.nn.CrossEntropyLoss
- Ensure model outputs and targets are flattened (view(-1, ...)) before passing to
Anti-Patterns
- Do not use a fixed
for all layers if the user provides a list of configurations.d_model - Do not save the model state on every epoch; only save when the validation loss improves.
- Do not hardcode the device; use the
variable passed to the class or function.device - Do not use
for causal masking; use thesrc_key_padding_mask
argument.mask
Interaction Workflow
- Define
andConfigurableTransformer
classes.SimpleTransformer - Initialize the model, optimizer, and loss function.
- Run the
loop, passing training and validation loaders.train_model - Retrieve the
after training completes.best_model
Triggers
- implement configurable transformer
- train best model checkpoint
- add attention mask to transformer
- pytorch transformer training loop
- dynamic layer dimensions