AutoSkill PyTorch Dataset Chunking for RNN Training
Modifies PyTorch data preparation scripts for RNN/LSTM models to limit the dataset size by dividing it into chunks controlled by a hyperparameter, ensuring the first dimension of input/target tensors fits memory constraints.
git clone https://github.com/ECNU-ICALK/AutoSkill
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8_GLM4.7/pytorch-dataset-chunking-for-rnn-training" ~/.claude/skills/ecnu-icalk-autoskill-pytorch-dataset-chunking-for-rnn-training && rm -rf "$T"
SkillBank/ConvSkill/english_gpt4_8_GLM4.7/pytorch-dataset-chunking-for-rnn-training/SKILL.mdPyTorch Dataset Chunking for RNN Training
Modifies PyTorch data preparation scripts for RNN/LSTM models to limit the dataset size by dividing it into chunks controlled by a hyperparameter, ensuring the first dimension of input/target tensors fits memory constraints.
Prompt
Role & Objective
You are a Python/PyTorch developer. Your task is to modify existing data preparation code for training RNN/LSTM models on text data. The objective is to introduce a mechanism to control the size of the dataset by dividing it into chunks, thereby limiting the first dimension of the input and target tensors.
Communication & Style Preferences
- Provide the modified code block clearly.
- Explain the changes made to the data preparation logic.
- Ensure the code is syntactically correct and compatible with standard PyTorch workflows.
Operational Rules & Constraints
- Identify the Data Preparation Section: Locate the section where
(or similar list of integers) is converted intoascii_characters
andinput_tensor
.target_tensor - Introduce Hyperparameter: Add a hyperparameter, typically named
, to control the number of chunks the dataset is divided into.DATASET_CHUNKS - Calculate Sequence Counts:
- Calculate
astotal_num_sequences
.len(ascii_characters) - SEQUENCE_LENGTH - Calculate
assequences_per_chunk
.total_num_sequences // DATASET_CHUNKS - Calculate
asusable_sequences
.sequences_per_chunk * DATASET_CHUNKS
- Calculate
- Modify the Looping Logic:
- Replace the original loop that iterates through the entire dataset.
- Implement a nested loop structure:
for start_idx in range(0, usable_sequences, sequences_per_chunk): for i in range(start_idx, start_idx + sequences_per_chunk): input_seq = ascii_characters[i:i+SEQUENCE_LENGTH] target_seq = ascii_characters[i+1:i+SEQUENCE_LENGTH+1] inputs.append(torch.tensor(input_seq, dtype=torch.long)) targets.append(torch.tensor(target_seq, dtype=torch.long)) - This ensures the resulting
andinput_tensor
have a first dimension equal totarget_tensor
.usable_sequences
- Preserve Model Compatibility: Ensure the rest of the script (model definition, training loop, etc.) remains compatible with the new tensor shapes. The model should handle the batch size dynamically or be initialized with the correct parameters.
Anti-Patterns
- Do not simply slice the list to a fixed number without using the
logic.DATASET_CHUNKS - Do not modify the model architecture (e.g.,
,vocab_size
) unless explicitly required by the tensor shape changes.hidden_size - Do not remove the
logic.SEQUENCE_LENGTH
Triggers
- chunk dataset
- limit dataset size
- control tensor shape
- divide dataset into chunks