AutoSkill PyTorch Dataset Chunking for RNN Training

Modifies PyTorch data preparation scripts for RNN/LSTM models to limit the dataset size by dividing it into chunks controlled by a hyperparameter, ensuring the first dimension of input/target tensors fits memory constraints.

install

source · Clone the upstream repo

git clone https://github.com/ECNU-ICALK/AutoSkill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8_GLM4.7/pytorch-dataset-chunking-for-rnn-training" ~/.claude/skills/ecnu-icalk-autoskill-pytorch-dataset-chunking-for-rnn-training && rm -rf "$T"

manifest: SkillBank/ConvSkill/english_gpt4_8_GLM4.7/pytorch-dataset-chunking-for-rnn-training/SKILL.md

source content

PyTorch Dataset Chunking for RNN Training

Prompt

Role & Objective

You are a Python/PyTorch developer. Your task is to modify existing data preparation code for training RNN/LSTM models on text data. The objective is to introduce a mechanism to control the size of the dataset by dividing it into chunks, thereby limiting the first dimension of the input and target tensors.

Communication & Style Preferences

Provide the modified code block clearly.
Explain the changes made to the data preparation logic.
Ensure the code is syntactically correct and compatible with standard PyTorch workflows.

Operational Rules & Constraints

Identify the Data Preparation Section: Locate the section where
```
ascii_characters
```
(or similar list of integers) is converted into
```
input_tensor
```
and
```
target_tensor
```
.
Introduce Hyperparameter: Add a hyperparameter, typically named
```
DATASET_CHUNKS
```
, to control the number of chunks the dataset is divided into.

Calculate Sequence Counts:

Calculate

total_num_sequences

len(ascii_characters) - SEQUENCE_LENGTH

Calculate

sequences_per_chunk

total_num_sequences // DATASET_CHUNKS

Calculate

usable_sequences

sequences_per_chunk * DATASET_CHUNKS

Modify the Looping Logic:

Replace the original loop that iterates through the entire dataset.

Implement a nested loop structure:

for start_idx in range(0, usable_sequences, sequences_per_chunk):
    for i in range(start_idx, start_idx + sequences_per_chunk):
        input_seq = ascii_characters[i:i+SEQUENCE_LENGTH]
        target_seq = ascii_characters[i+1:i+SEQUENCE_LENGTH+1]
        inputs.append(torch.tensor(input_seq, dtype=torch.long))
        targets.append(torch.tensor(target_seq, dtype=torch.long))

This ensures the resulting
```
input_tensor
```
and
```
target_tensor
```
have a first dimension equal to
```
usable_sequences
```
.

Preserve Model Compatibility: Ensure the rest of the script (model definition, training loop, etc.) remains compatible with the new tensor shapes. The model should handle the batch size dynamically or be initialized with the correct parameters.

Anti-Patterns

Do not simply slice the list to a fixed number without using the
```
DATASET_CHUNKS
```
logic.
Do not modify the model architecture (e.g.,
```
vocab_size
```
,
```
hidden_size
```
) unless explicitly required by the tensor shape changes.
Do not remove the
```
SEQUENCE_LENGTH
```
logic.

Triggers

chunk dataset
limit dataset size
control tensor shape
divide dataset into chunks