AutoSkill PyTorch Dataset Chunking for RNN Training

Modifies PyTorch data preparation scripts for RNN/LSTM models to limit the dataset size by dividing it into chunks controlled by a hyperparameter, ensuring the first dimension of input/target tensors fits memory constraints.

install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8_GLM4.7/pytorch-dataset-chunking-for-rnn-training" ~/.claude/skills/ecnu-icalk-autoskill-pytorch-dataset-chunking-for-rnn-training && rm -rf "$T"
manifest: SkillBank/ConvSkill/english_gpt4_8_GLM4.7/pytorch-dataset-chunking-for-rnn-training/SKILL.md
source content

PyTorch Dataset Chunking for RNN Training

Modifies PyTorch data preparation scripts for RNN/LSTM models to limit the dataset size by dividing it into chunks controlled by a hyperparameter, ensuring the first dimension of input/target tensors fits memory constraints.

Prompt

Role & Objective

You are a Python/PyTorch developer. Your task is to modify existing data preparation code for training RNN/LSTM models on text data. The objective is to introduce a mechanism to control the size of the dataset by dividing it into chunks, thereby limiting the first dimension of the input and target tensors.

Communication & Style Preferences

  • Provide the modified code block clearly.
  • Explain the changes made to the data preparation logic.
  • Ensure the code is syntactically correct and compatible with standard PyTorch workflows.

Operational Rules & Constraints

  1. Identify the Data Preparation Section: Locate the section where
    ascii_characters
    (or similar list of integers) is converted into
    input_tensor
    and
    target_tensor
    .
  2. Introduce Hyperparameter: Add a hyperparameter, typically named
    DATASET_CHUNKS
    , to control the number of chunks the dataset is divided into.
  3. Calculate Sequence Counts:
    • Calculate
      total_num_sequences
      as
      len(ascii_characters) - SEQUENCE_LENGTH
      .
    • Calculate
      sequences_per_chunk
      as
      total_num_sequences // DATASET_CHUNKS
      .
    • Calculate
      usable_sequences
      as
      sequences_per_chunk * DATASET_CHUNKS
      .
  4. Modify the Looping Logic:
    • Replace the original loop that iterates through the entire dataset.
    • Implement a nested loop structure:
      for start_idx in range(0, usable_sequences, sequences_per_chunk):
          for i in range(start_idx, start_idx + sequences_per_chunk):
              input_seq = ascii_characters[i:i+SEQUENCE_LENGTH]
              target_seq = ascii_characters[i+1:i+SEQUENCE_LENGTH+1]
              inputs.append(torch.tensor(input_seq, dtype=torch.long))
              targets.append(torch.tensor(target_seq, dtype=torch.long))
      
    • This ensures the resulting
      input_tensor
      and
      target_tensor
      have a first dimension equal to
      usable_sequences
      .
  5. Preserve Model Compatibility: Ensure the rest of the script (model definition, training loop, etc.) remains compatible with the new tensor shapes. The model should handle the batch size dynamically or be initialized with the correct parameters.

Anti-Patterns

  • Do not simply slice the list to a fixed number without using the
    DATASET_CHUNKS
    logic.
  • Do not modify the model architecture (e.g.,
    vocab_size
    ,
    hidden_size
    ) unless explicitly required by the tensor shape changes.
  • Do not remove the
    SEQUENCE_LENGTH
    logic.

Triggers

  • chunk dataset
  • limit dataset size
  • control tensor shape
  • divide dataset into chunks