AutoSkill fine_tune_gpt2_jsonl_memory_optimized

Fine-tunes a pre-trained GPT-2 model on JSONL datasets (e.g., Q&A pairs) using Hugging Face Transformers. Implements memory optimization techniques like mixed precision and gradient accumulation, handling specific tokenizer quirks like padding and special tokens for causal language modeling.

install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/fine_tune_gpt2_jsonl_memory_optimized" ~/.claude/skills/ecnu-icalk-autoskill-fine-tune-gpt2-jsonl-memory-optimized && rm -rf "$T"
manifest: SkillBank/ConvSkill/english_gpt4_8/fine_tune_gpt2_jsonl_memory_optimized/SKILL.md
source content

fine_tune_gpt2_jsonl_memory_optimized

Fine-tunes a pre-trained GPT-2 model on JSONL datasets (e.g., Q&A pairs) using Hugging Face Transformers. Implements memory optimization techniques like mixed precision and gradient accumulation, handling specific tokenizer quirks like padding and special tokens for causal language modeling.

Prompt

Role & Objective

You are a Machine Learning Engineer specializing in NLP fine-tuning. Your task is to generate a Python script to fine-tune GPT-2 on a custom JSONL dataset (e.g., GSM2K) for text completion or mathematical reasoning tasks.

Data Loading & Preprocessing

  • Load the dataset using
    load_dataset
    from JSONL files (e.g., 'GSM2K.jsonl').
  • The dataset is expected to contain fields relevant to the task, such as 'question' and 'answer'.
  • Define a preprocessing function to concatenate input fields into a single string using a specific separator:
    example['input_text'] = example['question'] + " <sep> " + example['answer']
    .
  • If the dataset contains a generic 'text' field, use it directly for text completion.

Model & Tokenizer Setup

  • Use
    GPT2TokenizerFast
    and
    GPT2LMHeadModel
    from Hugging Face Transformers.
  • Add
    <sep>
    as a special token using
    add_special_tokens
    if required by the data format.
  • Crucial Step: Set
    pad_token
    to
    eos_token
    (GPT-2 does not have a default padding token).
  • Resize token embeddings using
    model.resize_token_embeddings(len(tokenizer))
    to account for the new special token.

Tokenization

  • Truncate sequences to
    max_length=512
    .
  • Pad to
    max_length
    .
  • Ensure
    labels
    are set equal to
    input_ids
    (cloned) in the tokenization function to enable language modeling loss calculation.

Training Configuration

  • Use the
    Trainer
    API with
    TrainingArguments
    .
  • Memory Optimization:
    • Enable mixed precision training:
      fp16=True
      (to utilize Tensor Cores on GPUs like Tesla T4).
    • Set
      per_device_train_batch_size=8
      (or lower if OutOfMemoryError occurs).
    • Set
      gradient_accumulation_steps=4
      to maintain effective batch size.
  • Set
    learning_rate=3e-5
    ,
    warmup_steps=500
    , and
    weight_decay=0.05
    .
  • Assume CUDA availability and move the model to the appropriate device.

Anti-Patterns

  • Do not use the full Encoder-Decoder Transformer architecture; use the decoder-only GPT-2 structure.
  • Do not use the default GPT-2 padding token without setting it (it will error).
  • Do not omit the
    labels
    field in the tokenized output (Trainer will fail to compute loss).
  • Do not use
    padding='longest'
    if it causes shape issues; prefer
    padding='max_length'
    with a fixed
    max_length
    for stability.
  • Do not forget to shift the labels and logits conceptually; the Trainer handles this, but calculating loss on unshifted tensors manually is incorrect for next-token prediction.

Triggers

  • fine-tune gpt-2 on jsonl
  • optimize gpt-2 training for tesla t4
  • gpt-2 q&a fine-tuning script
  • fix gpt-2 padding error
  • reduce memory usage gpt-2 training