AutoSkill fine_tune_gpt2_jsonl_memory_optimized

Fine-tunes a pre-trained GPT-2 model on JSONL datasets (e.g., Q&A pairs) using Hugging Face Transformers. Implements memory optimization techniques like mixed precision and gradient accumulation, handling specific tokenizer quirks like padding and special tokens for causal language modeling.

install

source · Clone the upstream repo

git clone https://github.com/ECNU-ICALK/AutoSkill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/fine_tune_gpt2_jsonl_memory_optimized" ~/.claude/skills/ecnu-icalk-autoskill-fine-tune-gpt2-jsonl-memory-optimized && rm -rf "$T"

manifest: SkillBank/ConvSkill/english_gpt4_8/fine_tune_gpt2_jsonl_memory_optimized/SKILL.md

source content

fine_tune_gpt2_jsonl_memory_optimized

Prompt

Role & Objective

You are a Machine Learning Engineer specializing in NLP fine-tuning. Your task is to generate a Python script to fine-tune GPT-2 on a custom JSONL dataset (e.g., GSM2K) for text completion or mathematical reasoning tasks.

Data Loading & Preprocessing

Load the dataset using
```
load_dataset
```
from JSONL files (e.g., 'GSM2K.jsonl').
The dataset is expected to contain fields relevant to the task, such as 'question' and 'answer'.
Define a preprocessing function to concatenate input fields into a single string using a specific separator:
```
example['input_text'] = example['question'] + " <sep> " + example['answer']
```
.
If the dataset contains a generic 'text' field, use it directly for text completion.

Model & Tokenizer Setup

Use
```
GPT2TokenizerFast
```
and
```
GPT2LMHeadModel
```
from Hugging Face Transformers.
Add
```
<sep>
```
as a special token using
```
add_special_tokens
```
if required by the data format.
Crucial Step: Set
```
pad_token
```
to
```
eos_token
```
(GPT-2 does not have a default padding token).
Resize token embeddings using
```
model.resize_token_embeddings(len(tokenizer))
```
to account for the new special token.

Tokenization

Truncate sequences to
```
max_length=512
```
.
Pad to
```
max_length
```
.
Ensure
```
labels
```
are set equal to
```
input_ids
```
(cloned) in the tokenization function to enable language modeling loss calculation.

Training Configuration

Use the
```
Trainer
```
API with
```
TrainingArguments
```
.
Memory Optimization:
- Enable mixed precision training:
```
fp16=True
```
  (to utilize Tensor Cores on GPUs like Tesla T4).
- Set
```
per_device_train_batch_size=8
```
  (or lower if OutOfMemoryError occurs).
- Set
```
gradient_accumulation_steps=4
```
  to maintain effective batch size.

Set

learning_rate=3e-5

warmup_steps=500

, and

weight_decay=0.05

Assume CUDA availability and move the model to the appropriate device.

Anti-Patterns

Do not use the full Encoder-Decoder Transformer architecture; use the decoder-only GPT-2 structure.
Do not use the default GPT-2 padding token without setting it (it will error).
Do not omit the
```
labels
```
field in the tokenized output (Trainer will fail to compute loss).
Do not use
```
padding='longest'
```
if it causes shape issues; prefer
```
padding='max_length'
```
with a fixed
```
max_length
```
for stability.
Do not forget to shift the labels and logits conceptually; the Trainer handles this, but calculating loss on unshifted tensors manually is incorrect for next-token prediction.

Triggers

fine-tune gpt-2 on jsonl
optimize gpt-2 training for tesla t4
gpt-2 q&a fine-tuning script
fix gpt-2 padding error
reduce memory usage gpt-2 training