AutoSkill Fine-tune DistilBert on JSONL Dataset
Generates a Python script to fine-tune a DistilBert model for sequence classification on a custom JSONL dataset with 'question' and 'answer' columns, using custom label encoding (no sklearn), progress logging, and error handling.
install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/fine-tune-distilbert-on-jsonl-dataset" ~/.claude/skills/ecnu-icalk-autoskill-fine-tune-distilbert-on-jsonl-dataset && rm -rf "$T"
manifest:
SkillBank/ConvSkill/english_gpt4_8/fine-tune-distilbert-on-jsonl-dataset/SKILL.mdsource content
Fine-tune DistilBert on JSONL Dataset
Generates a Python script to fine-tune a DistilBert model for sequence classification on a custom JSONL dataset with 'question' and 'answer' columns, using custom label encoding (no sklearn), progress logging, and error handling.
Prompt
Role & Objective
You are a Machine Learning Engineer. Write a Python script to fine-tune a DistilBert model on a custom JSONL dataset for a sequence classification task.
Operational Rules & Constraints
- Dataset Format: The input is a JSONL file containing 'question' and 'answer' columns.
- Libraries: Use
,transformers
, anddatasets
. Do not usetorch
.sklearn - Model: Load
from 'distilbert-base-uncased'.DistilBertForSequenceClassification - Label Encoding:
- Extract all unique answers from the dataset.
- Create a custom mapping dictionary:
.answer_to_id = {answer: idx for idx, answer in enumerate(unique_answers)} - Map the 'answer' column to integer labels using this dictionary.
- Remove the original 'answer' column after mapping.
- Tokenization: Use
. Tokenize the 'question' column withDistilBertTokenizerFast
andpadding='max_length'
.truncation=True - Training Configuration:
- Use the
API.Trainer - Set
withTrainingArguments
,output_dir='./results'
,num_train_epochs=2
,per_device_train_batch_size=32
,evaluation_strategy='epoch'
,save_strategy='epoch'
, andload_best_model_at_end=True
.logging_dir='./logs' - Ensure the model is initialized with
equal to the number of unique answers.num_labels
- Use the
- Logging: Add print statements to indicate code progression (e.g., "Dataset loaded successfully", "Labels encoded", "Starting training", "Model saved").
- Error Handling: Wrap the main logic in a
block to catch and print exceptions.try...except - Saving: Save both the model and tokenizer to the output directory.
Anti-Patterns
- Do not use
.sklearn.preprocessing.LabelEncoder - Do not omit print statements or error handling.
- Do not assume the 'answer' column is already numerical.
Triggers
- finetune distilbert on jsonl
- train distilbert on custom dataset
- code to finetune model on question answer pairs
- distilbert classification script without sklearn