install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/TerminalSkills/skills/kaggle-finetune" ~/.claude/skills/comeonoliver-skillshub-kaggle-finetune && rm -rf "$T"
manifest:
skills/TerminalSkills/skills/kaggle-finetune/SKILL.mdsource content
Kaggle Fine-Tuning Workflow
Overview
Complete pipeline for downloading Kaggle datasets and fine-tuning LLMs. Handles dataset discovery, download via Kaggle CLI, preprocessing into HuggingFace chat format, and training with PEFT/LoRA for memory-efficient fine-tuning.
Prerequisites
pip install kaggle peft transformers accelerate bitsandbytes datasets trl
Set Kaggle API token:
export KAGGLE_API_TOKEN=KGAT_xxxxx
Instructions
Step 1: Search and download datasets
# Search for relevant datasets kaggle datasets list -s "customer service conversation" --sort-by votes # Download specific dataset kaggle datasets download -d bitext/bitext-gen-ai-chatbot-customer-support-dataset -p ./data --unzip
Recommended datasets for chatbots:
| Dataset | Use Case |
|---|---|
| Customer support |
| General chat |
| Call center |
| FAQ format |
Step 2: Preprocess into chat format
Convert data to HuggingFace messages format:
import pandas as pd import json def convert_to_chat_format(input_path, output_path, user_col, assistant_col, system_prompt=None): df = pd.read_csv(input_path) records = [] for _, row in df.iterrows(): messages = [] if system_prompt: messages.append({"role": "system", "content": system_prompt}) messages.append({"role": "user", "content": str(row[user_col])}) messages.append({"role": "assistant", "content": str(row[assistant_col])}) records.append({"messages": messages}) with open(output_path, 'w') as f: for record in records: f.write(json.dumps(record) + '\n') return len(records) # Example usage convert_to_chat_format( "data/customer_support.csv", "data/train.jsonl", user_col="instruction", assistant_col="response", system_prompt="You are a helpful customer service assistant." )
Step 3: Fine-tune with LoRA
from datasets import load_dataset from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig from peft import LoraConfig, TaskType from trl import SFTTrainer, SFTConfig import torch # Model selection by VRAM: 8GB→1.5B, 16GB→7B(4-bit), 24GB→8B model_name = "Qwen/Qwen2.5-3B-Instruct" # 4-bit quantization for memory efficiency bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, lora_alpha=32, lora_dropout=0.05, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], ) dataset = load_dataset("json", data_files="data/train.jsonl", split="train") trainer = SFTTrainer( model=model, args=SFTConfig( output_dir="./model-finetune", num_train_epochs=3, per_device_train_batch_size=2, gradient_accumulation_steps=8, learning_rate=2e-4, fp16=True, max_seq_length=512, ), train_dataset=dataset, peft_config=lora_config, tokenizer=tokenizer, ) trainer.train() trainer.save_model("./model-lora")
Step 4: Test and deploy
from peft import PeftModel model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto") model = PeftModel.from_pretrained(model, "./model-lora") messages = [{"role": "user", "content": "How can I reset my password?"}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Examples
Example 1: Fine-tune a customer service chatbot from a Kaggle dataset
User prompt: "Download the Bitext customer support dataset from Kaggle and fine-tune Qwen2.5-3B-Instruct on it using LoRA. I have a 16GB GPU."
The agent will:
- Verify the Kaggle CLI is installed and
is set.KAGGLE_API_TOKEN - Run
to fetch the dataset.kaggle datasets download -d bitext/bitext-gen-ai-chatbot-customer-support-dataset -p ./data --unzip - Inspect the CSV columns to identify the user input and assistant response fields.
- Write and execute a preprocessing script that converts the CSV into JSONL chat format with a system prompt like "You are a helpful customer service assistant."
- Configure a LoRA fine-tune with
, 4-bit quantization, batch size 2 with gradient accumulation of 8, and train for 3 epochs.r=16 - Save the LoRA adapter to
and run a test inference with a sample prompt like "How do I reset my password?"./model-lora/
Example 2: Build a medical FAQ chatbot from Kaggle mental health data
User prompt: "Find a mental health FAQ dataset on Kaggle and prepare it for fine-tuning. I only have a CPU, so pick a small model."
The agent will:
- Search Kaggle with
and select an appropriate dataset.kaggle datasets list -s "mental health FAQ" --sort-by votes - Download and unzip the dataset to
../data/ - Convert the FAQ pairs into JSONL chat format with a system prompt suited to mental health support.
- Select Qwen2.5-1.5B-Instruct as a CPU-friendly model and configure training with
, batch size 1, gradient accumulation 16, andload_in_4bit=True
to fit in memory.max_seq_length=256 - Start training and monitor loss, noting it will take several hours on CPU.
Guidelines
- Always verify the Kaggle API token is set as
before attempting downloads; the CLI will fail silently or with cryptic errors without it.KAGGLE_API_TOKEN - Choose your base model based on available VRAM: 1.5B parameters for 8GB, 3B-7B (4-bit) for 16GB, and 8B for 24GB.
- If you encounter out-of-memory errors during training, reduce
to 1 and increaseper_device_train_batch_size
to compensate before reducing model size.gradient_accumulation_steps - Inspect the raw CSV data before preprocessing to verify column names and data quality; missing values or mismatched columns will silently produce poor training data.
- Start with 3 training epochs and LoRA rank
; increase epochs to 5 and rank to 32-64 only if evaluation shows the model is underfitting.r=16 - Enable
(orfp16=True
on Ampere+ GPUs) to halve memory usage and speed up training with minimal accuracy impact.bf16=True