Skills peft-fine-tuning

install
source · Clone the upstream repo
git clone https://github.com/TerminalSkills/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/TerminalSkills/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/peft-fine-tuning" ~/.claude/skills/terminalskills-skills-peft-fine-tuning && rm -rf "$T"
manifest: skills/peft-fine-tuning/SKILL.md
safety · automated scan (low risk)
This is a pattern-based risk scan, not a security review. Our crawler flagged:
  • pip install
Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.
source content

PEFT Fine-Tuning

Overview

Fine-tune large language models efficiently using Parameter-Efficient Fine-Tuning (PEFT) methods. Train 7B to 70B parameter models on consumer GPUs (16-48 GB VRAM) using LoRA, QLoRA, and 25+ adapter methods from the Hugging Face PEFT library. Avoid the cost and hardware requirements of full fine-tuning while achieving comparable results.

Instructions

When a user asks to fine-tune a model, determine the approach:

Task A: Set up the environment

pip install torch transformers datasets peft accelerate bitsandbytes trl
# For Flash Attention 2 (recommended for speed)
pip install flash-attn --no-build-isolation

Verify GPU availability:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

Task B: Fine-tune with LoRA (16+ GB VRAM)

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
from trl import SFTTrainer

# 1. Load base model
model_name = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

# 2. Configure LoRA
lora_config = LoraConfig(
    r=16,                          # Rank (8-64; higher = more capacity)
    lora_alpha=32,                 # Scaling factor (usually 2x rank)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13.6M || all params: 8.03B || 0.17%

# 3. Load and format dataset
dataset = load_dataset("your-dataset")

def format_prompt(example):
    return f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"

# 4. Train
training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    formatting_func=format_prompt,
    max_seq_length=2048,
)

trainer.train()
trainer.save_model("./lora-adapter")

Task C: Fine-tune with QLoRA (8+ GB VRAM)

QLoRA quantizes the base model to 4-bit, dramatically reducing memory:

from transformers import BitsAndBytesConfig

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

# Apply LoRA on top of quantized model
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
# Now fine-tune with the same SFTTrainer setup from Task B

VRAM requirements with QLoRA:

  • 7B model: ~6 GB
  • 13B model: ~10 GB
  • 70B model: ~36 GB

Task D: Merge and export the fine-tuned model

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model + adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "./lora-adapter")

# Merge adapter weights into base model
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer.save_pretrained("./merged-model")

# Convert to GGUF for Ollama/llama.cpp
# pip install llama-cpp-python
# python -m llama_cpp.convert ./merged-model --outfile model.gguf

Task E: Prepare a custom dataset

from datasets import Dataset
import json

# Format: instruction-response pairs
data = [
    {"instruction": "Summarize this contract clause.", "input": "...", "output": "..."},
    {"instruction": "Extract the key dates.", "input": "...", "output": "..."},
]

# Create Hugging Face dataset
dataset = Dataset.from_list(data)
dataset = dataset.train_test_split(test_size=0.1)

# Or load from JSONL file
dataset = load_dataset("json", data_files="training_data.jsonl")

Examples

Example 1: Fine-tune Llama 3.1 8B for customer support

User request: "Fine-tune Llama 8B on our support ticket data"

# Format support tickets as instruction pairs
def format_support(example):
    return (
        f"### Customer Query:\n{example['question']}\n\n"
        f"### Support Response:\n{example['answer']}"
    )

# Use QLoRA for 8GB VRAM GPUs
# Train for 3 epochs with lr=2e-4, rank=16
# Result: ~2 hours on RTX 4090, adapter size ~30 MB

Example 2: Domain-adapt a model for medical text

User request: "Adapt Mistral 7B to understand medical terminology"

Use continued pre-training with LoRA on a medical corpus, then instruction-tune on medical QA pairs. Set

r=32
for higher capacity on specialized domains.

Example 3: Fine-tune a 70B model with QLoRA on 2x A100

User request: "Fine-tune Llama 70B on our internal documents"

Use QLoRA with

device_map="auto"
to shard across GPUs. Set
per_device_train_batch_size=1
with
gradient_accumulation_steps=16
. Expect ~24 hours for 3 epochs on 10K samples.

Guidelines

  • Start with QLoRA if VRAM is limited; it matches LoRA quality in most benchmarks.
  • Use rank
    r=16
    as a default. Increase to
    r=32-64
    for complex domain adaptation; decrease to
    r=8
    for simple style tuning.
  • Always set
    lora_alpha = 2 * r
    as a starting point.
  • Target all linear layers (
    q_proj
    ,
    k_proj
    ,
    v_proj
    ,
    o_proj
    ,
    gate_proj
    ,
    up_proj
    ,
    down_proj
    ) for best results.
  • Use a cosine learning rate scheduler with 3% warmup for stable training.
  • Monitor training loss: it should decrease steadily. If it plateaus early, increase rank or learning rate.
  • Evaluate on a held-out test set after each epoch to detect overfitting.
  • Save checkpoints every epoch; adapter files are small (~30-100 MB).
  • Clean, well-formatted training data matters more than quantity. 1,000 high-quality examples often beat 10,000 noisy ones.