Claude-skill-registry-data lora
Parameter-efficient fine-tuning with Low-Rank Adaptation (LoRA). Use when fine-tuning large language models with limited GPU memory, creating task-specific adapters, or when you need to train multiple specialized models from a single base.
git clone https://github.com/majiayu000/claude-skill-registry-data
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/lora" ~/.claude/skills/majiayu000-claude-skill-registry-data-lora && rm -rf "$T"
data/lora/SKILL.mdUsing LoRA for Fine-tuning
LoRA (Low-Rank Adaptation) enables efficient fine-tuning by freezing pretrained weights and injecting small trainable matrices into transformer layers. This reduces trainable parameters to ~0.1% of the original model while maintaining performance.
Table of Contents
- Core Concepts
- Basic Setup
- Configuration Parameters
- QLoRA (Quantized LoRA)
- Training Patterns
- Saving and Loading
- Merging Adapters
- Best Practices
Core Concepts
How LoRA Works
Instead of updating all weights during fine-tuning, LoRA decomposes weight updates into low-rank matrices:
W' = W + BA
Where:
is the frozen pretrained weight matrix (d × k)W
is a trainable matrix (d × r)B
is a trainable matrix (r × k)A
is the rank, much smaller than d and kr
The key insight: weight updates during fine-tuning have low intrinsic rank, so we can represent them efficiently with smaller matrices.
Why Use LoRA
| Aspect | Full Fine-tuning | LoRA |
|---|---|---|
| Trainable params | 100% | ~0.1-1% |
| Memory usage | High | Low |
| Adapter size | Full model | ~3-100 MB |
| Training speed | Slower | Faster |
| Multiple tasks | Separate models | Swap adapters |
Basic Setup
Installation
pip install peft transformers accelerate
Minimal Example
from transformers import AutoModelForCausalLM, AutoTokenizer from peft import LoraConfig, get_peft_model, TaskType import torch # Load base model model_name = "meta-llama/Llama-3.2-1B" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token # Configure LoRA lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, ) # Apply LoRA model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 3,407,872 || all params: 1,238,300,672 || trainable%: 0.28%
Configuration Parameters
LoraConfig Options
from peft import LoraConfig, TaskType config = LoraConfig( # Core parameters r=16, # Rank of update matrices lora_alpha=32, # Scaling factor (alpha/r applied to updates) target_modules=["q_proj", "v_proj"], # Layers to adapt # Regularization lora_dropout=0.05, # Dropout on LoRA layers bias="none", # "none", "all", or "lora_only" # Task configuration task_type=TaskType.CAUSAL_LM, # CAUSAL_LM, SEQ_CLS, SEQ_2_SEQ_LM, TOKEN_CLS # Advanced modules_to_save=None, # Additional modules to train (e.g., ["lm_head"]) layers_to_transform=None, # Specific layer indices to adapt use_rslora=False, # Rank-stabilized LoRA scaling use_dora=False, # Weight-Decomposed LoRA )
Target Modules by Architecture
# Llama, Mistral, Qwen target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] # GPT-2, GPT-J target_modules = ["c_attn", "c_proj", "c_fc"] # BERT, RoBERTa target_modules = ["query", "key", "value", "dense"] # Falcon target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"] # Phi target_modules = ["q_proj", "k_proj", "v_proj", "dense", "fc1", "fc2"]
Finding Target Modules
# Print all linear layer names from peft.utils import get_peft_model_state_dict def find_target_modules(model): linear_modules = set() for name, module in model.named_modules(): if isinstance(module, torch.nn.Linear): # Get the last part of the name (e.g., "q_proj" from "model.layers.0.self_attn.q_proj") layer_name = name.split(".")[-1] linear_modules.add(layer_name) return list(linear_modules) print(find_target_modules(model))
QLoRA (Quantized LoRA)
QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of large models on consumer GPUs.
Setup
from transformers import AutoModelForCausalLM, BitsAndBytesConfig from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training import torch # 4-bit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # Normalized float 4-bit bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, # Nested quantization ) # Load quantized model model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B", quantization_config=bnb_config, device_map="auto", ) # Prepare for k-bit training model = prepare_model_for_kbit_training(model) # Apply LoRA lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, ) model = get_peft_model(model, lora_config)
Memory Requirements
| Model Size | Full FT (16-bit) | LoRA (16-bit) | QLoRA (4-bit) |
|---|---|---|---|
| 7B | ~60 GB | ~16 GB | ~6 GB |
| 13B | ~104 GB | ~28 GB | ~10 GB |
| 70B | ~560 GB | ~160 GB | ~48 GB |
Training Patterns
With Hugging Face Trainer
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling from datasets import load_dataset # Prepare dataset dataset = load_dataset("tatsu-lab/alpaca", split="train") def format_prompt(example): if example["input"]: text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}" else: text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}" return {"text": text} dataset = dataset.map(format_prompt) def tokenize(examples): return tokenizer( examples["text"], truncation=True, max_length=512, padding=False, ) tokenized = dataset.map(tokenize, batched=True, remove_columns=dataset.column_names) # Training arguments (note higher learning rate) training_args = TrainingArguments( output_dir="./lora-output", per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=1, learning_rate=2e-4, # Higher than full fine-tuning bf16=True, logging_steps=10, save_steps=500, warmup_ratio=0.03, gradient_checkpointing=True, optim="adamw_torch_fused", ) trainer = Trainer( model=model, args=training_args, train_dataset=tokenized, data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), ) trainer.train()
With SFTTrainer (TRL)
from trl import SFTTrainer, SFTConfig sft_config = SFTConfig( output_dir="./sft-lora", max_seq_length=1024, per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=1, learning_rate=2e-4, bf16=True, logging_steps=10, gradient_checkpointing=True, ) trainer = SFTTrainer( model=model, args=sft_config, train_dataset=dataset, tokenizer=tokenizer, peft_config=lora_config, # Pass config directly, SFTTrainer applies it dataset_text_field="text", ) trainer.train()
Classification Task
from transformers import AutoModelForSequenceClassification from peft import LoraConfig, get_peft_model, TaskType model = AutoModelForSequenceClassification.from_pretrained( "bert-base-uncased", num_labels=2, ) lora_config = LoraConfig( r=8, lora_alpha=16, target_modules=["query", "value"], lora_dropout=0.1, bias="none", task_type=TaskType.SEQ_CLS, modules_to_save=["classifier"], # Train classification head fully ) model = get_peft_model(model, lora_config)
Saving and Loading
Save Adapter
# Save only LoRA weights (small file) model.save_pretrained("./my-lora-adapter") tokenizer.save_pretrained("./my-lora-adapter") # Push to Hub model.push_to_hub("username/my-lora-adapter")
Load Adapter
from peft import PeftModel from transformers import AutoModelForCausalLM # Load base model base_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-1B", torch_dtype=torch.bfloat16, device_map="auto", ) # Load adapter model = PeftModel.from_pretrained(base_model, "./my-lora-adapter") # For inference model.eval()
Switch Between Adapters
# Load multiple adapters model.load_adapter("./adapter-1", adapter_name="task1") model.load_adapter("./adapter-2", adapter_name="task2") # Switch active adapter model.set_adapter("task1") output = model.generate(**inputs) model.set_adapter("task2") output = model.generate(**inputs) # Disable adapter (use base model) with model.disable_adapter(): output = model.generate(**inputs)
Merging Adapters
Merge LoRA weights into the base model for deployment without adapter overhead.
from peft import PeftModel # Load base model base_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-1B", torch_dtype=torch.bfloat16, device_map="cpu", # Merge on CPU to avoid memory issues ) # Load adapter model = PeftModel.from_pretrained(base_model, "./my-lora-adapter") # Merge and unload merged_model = model.merge_and_unload() # Save merged model merged_model.save_pretrained("./merged-model") tokenizer.save_pretrained("./merged-model") # Push merged model to Hub merged_model.push_to_hub("username/my-merged-model")
Best Practices
-
Start with r=16: Scale up to 32 or 64 if the model underfits, down to 8 if overfitting or memory-constrained
-
Set lora_alpha = 2 × r: This is a common heuristic; the effective scaling is
alpha/r -
Target all attention and MLP layers: For best results on LLMs, include gate/up/down projections:
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] -
Use higher learning rate: 2e-4 is typical for LoRA vs 2e-5 for full fine-tuning
-
Enable gradient checkpointing: Reduces memory at cost of ~20% slower training:
model.gradient_checkpointing_enable() -
Use QLoRA for large models: Essential for fine-tuning 7B+ models on consumer GPUs
-
Keep dropout low: 0.05 is usually sufficient; higher values may hurt performance
-
Save checkpoints frequently: LoRA adapters are small, so save often
-
Evaluate on base model too: Ensure adapter doesn't degrade base capabilities
-
Consider modules_to_save for task heads: For classification, train the classifier fully:
modules_to_save=["classifier", "score"]
References
See
reference/ for detailed documentation:
- DoRA, rsLoRA, adapter composition, and debuggingadvanced-techniques.md