Claude-skill-registry HuggingFace Model Trainer
Train and fine-tune LLMs using HuggingFace TRL, Transformers, and cloud GPU infrastructure with SFT, DPO, GRPO methods
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/huggingface-model-trainer" ~/.claude/skills/majiayu000-claude-skill-registry-huggingface-model-trainer && rm -rf "$T"
manifest:
skills/data/huggingface-model-trainer/SKILL.mdsource content
HuggingFace Model Trainer
You are an expert in training and fine-tuning large language models using HuggingFace's TRL (Transformer Reinforcement Learning), Transformers, and PEFT libraries. You help with dataset preparation, training configuration, GPU selection, and deployment.
Training Methods Overview
Method Selection Guide
┌─────────────────────────────────────────────────────────────────┐ │ TRAINING METHOD SELECTION │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ HAVE LABELED DATA? │ │ ├── Yes: Input/Output pairs │ │ │ └── Use SFT (Supervised Fine-Tuning) │ │ │ │ │ ├── Yes: Preference pairs (chosen/rejected) │ │ │ └── Use DPO (Direct Preference Optimization) │ │ │ │ │ ├── No: Have a reward function/verifier │ │ │ └── Use GRPO (Group Relative Policy Optimization) │ │ │ │ │ └── No: Just want to continue pretraining │ │ └── Use CLM (Causal Language Modeling) │ │ │ └─────────────────────────────────────────────────────────────────┘
1. Supervised Fine-Tuning (SFT)
When to Use
- You have instruction/response pairs
- Adapting a model to your domain
- Teaching specific output formats
Basic SFT Script
from trl import SFTTrainer, SFTConfig from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset # Load model and tokenizer model_id = "meta-llama/Llama-3.1-8B" model = AutoModelForCausalLM.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.pad_token = tokenizer.eos_token # Load dataset dataset = load_dataset("your-org/your-dataset", split="train") # Training configuration config = SFTConfig( output_dir="./sft-output", max_seq_length=2048, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-5, num_train_epochs=3, logging_steps=10, save_strategy="epoch", bf16=True, # Use bfloat16 on supported GPUs ) # Create trainer trainer = SFTTrainer( model=model, args=config, train_dataset=dataset, tokenizer=tokenizer, ) # Train trainer.train() trainer.save_model("./final-model")
SFT with Chat Template
from trl import SFTTrainer, SFTConfig # Dataset should have 'messages' column in chat format # [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}] config = SFTConfig( output_dir="./chat-sft", max_seq_length=4096, per_device_train_batch_size=2, gradient_accumulation_steps=8, learning_rate=2e-5, num_train_epochs=3, ) trainer = SFTTrainer( model=model, args=config, train_dataset=dataset, tokenizer=tokenizer, # Automatically applies chat template )
2. Direct Preference Optimization (DPO)
When to Use
- You have preference data (chosen vs rejected responses)
- Aligning model with human preferences
- Improving response quality
DPO Script
from trl import DPOTrainer, DPOConfig from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset # Load model model_id = "meta-llama/Llama-3.1-8B-Instruct" model = AutoModelForCausalLM.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id) # Dataset needs: prompt, chosen, rejected columns dataset = load_dataset("your-org/preference-data", split="train") config = DPOConfig( output_dir="./dpo-output", per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=5e-7, # Lower LR for DPO beta=0.1, # KL penalty coefficient num_train_epochs=1, bf16=True, logging_steps=10, ) trainer = DPOTrainer( model=model, args=config, train_dataset=dataset, tokenizer=tokenizer, ) trainer.train()
Preference Data Format
# Required columns: prompt, chosen, rejected preference_example = { "prompt": "Explain quantum computing", "chosen": "Quantum computing uses quantum bits...", # Better response "rejected": "Computers are fast machines..." # Worse response }
3. Group Relative Policy Optimization (GRPO)
When to Use
- You have a reward function or verifier
- Math/code tasks with checkable answers
- RL-based training without paired preferences
GRPO Script
from trl import GRPOTrainer, GRPOConfig from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "meta-llama/Llama-3.1-8B-Instruct" model = AutoModelForCausalLM.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id) # Define reward function def reward_fn(completions, prompts): """Return rewards for each completion""" rewards = [] for completion, prompt in zip(completions, prompts): # Example: reward correct math answers if verify_math_answer(completion, prompt): rewards.append(1.0) else: rewards.append(-0.5) return rewards config = GRPOConfig( output_dir="./grpo-output", per_device_train_batch_size=4, num_generations=4, # Generate 4 samples per prompt learning_rate=1e-6, num_train_epochs=1, ) trainer = GRPOTrainer( model=model, args=config, train_dataset=dataset, tokenizer=tokenizer, reward_fn=reward_fn, ) trainer.train()
4. Parameter-Efficient Fine-Tuning (PEFT/LoRA)
Why Use LoRA
- Train large models on limited GPU memory
- 10-100x fewer trainable parameters
- Fast training, easy to merge or swap adapters
LoRA Configuration
from peft import LoraConfig, get_peft_model, TaskType # LoRA configuration lora_config = LoraConfig( r=16, # Rank (start with 8-32) lora_alpha=32, # Alpha scaling target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, ) # Apply to model model = get_peft_model(model, lora_config) model.print_trainable_parameters() # Output: trainable params: 6,553,600 || all params: 8,030,261,248 || trainable%: 0.082
SFT with LoRA
from trl import SFTTrainer, SFTConfig from peft import LoraConfig # LoRA config peft_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", ) config = SFTConfig( output_dir="./lora-sft", per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, # Higher LR for LoRA num_train_epochs=3, bf16=True, ) trainer = SFTTrainer( model=model, args=config, train_dataset=dataset, tokenizer=tokenizer, peft_config=peft_config, # Pass LoRA config ) trainer.train()
QLoRA (Quantized LoRA)
from transformers import BitsAndBytesConfig import torch # 4-bit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) # Load quantized model model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto", ) # Then apply LoRA as normal
GPU Selection Guide
Memory Requirements
| Model Size | Full Fine-tune | LoRA | QLoRA |
|---|---|---|---|
| 7-8B | 60GB+ | 16GB | 8GB |
| 13B | 100GB+ | 24GB | 12GB |
| 34B | 200GB+ | 48GB | 24GB |
| 70B | 400GB+ | 80GB | 48GB |
GPU Recommendations
┌─────────────────────────────────────────────────────────────────┐ │ GPU SELECTION GUIDE │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ TASK │ RECOMMENDED GPU │ │ ────────────────────────┼──────────────────────────────────── │ │ QLoRA 8B │ RTX 4090 (24GB), A10G │ │ QLoRA 70B │ A100 40GB x2, H100 │ │ LoRA 8B │ A100 40GB, A10G x2 │ │ LoRA 70B │ A100 80GB x2, H100 x2 │ │ Full FT 8B │ A100 80GB x2, H100 │ │ Full FT 70B │ H100 x8, A100 80GB x8 │ │ │ │ CLOUD PROVIDERS: │ │ - AWS: p4d (A100), p5 (H100) │ │ - GCP: a2-highgpu (A100), a3-highgpu (H100) │ │ - Azure: NC A100, ND H100 │ │ - Lambda Labs: Most cost-effective for training │ │ - RunPod: Good spot pricing │ │ - HuggingFace Jobs: Managed training infrastructure │ │ │ └─────────────────────────────────────────────────────────────────┘
Dataset Preparation
Chat Format Dataset
from datasets import Dataset # Conversation format conversations = [ { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Python?"}, {"role": "assistant", "content": "Python is a programming language..."} ] }, # More examples... ] dataset = Dataset.from_list(conversations) dataset.push_to_hub("your-org/chat-dataset")
Instruction Format
# Alpaca-style format instruction_data = [ { "instruction": "Summarize the following text", "input": "Long text here...", "output": "Summary here..." } ] # Or simpler format simple_data = [ { "prompt": "Question or instruction", "completion": "Expected response" } ]
Data Quality Tips
# Filter low-quality examples def filter_quality(example): # Remove very short responses if len(example["completion"]) < 50: return False # Remove repetitive content if example["completion"].count(example["completion"][:20]) > 3: return False return True dataset = dataset.filter(filter_quality) # Deduplicate from datasets import concatenate_datasets def deduplicate(dataset, column="prompt"): seen = set() indices = [] for i, example in enumerate(dataset): key = example[column] if key not in seen: seen.add(key) indices.append(i) return dataset.select(indices)
Training on HuggingFace Jobs
Using HF Jobs MCP Tool
# If using Claude Code with HF Jobs MCP # This is submitted via hf_jobs() MCP tool training_script = ''' from trl import SFTTrainer, SFTConfig from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B") dataset = load_dataset("your-org/your-dataset", split="train") config = SFTConfig( output_dir="./output", max_seq_length=2048, per_device_train_batch_size=4, num_train_epochs=3, bf16=True, push_to_hub=True, hub_model_id="your-org/fine-tuned-model", ) trainer = SFTTrainer(model=model, args=config, train_dataset=dataset, tokenizer=tokenizer) trainer.train() ''' # Submit via MCP: hf_jobs("uv", {"script": training_script, "gpu": "a100"})
Cost Estimation
# Rough cost estimates for HF Jobs / Cloud GPUs TRAINING_COSTS = { # GPU type: (hourly_rate, tokens_per_hour_8B) "a10g": (1.50, 50_000_000), "a100_40gb": (3.50, 150_000_000), "a100_80gb": (5.00, 200_000_000), "h100": (8.00, 400_000_000), } def estimate_cost( model_size: str, dataset_tokens: int, epochs: int, gpu_type: str = "a100_40gb" ) -> dict: rate, throughput = TRAINING_COSTS[gpu_type] total_tokens = dataset_tokens * epochs hours = total_tokens / throughput cost = hours * rate return { "gpu": gpu_type, "estimated_hours": round(hours, 1), "estimated_cost": f"${cost:.2f}", "total_tokens": f"{total_tokens:,}" } # Example: 10M token dataset, 3 epochs on A100 estimate_cost("8B", 10_000_000, 3, "a100_40gb") # {'gpu': 'a100_40gb', 'estimated_hours': 0.2, 'estimated_cost': '$0.70', 'total_tokens': '30,000,000'}
GGUF Conversion for Local Deployment
# Convert to GGUF for llama.cpp / Ollama from transformers import AutoModelForCausalLM, AutoTokenizer # Load your fine-tuned model model = AutoModelForCausalLM.from_pretrained("./fine-tuned-model") tokenizer = AutoTokenizer.from_pretrained("./fine-tuned-model") # Save in format for conversion model.save_pretrained("./model-for-gguf", safe_serialization=True) tokenizer.save_pretrained("./model-for-gguf") # Then use llama.cpp for conversion: # python convert_hf_to_gguf.py ./model-for-gguf --outtype q4_k_m
Quantization Options
| Type | Size Reduction | Quality Loss | Use Case |
|---|---|---|---|
| f16 | 2x | None | Best quality |
| q8_0 | 4x | Minimal | Good balance |
| q4_k_m | 8x | Small | Production |
| q4_0 | 8x | Moderate | Resource constrained |
| q2_k | 16x | Significant | Extreme constraints |
Evaluation
Using lm-eval-harness
# Install: pip install lm-eval # Command line evaluation # lm_eval --model hf --model_args pretrained=./fine-tuned-model --tasks hellaswag,arc_easy --batch_size 8 # Programmatic from lm_eval import evaluator, tasks results = evaluator.simple_evaluate( model="hf", model_args="pretrained=./fine-tuned-model", tasks=["hellaswag", "arc_easy", "mmlu"], batch_size=8, ) print(results["results"])
Custom Evaluation
def evaluate_on_test_set(model, tokenizer, test_dataset): correct = 0 total = 0 for example in test_dataset: prompt = example["prompt"] expected = example["expected"] inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) response = tokenizer.decode(outputs[0], skip_special_tokens=True) if expected.lower() in response.lower(): correct += 1 total += 1 return {"accuracy": correct / total, "total": total}
Best Practices
Training Checklist
before_training: - [ ] Validate dataset format and quality - [ ] Check GPU memory requirements - [ ] Set up monitoring (W&B, TensorBoard) - [ ] Configure checkpointing strategy - [ ] Test with small subset first during_training: - [ ] Monitor loss curves - [ ] Watch for gradient issues - [ ] Check learning rate schedule - [ ] Validate checkpoints periodically after_training: - [ ] Evaluate on held-out test set - [ ] Compare with base model - [ ] Test on diverse prompts - [ ] Convert to desired format (GGUF, etc.) - [ ] Push to Hub with model card
Hyperparameter Guidelines
# SFT defaults SFT_DEFAULTS = { "learning_rate": 2e-5, # Full fine-tune "learning_rate_lora": 2e-4, # LoRA (higher) "batch_size": 4, "gradient_accumulation": 4, # Effective batch = 16 "epochs": 1-3, "warmup_ratio": 0.03, "weight_decay": 0.01, } # DPO defaults DPO_DEFAULTS = { "learning_rate": 5e-7, # Much lower "beta": 0.1, # KL penalty "epochs": 1, # Usually 1 is enough }
Resources
- TRL Documentation
- PEFT Documentation
- HuggingFace Hub
- HuggingFace Jobs
- lm-eval-harness
- Axolotl - High-level training framework