Claude-skill-registry HuggingFace Model Trainer

Train and fine-tune LLMs using HuggingFace TRL, Transformers, and cloud GPU infrastructure with SFT, DPO, GRPO methods

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/huggingface-model-trainer" ~/.claude/skills/majiayu000-claude-skill-registry-huggingface-model-trainer && rm -rf "$T"

manifest: skills/data/huggingface-model-trainer/SKILL.md

HuggingFace Model Trainer

You are an expert in training and fine-tuning large language models using HuggingFace's TRL (Transformer Reinforcement Learning), Transformers, and PEFT libraries. You help with dataset preparation, training configuration, GPU selection, and deployment.

Training Methods Overview

Method Selection Guide

┌─────────────────────────────────────────────────────────────────┐
│                    TRAINING METHOD SELECTION                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  HAVE LABELED DATA?                                             │
│  ├── Yes: Input/Output pairs                                    │
│  │   └── Use SFT (Supervised Fine-Tuning)                       │
│  │                                                               │
│  ├── Yes: Preference pairs (chosen/rejected)                    │
│  │   └── Use DPO (Direct Preference Optimization)               │
│  │                                                               │
│  ├── No: Have a reward function/verifier                        │
│  │   └── Use GRPO (Group Relative Policy Optimization)          │
│  │                                                               │
│  └── No: Just want to continue pretraining                      │
│      └── Use CLM (Causal Language Modeling)                     │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

1. Supervised Fine-Tuning (SFT)

When to Use

You have instruction/response pairs
Adapting a model to your domain
Teaching specific output formats

Basic SFT Script

from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

# Load model and tokenizer
model_id = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# Load dataset
dataset = load_dataset("your-org/your-dataset", split="train")

# Training configuration
config = SFTConfig(
    output_dir="./sft-output",
    max_seq_length=2048,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    num_train_epochs=3,
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,  # Use bfloat16 on supported GPUs
)

# Create trainer
trainer = SFTTrainer(
    model=model,
    args=config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

# Train
trainer.train()
trainer.save_model("./final-model")

SFT with Chat Template

from trl import SFTTrainer, SFTConfig

# Dataset should have 'messages' column in chat format
# [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]

config = SFTConfig(
    output_dir="./chat-sft",
    max_seq_length=4096,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    num_train_epochs=3,
)

trainer = SFTTrainer(
    model=model,
    args=config,
    train_dataset=dataset,
    tokenizer=tokenizer,
    # Automatically applies chat template
)

2. Direct Preference Optimization (DPO)

When to Use

You have preference data (chosen vs rejected responses)
Aligning model with human preferences
Improving response quality

DPO Script

from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

# Load model
model_id = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Dataset needs: prompt, chosen, rejected columns
dataset = load_dataset("your-org/preference-data", split="train")

config = DPOConfig(
    output_dir="./dpo-output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-7,  # Lower LR for DPO
    beta=0.1,  # KL penalty coefficient
    num_train_epochs=1,
    bf16=True,
    logging_steps=10,
)

trainer = DPOTrainer(
    model=model,
    args=config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()

Preference Data Format

# Required columns: prompt, chosen, rejected
preference_example = {
    "prompt": "Explain quantum computing",
    "chosen": "Quantum computing uses quantum bits...",  # Better response
    "rejected": "Computers are fast machines..."  # Worse response
}

3. Group Relative Policy Optimization (GRPO)

When to Use

You have a reward function or verifier
Math/code tasks with checkable answers
RL-based training without paired preferences

GRPO Script

from trl import GRPOTrainer, GRPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Define reward function
def reward_fn(completions, prompts):
    """Return rewards for each completion"""
    rewards = []
    for completion, prompt in zip(completions, prompts):
        # Example: reward correct math answers
        if verify_math_answer(completion, prompt):
            rewards.append(1.0)
        else:
            rewards.append(-0.5)
    return rewards

config = GRPOConfig(
    output_dir="./grpo-output",
    per_device_train_batch_size=4,
    num_generations=4,  # Generate 4 samples per prompt
    learning_rate=1e-6,
    num_train_epochs=1,
)

trainer = GRPOTrainer(
    model=model,
    args=config,
    train_dataset=dataset,
    tokenizer=tokenizer,
    reward_fn=reward_fn,
)

trainer.train()

4. Parameter-Efficient Fine-Tuning (PEFT/LoRA)

Why Use LoRA

Train large models on limited GPU memory
10-100x fewer trainable parameters
Fast training, easy to merge or swap adapters

LoRA Configuration

from peft import LoraConfig, get_peft_model, TaskType

# LoRA configuration
lora_config = LoraConfig(
    r=16,  # Rank (start with 8-32)
    lora_alpha=32,  # Alpha scaling
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Apply to model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 6,553,600 || all params: 8,030,261,248 || trainable%: 0.082

SFT with LoRA

from trl import SFTTrainer, SFTConfig
from peft import LoraConfig

# LoRA config
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

config = SFTConfig(
    output_dir="./lora-sft",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,  # Higher LR for LoRA
    num_train_epochs=3,
    bf16=True,
)

trainer = SFTTrainer(
    model=model,
    args=config,
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,  # Pass LoRA config
)

trainer.train()

QLoRA (Quantized LoRA)

from transformers import BitsAndBytesConfig
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

# Then apply LoRA as normal

GPU Selection Guide

Memory Requirements

Model Size	Full Fine-tune	LoRA	QLoRA
7-8B	60GB+	16GB	8GB
13B	100GB+	24GB	12GB
34B	200GB+	48GB	24GB
70B	400GB+	80GB	48GB

GPU Recommendations

┌─────────────────────────────────────────────────────────────────┐
│                    GPU SELECTION GUIDE                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  TASK                    │ RECOMMENDED GPU                       │
│  ────────────────────────┼────────────────────────────────────  │
│  QLoRA 8B               │ RTX 4090 (24GB), A10G                 │
│  QLoRA 70B              │ A100 40GB x2, H100                    │
│  LoRA 8B                │ A100 40GB, A10G x2                    │
│  LoRA 70B               │ A100 80GB x2, H100 x2                 │
│  Full FT 8B             │ A100 80GB x2, H100                    │
│  Full FT 70B            │ H100 x8, A100 80GB x8                 │
│                                                                  │
│  CLOUD PROVIDERS:                                                │
│  - AWS: p4d (A100), p5 (H100)                                   │
│  - GCP: a2-highgpu (A100), a3-highgpu (H100)                   │
│  - Azure: NC A100, ND H100                                      │
│  - Lambda Labs: Most cost-effective for training                │
│  - RunPod: Good spot pricing                                    │
│  - HuggingFace Jobs: Managed training infrastructure            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Dataset Preparation

Chat Format Dataset

from datasets import Dataset

# Conversation format
conversations = [
    {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is Python?"},
            {"role": "assistant", "content": "Python is a programming language..."}
        ]
    },
    # More examples...
]

dataset = Dataset.from_list(conversations)
dataset.push_to_hub("your-org/chat-dataset")

Instruction Format

# Alpaca-style format
instruction_data = [
    {
        "instruction": "Summarize the following text",
        "input": "Long text here...",
        "output": "Summary here..."
    }
]

# Or simpler format
simple_data = [
    {
        "prompt": "Question or instruction",
        "completion": "Expected response"
    }
]

Data Quality Tips

# Filter low-quality examples
def filter_quality(example):
    # Remove very short responses
    if len(example["completion"]) < 50:
        return False
    # Remove repetitive content
    if example["completion"].count(example["completion"][:20]) > 3:
        return False
    return True

dataset = dataset.filter(filter_quality)

# Deduplicate
from datasets import concatenate_datasets

def deduplicate(dataset, column="prompt"):
    seen = set()
    indices = []
    for i, example in enumerate(dataset):
        key = example[column]
        if key not in seen:
            seen.add(key)
            indices.append(i)
    return dataset.select(indices)

Training on HuggingFace Jobs

Using HF Jobs MCP Tool

# If using Claude Code with HF Jobs MCP
# This is submitted via hf_jobs() MCP tool

training_script = '''
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
dataset = load_dataset("your-org/your-dataset", split="train")

config = SFTConfig(
    output_dir="./output",
    max_seq_length=2048,
    per_device_train_batch_size=4,
    num_train_epochs=3,
    bf16=True,
    push_to_hub=True,
    hub_model_id="your-org/fine-tuned-model",
)

trainer = SFTTrainer(model=model, args=config, train_dataset=dataset, tokenizer=tokenizer)
trainer.train()
'''

# Submit via MCP: hf_jobs("uv", {"script": training_script, "gpu": "a100"})

Cost Estimation

# Rough cost estimates for HF Jobs / Cloud GPUs
TRAINING_COSTS = {
    # GPU type: (hourly_rate, tokens_per_hour_8B)
    "a10g": (1.50, 50_000_000),
    "a100_40gb": (3.50, 150_000_000),
    "a100_80gb": (5.00, 200_000_000),
    "h100": (8.00, 400_000_000),
}

def estimate_cost(
    model_size: str,
    dataset_tokens: int,
    epochs: int,
    gpu_type: str = "a100_40gb"
) -> dict:
    rate, throughput = TRAINING_COSTS[gpu_type]
    total_tokens = dataset_tokens * epochs
    hours = total_tokens / throughput
    cost = hours * rate

    return {
        "gpu": gpu_type,
        "estimated_hours": round(hours, 1),
        "estimated_cost": f"${cost:.2f}",
        "total_tokens": f"{total_tokens:,}"
    }

# Example: 10M token dataset, 3 epochs on A100
estimate_cost("8B", 10_000_000, 3, "a100_40gb")
# {'gpu': 'a100_40gb', 'estimated_hours': 0.2, 'estimated_cost': '$0.70', 'total_tokens': '30,000,000'}

GGUF Conversion for Local Deployment

# Convert to GGUF for llama.cpp / Ollama

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load your fine-tuned model
model = AutoModelForCausalLM.from_pretrained("./fine-tuned-model")
tokenizer = AutoTokenizer.from_pretrained("./fine-tuned-model")

# Save in format for conversion
model.save_pretrained("./model-for-gguf", safe_serialization=True)
tokenizer.save_pretrained("./model-for-gguf")

# Then use llama.cpp for conversion:
# python convert_hf_to_gguf.py ./model-for-gguf --outtype q4_k_m

Quantization Options

Type	Size Reduction	Quality Loss	Use Case
f16	2x	None	Best quality
q8_0	4x	Minimal	Good balance
q4_k_m	8x	Small	Production
q4_0	8x	Moderate	Resource constrained
q2_k	16x	Significant	Extreme constraints

Evaluation

Using lm-eval-harness

# Install: pip install lm-eval

# Command line evaluation
# lm_eval --model hf --model_args pretrained=./fine-tuned-model --tasks hellaswag,arc_easy --batch_size 8

# Programmatic
from lm_eval import evaluator, tasks

results = evaluator.simple_evaluate(
    model="hf",
    model_args="pretrained=./fine-tuned-model",
    tasks=["hellaswag", "arc_easy", "mmlu"],
    batch_size=8,
)

print(results["results"])

Custom Evaluation

def evaluate_on_test_set(model, tokenizer, test_dataset):
    correct = 0
    total = 0

    for example in test_dataset:
        prompt = example["prompt"]
        expected = example["expected"]

        inputs = tokenizer(prompt, return_tensors="pt")
        outputs = model.generate(**inputs, max_new_tokens=100)
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)

        if expected.lower() in response.lower():
            correct += 1
        total += 1

    return {"accuracy": correct / total, "total": total}

Best Practices

Training Checklist

before_training:
  - [ ] Validate dataset format and quality
  - [ ] Check GPU memory requirements
  - [ ] Set up monitoring (W&B, TensorBoard)
  - [ ] Configure checkpointing strategy
  - [ ] Test with small subset first

during_training:
  - [ ] Monitor loss curves
  - [ ] Watch for gradient issues
  - [ ] Check learning rate schedule
  - [ ] Validate checkpoints periodically

after_training:
  - [ ] Evaluate on held-out test set
  - [ ] Compare with base model
  - [ ] Test on diverse prompts
  - [ ] Convert to desired format (GGUF, etc.)
  - [ ] Push to Hub with model card

Hyperparameter Guidelines

# SFT defaults
SFT_DEFAULTS = {
    "learning_rate": 2e-5,  # Full fine-tune
    "learning_rate_lora": 2e-4,  # LoRA (higher)
    "batch_size": 4,
    "gradient_accumulation": 4,  # Effective batch = 16
    "epochs": 1-3,
    "warmup_ratio": 0.03,
    "weight_decay": 0.01,
}

# DPO defaults
DPO_DEFAULTS = {
    "learning_rate": 5e-7,  # Much lower
    "beta": 0.1,  # KL penalty
    "epochs": 1,  # Usually 1 is enough
}

Claude-skill-registry HuggingFace Model Trainer

HuggingFace Model Trainer

Training Methods Overview

Method Selection Guide

1. Supervised Fine-Tuning (SFT)

When to Use

Basic SFT Script

SFT with Chat Template

2. Direct Preference Optimization (DPO)

When to Use

DPO Script

Preference Data Format

3. Group Relative Policy Optimization (GRPO)

When to Use

GRPO Script

4. Parameter-Efficient Fine-Tuning (PEFT/LoRA)

Why Use LoRA

LoRA Configuration

SFT with LoRA

QLoRA (Quantized LoRA)

GPU Selection Guide

Memory Requirements

GPU Recommendations

Dataset Preparation

Chat Format Dataset

Instruction Format

Data Quality Tips

Training on HuggingFace Jobs

Using HF Jobs MCP Tool

Cost Estimation

GGUF Conversion for Local Deployment

Quantization Options

Evaluation

Using lm-eval-harness

Custom Evaluation

Best Practices

Training Checklist

Hyperparameter Guidelines

Resources