Claude-skill-registry funsloth-local

Training manager for local GPU training - validate CUDA, manage GPU selection, monitor progress, handle checkpoints

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/funsloth-local" ~/.claude/skills/majiayu000-claude-skill-registry-funsloth-local && rm -rf "$T"
manifest: skills/data/funsloth-local/SKILL.md
source content

Local GPU Training Manager

Run Unsloth training on your local GPU.

Prerequisites Check

1. Verify CUDA

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

If CUDA not available:

  • Check NVIDIA drivers:
    nvidia-smi
  • Check CUDA:
    nvcc --version
  • Reinstall PyTorch:
    pip install torch --index-url https://download.pytorch.org/whl/cu121

2. Check VRAM

See references/HARDWARE_GUIDE.md for requirements:

VRAMRecommended Setup
8GB7B, 4-bit, batch=1, LoRA r=8
12GB7B, 4-bit, batch=2, LoRA r=16
16GB7-13B, 4-bit, batch=2, LoRA r=16-32
24GB7-14B, 4-bit, batch=4, LoRA r=32

3. Check Dependencies

pip install unsloth torch transformers trl peft datasets accelerate bitsandbytes

Docker Option

Use the official Unsloth Docker image for a pre-configured environment (supports all GPUs including Blackwell/50-series):

docker run -d \
  -e JUPYTER_PASSWORD="unsloth" \
  -p 8888:8888 \
  -v $(pwd)/work:/workspace/work \
  --gpus all \
  unsloth/unsloth

Access Jupyter at

http://localhost:8888
. Example notebooks are in
/workspace/unsloth-notebooks/
.

Environment variables:

  • JUPYTER_PASSWORD
    - Jupyter auth (default:
    unsloth
    )
  • JUPYTER_PORT
    - Port (default:
    8888
    )
  • USER_PASSWORD
    - User/sudo password (default:
    unsloth
    )

Run Training

Option 1: Notebook

jupyter notebook notebooks/sft_template.ipynb

Option 2: Script

# Edit configuration in script, then run
python scripts/train_sft.py

GPU Selection (Multi-GPU)

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Use first GPU

Monitor Training

Terminal

# Watch GPU usage
watch -n 1 nvidia-smi

# Or use nvitop (more detailed)
pip install nvitop && nvitop

WandB (Optional)

export WANDB_API_KEY="your-key"
# Add report_to="wandb" in TrainingArguments

Troubleshooting

OOM Error

Try in order:

  1. Reduce batch_size (to 1)
  2. Increase gradient_accumulation
  3. Reduce max_seq_length
  4. Reduce LoRA rank
  5. torch.cuda.empty_cache()

Loss Not Decreasing

  1. Check learning rate (try higher or lower)
  2. Verify chat template matches model
  3. Inspect data format

Training Too Slow

  1. Enable bf16 if supported
  2. Use
    packing=True
    for short sequences
  3. Reduce logging_steps

See references/TROUBLESHOOTING.md for more solutions.

Resume from Checkpoint

TrainingArguments(
    resume_from_checkpoint=True,  # Auto-find latest
    # Or: resume_from_checkpoint="outputs/checkpoint-500"
)

Save Model

Training script automatically saves:

  • outputs/lora_adapter/
    - LoRA weights
  • outputs/merged_16bit/
    - Merged model (optional)

Test Inference

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained("outputs/lora_adapter")
FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Handoff

Offer

funsloth-upload
for Hub upload with model card.

Tips

  1. Close other GPU apps before training
  2. Monitor temps - keep under 85C
  3. Use UPS for long runs
  4. Save frequently with
    save_steps

Bundled Resources