Claude-skill-registry funsloth-local

Training manager for local GPU training - validate CUDA, manage GPU selection, monitor progress, handle checkpoints

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/funsloth-local" ~/.claude/skills/majiayu000-claude-skill-registry-funsloth-local && rm -rf "$T"

manifest: skills/data/funsloth-local/SKILL.md

Local GPU Training Manager

Run Unsloth training on your local GPU.

Prerequisites Check

1. Verify CUDA

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

If CUDA not available:

Check NVIDIA drivers:
```
nvidia-smi
```
Check CUDA:
```
nvcc --version
```

Reinstall PyTorch:

pip install torch --index-url https://download.pytorch.org/whl/cu121

2. Check VRAM

See references/HARDWARE_GUIDE.md for requirements:

VRAM	Recommended Setup
8GB	7B, 4-bit, batch=1, LoRA r=8
12GB	7B, 4-bit, batch=2, LoRA r=16
16GB	7-13B, 4-bit, batch=2, LoRA r=16-32
24GB	7-14B, 4-bit, batch=4, LoRA r=32

3. Check Dependencies

pip install unsloth torch transformers trl peft datasets accelerate bitsandbytes

Docker Option

Use the official Unsloth Docker image for a pre-configured environment (supports all GPUs including Blackwell/50-series):

docker run -d \
  -e JUPYTER_PASSWORD="unsloth" \
  -p 8888:8888 \
  -v $(pwd)/work:/workspace/work \
  --gpus all \
  unsloth/unsloth

Access Jupyter at

http://localhost:8888

. Example notebooks are in

/workspace/unsloth-notebooks/

Environment variables:

```
JUPYTER_PASSWORD
```
- Jupyter auth (default:
```
unsloth
```
)
```
JUPYTER_PORT
```
- Port (default:
```
8888
```
)
```
USER_PASSWORD
```
- User/sudo password (default:
```
unsloth
```
)

Run Training

Option 1: Notebook

jupyter notebook notebooks/sft_template.ipynb

Option 2: Script

# Edit configuration in script, then run
python scripts/train_sft.py

GPU Selection (Multi-GPU)

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Use first GPU

Monitor Training

Terminal

# Watch GPU usage
watch -n 1 nvidia-smi

# Or use nvitop (more detailed)
pip install nvitop && nvitop

WandB (Optional)

export WANDB_API_KEY="your-key"
# Add report_to="wandb" in TrainingArguments

Troubleshooting

OOM Error

Try in order:

Reduce batch_size (to 1)
Increase gradient_accumulation
Reduce max_seq_length
Reduce LoRA rank
```
torch.cuda.empty_cache()
```

Loss Not Decreasing

Check learning rate (try higher or lower)
Verify chat template matches model
Inspect data format

Training Too Slow

Enable bf16 if supported
Use
```
packing=True
```
for short sequences
Reduce logging_steps

See references/TROUBLESHOOTING.md for more solutions.

Resume from Checkpoint

TrainingArguments(
    resume_from_checkpoint=True,  # Auto-find latest
    # Or: resume_from_checkpoint="outputs/checkpoint-500"
)

Save Model

Training script automatically saves:

```
outputs/lora_adapter/
```
- LoRA weights
```
outputs/merged_16bit/
```
- Merged model (optional)

Test Inference

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained("outputs/lora_adapter")
FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Handoff

Offer

funsloth-upload

for Hub upload with model card.

Tips

Close other GPU apps before training
Monitor temps - keep under 85C
Use UPS for long runs
Save frequently with
```
save_steps
```

Bundled Resources

notebooks/sft_template.ipynb - Notebook template
scripts/train_sft.py - Script template
references/HARDWARE_GUIDE.md - VRAM requirements
references/TROUBLESHOOTING.md - Common issues