Awesome-omni-skill training-hub

Fine-tune LLMs using Red Hat training-hub library with SFT, LoRA, and OSFT algorithms. Use when preparing JSONL datasets, running training jobs, configuring hardware, scaling to clusters, evaluating models, or deploying with vLLM.

install

source · Clone the upstream repo

git clone https://github.com/diegosouzapw/awesome-omni-skill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/machine-learning/training-hub" ~/.claude/skills/diegosouzapw-awesome-omni-skill-training-hub-c4d6db && rm -rf "$T"

manifest: skills/machine-learning/training-hub/SKILL.md

source content

Training Hub

Red Hat's unified library for LLM post-training: SFT, LoRA, and OSFT (continual learning).

Quick Reference

Task	Command
Recommend config	`python scripts/recommend_config.py --model <model> --hardware <hw>`
Estimate memory	`python scripts/estimate_memory.py --model <model> --method sft --hardware h100`
Validate dataset	`python scripts/validate_dataset.py data.jsonl`
Full fine-tuning	`from training_hub import sft`
LoRA training	`from training_hub import lora_sft`
OSFT (continual)	`from training_hub import osft`

Installation

pip install training-hub              # Basic
pip install training-hub[lora]        # LoRA with Unsloth (2x faster)
pip install training-hub[cuda] --no-build-isolation  # CUDA support

Get Started Fast

# Get optimal config for your hardware
python scripts/recommend_config.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --hardware rtx-5090

Data Format

Training data must be JSONL with message structure:

{"messages": [{"role": "user", "content": "Q"}, {"role": "assistant", "content": "A"}]}

Validate before training:

python scripts/validate_dataset.py ./training_data.jsonl

For data preparation details, see DATA-FORMATS.md.

Training Methods

Supervised Fine-Tuning (SFT)

Full-parameter fine-tuning. Requires significant VRAM.

from training_hub import sft

result = sft(
    model_path="Qwen/Qwen2.5-7B-Instruct",
    data_path="./training_data.jsonl",
    ckpt_output_dir="./checkpoints",
    num_epochs=3,
    effective_batch_size=8,
    learning_rate=2e-5,
    max_seq_len=2048,
    max_tokens_per_gpu=45000,
)

LoRA Fine-Tuning

Memory-efficient adaptation (up to 2x faster, 70% less VRAM):

from training_hub import lora_sft

result = lora_sft(
    model_path="Qwen/Qwen2.5-7B-Instruct",
    data_path="./training_data.jsonl",
    ckpt_output_dir="./outputs",
    lora_r=16,
    lora_alpha=32,
    num_epochs=3,
    learning_rate=2e-4,
)

QLoRA (4-bit): Add

load_in_4bit=True

for large models on limited VRAM.

OSFT (Continual Learning)

Adapt without catastrophic forgetting:

from training_hub import osft

result = osft(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    data_path="./domain_data.jsonl",
    ckpt_output_dir="./checkpoints",
    unfreeze_rank_ratio=0.25,
    effective_batch_size=16,
    learning_rate=2e-5,
)

For all parameters, see ALGORITHMS.md.

Hardware Support

Hardware	VRAM	Best For
RTX 5090	32GB	8B LoRA, 70B QLoRA
DGX Spark	128GB	70B SFT
H100	80GB	14B SFT, 70B LoRA
8×H100	640GB	70B SFT

# Check if your config fits
python scripts/estimate_memory.py \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --method lora \
  --hardware h100 \
  --num-gpus 8

For hardware-specific configs, see HARDWARE.md.

Scaling

Multi-GPU:

result = sft(..., nproc_per_node=8)

Multi-node:

result = sft(..., nnodes=2, node_rank=0, nproc_per_node=8, rdzv_endpoint="0.0.0.0:29500")

For Slurm, Kubernetes, and datacenter deployments, see SCALE.md.

Algorithm Selection

Scenario	Method
First-time fine-tuning, large dataset	SFT
Memory constrained	LoRA
Very large model (70B+), limited VRAM	LoRA + QLoRA
Preserve existing capabilities	OSFT
Domain adaptation, small dataset	OSFT

Documentation

Topic	File
Hardware profiles & configs	HARDWARE.md
All algorithm parameters	ALGORITHMS.md
Data formats & conversion	DATA-FORMATS.md
Datacenter & cluster setup	SCALE.md
Model evaluation	EVALUATION.md
vLLM inference & serving	INFERENCE.md
Advanced techniques	ADVANCED.md
Model-specific configs	MODELS.md
Troubleshooting	TROUBLESHOOTING.md
Distributed training	DISTRIBUTED.md

Utility Scripts

Script	Purpose
`recommend_config.py`	Generate optimal config for model + hardware
`estimate_memory.py`	Estimate GPU memory requirements
`validate_dataset.py`	Validate JSONL dataset format
`convert_to_jsonl.py`	Convert CSV, Alpaca, ShareGPT to JSONL

Troubleshooting

CUDA OOM: Reduce

max_tokens_per_gpu

, use LoRA + QLoRA, or add GPUs

Dataset errors: Run

python scripts/validate_dataset.py

first

LoRA multi-GPU: Requires

torchrun --nproc-per-node=N script.py

Training diverges: Lower

learning_rate

(try 1e-5 for SFT, 1e-4 for LoRA)

For more, see TROUBLESHOOTING.md.