Skilllibrary llm-creation
Plans and executes end-to-end LLM creation from architecture design through pretraining, instruction tuning, and alignment. Covers scaling laws, compute budgets, tokenizer training, distributed training infrastructure, and evaluation checkpoints. Use when building a new language model from scratch.
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/llm-creation" ~/.claude/skills/merceralex397-collab-skilllibrary-llm-creation && rm -rf "$T"
manifest:
12-ai-llm-training-architecture-and-research/llm-creation/SKILL.mdsource content
Purpose
Plan and execute the full pipeline for creating an LLM from scratch — architecture design, scaling law calculations, tokenizer training, distributed pretraining, instruction tuning, alignment, evaluation checkpoints, and release documentation.
When to use this skill
Use this skill when:
- designing a new LLM architecture (choosing model size, layers, heads, context length)
- estimating compute budgets and data requirements using scaling laws
- planning the full training pipeline: tokenizer → pretraining → SFT → alignment
- setting up distributed training infrastructure (FSDP, DeepSpeed, Megatron-LM)
- defining evaluation checkpoint schedules and release criteria
Do not use this skill when
- the task is fine-tuning an existing model (use
orfine-tuning
)instruction-tuning - the task is only about inference optimization (use
)inference-kernel-optimization - the goal is eval dataset creation without training (use
)eval-dataset-design
Operating procedure
- Define architecture. Choose a decoder-only transformer configuration:
Use RoPE positional embeddings, SwiGLU activation, RMSNorm, GQA (grouped-query attention) for ≥13B models.Model size guide: - 1B: num_layers=24, hidden_dim=2048, num_heads=16, context=2048 - 7B: num_layers=32, hidden_dim=4096, num_heads=32, context=4096 - 13B: num_layers=40, hidden_dim=5120, num_heads=40, context=4096 - 70B: num_layers=80, hidden_dim=8192, num_heads=64, context=8192 - Apply scaling laws. Chinchilla-optimal: train on ~20 tokens per parameter (a 7B model needs ~140B tokens). Estimate total FLOPs:
where N = parameters, D = tokens. Budget GPU-hours:C ≈ 6 × N × D
, target 40–50% MFU.C / (GPU_TFLOPS × utilization × 3600) - Train tokenizer. Train a BPE tokenizer (SentencePiece or HuggingFace tokenizers) on a representative corpus sample. Vocab size: 32k–64k. Ensure coverage of code, multilingual text, and special tokens (
,<|begin_of_text|>
, chat role markers).<|end_of_text|> - Configure distributed pretraining. Choose framework:
- FSDP (PyTorch native):
with mixed precision (bf16), activation checkpointing.FullyShardedDataParallel - DeepSpeed ZeRO Stage 3: partitions optimizer states, gradients, and parameters.
- Megatron-LM: tensor + pipeline parallelism for >70B models. Learning rate: peak 3e-4, cosine decay to 3e-5, warmup over first 2000 steps. Batch size: ramp from 256 to 4M tokens over warmup.
- FSDP (PyTorch native):
- Monitor training. Log training loss, gradient norm, and learning rate every 10 steps. Evaluate perplexity on held-out validation set every 1000 steps. Run downstream benchmarks (MMLU, HellaSwag, HumanEval) at 25%, 50%, 75%, and 100% of training.
- Post-training pipeline. After pretraining: instruction-tune with SFTTrainer (delegate to
skill), then align with DPO or RLHF (delegate toinstruction-tuning
skill).preference-optimization - Document and release. Produce a model card: architecture, training data composition, compute used, benchmark results, known limitations, intended use, and license.
Decision rules
- Follow Chinchilla scaling: if compute is fixed, allocate ~50% to model size and ~50% to data tokens.
- For context length > 8k, use RoPE with NTK-aware scaling or YaRN for length extrapolation.
- Use GQA (num_kv_heads = num_heads / 8) for models ≥ 13B to reduce KV cache memory.
- If training loss spikes, reduce learning rate by 50% and resume from last stable checkpoint.
- Checkpoint every 1000 steps minimum; keep at least last 5 checkpoints for rollback.
- Do not release without running safety evaluations (ToxiGen, BBQ bias benchmark, red-teaming).
Output requirements
— model config (layers, dims, heads, context, vocab), parameter countArchitecture spec
— FLOPs estimate, GPU type/count, estimated wall-clock time, costCompute budget
— corpus composition, token counts, deduplication and filtering pipelineData plan
— optimizer, LR schedule, batch size ramp, parallelism strategyTraining config
— checkpoints, benchmarks, and pass/fail criteria at each stageEvaluation schedule
— standard model card with architecture, data, benchmarks, limitationsModel card
References
- Chinchilla scaling laws: Hoffmann et al., "Training Compute-Optimal LLMs" (arXiv:2203.15556)
- LLaMA architecture: Touvron et al. (arXiv:2302.13971)
- DeepSpeed ZeRO: Rajbhandari et al. (arXiv:1910.02054)
- Megatron-LM: Shoeybi et al. (arXiv:1909.08053)
- RoPE: Su et al., "RoFormer" (arXiv:2104.09864)
- Model cards: Mitchell et al., "Model Cards for Model Reporting" (FAT* 2019)
Related skills
model-architecturedata-cleaning-labelinginstruction-tuningfine-tuningpreference-optimization
Failure handling
- If training loss diverges, roll back to the last stable checkpoint, halve the learning rate, and resume.
- If downstream benchmarks regress at a checkpoint, investigate data quality for the recent training window — check for duplicates or corrupted shards.
- If GPU utilization drops below 30% MFU, profile for communication bottlenecks and adjust parallelism strategy (increase tensor parallelism, reduce pipeline stages).