Skilllibrary quantization-research
Applies post-training quantization (GPTQ, AWQ, GGUF) and quantization-aware training to reduce LLM memory footprint and inference cost. Use when the task involves bit-width selection, calibration, weight quantization, or evaluating perplexity degradation from quantized models. Do not use for general model compression that is not quantization-specific.
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/quantization-research" ~/.claude/skills/merceralex397-collab-skilllibrary-quantization-research && rm -rf "$T"
manifest:
12-ai-llm-training-architecture-and-research/quantization-research/SKILL.mdsource content
Purpose
Reduce model memory footprint and inference latency via weight and activation quantization while controlling quality degradation. Covers post-training quantization (PTQ) methods like GPTQ, AWQ, and GGUF conversion, as well as quantization-aware training (QAT) where forward passes simulate low-precision arithmetic.
When to use this skill
Use this skill when:
- quantizing a pretrained model for deployment (GPTQ, AWQ, or GGUF export)
- selecting bit-width (FP16, BF16, INT8, INT4, NF4) for a serving target
- running calibration passes to collect activation statistics for quantization ranges
- evaluating perplexity or task accuracy before and after quantization
- configuring
for 4-bit or 8-bit inference in HuggingFaceBitsAndBytesConfig
Do not use this skill when
- the task is knowledge distillation or model pruning without quantization — prefer
distillation-compression - the task is writing custom CUDA kernels for inference — prefer
inference-kernel-optimization - the task is deploying an already-quantized model behind an API — prefer
serving-architecture
Operating procedure
- Identify the target precision. Choose FP16/BF16 for training-compatible half-precision, INT8 for balanced size/quality, or INT4/NF4 for aggressive compression. BF16 has wider dynamic range than FP16 and avoids overflow in large models.
- Select the quantization method.
- GPTQ: Layer-wise PTQ using approximate Hessian inverse to minimize squared error per layer. Use
with a calibration dataset of 128–256 samples. Produces GPU-optimized models.auto_gptq - AWQ: Activation-aware weight quantization that protects salient weight channels (those multiplied by large activations). Use
. Better generalization than GPTQ on some benchmarks.autoawq - GGUF/GGML: CPU-friendly quantization via
. Convert withllama.cpp
then quantize withpython convert_hf_to_gguf.py
../quantize model.gguf model-Q4_K_M.gguf Q4_K_M - BitsAndBytes: On-the-fly NF4/INT8 quantization at load time:
from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)
- GPTQ: Layer-wise PTQ using approximate Hessian inverse to minimize squared error per layer. Use
- Run calibration. Provide a representative calibration dataset (e.g., 128 samples from C4 or domain data). GPTQ uses this to compute layer-wise Hessians; AWQ uses it to identify salient channels.
- Evaluate quality. Measure perplexity on a held-out set (WikiText-2 or domain corpus). Run downstream task benchmarks (MMLU, HumanEval) and compare against the FP16 baseline. Accept ≤0.5 perplexity increase for INT4.
- For QAT, insert
modules into the model, train for 1–5% of original steps with a reduced learning rate, then export to INT8 viatorch.quantization.FakeQuantize
.torch.quantization.convert - Package the artifact. Save quantized weights with metadata: original model, method, bit-width, calibration dataset, and perplexity delta.
Decision rules
- Prefer AWQ over GPTQ when the model will be used across diverse tasks (AWQ generalizes better).
- Prefer GPTQ when targeting a single benchmark or domain and maximum compression is needed.
- Use GGUF when the deployment target is CPU or Apple Silicon (llama.cpp ecosystem).
- Use BitsAndBytes NF4 for quick experimentation; switch to GPTQ/AWQ for production.
- Do not skip calibration — uncalibrated quantization causes severe accuracy loss especially at INT4.
- Always compare quantized model against FP16 baseline on at least two evaluation axes (perplexity + one task metric).
Output requirements
— method, bit-width, calibration details, and BitsAndBytes/GPTQ/AWQ parameters usedQuantization Config
— perplexity before/after, task accuracy delta, and failure cases identifiedQuality Report
— model card with quantization provenance for reproducibilityArtifact Metadata
— compatible serving frameworks and expected memory/latency profileDeployment Notes
References
Read these only when relevant:
- GPTQ paper: Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers"
- AWQ paper: Lin et al., "AWQ: Activation-aware Weight Quantization for LLM Compression"
library: https://github.com/AutoGPTQ/AutoGPTQauto-gptq
library: https://github.com/casper-hansen/AutoAWQautoawq
library: https://github.com/TimDettmers/bitsandbytesbitsandbytes
GGUF tooling: https://github.com/ggerganov/llama.cppllama.cpp
Related skills
— non-quantization model compressiondistillation-compression
— custom kernels for quantized opsinference-kernel-optimization
— deploying quantized models at scaleserving-architecture
— building calibration and evaluation datasetseval-dataset-design
Failure handling
- If perplexity degrades >1.0 points at INT4, fall back to INT8 or use mixed-precision (sensitive layers at higher precision).
- If calibration data is unavailable, use a generic corpus (C4 or RedPajama sample) but flag reduced domain accuracy confidence.
- If GPTQ produces outlier layers with high reconstruction error, try increasing
(e.g., 128 → 64) or switch to AWQ.group_size - If the quantized model shows task-specific regressions not visible in perplexity, add task-specific eval samples to the calibration set and re-quantize.