Skilllibrary tokenizer-design
Design, train, and evaluate tokenizers (BPE, SentencePiece unigram) for LLMs. Use when selecting vocabulary size, defining special tokens, training a tokenizer on a corpus, analyzing fertility/compression, or handling multilingual coverage. Covers the HuggingFace `tokenizers` library and `SentencePieceTrainer`.
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/tokenizer-design" ~/.claude/skills/merceralex397-collab-skilllibrary-tokenizer-design && rm -rf "$T"
manifest:
12-ai-llm-training-architecture-and-research/tokenizer-design/SKILL.mdsource content
Purpose
Guide the design, training, and evaluation of tokenizers for language models—covering algorithm selection (BPE vs unigram), vocabulary sizing, special token definition, pre-tokenization strategy, multilingual script coverage, and compression ratio analysis.
When to use this skill
- Training a new BPE or unigram tokenizer on a custom corpus
- Choosing vocabulary size (32k / 64k / 128k) and analyzing its impact on embedding parameters and sequence length
- Defining special tokens (
,<bos>
,<eos>
,<pad>
) or chat template tokens (<unk>
,<|im_start|>
)<|im_end|> - Configuring pre-tokenization: whitespace splitting, digit isolation, byte-level fallback
- Measuring tokenizer fertility (tokens per word) and compression ratio (bytes per token)
- Ensuring multilingual coverage with character coverage thresholds (e.g., 0.9999)
Do not use this skill when
- The task is model architecture design (weight layout, attention)—use
model-architecture - The task is end-to-end pretraining orchestration—use
pretraining-pipeline - You only need to load an existing tokenizer for inference with no design decisions
Operating procedure
- Select algorithm. Choose BPE (
) for merge-rule transparency or SentencePiece unigram for probabilistic subword selection. Usetokenizers.models.BPE
for language-agnostic byte-fallback support.sentencepiece.SentencePieceTrainer.train() - Set vocabulary size. Start with 32k for single-language models; use 64k–128k for multilingual. Each vocab entry adds
parameters to the embedding matrix—quantify the tradeoff.hidden_dim - Define special tokens. Register
,<bos>
,<eos>
,<pad>
at fixed IDs. For chat models add role delimiters:<unk>
,<|im_start|>system
. Reserve a contiguous block for future additions.<|im_end|> - Configure pre-tokenization. Use
for GPT-style orpre_tokenizers.ByteLevel(add_prefix_space=False)
for Llama-style digit splitting.pre_tokenizers.Sequence([Whitespace(), Digits(individual_digits=True)]) - Train the tokenizer.
For SentencePiece:from tokenizers import Tokenizer, models, trainers, pre_tokenizers tokenizer = Tokenizer(models.BPE()) tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False) trainer = trainers.BpeTrainer(vocab_size=64000, special_tokens=["<pad>","<eos>","<bos>","<unk>"]) tokenizer.train(files=["corpus.txt"], trainer=trainer)
.spm.SentencePieceTrainer.train(input="corpus.txt", model_prefix="tok", vocab_size=64000, character_coverage=0.9999, model_type="bpe") - Analyze fertility. Compute tokens-per-word on held-out samples per language. Target ≤1.5 for English, ≤2.5 for CJK. Compute bytes-per-token; higher is better compression.
- Test round-trip fidelity. Assert
for edge cases: Unicode emoji, mixed-script, code blocks, LaTeX math, zero-width joiners.decode(encode(text)) == text - Export. Save as
(HuggingFace fast tokenizer format) ortokenizer.json
(SentencePiece). Commit the trained artifact alongside its training config for reproducibility..model
Decision rules
- If the corpus is multilingual (>3 scripts), prefer SentencePiece with
to avoidbyte_fallback=True
on rare characters.<unk> - If vocabulary budget is constrained, prefer 32k with byte-level fallback over 64k without—smaller embedding matrix, zero OOV.
- If the model will process code, ensure digit splitting is enabled and whitespace sequences are preserved as individual tokens.
- Never silently drop characters—verify
rate is 0% on a diverse validation set.<unk>
Output requirements
— algorithm, vocab size, pre-tokenizer chain, special token map with fixed IDsTokenizer config
— reproducible script or snippet with corpus path, coverage, and vocab sizeTraining command
— tokens-per-word and bytes-per-token per language/domain on held-out dataFertility report
— pass/fail on Unicode, code, math, and multilingual edge casesRound-trip test results
References
- HuggingFace
docs: https://huggingface.co/docs/tokenizerstokenizers - SentencePiece repo: https://github.com/google/sentencepiece
- Kudo & Richardson, "SentencePiece: A simple and language independent subword tokenizer" (EMNLP 2018)
- Sennrich et al., "Neural Machine Translation of Rare Words with Subword Units" (ACL 2016)
Related skills
— embedding layer sizing depends on vocab size chosen heremodel-architecture
— tokenizer must be finalized before data preprocessing beginspretraining-pipeline
— end-to-end model creation references tokenizer design decisionsllm-creation
Failure handling
- If fertility exceeds 3.0 tokens/word for a target language, increase
or add that language's data to the training corpus and retrain.character_coverage - If round-trip decode fails, check for normalization (NFKC) conflicts in SentencePiece—disable normalization or switch to byte-level BPE.
- If vocab size causes OOM in the embedding layer, reduce vocab and verify compression ratio remains acceptable before proceeding.