Claude-code-templates huggingface-tokenizers
Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.
git clone https://github.com/davila7/claude-code-templates
T=$(mktemp -d) && git clone --depth=1 https://github.com/davila7/claude-code-templates "$T" && mkdir -p ~/.claude/skills && cp -r "$T/cli-tool/components/skills/ai-research/tokenization-huggingface-tokenizers" ~/.claude/skills/davila7-claude-code-templates-huggingface-tokenizers && rm -rf "$T"
cli-tool/components/skills/ai-research/tokenization-huggingface-tokenizers/SKILL.mdHuggingFace Tokenizers - Fast Tokenization for NLP
Fast, production-ready tokenizers with Rust performance and Python ease-of-use.
When to use HuggingFace Tokenizers
Use HuggingFace Tokenizers when:
- Need extremely fast tokenization (<20s per GB of text)
- Training custom tokenizers from scratch
- Want alignment tracking (token → original text position)
- Building production NLP pipelines
- Need to tokenize large corpora efficiently
Performance:
- Speed: <20 seconds to tokenize 1GB on CPU
- Implementation: Rust core with Python/Node.js bindings
- Efficiency: 10-100× faster than pure Python implementations
Use alternatives instead:
- SentencePiece: Language-independent, used by T5/ALBERT
- tiktoken: OpenAI's BPE tokenizer for GPT models
- transformers AutoTokenizer: Loading pretrained only (uses this library internally)
Quick start
Installation
# Install tokenizers pip install tokenizers # With transformers integration pip install tokenizers transformers
Load pretrained tokenizer
from tokenizers import Tokenizer # Load from HuggingFace Hub tokenizer = Tokenizer.from_pretrained("bert-base-uncased") # Encode text output = tokenizer.encode("Hello, how are you?") print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?'] print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029] # Decode back text = tokenizer.decode(output.ids) print(text) # "hello, how are you?"
Train custom BPE tokenizer
from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import Whitespace # Initialize tokenizer with BPE model tokenizer = Tokenizer(BPE(unk_token="[UNK]")) tokenizer.pre_tokenizer = Whitespace() # Configure trainer trainer = BpeTrainer( vocab_size=30000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], min_frequency=2 ) # Train on files files = ["train.txt", "validation.txt"] tokenizer.train(files, trainer) # Save tokenizer.save("my-tokenizer.json")
Training time: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB
Batch encoding with padding
# Enable padding tokenizer.enable_padding(pad_id=3, pad_token="[PAD]") # Encode batch texts = ["Hello world", "This is a longer sentence"] encodings = tokenizer.encode_batch(texts) for encoding in encodings: print(encoding.ids) # [101, 7592, 2088, 102, 3, 3, 3] # [101, 2023, 2003, 1037, 2936, 6251, 102]
Tokenization algorithms
BPE (Byte-Pair Encoding)
How it works:
- Start with character-level vocabulary
- Find most frequent character pair
- Merge into new token, add to vocabulary
- Repeat until vocabulary size reached
Used by: GPT-2, GPT-3, RoBERTa, BART, DeBERTa
from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import ByteLevel tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>")) tokenizer.pre_tokenizer = ByteLevel() trainer = BpeTrainer( vocab_size=50257, special_tokens=["<|endoftext|>"], min_frequency=2 ) tokenizer.train(files=["data.txt"], trainer=trainer)
Advantages:
- Handles OOV words well (breaks into subwords)
- Flexible vocabulary size
- Good for morphologically rich languages
Trade-offs:
- Tokenization depends on merge order
- May split common words unexpectedly
WordPiece
How it works:
- Start with character vocabulary
- Score merge pairs:
frequency(pair) / (frequency(first) × frequency(second)) - Merge highest scoring pair
- Repeat until vocabulary size reached
Used by: BERT, DistilBERT, MobileBERT
from tokenizers import Tokenizer from tokenizers.models import WordPiece from tokenizers.trainers import WordPieceTrainer from tokenizers.pre_tokenizers import Whitespace from tokenizers.normalizers import BertNormalizer tokenizer = Tokenizer(WordPiece(unk_token="[UNK]")) tokenizer.normalizer = BertNormalizer(lowercase=True) tokenizer.pre_tokenizer = Whitespace() trainer = WordPieceTrainer( vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], continuing_subword_prefix="##" ) tokenizer.train(files=["corpus.txt"], trainer=trainer)
Advantages:
- Prioritizes meaningful merges (high score = semantically related)
- Used successfully in BERT (state-of-the-art results)
Trade-offs:
- Unknown words become
if no subword match[UNK] - Saves vocabulary, not merge rules (larger files)
Unigram
How it works:
- Start with large vocabulary (all substrings)
- Compute loss for corpus with current vocabulary
- Remove tokens with minimal impact on loss
- Repeat until vocabulary size reached
Used by: ALBERT, T5, mBART, XLNet (via SentencePiece)
from tokenizers import Tokenizer from tokenizers.models import Unigram from tokenizers.trainers import UnigramTrainer tokenizer = Tokenizer(Unigram()) trainer = UnigramTrainer( vocab_size=8000, special_tokens=["<unk>", "<s>", "</s>"], unk_token="<unk>" ) tokenizer.train(files=["data.txt"], trainer=trainer)
Advantages:
- Probabilistic (finds most likely tokenization)
- Works well for languages without word boundaries
- Handles diverse linguistic contexts
Trade-offs:
- Computationally expensive to train
- More hyperparameters to tune
Tokenization pipeline
Complete pipeline: Normalization → Pre-tokenization → Model → Post-processing
Normalization
Clean and standardize text:
from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence tokenizer.normalizer = Sequence([ NFD(), # Unicode normalization (decompose) Lowercase(), # Convert to lowercase StripAccents() # Remove accents ]) # Input: "Héllo WORLD" # After normalization: "hello world"
Common normalizers:
,NFD
,NFC
,NFKD
- Unicode normalization formsNFKC
- Convert to lowercaseLowercase()
- Remove accents (é → e)StripAccents()
- Remove whitespaceStrip()
- Regex replacementReplace(pattern, content)
Pre-tokenization
Split text into word-like units:
from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel # Split on whitespace and punctuation tokenizer.pre_tokenizer = Sequence([ Whitespace(), Punctuation() ]) # Input: "Hello, world!" # After pre-tokenization: ["Hello", ",", "world", "!"]
Common pre-tokenizers:
- Split on spaces, tabs, newlinesWhitespace()
- GPT-2 style byte-level splittingByteLevel()
- Isolate punctuationPunctuation()
- Split digits individuallyDigits(individual_digits=True)
- Replace spaces with ▁ (SentencePiece style)Metaspace()
Post-processing
Add special tokens for model input:
from tokenizers.processors import TemplateProcessing # BERT-style: [CLS] sentence [SEP] tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B [SEP]", special_tokens=[ ("[CLS]", 1), ("[SEP]", 2), ], )
Common patterns:
# GPT-2: sentence <|endoftext|> TemplateProcessing( single="$A <|endoftext|>", special_tokens=[("<|endoftext|>", 50256)] ) # RoBERTa: <s> sentence </s> TemplateProcessing( single="<s> $A </s>", pair="<s> $A </s> </s> $B </s>", special_tokens=[("<s>", 0), ("</s>", 2)] )
Alignment tracking
Track token positions in original text:
output = tokenizer.encode("Hello, world!") # Get token offsets for token, offset in zip(output.tokens, output.offsets): start, end = offset print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}") # Output: # hello → [ 0, 5): 'Hello' # , → [ 5, 6): ',' # world → [ 7, 12): 'world' # ! → [12, 13): '!'
Use cases:
- Named entity recognition (map predictions back to text)
- Question answering (extract answer spans)
- Token classification (align labels to original positions)
Integration with transformers
Load with AutoTokenizer
from transformers import AutoTokenizer # AutoTokenizer automatically uses fast tokenizers tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Check if using fast tokenizer print(tokenizer.is_fast) # True # Access underlying tokenizers.Tokenizer fast_tokenizer = tokenizer.backend_tokenizer print(type(fast_tokenizer)) # <class 'tokenizers.Tokenizer'>
Convert custom tokenizer to transformers
from tokenizers import Tokenizer from transformers import PreTrainedTokenizerFast # Train custom tokenizer tokenizer = Tokenizer(BPE()) # ... train tokenizer ... tokenizer.save("my-tokenizer.json") # Wrap for transformers transformers_tokenizer = PreTrainedTokenizerFast( tokenizer_file="my-tokenizer.json", unk_token="[UNK]", pad_token="[PAD]", cls_token="[CLS]", sep_token="[SEP]", mask_token="[MASK]" ) # Use like any transformers tokenizer outputs = transformers_tokenizer( "Hello world", padding=True, truncation=True, max_length=512, return_tensors="pt" )
Common patterns
Train from iterator (large datasets)
from datasets import load_dataset # Load dataset dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train") # Create batch iterator def batch_iterator(batch_size=1000): for i in range(0, len(dataset), batch_size): yield dataset[i:i + batch_size]["text"] # Train tokenizer tokenizer.train_from_iterator( batch_iterator(), trainer=trainer, length=len(dataset) # For progress bar )
Performance: Processes 1GB in ~10-20 minutes
Enable truncation and padding
# Enable truncation tokenizer.enable_truncation(max_length=512) # Enable padding tokenizer.enable_padding( pad_id=tokenizer.token_to_id("[PAD]"), pad_token="[PAD]", length=512 # Fixed length, or None for batch max ) # Encode with both output = tokenizer.encode("This is a long sentence that will be truncated...") print(len(output.ids)) # 512
Multi-processing
from tokenizers import Tokenizer from multiprocessing import Pool # Load tokenizer tokenizer = Tokenizer.from_file("tokenizer.json") def encode_batch(texts): return tokenizer.encode_batch(texts) # Process large corpus in parallel with Pool(8) as pool: # Split corpus into chunks chunk_size = 1000 chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)] # Encode in parallel results = pool.map(encode_batch, chunks)
Speedup: 5-8× with 8 cores
Performance benchmarks
Training speed
| Corpus Size | BPE (30k vocab) | WordPiece (30k) | Unigram (8k) |
|---|---|---|---|
| 10 MB | 15 sec | 18 sec | 25 sec |
| 100 MB | 1.5 min | 2 min | 4 min |
| 1 GB | 15 min | 20 min | 40 min |
Hardware: 16-core CPU, tested on English Wikipedia
Tokenization speed
| Implementation | 1 GB corpus | Throughput |
|---|---|---|
| Pure Python | ~20 minutes | ~50 MB/min |
| HF Tokenizers | ~15 seconds | ~4 GB/min |
| Speedup | 80× | 80× |
Test: English text, average sentence length 20 words
Memory usage
| Task | Memory |
|---|---|
| Load tokenizer | ~10 MB |
| Train BPE (30k vocab) | ~200 MB |
| Encode 1M sentences | ~500 MB |
Supported models
Pre-trained tokenizers available via
from_pretrained():
BERT family:
,bert-base-uncasedbert-large-caseddistilbert-base-uncased
,roberta-baseroberta-large
GPT family:
,gpt2
,gpt2-mediumgpt2-largedistilgpt2
T5 family:
,t5-small
,t5-baset5-largegoogle/flan-t5-xxl
Other:
,facebook/bart-basefacebook/mbart-large-cc25
,albert-base-v2albert-xlarge-v2
,xlm-roberta-basexlm-roberta-large
Browse all: https://huggingface.co/models?library=tokenizers
References
- Training Guide - Train custom tokenizers, configure trainers, handle large datasets
- Algorithms Deep Dive - BPE, WordPiece, Unigram explained in detail
- Pipeline Components - Normalizers, pre-tokenizers, post-processors, decoders
- Transformers Integration - AutoTokenizer, PreTrainedTokenizerFast, special tokens
Resources
- Docs: https://huggingface.co/docs/tokenizers
- GitHub: https://github.com/huggingface/tokenizers ⭐ 9,000+
- Version: 0.20.0+
- Course: https://huggingface.co/learn/nlp-course/chapter6/1
- Paper: BPE (Sennrich et al., 2016), WordPiece (Schuster & Nakajima, 2012)