Hacktricks-skills text-tokenizer
How to tokenize text for LLMs and NLP models. Use this skill whenever the user needs to convert text into token IDs, understand tokenization methods (BPE, WordPiece, Unigram), work with vocabularies, or implement tokenization for machine learning. Make sure to use this skill when users mention tokenizing, token IDs, vocabulary creation, BPE, WordPiece, or any text preprocessing for ML models.
install
source · Clone the upstream repo
git clone https://github.com/abelrguezr/hacktricks-skills
manifest:
skills/AI/AI-llm-architecture/1.-tokenizing/SKILL.MDsource content
Text Tokenizer
A skill for tokenizing text into numerical IDs for machine learning models.
What is Tokenizing?
Tokenizing is the process of breaking down text into smaller pieces called tokens, each assigned a unique numerical ID. This is fundamental for preparing text for ML models, especially in NLP.
Goal: Divide input into tokens (IDs) in a way that makes sense for the model.
Basic Tokenization
1. Splitting Text
- Simple tokenizer splits text into words and punctuation
- Example:
→"Hello, world!"["Hello", ",", "world", "!"]
2. Creating a Vocabulary
- Maps each token to a numerical ID
- Special tokens:
(Beginning of Sequence): Marks text start[BOS]
(End of Sequence): Marks text end[EOS]
(Padding): Makes sequences same length in batches[PAD]
(Unknown): Represents tokens not in vocabulary[UNK]
- Example:
→"Hello, world!"[64, 455, 78, 467]
3. Handling Unknown Words
- Words not in vocabulary get replaced with
[UNK] - Example:
→"Bye, world!"
(assuming[987, 455, 78, 467]
= 987)[UNK]
Advanced Tokenization Methods
Byte Pair Encoding (BPE)
- Purpose: Reduces vocabulary size, handles rare/unknown words
- How it works:
- Starts with individual characters as tokens
- Iteratively merges most frequent token pairs
- Continues until no more frequent pairs exist
- Benefits:
- Eliminates need for
token[UNK] - More efficient and flexible vocabulary
- Eliminates need for
- Example:
→"playing"["play", "ing"]
WordPiece
- Used by: BERT and similar models
- Purpose: Similar to BPE, breaks words into subword units
- How it works:
- Begins with base vocabulary of individual characters
- Iteratively adds most frequent subword that maximizes training data likelihood
- Uses probabilistic model for merging decisions
- Benefits:
- Balances vocabulary size with word representation
- Efficiently handles rare and compound words
- Example:
→"unhappiness"["un", "happy", "ness"]
Unigram Language Model
- Used by: SentencePiece
- Purpose: Uses probabilistic model for optimal subword selection
- How it works:
- Starts with large set of potential tokens
- Iteratively removes tokens that least improve model probability
- Finalizes vocabulary with most probable subword units
- Benefits:
- Flexible and natural language modeling
- Often results in more efficient tokenizations
- Example:
→"internationalization"["international", "ization"]
Implementation with tiktoken
Basic Usage
import tiktoken # Load GPT-2 tokenizer encoding = tiktoken.get_encoding("gpt2") # Encode text to token IDs token_ids = encoding.encode("Hello, world!") print(token_ids) # [15496, 11, 995, 0] # Decode token IDs back to text text = encoding.decode(token_ids) print(text) # "Hello, world!"
With Special Tokens
# Encode with special tokens allowed token_ids = encoding.encode("Hello, world!", allowed_special={"[EOS]"}) # Check token count print(len(token_ids))
Processing Files
import urllib.request # Download text file url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt" file_path = "the-verdict.txt" urllib.request.urlretrieve(url, file_path) # Read and tokenize with open(file_path, "r", encoding="utf-8") as f: raw_text = f.read() token_ids = tiktoken.get_encoding("gpt2").encode(raw_text, allowed_special={"[EOS]"}) # Print first 50 tokens print(token_ids[:50])
Common Use Cases
- Preprocessing text for training - Convert training data to token IDs
- Understanding model input requirements - Know what format your model expects
- Debugging tokenization issues - Inspect how text is being tokenized
- Comparing different tokenization methods - Evaluate BPE vs WordPiece vs Unigram
Best Practices
- Choose tokenizer based on your model - GPT-2 uses BPE, BERT uses WordPiece
- Handle special tokens appropriately - Include
,[BOS]
as needed for your use case[EOS] - Consider vocabulary size vs. tokenization quality tradeoff - Larger vocabularies may tokenize more efficiently but use more memory
- Test with edge cases - Rare words, special characters, different languages
- Use the right encoding - Match the tokenizer to your model architecture
Troubleshooting
Unknown tokens appearing
- Check if your vocabulary is large enough
- Consider using BPE or WordPiece instead of basic tokenization
- Verify special tokens are properly configured
Token count seems too high
- Try a different tokenization method (BPE often produces fewer tokens)
- Check if you're including unnecessary whitespace or special characters
Decoding produces unexpected output
- Ensure you're using the same encoding for encode/decode
- Check if special tokens are being handled correctly
- Verify the token IDs are valid for your vocabulary