Tokenizer
The component that converts text into numerical tokens that a language model can process.
AI-generated
A tokenizer breaks text into tokens, which are the atomic units a language model processes. Different models use different tokenizers. Common words might be single tokens ("the" = 1 token). Rare words get split into subword pieces ("cryptocurrency" might be "crypt" + "ocurrency"). Each token maps to a number that the model works with.
Tokenizer behavior affects everything from pricing (you pay per token) to model capability (some tokenizers handle code or non-English languages better). When a model seems to struggle with counting characters or spelling, it is often because those tasks are unnatural at the token level. Understanding tokenization helps you write better prompts and predict model behavior.
Hugging Face: Tokenizers library - https://huggingface.co/docs/tokenizers
OpenAI: tiktoken (tokenizer library) - https://github.com/openai/tiktoken