AI-101

Tokenizer

The component that converts text into numerical tokens that a language model can process.

technicalcore-concepts
AI Confidence: 85%

AI-generated

What It Means

A tokenizer breaks text into tokens, which are the atomic units a language model processes. Different models use different tokenizers. Common words might be single tokens ("the" = 1 token). Rare words get split into subword pieces ("cryptocurrency" might be "crypt" + "ocurrency"). Each token maps to a number that the model works with.

Why It Matters

Tokenizer behavior affects everything from pricing (you pay per token) to model capability (some tokenizers handle code or non-English languages better). When a model seems to struggle with counting characters or spelling, it is often because those tasks are unnatural at the token level. Understanding tokenization helps you write better prompts and predict model behavior.

Sources & Further Reading

Hugging Face: Tokenizers library - https://huggingface.co/docs/tokenizers

OpenAI: tiktoken (tokenizer library) - https://github.com/openai/tiktoken