Natural Language Processing

What is Tokenization in AI?

Tokenization is the process of breaking text into smaller units called tokens, which are the fundamental building blocks that language models actually read, process, and generate. It is the very first step in how AI understands text.

Why Can't Models Just Read Words?

Neural networks operate on numbers, not letters. Before a language model can process any text, that text must be converted into a numerical representation. Tokenization is the bridge between human language and machine computation. It splits text into discrete pieces (tokens), and each token is assigned a unique numerical ID from the model's vocabulary.

But here is the critical insight: tokens are not the same as words. Depending on the tokenizer, a single word might become one token, or it might be split into multiple sub-word pieces. The word "unbelievable" might be broken into ["un", "believ", "able"]. A common word like "the" is typically a single token.

Key Insight

Tokens are the "atoms" of language models. Everything a model costs, everything it can remember (context window), and everything it generates is measured in tokens, not words. A rough rule of thumb for English: 1 token is approximately 0.75 words, or about 4 characters.

Tokenization in Action

Here is how the same sentence might be tokenized by different approaches.

Input: "Tokenization is fascinating!"
Word: Tokenization is fascinating !
BPE: Token ization is fascin ating !
Char: T o k e n i z a t i o n ...

Sub-word tokenization (like BPE) strikes the best balance: it keeps common words intact while breaking rare or compound words into recognizable pieces. This means the model can handle words it has never seen before by composing them from familiar parts.

Major Tokenization Algorithms

Three algorithms dominate modern NLP tokenization. Each takes a different approach to building a vocabulary of sub-word units.

Byte Pair Encoding (BPE)

Used by GPT-2, GPT-3, GPT-4, and LLaMA. BPE starts with individual characters and iteratively merges the most frequently co-occurring pairs until reaching a target vocabulary size.

How it works: Start with all characters as tokens. Find the pair that appears most often (e.g., "t" + "h" = "th"). Merge that pair into a new token. Repeat until the vocabulary reaches the desired size (e.g., 50,000 tokens).

Strength: Balances vocabulary size with the ability to represent any text, including rare words and code.

🔬

WordPiece

Used by BERT and its derivatives. Similar to BPE, but instead of merging the most frequent pairs, WordPiece merges the pair that maximizes the likelihood of the training data when added to the vocabulary.

Signature feature: Uses the "##" prefix to denote continuation tokens. For example, "playing" might become ["play", "##ing"]. This clearly marks where a word has been split.

Strength: Produces linguistically meaningful subwords. The probability-based merging tends to create tokens that correspond to morphemes (meaningful word parts).

🌐

SentencePiece

Used by T5, ALBERT, and many multilingual models. Unlike BPE and WordPiece, SentencePiece treats the input as a raw byte stream rather than pre-tokenized words. It does not require whitespace-separated input.

Why it matters: Many languages (Japanese, Chinese, Thai) do not use spaces between words. SentencePiece handles these languages natively without needing a language-specific word segmenter.

Strength: Truly language-agnostic. The same algorithm works for English, Japanese, Arabic, and code without modification.

The Tokenization Pipeline

Raw Text "Hello world" String input Tokenizer ["Hello", "world"] Sub-word split Token IDs [15496, 995] Integer lookup LLM Embedding + Processing Vectors in, text out

Token Limits and Context Windows

Every language model has a maximum number of tokens it can process at once, known as its context window. This is one of the most important practical constraints when working with LLMs, and it is measured entirely in tokens, not words or characters.

Model Context Window Approx. Words Approx. Pages
GPT-3.5 4,096 tokens ~3,000 words ~6 pages
GPT-4 8,192 / 128K tokens ~6,000 / ~96,000 words ~12 / ~192 pages
Claude 3 200K tokens ~150,000 words ~300 pages
Gemini 1.5 Pro 1M+ tokens ~750,000 words ~1,500 pages

Why Token Limits Matter for Cost

API pricing for LLMs is based on tokens consumed (input + output). A verbose prompt that uses 2,000 tokens costs twice as much as a concise 1,000-token prompt producing the same result. Efficient prompt engineering is, at its core, token engineering.

How Languages Tokenize Differently

Tokenization is not equal across languages, and this has real consequences for cost, performance, and fairness in multilingual AI systems.

🇬🇧

English

English tokenizes efficiently because most tokenizers are trained primarily on English text. Common English words are single tokens. The sentence "The cat sat on the mat" is typically 6-7 tokens.

🇯🇵

Japanese / Chinese

CJK languages often require more tokens per concept because characters may be split into byte-level pieces. A single Japanese kanji character might consume 2-3 tokens. This means Japanese text can cost 2-3 times more to process than equivalent English text.

🇮🇳

Hindi / Arabic

Languages with complex scripts and rich morphology tend to fragment more. A Hindi word might split into 3-5 tokens where the English equivalent is just 1. This "tokenization tax" means non-English users get less value from the same context window.

The Tokenization Fairness Problem

Because most tokenizers are trained on English-heavy datasets, they develop larger vocabularies for English sub-words. Non-English languages are under-represented, leading to more fragmentation, higher costs, and lower effective context lengths. This is an active area of research, with multilingual tokenizers like SentencePiece designed specifically to address this disparity.

Why Tokenization Matters for Model Performance

Tokenization is not just a preprocessing step. It fundamentally shapes what a model can learn and how well it performs.

📈

Vocabulary Size Trade-off

A larger vocabulary means more words can be single tokens (faster, more efficient), but the embedding matrix grows larger, consuming more memory. A smaller vocabulary is more compact, but text fragments into more tokens, slowing processing and reducing effective context length. Most modern models use 30,000 to 100,000 tokens.

💻

Code and Special Characters

Tokenizers must handle not just natural language but also code, mathematical notation, URLs, and special characters. Modern tokenizers like those in GPT-4 and Code Llama are specifically trained on code corpora to tokenize programming languages efficiently, keeping common code patterns as single tokens.

🔢

Arithmetic and Numbers

Numbers are notoriously tricky for tokenizers. "123456" might be tokenized as ["123", "456"] or ["1", "234", "56"], making it difficult for models to perform arithmetic. This is one reason LLMs struggle with math: they do not see numbers as numbers but as arbitrary sub-word chunks.