Tokenization

LLMs don't process raw text. Before inference, text is converted into a sequence of integers using a tokenizer. Each integer maps to a "token" - typically a subword fragment, full word, or punctuation character, depending on frequency in the training corpus.

How tokenizers work

Modern LLMs use Byte Pair Encoding (BPE), SentencePiece, or other subword tokenization algorithms trained on large text corpora. Common words like "the", "is", and "of" become single tokens. Rare words and technical terms split into fragments: "tokenization" might become ["token", "ization"]. Code identifiers, URLs, and non-English text are often less efficient, consuming more tokens per character.

Why token count matters

Most LLM APIs charge by token consumption, not by characters or words. Context windows are measured in tokens. Output length limits are specified in tokens. A rough rule: 1 token approximately equals 0.75 words in English or 4 characters. Code and non-English text typically consume more tokens per equivalent content.

Counting tokens

OpenAI provides tiktoken, an open-source library for counting tokens in GPT-4 and GPT-3.5. Anthropic offers a token counting API for Claude models. Google publishes token counting tools for Gemini. Most LLM providers include token estimators in their documentation. For reference: a 10-page PDF is roughly 5,000-8,000 tokens; a full novel runs around 100,000-150,000 tokens.

Token efficiency and model behavior

Some tasks inherently consume more tokens. Chain-of-thought prompting generates intermediate reasoning tokens before producing the final answer. Models designed for extended reasoning allocate additional tokens to internal processing, trading token consumption for improved accuracy on complex problems.

How tokenizers work

Why token count matters

Counting tokens

Token efficiency and model behavior

Related terms

Models relevant to Tokenization