Core Concepts

Tokenization

The process of splitting text into tokens - the smallest units an LLM processes - which are usually subword pieces, not full words.

LLMs don't process raw text. Before any inference, text is converted into a sequence of integers using a tokenizer. Each integer maps to a "token" - typically a subword fragment, full word, or punctuation character, depending on frequency in the training corpus.

How tokenizers work

Modern LLMs use Byte Pair Encoding (BPE) or SentencePiece tokenizers trained on a large text corpus. Common words ("the", "is", "of") become single tokens. Rare words and technical terms are split into fragments: "tokenization" might become ["token", "ization"], and "GPT" stays as ["GPT"]. Code identifiers, URLs, and non-English text are often less efficient - they use more tokens per character.

Why token count matters

API pricing is based on tokens, not characters or words. Context windows are measured in tokens. Output length limits are in tokens. A rough rule: 1 token = ~0.75 words in English, or ~4 characters. Code and non-English text typically use more tokens per equivalent content.

Practical token estimation

OpenAI's tiktoken library (open-source) is the standard tool for counting GPT-4 tokens. For Claude, Anthropic provides a token counting API. A 10-page PDF is roughly 5,000-8,000 tokens depending on content density. A full novel is around 100,000-150,000 tokens.

Token efficiency and model performance

Some tasks are token-inefficient by design: chain-of-thought prompting produces many intermediate reasoning tokens before the final answer. Models that think "out loud" (like o1) use more tokens per task but achieve better accuracy on hard problems.

Related terms

Models relevant to Tokenization