Tokenization
The process of splitting text into tokens - the smallest units an LLM processes - which are usually subword pieces, not full words.
LLMs don't process raw text. Before any inference, text is converted into a sequence of integers using a tokenizer. Each integer maps to a "token" - typically a subword fragment, full word, or punctuation character, depending on frequency in the training corpus.
How tokenizers work
Modern LLMs use Byte Pair Encoding (BPE) or SentencePiece tokenizers trained on a large text corpus. Common words ("the", "is", "of") become single tokens. Rare words and technical terms are split into fragments: "tokenization" might become ["token", "ization"], and "GPT" stays as ["GPT"]. Code identifiers, URLs, and non-English text are often less efficient - they use more tokens per character.
Why token count matters
API pricing is based on tokens, not characters or words. Context windows are measured in tokens. Output length limits are in tokens. A rough rule: 1 token = ~0.75 words in English, or ~4 characters. Code and non-English text typically use more tokens per equivalent content.
Practical token estimation
OpenAI's tiktoken library (open-source) is the standard tool for counting GPT-4 tokens. For Claude, Anthropic provides a token counting API. A 10-page PDF is roughly 5,000-8,000 tokens depending on content density. A full novel is around 100,000-150,000 tokens.
Token efficiency and model performance
Some tasks are token-inefficient by design: chain-of-thought prompting produces many intermediate reasoning tokens before the final answer. Models that think "out loud" (like o1) use more tokens per task but achieve better accuracy on hard problems.