Let’s Build It: A Complete Guide to Tokenization in LLMs by Building the GPT Tokenizer

Posted on October 19, 2025October 19, 2025 by Mark Harrell

Contents show

Let's Build It: A Complete Guide to Tokenization in LLMs by Building the GPT Tokenizer

The Hidden Force Behind AI's Language Understanding

Ever asked ChatGPT to count the letters in a word and gotten the wrong answer? Or noticed it struggles with simple spelling tasks? You're not imagining things. The culprit isn't the neural network itself…it's something far more fundamental happening before the AI even “sees” your text.

Welcome to the world of tokenization, the invisible preprocessing step that shapes how every large language model understands language. Think of it as the translation layer between human text and machine understanding, and it's filled with surprising quirks that explain many of AI's strangest behaviors.

Here's what makes this fascinating: while you see the word “strawberry,” GPT might see it as three separate chunks. When you type a number like “127,” the model might process it as a single unit, but “677” gets split into pieces. This isn't random…it's the result of a training process that determines how text gets broken down into tokens, the fundamental units that LLMs actually work with.

In this guide, we'll build a complete tokenizer from scratch, exploring the same Byte Pair Encoding algorithm that powers GPT-4. You'll understand why AI sometimes can't spell, why it's worse at languages other than English, and why certain prompts cause bizarre behaviors. More importantly, you'll learn how recent breakthroughs like Meta's tokenizer-free models in late 2024 might finally solve these problems.

By the end, you'll have built your own GPT-style tokenizer and gained insights that give you an edge in working with AI systems.

From Text to Numbers: Understanding the Basics

Computers don't see letters the way you do. When you read “Hello World,” you perceive individual characters forming words with meaning. A computer sees this as a sequence of numbers…nothing more.

This creates our first challenge: how do we convert text into numbers that a language model can process?

The simplest approach treats each character as a separate token. The letter ‘h' might become the number 104, ‘e' becomes 101, and so on. With this method, “Hello World” transforms into a sequence like [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]. Each position in this list represents one character.

But here's the problem: this is incredibly inefficient. A typical book might contain hundreds of thousands of characters. Processing each one individually means the language model must handle enormous sequences, burning through its limited attention span on individual letters rather than meaningful chunks of text.

We need something smarter…a way to compress these sequences while preserving all the information.

Enter Unicode and UTF-8. Unicode provides a standardized way to represent roughly 150,000 characters across 161 different scripts…everything from English letters to Chinese characters to emoji. Each character gets a unique code point, a number that identifies it universally.

UTF-8 then converts these code points into bytes, the actual binary data computers work with. Simple characters like ‘A' become a single byte, while complex characters like ‘安' might become three bytes. This variable-length encoding keeps things efficient while supporting every character humans use.

But even UTF-8 gives us a vocabulary of only 256 possible values (the bytes 0-255). That's still too small and would create sequences that are far too long. We need a way to build larger, more meaningful chunks.

That's where Byte Pair Encoding comes in.

The Magic of Byte Pair Encoding

Byte Pair Encoding sounds complicated, but the core idea is beautifully simple: find the most common pairs of elements in your sequence, merge them into a new element, and repeat.

Imagine you have a sequence of letters: “aaabdaaabac”. Let's compress it step by step.

First, scan through and count which pairs appear most frequently. The pair “aa” shows up twice, more than any other combination. So we create a new token…let's call it “Z”…to represent “aa”. Replace every occurrence:

Original: a-a-a-b-d-a-a-a-b-a-c (11 characters) After merging “aa” → “Z”: Z-a-b-d-Z-a-b-a-c (9 characters)

Now we've got a shorter sequence and a vocabulary that's grown by one. We repeat the process. This time “ab” appears most frequently, so we create token “Y” for “ab”:

After merging “ab” → “Y”: Z-Y-d-Z-Y-a-c (7 characters)

One more round. The pair “ZY” appears twice, so we create token “X”:

After merging “ZY” → “X”: X-d-X-a-c (5 characters)

We've compressed an 11-character sequence down to 5 tokens, building a vocabulary of seven elements along the way. Each new token represents a frequently occurring pattern.

This same algorithm works on bytes. Start with your UTF-8 encoded text as a sequence of bytes (256 possible values). Find the most common byte pair and merge it into a new token (number 256). Find the next most common pair and merge it (token 257). Continue until you've built a vocabulary of your desired size…typically around 50,000 to 100,000 tokens for modern LLMs.

The beauty of BPE is that it automatically discovers common patterns in your training data. If you train on English text, common English words and word fragments become single tokens. If you include code, common programming patterns get their own tokens. The algorithm adapts to whatever text you feed it.

In practice, this means frequent words like “the” or “ing” become single tokens, while rare words get split into multiple pieces. The model processes “hello” as one token but might break “antidisestablishmentarianism” into five or six chunks.

This compression is crucial. Language models have a fixed context window…they can only attend to a certain number of tokens at once. By compressing text into fewer, more meaningful tokens, we allow the model to process longer passages and capture more context.

GPT's Tokenization Evolution: From GPT-2 to GPT-4

OpenAI's journey from GPT-2 to GPT-4 involved significant refinements to tokenization. The core BPE algorithm remained the same, but the preprocessing and vocabulary grew more sophisticated.

GPT-2 introduced an important innovation: using regular expressions to enforce boundaries before BPE even starts. The naive BPE algorithm has a problem…it might merge “dog” with different punctuation marks to create separate tokens for “dog.”, “dog!”, and “dog?”. This wastes vocabulary space and forces the model to learn that these all represent the same base word.

To prevent this, GPT-2 applies a complex regex pattern that splits text into categories: letters, numbers, punctuation, and whitespace. BPE then operates within each category but never across categories. The word “dog” can merge with other letters, but not with the period that follows it.

The pattern looks intimidating: ‘s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+

Let's break down what this does. It first matches common contractions like “‘s” or “‘t”. Then it matches optional space plus one or more letters, optional space plus numbers, or optional space plus punctuation. The final parts handle whitespace in specific ways.

When you type “Hello world how are you?”, this pattern splits it into chunks: [“Hello”, ” world”, ” how”, ” are”, ” you”, “?”]. Each chunk gets tokenized separately, then the results concatenate together.

GPT-4 refined this further with an improved pattern: (?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{2,}|…

The key improvements? Case-insensitive matching for contractions (so “DON'T” and “don't” both work correctly), better handling of newlines, and a requirement that numbers have at least two digits. This last change prevents individual digits from becoming tokens, which helps with arithmetic.

The vocabulary also expanded dramatically. GPT-2 used about 50,000 tokens, while GPT-4 jumped to over 100,000. This allows for more efficient compression and better handling of diverse content…code, multiple languages, technical terms, and more.

These improvements show in practice. The same Python code that GPT-2 struggled with gets tokenized much more efficiently in GPT-4, with whitespace properly merged and fewer tokens overall. Better tokenization translates directly to better model performance.

Why Tokenization Causes Strange AI Behaviors

Now we can finally explain those weird AI quirks you've encountered.

The spelling problem: When you ask GPT-4 “How many L's are in ‘DefaultCellStyle'?”, it confidently gives the wrong answer. Why? Because “DefaultCellStyle” is a single token in GPT-4's vocabulary. The model never sees the individual letters…it just sees token 98518. Trying to count letters in a single indivisible unit is like asking you to count the individual atoms in a baseball.

When the same question is asked differently…”First, write out each letter separated by spaces, then count the L's”…the model succeeds. Breaking the token into individual character tokens makes them visible, allowing accurate counting.

The multilingual gap: Type “Hello how are you?” and count the tokens: five. Now try the Korean equivalent “안녕하세요 어떻게 지내세요?”: fifteen tokens. The same meaning takes three times as many tokens in Korean.

This happens because GPT's tokenizer trained primarily on English text. English words merged frequently during training, becoming efficient single tokens. Korean characters appeared less often, so they remained split into multiple pieces. This makes Korean text “bloated” in token space, consuming more of the model's limited context window and forcing it to work harder to extract the same meaning.

Arithmetic struggles: Addition works at the digit level…you add ones, then tens, then hundreds. But numbers tokenize unpredictably. The number “127” might be one token, while “677” splits into two. Different four-digit numbers might be 1, 2, 3, or 4 tokens depending on arbitrary patterns learned during tokenization training.

This inconsistency makes arithmetic genuinely difficult for LLMs. The model must learn to add numbers that appear in completely different formats…sometimes seeing all four digits at once, sometimes seeing them split across multiple tokens. While models manage this reasonably well through sheer training effort, the underlying tokenization makes the task much harder than it needs to be.

The mysterious SolidGoldMagikarp: This might be the strangest tokenization issue ever discovered. When researchers asked GPT-3 simple questions about certain tokens like “SolidGoldMagikarp” or “petertodd,” the model completely broke down…evading questions, hallucinating, even becoming insulting.

The explanation? These were Reddit usernames that appeared frequently in the tokenization training data but never appeared in the language model's training data. They became tokens (due to high frequency in tokenization data) but those tokens never got activated during model training. They remained completely untrained…essentially unallocated memory in the model. Using them at test time fed random, untrained values into the transformer, producing undefined behavior.

Alternative Approaches: SentencePiece and Beyond

Not all tokenizers work the same way. While GPT models use byte-level BPE, other popular models take different approaches.

SentencePiece, used by Llama and Mistral models, works at the Unicode code point level rather than the byte level. Instead of encoding text to UTF-8 bytes first, it directly applies BPE to the Unicode characters themselves. Only rare characters that don't appear in the vocabulary fall back to byte-level encoding.

This has interesting implications for multilingual support. In GPT's approach, all languages get encoded to bytes first, then BPE merges those bytes. Every language is treated equally at the fundamental level. SentencePiece, on the other hand, can keep entire Chinese or Japanese characters as single tokens before BPE even begins, potentially making it more efficient for those languages.

The trade-offs matter for model designers. GPT's byte-level approach is universal and consistent…every character follows the same path from bytes to tokens. SentencePiece is potentially more efficient for character-rich languages but introduces more complexity in how different types of content get handled.

SentencePiece also supports multiple training algorithms beyond BPE, including unigram language models. It's a Swiss Army knife approach…more features and flexibility, but with more complexity and historical baggage from earlier NLP tasks.

Both libraries have proven effective at scale. The choice between them reflects different design philosophies and priorities rather than one being strictly superior.

The Future: Tokenizer-Free Models

What if we could skip tokenization entirely?

In December 2024, Meta's research team unveiled the Byte Latent Transformer (BLT), a breakthrough architecture that processes raw bytes directly without any tokenization step. Instead of fixed token boundaries determined during preprocessing, BLT dynamically groups bytes into patches during inference.

The key innovation is a hierarchical structure that makes processing raw byte sequences computationally tractable. Earlier attempts at tokenizer-free models struggled because attention mechanisms become extremely expensive with very long sequences. BLT addresses this by organizing the transformer in a way that can handle byte-level inputs efficiently.

The results are impressive: 50% fewer FLOPs (floating-point operations) at inference time compared to traditional tokenization approaches, with equal or better performance on language understanding tasks. This suggests we can eliminate the tokenization bottleneck without sacrificing quality.

Why does this matter? Tokenizer-free models would solve all the problems we've discussed. No more spelling issues from chunked words. No more multilingual gaps from training data bias in tokenization. No more unallocated token embeddings causing bizarre behaviors. No more edge cases with partial tokens or trailing whitespace.

The model would see text exactly as it appears…byte by byte…and learn its own optimal way to process those bytes. This is conceptually cleaner and eliminates an entire class of potential issues.

However, BLT and similar approaches are still early-stage research. They haven't been validated at the massive scale of GPT-4 or Llama 3, and there may be challenges that only emerge at larger sizes. But the direction is promising, and 2025 may see significant progress toward tokenizer-free language models becoming mainstream.

Building Your Own Tokenizer: A Practical Guide

Ready to build your own GPT-style tokenizer? Here's the practical roadmap.

The minbpe repository provides an excellent learning path. Start with a BasicTokenizer that implements pure BPE without any preprocessing. Train it on a text file and watch what tokens emerge. This helps build intuition for how BPE discovers patterns.

Next, add regex preprocessing to create a RegexTokenizer. Implement GPT-4's pattern that splits text before applying BPE. Compare the vocabularies…you'll see how preventing cross-category merges creates more sensible tokens.

The tricky part is loading GPT-4's actual merges. The published tokenizer applies a byte shuffle before BPE, so you need to reverse-engineer this shuffle to recover the original merge sequence. The minbpe repository includes helper functions for this.

Finally, add special token support. These bypass normal BPE processing, getting directly swapped when encountered in text. Implement the allowed_special parameter that controls which special tokens get processed versus treated as regular text.

Common pitfalls? Not handling the case where your text becomes a single token or empty (causes errors when looking for pairs). Forgetting that encoding and decoding are not symmetric (some token sequences aren't valid UTF-8). Mishandling the iteration order of merges (later merges depend on earlier ones being applied first).

The exercise typically takes a few hours if you work through it systematically. You'll end up with a tokenizer that produces identical output to OpenAI's tiktoken library…a complete, working implementation you built from scratch.

Watch How To build the GPT Tokenizer

Practical Implications and Best Practices

Understanding tokenization has immediate practical applications.

Token efficiency matters for costs: Every API call to GPT-4 is priced per token…both input and output. A 1000-word document might be 1500 tokens, and you pay for each one. Choosing efficient representations saves money at scale.

YAML is significantly more token-efficient than JSON for structured data. The same information that takes 214 tokens in JSON might only need 99 tokens in YAML. In production systems handling thousands of API calls daily, this adds up quickly.

Prompt optimization: Knowing how tokenization works helps you craft better prompts. If you need the model to process individual characters, explicitly spell them out separated by spaces rather than asking it to “analyze” a single token. If you're working with multilingual content, be aware that non-English text consumes more tokens.

Security considerations: Special tokens create an attack surface. User-provided input should never be parsed for special tokens…that's attacker-controlled text that could inject end-of-sequence markers or other special tokens to confuse the model. Always sanitize or escape special token syntax in user inputs.

Context window planning: When designing systems that maintain conversation history, remember that every previous message consumes tokens from your context window. Long conversations eventually exceed the limit, requiring strategies like summarization or selective history retention.

Understanding these details transforms you from someone who uses AI to someone who uses it effectively.

Mastering the Foundation

Tokenization sits at the foundation of every large language model. It's not the glamorous part…neural architectures get the headlines…but it profoundly shapes how models understand and generate text.

Every quirk you've noticed in ChatGPT's behavior, every surprising failure at simple tasks, every inconsistency across languages…many of these trace back to tokenization. It's the invisible layer that translates between human text and machine understanding, and its design decisions ripple through everything the model does.

The good news? You now understand this foundation deeply. You know why models struggle with spelling, why they favor English, why arithmetic is hard, and why certain prompts cause bizarre behaviors. You understand the trade-offs in vocabulary size, the role of special tokens, and the differences between major tokenization libraries.

More importantly, you've learned that this might all be temporary. Tokenizer-free models like Meta's BLT suggest we might eliminate this preprocessing step entirely within the next few years, moving to architectures that process raw bytes directly.

Until then, tokenization remains essential knowledge for anyone working seriously with LLMs. The developers who understand this layer…who can debug tokenization issues, optimize prompts for token efficiency, and avoid tokenization-related pitfalls…have a genuine advantage.

As we look ahead to 2025 and beyond, tokenization represents both a solved problem (we know how to do it reliably at scale) and an opportunity (we might be able to eliminate it entirely). Either way, understanding how it works today gives you the foundation to understand whatever comes next.

The root of suffering may be tokenization, but knowledge of tokenization brings mastery.