Why Claude Generates Text One Token at a Time: The LEGO Brick Explanation

Language models like Claude don't think the way humans do; they generate text one token at a time, calculating probabilities for the next piece in a sequence. A token is the standardized unit that models work with, roughly equivalent to four characters or three-quarters of a word in English. Before Claude processes anything you write, every word, space, and punctuation mark converts into a sequence of integers that the model can understand .

What Are Tokens and Why Do They Matter?

Tokens are the building blocks of language models, similar to standardized LEGO pieces that snap together one at a time. A token isn't always a full word; it can be an entire word like "hello" (one token), a chunk of a word like part of "tokenization" (several tokens), or even a single character. The rule of thumb for English is that one token equals roughly four characters or about three-quarters of a word .

Most large language models, including Claude, use an algorithm called Byte Pair Encoding (BPE) to build their vocabulary. This process starts with 256 possible byte values, scans billions of training texts, identifies the most frequent byte pairs, merges them into new tokens, and repeats. The result is a vocabulary ranging from approximately 100,000 to 260,000 tokens depending on the model .

Here's where language matters significantly. The training corpus for these models is dominated by English, meaning common English words like "the," "and," and "great" become single tokens. Words in other languages get fragmented into smaller chunks. A study presented at NeurIPS 2023 measured what researchers called the "tokenization premium" across languages, revealing that Portuguese consumes roughly 1.48 times more tokens than English when using GPT-4's tokenizer, and even with newer tokenizers like GPT-4o, Portuguese still requires approximately 1.3 to 1.4 times more tokens .

How Does the Context Window Limit What Models Can Process?

If tokens are the pieces, the context window is the desk where the model builds its response. It has a fixed size, and everything must fit on it: your instructions, conversation history, reference files, and the response the model is constructing. When the desk fills up, that's it. The model doesn't remember anything left off the surface .

The market has converged on 1 million tokens as the standard for frontier models. For scale, 1 million tokens is roughly 750,000 words in English, equivalent to about 8 to 10 books. For Portuguese, due to the tokenization tax, that drops to around 500,000 words, or about 7 books .

However, having a large context window doesn't mean models use all of it effectively. Recent research shows that models' ability to pay attention drops as context grows, especially for information positioned in the middle of the text, a phenomenon called "lost in the middle." The NoLiMa benchmark from ICML 2025 showed that most large language models fail more than half the time when they need to locate specific information in contexts beyond 32,000 tokens .

How Do Language Models Actually Generate Text?

The generation process is surprisingly straightforward. The model looks at everything already on the desk, calculates a probability distribution over the entire vocabulary (between approximately 100,000 and 260,000 possible pieces) to decide which one fits best in the sequence, places one, and repeats. One at a time, from beginning to end of the response. There's no master plan. This is called autoregressive generation, and it's the core mechanic of the Transformer architecture published in 2017 .

Each piece placed depends on all the ones before it: both the original input and what the model has already built. That's why responses sometimes start well and derail halfway through. The model doesn't know where it will end up when it starts generating. This mechanism is the same as autocomplete in code editors, just at a vastly larger scale. Models like GPT-2 from 2019 had 1.5 billion parameters and a tiny context window. Modern models like Claude operate at a completely different scale, with context windows a thousand times larger, enabling them to build far more complex responses .

Steps to Understanding How Claude Processes Your Input

  • Tokenization: Your text converts into a sequence of integers before the model processes anything, with each token representing roughly four characters in English or fewer in other languages.
  • Context Window Placement: Everything you send (your question, files, conversation history) and everything the model replies with must fit together on the same fixed-size surface, with the response already reserving a portion of available space.
  • Attention Mechanism: For each new token the model generates, it "lights up" the preceding tokens that carry the most weight for that decision and "dims" the ones that don't matter, using a process called self-attention.
  • Sequential Generation: The model predicts one token at a time based on probabilities calculated from all previous tokens, building the response piece by piece without knowing the final outcome in advance.

Why Does Claude Sometimes Get Things Wrong?

Understanding how models generate text explains why they sometimes fail with unsettling confidence. The model places one piece at a time based on probability distributions, but it doesn't have a master plan or the ability to look ahead. If the input is "I need to check the bank by the river," the model must determine whether "bank" means the riverbank or a financial institution. The answer lies in the attention mechanism, introduced in the paper "Attention Is All You Need," which is the heart of the Transformer architecture powering every modern large language model .

The attention mechanism works like a building manual that doesn't just show the next step. For each new piece, it highlights which parts of the construction matter for that decision: the foundation lights up bright because it supports everything, nearby towers glow because they define the pattern, and the garden on the other side stays dim because it's irrelevant right now. The attention mechanism does exactly this for each token, "lighting up" the preceding ones that carry the most weight and "dimming" the ones that don't matter .

This sequential, probabilistic approach means models can confidently generate plausible-sounding but incorrect information. They're optimizing for the next most likely token, not for factual accuracy. Combined with the "lost in the middle" phenomenon where models struggle to locate information in large contexts, these limitations explain why Claude and other models sometimes produce errors despite appearing authoritative .