How LLMs Actually Work

A tokenizer takes a string and produces a sequence of integers, where each integer points to an entry in a fixed vocabulary.

Whole-word vocabularies are too big and don't generalize to new words. Character-level vocabularies are too small and force the model to learn even the simplest patterns from scratch

The most common pieces become single tokens, and rare or novel words get composed from smaller pieces.

In a transformer, each token becomes a vector so the model can do math with it.

When the tokenizer hands the model an integer, the model looks up that row and uses the vector instead. That vector is the token's embedding. It's the model's representation of what that token "means," learned during training.

The interesting property of these embeddings is that semantically similar tokens end up with similar vectors

The embedding alone says nothing about where the token sits in the sequence. The vector for "dog" is the same vector whether "dog" is the first word in your prompt or the fifth. That's a problem.

That's the gap positional encoding fills.

Rotary Position Embeddings (RoPE), introduced by Su et al. in 2021 and now used in LLaMA, Mistral, Gemma, Qwen, and most other open-weight families. The intuition: instead of adding position info to each token's vector, RoPE rotates the Query and Key vectors by an angle that depends on the token's position.

RoPE stands for Rotary Position Embeddings. Instead of adding a position vector, it rotates Query and Key vectors so relative distance shows up during attention.

The Query asks, "what am I looking for from other tokens?"
The Key says, "this is what I offer to tokens looking at me."
The Value carries, "this is what gets passed along when a match happens."

Matching happens through a similarity score. Each token's Query is compared against the Key of each token it is allowed to see, using a scaled dot product. Intuitively, this measures how much the two vectors line up.

Consider the sentence "The cat that I saw yesterday was sleeping." When the model processes "was," it needs to figure out what's doing the sleeping. The Query vector for "was" gets compared against the Key vectors of the tokens it is allowed to see. The dot product with "cat" is high, because the model has learned that verbs like "was" need a subject and that subjects like "cat" produce Key vectors that line up well. The dot product with "yesterday" is low. Softmax turns those scores into weights, "cat" gets a high weight, "yesterday" gets a low one. The model then takes a weighted sum of the corresponding value vectors, so the value for "cat" dominates the result. The new representation of "was" is now mostly shaped by the value of "cat."

An induction head is an attention head that notices repeated patterns in the prompt and helps continue them.

Attention has one big cost. In full attention, each token compares against all the tokens it is allowed to see, so doubling the prompt length roughly quadruples the work. This is why long prompts are expensive to run

An attention head is one independent attention pass with its own learned projections.

Each head runs its attention pass independently. Then the outputs of all the heads get concatenated and passed through a final linear layer that mixes them back into one full-size vector.

The model is never told what each head should do. Specialization emerges naturally during training. Researchers have found heads that track grammar (linking verbs to their objects, articles to their nouns), heads that figure out which pronoun refers to which name, heads that track positional patterns, induction heads, and many more.

Each head needs to keep its Key and Value vectors in memory for all the tokens already generated, so that when a new token gets generated the model doesn't have to recompute everything from scratch. This is called the KV cache, and it's the main memory cost of running an LLM at long context lengths.

Grouped-Query Attention lets multiple query heads share fewer key/value heads. That cuts KV-cache memory while keeping many query views.

A non-linearity is a function that bends its input. The simplest one, ReLU, outputs zero for any negative number and passes positive numbers through unchanged.

Researchers have figured out how to directly edit some facts in a trained model without retraining it. Methods like ROME (Rank-One Model Editing) can change "the Eiffel Tower is in Paris" to "the Eiffel Tower is in Rome" by making a targeted low-rank edit to a specific FFN weight matrix.

Mixture of Experts means the model has several feed-forward networks and routes each token through only a few of them.

Mixtral 8x7B has 46.7 billion total parameters but uses about 12.9 billion per token. This has become a common option for very large models because it lets you keep growing the parameter count without making inference cost grow in proportion.

Without residual connections, very deep models become much harder to train. Without layer normalization, the running sum can blow up or collapse. With both, you get models hundreds of layers deep.

After all the layers of attention and feed-forward processing finish, the model has a vector for each token in the sequence

Logits are raw scores for each possible next token. They become probabilities only after softmax.

The model usually does not just pick the highest-probability token every time. Decoding settings control how deterministic or varied the output is.

Once a token is picked, it gets added to the input. The model runs the next step on the longer sequence, usually reusing the KV cache so it doesn't recompute the whole prefix from scratch. New attention for the new token. New feed-forward. New final vector. New prediction. The loop continues until the model emits an end-of-sequence token or hits a length limit.

The base model isn't trained on factual accuracy, conversational ability, reasoning, or coding directly. It's trained to predict the next token in massive amounts of text. Later post-training can then tune the model for instruction following, preference, safety, and conversational behavior.

What changes between models is:

The trained weights themselves, learned from different training data at different scales.
The configuration: number of layers, vocabulary size, head count, parameter count, MoE or dense.
The post-training: instruction tuning, learning from human feedback, safety controls applied on top of the base model.

Weights are the learned numbers inside the model. Training changes those numbers until the model predicts text well.