Every reply you’ve ever seen from an LLM was generated one token at a time, by running the same fixed-size model over and over on its own output. Not the cleverest paragraph — every paragraph. The model is a transformer, and the loop that runs it is mostly mechanical bookkeeping. Understanding the model, and that loop, is the whole game. Everything the rest of this series does — splitting the model across GPUs, sharing one forward pass across many users, making long prompts fit, making generation faster — sits downstream of these two pieces.

This article opens the model up at three zoom levels:

  1. The whole model, end to end — what comes in, what comes out, what’s in between.
  2. One layer up close — what’s actually inside the part labelled “transformer block”.
  3. The full loop — how the model is used to produce a long reply, one token at a time.

We’ll keep things abstract — symbols like d, L, h rather than specific numbers — because the structure is what’s portable. Different models pick different sizes, but they all sit in this same shape. Concrete numbers earn their place in later articles, where they’re actually load-bearing.

Along the way, real questions surface naturally — things like “wait, does the model really redo all of that every time it produces one token?” or “what if the model is too big to fit on one GPU?” Those questions are exactly what the rest of the series picks apart. Each one becomes its own article down the line.


Part I — The model end to end

1. Tokens in, next token out

Hand the model a string — say, "the quick brown fox jumps over" — and ask it to keep going. What does it actually do, mechanically? Six steps, top to bottom.

1. Tokenize. First, the string is chopped into pieces called tokens. Each token is a small integer ID — because under the hood, the model can only do arithmetic on numbers. Roughly: common short words become one token each, rarer words get broken into a few pieces. We’ll call the count of tokens N.

2. Embed. Each integer ID gets looked up in a giant table called the embedding table. The table has one row per possible token in the vocabulary, and each row is a vector of d numbers. (d is one of the model’s design choices — its hidden dimension. Real models put d in the thousands.) Looking up N tokens turns a list of N IDs into a tensor of shape [N × d]: N rows, each d numbers wide.

Why a vector and not just keep the integer ID? Because the model only knows how to do linear algebra, and an integer ID has no useful geometry — token 5 isn’t “closer to” token 6 than to token 100, even though they’re consecutive integers. The embedding table gives every token a learned point in a d-dimensional space, where tokens with similar meanings end up nearby and unrelated ones end up far apart. Each row is the model’s initial, context-free feeling about what that token means.

(The vocabulary holds some vocab distinct tokens — typically tens of thousands. So the embedding table itself is [vocab × d].)

3. A stack of transformer blocks. This [N × d] tensor now flows through L transformer blocks, stacked one on top of the next. Each block reads the whole sequence, mixes information across positions, and writes back a refined version. Crucially, every block’s input and output have the same [N × d] shape — only the contents of the rows change.

After all L of these passes, the rows have been refined far past their starting point. Each one now represents what that token means in this particular sequence — not the generic, context-free meaning we started with. We’ll dig into why blocks stack so well in §2, and open one up in Part II.

4. Final norm. A small normalization step right at the top of the stack — a clean-up pass. Same shape going in, same shape coming out.

5. LM head. A linear layer projects each row from d features back out to one number per token in the vocabulary — vocab numbers per row. Output shape: [N × vocab]. Each row is a long vector of “scores” over the entire vocabulary. These scores are called logits. The logit for token t at position i is the model’s raw, unsquashed answer to “how plausible is t as the next token at position i?”

6. Softmax → sample. The row we actually care about is the last one — the position right after the last input token, where the model’s prediction for “what comes next” sits. Softmax turns those raw logits into probabilities — all positive, all summing to 1. Sample one token from the resulting distribution. That’s the model’s guess for the next token.

A picture of the whole stack:

the model, end to endtoken IDs (integers)shape: [N]embedding lookup[vocab × d][N × d]L transformer blocks[N × d] in, [N × d] out, repeatedblock 1block 2block L−1block L[N × d]final LayerNormLM head[d × vocab][N × vocab] logitssoftmax(last row) → next-token distribution

So the entire model is a function: it reads N tokens and returns a probability distribution over what the (N+1)-th token should be. Everything else — the chatty replies, the long answers, the streaming output you see in a chat UI — comes from running this function in a loop. We’ll get to that loop in Part III.


2. Why blocks stack: the stream-processor pattern

A transformer, in one line: a stack of L identical “stream processors” that read a fixed-shape stream of tokens, refine it, and hand it on. The shape is [N × d]. Same shape in, same shape out, repeated L times.

Why does that property matter? Two reasons, and the rest of the series leans on both:

  1. It makes the design scale by stacking. Want a bigger model? Stack more blocks. A small open-source model and a massive flagship one look almost identical at this zoom level — same six-step pipeline, same block structure, just a different L (and a slightly wider d). Same recipe, scaled.
  2. It frees everything downstream from caring about depth. A block doesn’t know whether it’s the 1st in the stack or the 32nd, so any tool that touches blocks (the GPU splitter, the batcher, the scheduler) doesn’t have to either. The stack is a uniform substrate to operate on.

(You may have seen this idea elsewhere — Unix pipes, audio plugins, image-processing pipelines. Same shape in, same shape out, stack as many as you want.)

To pin the shape down concretely: look back at §1’s six steps. Once we’re past tokenization, every step in the middle reads and writes the same [N × d] tensor.

  • The embedding turns N token IDs into an [N × d] tensor.
  • Each transformer block reads [N × d] and returns [N × d].
  • The final norm reads [N × d] and returns [N × d].
  • Only the LM head, at the very top, changes the width — back out to vocab.

The shape never changes in the middle of the pipeline. The contents do — each block refines the rows, building up a richer, more context-aware representation — but the geometry is fixed at [N × d] from the bottom of the stack to the top.

A few symbols later sections and articles will reach for:

  • N — the length of the current sequence. Varies per request — it’s a property of the input, not the model.
  • d — the hidden dimension. The width of every row in the stream.
  • L — how many transformer blocks are stacked.
  • vocab — how many distinct tokens the model knows about. Sets the width of the embedding table and the LM head.

We’ll meet two more in Part II: h (the number of attention heads inside a block) and d_head (each head’s width).


Part II — Inside one block

3. The block, drawn flat

Now let’s open up one of those L transformer blocks. The good news: they all have the same internal structure — different blocks have different learned numbers, but the wiring is identical. So understanding one block is understanding all of them.

A block has two halves, each wrapped in a residual connection (the little + at the bottom of each half — we’ll explain that in a sec):

one block — two halves, each wrapped in a residualinput[N × d]residualattentionsub-layerLayerNorm 1tidy-upQKV projectiond → 3d, split into Q, K, Vmulti-head attentionmixes across positionsoutput projectiond → d+[N × d]residualFFNsub-layerLayerNorm 2tidy-upFFN-upd → 4dactivation (GeLU)pointwise nonlinearityFFN-down4d → d+output[N × d]

The two halves are the two main events: an attention sub-layer and an FFN (feed-forward network) sub-layer. The other parts (LayerNorm, the activation, the +) are smaller pieces of glue.

A quick word on what each piece is doing:

  • LayerNorm is a normalization step — for each row of the tensor, it rescales the numbers so they have a clean mean and variance. Cheap, pointwise, and mostly there to keep the numbers from drifting into bad ranges as they pass through many layers. Think of it as a “tidy-up” pass.
  • The residual + means: take what came into this half, and add it onto what came out. So each half is computing a delta — a refinement to the existing representation, not a replacement. That’s what lets us stack many blocks without the signal getting hopelessly mangled along the way.
  • The QKV projection is just three linear layers fused into one big matmul. It produces three tensors — Q (queries), K (keys), V (values) — each of shape [N × d], by applying three different weight matrices to the input.
  • Multi-head attention is the only step that lets information flow between tokens. It’s the main event — §4 walks through what it actually computes, and §5 explains why it’s “multi-head.”
  • The output projection is a final linear layer that mixes the attention output back into something the residual + can absorb.
  • FFN-up and FFN-down are two linear layers with a nonlinearity in between. Together they widen each token’s d-dim representation to 4d, run a per-element nonlinearity, and pull it back to d. No mixing across tokens — every token is processed on its own.

Same shape in, same shape out — the §2 mantra. Stack many of these and you have the body of the model.


4. What attention actually does

We’ve said “attention mixes across positions” several times now without saying how. Let’s fix that.

At every position, the model produces three vectors from that position’s [d]-wide row:

  • a query Q“what am I looking for?”
  • a key K“here’s what I have to offer”
  • a value V“if you decide you care about me, here’s the actual content I want to pass along”

(That’s exactly what the QKV projection does — three linear layers, one each for Q, K, V, fused into one matmul.)

To update position i’s row, the model does three things:

  1. Compute scores. It compares i’s query against every position’s key, with a dot product. A bigger dot product = “those two vectors point in similar directions” = “this position is interesting to position i.” A smaller (or negative) dot product = “not interesting.” So we end up with a list of N scores — one per position.
  2. Turn scores into weights. Run those scores through a softmax to get attention weights — all positive, all summing to 1. High weight on position j means “i cares a lot about j;” low weight means “i basically ignores j.”
  3. Take a weighted average of values. Compute a weighted sum of every position’s value vector, using those weights. That sum is what gets written back as i’s updated representation.

In one sentence: position i’s new row is a weighted average of every position’s value vector, where the weights are decided by how well i’s query matched each one’s key.

That’s it — that’s the entire mechanical content of attention. Everything else in the block (the LayerNorms, the FFN, the residuals) is supporting infrastructure for this one operation. It’s also the only step in the entire model that lets information flow between tokens. Take attention away and the model can’t tell that “fox” and “the” are part of the same sentence.

We’ll add two more details to this picture:

  • §5 — Heads. Attention isn’t run once on the full d-wide features — it’s run multiple times in parallel on different slices of the features.
  • §6 — Causal mask. Position i isn’t actually allowed to attend to every position. It can only look at positions j ≤ i. We’ll explain why.

The FFN sub-layer, by comparison, is much simpler: every row gets passed through the same two linear layers and a nonlinearity, independently of every other row. No mixing across positions there — that’s the attention sub-layer’s job.

So the rhythm of every transformer block is: positions mix (attention), then features mix (FFN). Repeated L times.


5. Heads

Here’s a small thing about attention that turns out to matter a lot: it’s not run once on the full d-dim features, it’s run h times in parallel on different slices of the features.

After the QKV projection produces Q, K, V each of shape [N × d], we reshape each one along the feature dimension into h groups of width d_head = d / h. Each group is one head. Each head runs the §4 attention computation on its own slice — its own queries, its own keys, its own values. Their outputs are concatenated back into [N × d] and fed into the output projection.

multi-head attention: reshape, per-head, concatQ, K, V[N × d]reshapeh heads, each d_head wide[N × h × d_head]attention (§4)h attention outputs[N × h × d_head]concatto output proj[N × d]

each head runs §4’s attention algorithm on its own slice — independently of the others

Real models pick h and d_head to multiply back to d — typically a few dozen heads, each one a hundred-something wide.

The model-design intuition: different heads can learn to pay attention to different kinds of things. Some end up tracking short-range syntactic relationships (“which word does this pronoun refer to?”). Others track longer-range patterns. Multiple heads = multiple “perspectives” on what to attend to.

The systems intuition we’ll need later is more brutal: heads are independent. Head 0 doesn’t talk to head 1 during attention. Each one runs its own little attention computation on its own slice of the features and produces its own output.

That independence is just a property of how the model is built — but it’s load-bearing for everything that comes next. Article 02 will use exactly this property to literally cut the model across two GPUs: half the heads go to one card, half go to the other, and during attention they don’t need to talk to each other at all. The picture for “how do we run a model that’s too big for one GPU” turns out to start right here.


6. The causal mask

Inside attention, there’s one more rule we haven’t mentioned, and it’s essential: when a token at position i attends, it can only see positions j ≤ i. Positions j > i are masked out — their attention scores are forced to −∞ before softmax, which makes their post-softmax weights exactly zero, which means they contribute nothing to position i’s output.

Why this rule exists comes from training. The model is trained one next token at a time: feed in a sequence, ask the model to predict each next token from everything that came before it. If position i were allowed to peek at position i+1 during attention, it would be allowed to cheat by reading the answer. The mask is what enforces “no peeking ahead.”

The mask has two more consequences worth naming, and they come up later.

First, it’s what makes the generation loop in Part III well-defined: the token at position N+1 only depends on tokens 1..N, never the other way around. So we can compute new tokens in order, one at a time, without ever revising an earlier one. That property is what makes “generate a long answer one token at a time” work at all.

Second — and this is the bigger one — the mask means an old token’s work never has to be redone. Position 5’s hidden state is the same whether the sequence is 5 tokens long or 500 long; no future token can reach back and change it. That stable-once-computed property is what makes it even thinkable to save earlier work and reuse it later instead of recomputing it on every forward. Without the mask, every new token would force a full revisit of everything that came before. With it, we can imagine processing tokens in order and just remembering what we already computed — the question §10 will land on, and one of the most load-bearing optimizations in the rest of the series.


7. The whole block in one picture

We’ve now opened up every piece of a transformer block — the two halves (§3), attention’s Q/K/V mechanism (§4), the split into heads (§5), the causal mask (§6). Here’s the full picture in one trace, with the tensor shape labelled at every step.

Skim it once for the overall flow, then come back to it whenever something later in the series references “the [h × N × N] score matrix” or “the reshape into heads” — this diagram is the shape you’re being asked to picture.

inside one block — every operation, every shapeinput[N × d]residualLayerNorm 1[N × d]QKV projectionQKVeach [N × d]reshape Q, K, V along feature dim into h headseach [N × h × d_head]multi-head attention (per head, in parallel)Q · Kᵀ / √d_head[h × N × N] scores+ causal mask (future → −∞)[h × N × N]softmax (along last dim)[h × N × N] weightsweights · V[N × h × d_head]concat heads back into [N × d][N × d]output projection+[N × d]residualLayerNorm 2[N × d]FFN-up (d → 4d)[N × 4d]activation (GeLU)[N × 4d]FFN-down (4d → d)+output[N × d]

Three things worth pausing on:

  • Shapes start and end at [N × d] — the §2 mantra. Inside one block, the tensor briefly takes other shapes ([N × 4d] in the middle of the FFN, [h × N × N] for the attention scores) — but those are transient. The block always returns to [N × d] so the next block can consume it.
  • The [h × N × N] score matrix is the one that surprises. Its size scales with the square of the sequence length. Harmless when N is small, awkward when N is large — that’s where the cost of long sequences will eventually bite. Worth noticing now; future articles will come back to it.
  • Each residual + re-injects the input of that half back onto the output. So each half is computing a delta, not a replacement. That’s why we can stack many blocks without the signal collapsing.

Part III — Using the model to generate

8. One forward gives you one token

The model in §1 takes a sequence of length N and returns a probability distribution over what the next token should be. One token. Not a whole sentence, not even a phrase — a single next-token guess.

But we’re used to LLMs producing long replies. How does a one-token-at-a-time model produce a paragraph? Exactly how you’d guess: by running over and over and feeding its own output back in.

Concretely:

  1. Start with the prompt — a sequence of length N.
  2. Run a forward pass on it. You get a distribution over what token N+1 should be.
  3. Sample from that distribution (or just take the most likely token, “argmax”). You now have a token at position N+1.
  4. Append it to the sequence. The sequence is now length N+1.
  5. Run another forward pass on the full (N+1)-long sequence. You get a distribution over token N+2.
  6. Sample. Append. Sequence is now length N+2.
  7. Repeat until either the model samples a special end-of-sequence token (it has been trained to emit one when it thinks the response is complete) or you hit a length cap you’ve imposed.

A picture of the loop:

generation loop: sample, append, repeatpromptlength Nforwardon length Nsample token N+1from last-row softmaxsequencelength N+1feed the appended sequence back insequencelength N+1forwardon length N+1sample token N+2sequencelength N+2

until model emits end-of-sequenceor until a length cap is hit

That is the whole generation procedure, mathematically. Every output token from any LLM-based system you’ve ever used was produced by a loop that looks like this.


9. The first uncomfortable observation

Walk through the cost of generating K new tokens from a prompt of length N.

  • Forward 1 runs on the prompt: length N.
  • Forward 2 runs on prompt + 1 new token: length N+1.
  • Forward 3: length N+2.
  • Forward K: length N+K−1.

Every forward repeats almost everything the previous forward already did. The first N tokens of forward 2’s input are identical to forward 1’s input — the model nevertheless runs every block on every position from scratch, as if it had never seen them before.

If you total it up, the work scales like roughly (N + K)² / 2 — quadratic in the eventual sequence length. And most of that work is recomputing things that haven’t changed. A new token added at the end of the sequence doesn’t change any of the earlier tokens’ representations. The earlier tokens are still the same prompt and the same few sampled tokens that came before this one. Nothing about them needs to be redone.

So an obvious question hangs in the air: is all that recomputation actually necessary? Clearly not. But avoiding it isn’t free either — it means we’d have to keep some intermediate state around between forwards. Which raises its own questions: what state, exactly? Where do we put it? How big does it get? How does it grow as the conversation grows?

That kind of question is exactly what this series picks at later on.


10. The map of questions

Two themes run through the rest of the series, and most practical questions about running an LLM fall into one or the other.

Theme 1 — Making one forward pass fit. A single forward through this stack can be too big in several ways at once: too big to fit on one GPU, too long to compute in reasonable time, too memory-hungry inside attention. The articles in this theme are about splitting work spatially so a forward can land on the hardware you have.

  • The model itself can be huge. Stack enough blocks (large L) at a wide enough d and the weights alone won’t fit on a single GPU. How do we split one forward across multiple GPUs? (Articles 02 and 03 — leveraging exactly the head-independence we set up in §5.)
  • The prompt itself can be huge. §7’s [h × N × N] score matrix scales with the square of the sequence length. For a long prompt that either runs out of memory or pins the GPU for too long. Can we process the prompt in pieces, or compute attention more cleverly?

Theme 2 — Making the loop fast. Each forward gives one token, and §9 already spotted the biggest cost: the naive loop redoes most of its work. The articles in this theme are about not redoing things, sharing forwards across users, and scheduling who runs when.

  • Don’t redo work. §6 set up the property: an old token’s representation, once computed, never changes. So we should be able to save it and reuse it on the next forward instead of recomputing. That state has to live somewhere — where, how big does it get, and how does it grow as the conversation grows?
  • Many users at once. Real serving engines run many concurrent prompts of different lengths, finishing at different times. How do they share one forward without padding waste, and how does the scheduler keep everyone moving when some are on token 1 and others are on token 1000? (Article 04 begins this thread.)
  • Prompt-processing and one-more-token feel nothing alike. §9’s per-call cost shape is very different depending on whether you’re processing a long input from scratch or appending one extra output token. Their bottlenecks live in different parts of the GPU. Maybe the engine should treat them as different workloads — or even split them across different machines.

The model in §1–§7 is what both themes are about; the loop in §8 is what they’re trying to make work at scale. The rest of the series picks them off, one question at a time.