[{"content":"About this series These are learning notes — me working through how modern LLMs are actually served, mostly by talking to Claude and writing up the parts that finally clicked. The articles themselves are written in a confident \u0026ldquo;discovery journey\u0026rdquo; voice, but the project underneath is just someone learning in public.\nThe list below is alive — articles flip status as they ship, and the roadmap grows whenever a discussion surfaces a hole worth digging.\nArticles # Title Status Link 01 An LLM, end to end — bird\u0026rsquo;s-eye stack, one block, the generation loop, and the questions the rest of the series picks up [next] — 02 Tensor parallelism, built from scratch in your head [done] read → 03 Walking TP through a full block — start column-parallel everywhere, watch the comm explode, pair with row-parallel until two all-reduces per block fall out [done] read → 04 How to batch many requests through one forward pass — varlen attention, prefill only, TP turns out to be untouched [done] read → 05 ORCA and chunked prefill — iteration-level scheduling solves the boundary problems; chunked prefill bounds the iteration so a long prompt can\u0026rsquo;t hijack the engine\u0026rsquo;s heartbeat [done] read → 06 Prefill and decode disaggregation — two phases on opposite sides of the roofline; once you accept the asymmetry, sharing a GPU pool is no longer a compromise but a fight against the formula [done] read → 07 The engineering of disaggregation — KV cache transfer across fabrics (NVLink, NVSwitch, IB, PCIe), tiered memory pools (HBM, DRAM, SSD), overlap with prefill, topology-aware routing [next] — 08 Pipeline parallelism — the cut across blocks instead of within one, and the bubble it creates; why the prefill pool wants it [planned] — 09 MoE and expert parallelism — what changes when FFN becomes routed [planned] — 10 PagedAttention — the KV cache as virtual memory, blocks instead of contiguous slabs, copy-on-write across requests [planned] — 11 Sequence and context parallelism — splitting one request across GPUs, ring attention, the long-context move [planned] — 12 FlashAttention — tiled online softmax, why the [L × L] score matrix never has to exist [speculative] — 13 FlashDecoding — making the 1 × L_kv decode-attention call fast under bandwidth pressure [speculative] — 14 GQA and MLA — fewer KV heads, smaller KV cache, faster decode (and what it costs the model) [speculative] — 15 Speculative decoding — a draft model proposes, the big model verifies, two passes for the price of one [speculative] — 16 KV compression — quantization, eviction policies, what we can drop and what we can\u0026rsquo;t [speculative] — Status legend [done] shipped \u0026amp; linked · [next] actively drafting · [planned] on deck, will get there · [speculative] a hole worth digging — may or may not get filled, but the question is interesting\nRecurring threads worth flagging A few observations that keep showing up across articles, worth keeping in the back of your mind as you read:\nTP turns out to be remarkably non-disruptive. Request batching didn\u0026rsquo;t disturb it (Article 04), and continuous batching + chunked prefill didn\u0026rsquo;t either (Article 05). PP and MoE do interact with TP in interesting ways — that\u0026rsquo;s why those come up next. The KV cache is the connective tissue between articles 05 onward. It enters with decode and never really leaves; it\u0026rsquo;s also the thing that makes long contexts hard. Decode flips the bottleneck profile. Articles 02–04 assume prefill, where compute dominates. Once decode is in scope (Article 05 onward), bandwidth on weight reads becomes the binding constraint — and that\u0026rsquo;s what motivates almost every later optimization (FlashDecoding, GQA, prefill/decode disaggregation, speculative decoding). Modelers\u0026rsquo; choices keep load-bearing for serving in ways that weren\u0026rsquo;t designed in. Multi-head independence made TP comm-free; it also made request batching comm-free; it\u0026rsquo;ll show up again when we look at GQA/MLA. Worth tracking as a recurring theme. ","permalink":"https://wgzesg.github.io/llm_stories/posts/00-roadmap/","summary":"What this series is, and a living map of the articles — shipped, in progress, and the holes we\u0026rsquo;ve dug for our future selves to fill.","title":"Roadmap"},{"content":"Every reply you\u0026rsquo;ve ever seen from an LLM was generated one token at a time, by running the same fixed-size model over and over on its own output. Not the cleverest paragraph — every paragraph. The model is a transformer, and the loop that runs it is mostly mechanical bookkeeping. Understanding the model, and that loop, is the whole game. Everything the rest of this series does — splitting the model across GPUs, sharing one forward pass across many users, making long prompts fit, making generation faster — sits downstream of these two pieces.\nThis article opens the model up at three zoom levels:\nThe whole model, end to end — what comes in, what comes out, what\u0026rsquo;s in between. One layer up close — what\u0026rsquo;s actually inside the part labelled \u0026ldquo;transformer block\u0026rdquo;. The full loop — how the model is used to produce a long reply, one token at a time. We\u0026rsquo;ll keep things abstract — symbols like d, L, h rather than specific numbers — because the structure is what\u0026rsquo;s portable. Different models pick different sizes, but they all sit in this same shape. Concrete numbers earn their place in later articles, where they\u0026rsquo;re actually load-bearing.\nAlong the way, real questions surface naturally — things like \u0026ldquo;wait, does the model really redo all of that every time it produces one token?\u0026rdquo; or \u0026ldquo;what if the model is too big to fit on one GPU?\u0026rdquo; Those questions are exactly what the rest of the series picks apart. Each one becomes its own article down the line.\nPart I — The model end to end 1. Tokens in, next token out Hand the model a string — say, \u0026quot;the quick brown fox jumps over\u0026quot; — and ask it to keep going. What does it actually do, mechanically? Six steps, top to bottom.\n1. Tokenize. First, the string is chopped into pieces called tokens. Each token is a small integer ID — because under the hood, the model can only do arithmetic on numbers. Roughly: common short words become one token each, rarer words get broken into a few pieces. We\u0026rsquo;ll call the count of tokens N.\n2. Embed. Each integer ID gets looked up in a giant table called the embedding table. The table has one row per possible token in the vocabulary, and each row is a vector of d numbers. (d is one of the model\u0026rsquo;s design choices — its hidden dimension. Real models put d in the thousands.) Looking up N tokens turns a list of N IDs into a tensor of shape [N × d]: N rows, each d numbers wide.\nWhy a vector and not just keep the integer ID? Because the model only knows how to do linear algebra, and an integer ID has no useful geometry — token 5 isn\u0026rsquo;t \u0026ldquo;closer to\u0026rdquo; token 6 than to token 100, even though they\u0026rsquo;re consecutive integers. The embedding table gives every token a learned point in a d-dimensional space, where tokens with similar meanings end up nearby and unrelated ones end up far apart. Each row is the model\u0026rsquo;s initial, context-free feeling about what that token means.\n(The vocabulary holds some vocab distinct tokens — typically tens of thousands. So the embedding table itself is [vocab × d].)\n3. A stack of transformer blocks. This [N × d] tensor now flows through L transformer blocks, stacked one on top of the next. Each block reads the whole sequence, mixes information across positions, and writes back a refined version. Crucially, every block\u0026rsquo;s input and output have the same [N × d] shape — only the contents of the rows change.\nAfter all L of these passes, the rows have been refined far past their starting point. Each one now represents what that token means in this particular sequence — not the generic, context-free meaning we started with. We\u0026rsquo;ll dig into why blocks stack so well in §2, and open one up in Part II.\n4. Final norm. A small normalization step right at the top of the stack — a clean-up pass. Same shape going in, same shape coming out.\n5. LM head. A linear layer projects each row from d features back out to one number per token in the vocabulary — vocab numbers per row. Output shape: [N × vocab]. Each row is a long vector of \u0026ldquo;scores\u0026rdquo; over the entire vocabulary. These scores are called logits. The logit for token t at position i is the model\u0026rsquo;s raw, unsquashed answer to \u0026ldquo;how plausible is t as the next token at position i?\u0026rdquo;\n6. Softmax → sample. The row we actually care about is the last one — the position right after the last input token, where the model\u0026rsquo;s prediction for \u0026ldquo;what comes next\u0026rdquo; sits. Softmax turns those raw logits into probabilities — all positive, all summing to 1. Sample one token from the resulting distribution. That\u0026rsquo;s the model\u0026rsquo;s guess for the next token.\nA picture of the whole stack:\nthe model, end to end token IDs (integers) shape: [N] embedding lookup [vocab × d] [N × d] L transformer blocks [N × d] in, [N × d] out, repeated block 1 block 2 ⋮ ⋮ block L−1 block L [N × d] final LayerNorm LM head [d × vocab] [N × vocab] logits softmax(last row) → next-token distribution So the entire model is a function: it reads N tokens and returns a probability distribution over what the (N+1)-th token should be. Everything else — the chatty replies, the long answers, the streaming output you see in a chat UI — comes from running this function in a loop. We\u0026rsquo;ll get to that loop in Part III.\n2. Why blocks stack: the stream-processor pattern A transformer, in one line: a stack of L identical \u0026ldquo;stream processors\u0026rdquo; that read a fixed-shape stream of tokens, refine it, and hand it on. The shape is [N × d]. Same shape in, same shape out, repeated L times.\nWhy does that property matter? Two reasons, and the rest of the series leans on both:\nIt makes the design scale by stacking. Want a bigger model? Stack more blocks. A small open-source model and a massive flagship one look almost identical at this zoom level — same six-step pipeline, same block structure, just a different L (and a slightly wider d). Same recipe, scaled. It frees everything downstream from caring about depth. A block doesn\u0026rsquo;t know whether it\u0026rsquo;s the 1st in the stack or the 32nd, so any tool that touches blocks (the GPU splitter, the batcher, the scheduler) doesn\u0026rsquo;t have to either. The stack is a uniform substrate to operate on. (You may have seen this idea elsewhere — Unix pipes, audio plugins, image-processing pipelines. Same shape in, same shape out, stack as many as you want.)\nTo pin the shape down concretely: look back at §1\u0026rsquo;s six steps. Once we\u0026rsquo;re past tokenization, every step in the middle reads and writes the same [N × d] tensor.\nThe embedding turns N token IDs into an [N × d] tensor. Each transformer block reads [N × d] and returns [N × d]. The final norm reads [N × d] and returns [N × d]. Only the LM head, at the very top, changes the width — back out to vocab. The shape never changes in the middle of the pipeline. The contents do — each block refines the rows, building up a richer, more context-aware representation — but the geometry is fixed at [N × d] from the bottom of the stack to the top.\nA few symbols later sections and articles will reach for:\nN — the length of the current sequence. Varies per request — it\u0026rsquo;s a property of the input, not the model. d — the hidden dimension. The width of every row in the stream. L — how many transformer blocks are stacked. vocab — how many distinct tokens the model knows about. Sets the width of the embedding table and the LM head. We\u0026rsquo;ll meet two more in Part II: h (the number of attention heads inside a block) and d_head (each head\u0026rsquo;s width).\nPart II — Inside one block 3. The block, drawn flat Now let\u0026rsquo;s open up one of those L transformer blocks. The good news: they all have the same internal structure — different blocks have different learned numbers, but the wiring is identical. So understanding one block is understanding all of them.\nA block has two halves, each wrapped in a residual connection (the little + at the bottom of each half — we\u0026rsquo;ll explain that in a sec):\none block — two halves, each wrapped in a residual input [N × d] residual attention sub-layer LayerNorm 1 tidy-up QKV projection d → 3d, split into Q, K, V multi-head attention mixes across positions output projection d → d + [N × d] residual FFN sub-layer LayerNorm 2 tidy-up FFN-up d → 4d activation (GeLU) pointwise nonlinearity FFN-down 4d → d + output [N × d] The two halves are the two main events: an attention sub-layer and an FFN (feed-forward network) sub-layer. The other parts (LayerNorm, the activation, the +) are smaller pieces of glue.\nA quick word on what each piece is doing:\nLayerNorm is a normalization step — for each row of the tensor, it rescales the numbers so they have a clean mean and variance. Cheap, pointwise, and mostly there to keep the numbers from drifting into bad ranges as they pass through many layers. Think of it as a \u0026ldquo;tidy-up\u0026rdquo; pass. The residual + means: take what came into this half, and add it onto what came out. So each half is computing a delta — a refinement to the existing representation, not a replacement. That\u0026rsquo;s what lets us stack many blocks without the signal getting hopelessly mangled along the way. The QKV projection is just three linear layers fused into one big matmul. It produces three tensors — Q (queries), K (keys), V (values) — each of shape [N × d], by applying three different weight matrices to the input. Multi-head attention is the only step that lets information flow between tokens. It\u0026rsquo;s the main event — §4 walks through what it actually computes, and §5 explains why it\u0026rsquo;s \u0026ldquo;multi-head.\u0026rdquo; The output projection is a final linear layer that mixes the attention output back into something the residual + can absorb. FFN-up and FFN-down are two linear layers with a nonlinearity in between. Together they widen each token\u0026rsquo;s d-dim representation to 4d, run a per-element nonlinearity, and pull it back to d. No mixing across tokens — every token is processed on its own. Same shape in, same shape out — the §2 mantra. Stack many of these and you have the body of the model.\n4. What attention actually does We\u0026rsquo;ve said \u0026ldquo;attention mixes across positions\u0026rdquo; several times now without saying how. Let\u0026rsquo;s fix that.\nAt every position, the model produces three vectors from that position\u0026rsquo;s [d]-wide row:\na query Q — \u0026ldquo;what am I looking for?\u0026rdquo; a key K — \u0026ldquo;here\u0026rsquo;s what I have to offer\u0026rdquo; a value V — \u0026ldquo;if you decide you care about me, here\u0026rsquo;s the actual content I want to pass along\u0026rdquo; (That\u0026rsquo;s exactly what the QKV projection does — three linear layers, one each for Q, K, V, fused into one matmul.)\nTo update position i\u0026rsquo;s row, the model does three things:\nCompute scores. It compares i\u0026rsquo;s query against every position\u0026rsquo;s key, with a dot product. A bigger dot product = \u0026ldquo;those two vectors point in similar directions\u0026rdquo; = \u0026ldquo;this position is interesting to position i.\u0026rdquo; A smaller (or negative) dot product = \u0026ldquo;not interesting.\u0026rdquo; So we end up with a list of N scores — one per position. Turn scores into weights. Run those scores through a softmax to get attention weights — all positive, all summing to 1. High weight on position j means \u0026ldquo;i cares a lot about j;\u0026rdquo; low weight means \u0026ldquo;i basically ignores j.\u0026rdquo; Take a weighted average of values. Compute a weighted sum of every position\u0026rsquo;s value vector, using those weights. That sum is what gets written back as i\u0026rsquo;s updated representation. In one sentence: position i\u0026rsquo;s new row is a weighted average of every position\u0026rsquo;s value vector, where the weights are decided by how well i\u0026rsquo;s query matched each one\u0026rsquo;s key.\nThat\u0026rsquo;s it — that\u0026rsquo;s the entire mechanical content of attention. Everything else in the block (the LayerNorms, the FFN, the residuals) is supporting infrastructure for this one operation. It\u0026rsquo;s also the only step in the entire model that lets information flow between tokens. Take attention away and the model can\u0026rsquo;t tell that \u0026ldquo;fox\u0026rdquo; and \u0026ldquo;the\u0026rdquo; are part of the same sentence.\nWe\u0026rsquo;ll add two more details to this picture:\n§5 — Heads. Attention isn\u0026rsquo;t run once on the full d-wide features — it\u0026rsquo;s run multiple times in parallel on different slices of the features. §6 — Causal mask. Position i isn\u0026rsquo;t actually allowed to attend to every position. It can only look at positions j ≤ i. We\u0026rsquo;ll explain why. The FFN sub-layer, by comparison, is much simpler: every row gets passed through the same two linear layers and a nonlinearity, independently of every other row. No mixing across positions there — that\u0026rsquo;s the attention sub-layer\u0026rsquo;s job.\nSo the rhythm of every transformer block is: positions mix (attention), then features mix (FFN). Repeated L times.\n5. Heads Here\u0026rsquo;s a small thing about attention that turns out to matter a lot: it\u0026rsquo;s not run once on the full d-dim features, it\u0026rsquo;s run h times in parallel on different slices of the features.\nAfter the QKV projection produces Q, K, V each of shape [N × d], we reshape each one along the feature dimension into h groups of width d_head = d / h. Each group is one head. Each head runs the §4 attention computation on its own slice — its own queries, its own keys, its own values. Their outputs are concatenated back into [N × d] and fed into the output projection.\nmulti-head attention: reshape, per-head, concat Q, K, V [N × d] reshape h heads, each d_head wide [N × h × d_head] attention (§4) h attention outputs [N × h × d_head] concat to output proj [N × d] each head runs §4\u0026rsquo;s attention algorithm on its own slice — independently of the others Real models pick h and d_head to multiply back to d — typically a few dozen heads, each one a hundred-something wide.\nThe model-design intuition: different heads can learn to pay attention to different kinds of things. Some end up tracking short-range syntactic relationships (\u0026ldquo;which word does this pronoun refer to?\u0026rdquo;). Others track longer-range patterns. Multiple heads = multiple \u0026ldquo;perspectives\u0026rdquo; on what to attend to.\nThe systems intuition we\u0026rsquo;ll need later is more brutal: heads are independent. Head 0 doesn\u0026rsquo;t talk to head 1 during attention. Each one runs its own little attention computation on its own slice of the features and produces its own output.\nThat independence is just a property of how the model is built — but it\u0026rsquo;s load-bearing for everything that comes next. Article 02 will use exactly this property to literally cut the model across two GPUs: half the heads go to one card, half go to the other, and during attention they don\u0026rsquo;t need to talk to each other at all. The picture for \u0026ldquo;how do we run a model that\u0026rsquo;s too big for one GPU\u0026rdquo; turns out to start right here.\n6. The causal mask Inside attention, there\u0026rsquo;s one more rule we haven\u0026rsquo;t mentioned, and it\u0026rsquo;s essential: when a token at position i attends, it can only see positions j ≤ i. Positions j \u0026gt; i are masked out — their attention scores are forced to −∞ before softmax, which makes their post-softmax weights exactly zero, which means they contribute nothing to position i\u0026rsquo;s output.\nWhy this rule exists comes from training. The model is trained one next token at a time: feed in a sequence, ask the model to predict each next token from everything that came before it. If position i were allowed to peek at position i+1 during attention, it would be allowed to cheat by reading the answer. The mask is what enforces \u0026ldquo;no peeking ahead.\u0026rdquo;\nThe mask has two more consequences worth naming, and they come up later.\nFirst, it\u0026rsquo;s what makes the generation loop in Part III well-defined: the token at position N+1 only depends on tokens 1..N, never the other way around. So we can compute new tokens in order, one at a time, without ever revising an earlier one. That property is what makes \u0026ldquo;generate a long answer one token at a time\u0026rdquo; work at all.\nSecond — and this is the bigger one — the mask means an old token\u0026rsquo;s work never has to be redone. Position 5\u0026rsquo;s hidden state is the same whether the sequence is 5 tokens long or 500 long; no future token can reach back and change it. That stable-once-computed property is what makes it even thinkable to save earlier work and reuse it later instead of recomputing it on every forward. Without the mask, every new token would force a full revisit of everything that came before. With it, we can imagine processing tokens in order and just remembering what we already computed — the question §10 will land on, and one of the most load-bearing optimizations in the rest of the series.\n7. The whole block in one picture We\u0026rsquo;ve now opened up every piece of a transformer block — the two halves (§3), attention\u0026rsquo;s Q/K/V mechanism (§4), the split into heads (§5), the causal mask (§6). Here\u0026rsquo;s the full picture in one trace, with the tensor shape labelled at every step.\nSkim it once for the overall flow, then come back to it whenever something later in the series references \u0026ldquo;the [h × N × N] score matrix\u0026rdquo; or \u0026ldquo;the reshape into heads\u0026rdquo; — this diagram is the shape you\u0026rsquo;re being asked to picture.\ninside one block — every operation, every shape input [N × d] residual LayerNorm 1 [N × d] QKV projection Q K V each [N × d] reshape Q, K, V along feature dim into h heads each [N × h × d_head] multi-head attention (per head, in parallel) Q · Kᵀ / √d_head [h × N × N] scores + causal mask (future → −∞) [h × N × N] softmax (along last dim) [h × N × N] weights weights · V [N × h × d_head] concat heads back into [N × d] [N × d] output projection + [N × d] residual LayerNorm 2 [N × d] FFN-up (d → 4d) [N × 4d] activation (GeLU) [N × 4d] FFN-down (4d → d) + output [N × d] Three things worth pausing on:\nShapes start and end at [N × d] — the §2 mantra. Inside one block, the tensor briefly takes other shapes ([N × 4d] in the middle of the FFN, [h × N × N] for the attention scores) — but those are transient. The block always returns to [N × d] so the next block can consume it. The [h × N × N] score matrix is the one that surprises. Its size scales with the square of the sequence length. Harmless when N is small, awkward when N is large — that\u0026rsquo;s where the cost of long sequences will eventually bite. Worth noticing now; future articles will come back to it. Each residual + re-injects the input of that half back onto the output. So each half is computing a delta, not a replacement. That\u0026rsquo;s why we can stack many blocks without the signal collapsing. Part III — Using the model to generate 8. One forward gives you one token The model in §1 takes a sequence of length N and returns a probability distribution over what the next token should be. One token. Not a whole sentence, not even a phrase — a single next-token guess.\nBut we\u0026rsquo;re used to LLMs producing long replies. How does a one-token-at-a-time model produce a paragraph? Exactly how you\u0026rsquo;d guess: by running over and over and feeding its own output back in.\nConcretely:\nStart with the prompt — a sequence of length N. Run a forward pass on it. You get a distribution over what token N+1 should be. Sample from that distribution (or just take the most likely token, \u0026ldquo;argmax\u0026rdquo;). You now have a token at position N+1. Append it to the sequence. The sequence is now length N+1. Run another forward pass on the full (N+1)-long sequence. You get a distribution over token N+2. Sample. Append. Sequence is now length N+2. Repeat until either the model samples a special end-of-sequence token (it has been trained to emit one when it thinks the response is complete) or you hit a length cap you\u0026rsquo;ve imposed. A picture of the loop:\ngeneration loop: sample, append, repeat prompt length N forward on length N sample token N+1 from last-row softmax sequence length N+1 feed the appended sequence back in sequence length N+1 forward on length N+1 sample token N+2 sequence length N+2 ⋮\nuntil model emits end-of-sequence or until a length cap is hit That is the whole generation procedure, mathematically. Every output token from any LLM-based system you\u0026rsquo;ve ever used was produced by a loop that looks like this.\n9. The first uncomfortable observation Walk through the cost of generating K new tokens from a prompt of length N.\nForward 1 runs on the prompt: length N. Forward 2 runs on prompt + 1 new token: length N+1. Forward 3: length N+2. … Forward K: length N+K−1. Every forward repeats almost everything the previous forward already did. The first N tokens of forward 2\u0026rsquo;s input are identical to forward 1\u0026rsquo;s input — the model nevertheless runs every block on every position from scratch, as if it had never seen them before.\nIf you total it up, the work scales like roughly (N + K)² / 2 — quadratic in the eventual sequence length. And most of that work is recomputing things that haven\u0026rsquo;t changed. A new token added at the end of the sequence doesn\u0026rsquo;t change any of the earlier tokens\u0026rsquo; representations. The earlier tokens are still the same prompt and the same few sampled tokens that came before this one. Nothing about them needs to be redone.\nSo an obvious question hangs in the air: is all that recomputation actually necessary? Clearly not. But avoiding it isn\u0026rsquo;t free either — it means we\u0026rsquo;d have to keep some intermediate state around between forwards. Which raises its own questions: what state, exactly? Where do we put it? How big does it get? How does it grow as the conversation grows?\nThat kind of question is exactly what this series picks at later on.\n10. The map of questions Two themes run through the rest of the series, and most practical questions about running an LLM fall into one or the other.\nTheme 1 — Making one forward pass fit. A single forward through this stack can be too big in several ways at once: too big to fit on one GPU, too long to compute in reasonable time, too memory-hungry inside attention. The articles in this theme are about splitting work spatially so a forward can land on the hardware you have.\nThe model itself can be huge. Stack enough blocks (large L) at a wide enough d and the weights alone won\u0026rsquo;t fit on a single GPU. How do we split one forward across multiple GPUs? (Articles 02 and 03 — leveraging exactly the head-independence we set up in §5.) The prompt itself can be huge. §7\u0026rsquo;s [h × N × N] score matrix scales with the square of the sequence length. For a long prompt that either runs out of memory or pins the GPU for too long. Can we process the prompt in pieces, or compute attention more cleverly? Theme 2 — Making the loop fast. Each forward gives one token, and §9 already spotted the biggest cost: the naive loop redoes most of its work. The articles in this theme are about not redoing things, sharing forwards across users, and scheduling who runs when.\nDon\u0026rsquo;t redo work. §6 set up the property: an old token\u0026rsquo;s representation, once computed, never changes. So we should be able to save it and reuse it on the next forward instead of recomputing. That state has to live somewhere — where, how big does it get, and how does it grow as the conversation grows? Many users at once. Real serving engines run many concurrent prompts of different lengths, finishing at different times. How do they share one forward without padding waste, and how does the scheduler keep everyone moving when some are on token 1 and others are on token 1000? (Article 04 begins this thread.) Prompt-processing and one-more-token feel nothing alike. §9\u0026rsquo;s per-call cost shape is very different depending on whether you\u0026rsquo;re processing a long input from scratch or appending one extra output token. Their bottlenecks live in different parts of the GPU. Maybe the engine should treat them as different workloads — or even split them across different machines. The model in §1–§7 is what both themes are about; the loop in §8 is what they\u0026rsquo;re trying to make work at scale. The rest of the series picks them off, one question at a time.\n","permalink":"https://wgzesg.github.io/llm_stories/posts/01-llm-end-to-end/","summary":"Three zoom levels — the model end-to-end, one transformer block opened up, and the loop that turns a prompt into output. Just enough to ask the right questions about everything that comes after.","title":"An LLM, End to End"},{"content":"This isn\u0026rsquo;t a tutorial. It\u0026rsquo;s a walk through the mental model — the kind where each section makes you go \u0026ldquo;oh, that\u0026rsquo;s all it is?\u0026rdquo; By the end, tensor parallelism shouldn\u0026rsquo;t feel like an engineering trick. It should feel like the only two reasonable things you could possibly do.\nNo matrix-math notation. Just shapes and stories.\n1. The only picture you need of an input Forget tokens-as-words for a second. To a model, a token is just a row of numbers — d of them. A \u0026ldquo;feature vector\u0026rdquo; if you want to be fancy.\nA whole sentence (or batch) is just a stack of those rows:\nToken 1 → [ f1 f2 f3 ... fd ] Token 2 → [ f1 f2 f3 ... fd ] Token 3 → [ f1 f2 f3 ... fd ] ... Token n → [ f1 f2 f3 ... fd ] That\u0026rsquo;s it. n tokens, each living in d-dimensional space. Hold that picture — everything else builds on it.\n2. Where this matrix actually shows up Before we play with weight matrices abstractly, let\u0026rsquo;s anchor on a concrete moment in real LLM serving — so the shapes feel like something, not nothing.\nWhen an LLM handles your prompt, the first big phase is prefill: shove all n prompt tokens through the network in one shot. (The token-by-token decoding comes after.) And the very first computation inside prefill is the QKV projection in attention — each token (length d) gets turned into a query, key, and value vector (each length k).\nStack the tokens as the n × d table from section 1, and the whole QKV step (just the Q part shown here) is one matrix multiply:\n[ n × d ] @ [ d × k ] = [ n × k ] tokens weight per-token matrix query vectors That\u0026rsquo;s the shape. Now sit with the question for a beat:\nWhat is this matmul actually doing? What does it mean to multiply an n × d table of tokens by a d × k weight matrix?\n\u0026ldquo;We computed the queries\u0026rdquo; is the boring answer. The interesting question is what\u0026rsquo;s happening inside that d × k matrix — and there are two very different stories you can tell. Each one quietly hands you a different way to split the work across GPUs.\n3. The same weight matrix, told two different ways A linear layer takes a token (length d) and produces something of length k. The \u0026ldquo;thing\u0026rdquo; doing this is a weight matrix of shape d × k.\nQuick aside that matters a lot. When I say \u0026ldquo;linear layer,\u0026rdquo; I don\u0026rsquo;t mean one specific block in the network. I mean every matmul in a transformer:\nthe Q, K, V projections in attention — each is a d × k matrix turning a token into a query/key/value vector the attention output projection the FFN up-projection (d → 4d) and FFN down-projection (4d → d) even the unembedding at the end They\u0026rsquo;re all the same shape of operation: token in, matrix multiply, token out. So the two-views story below — and the two parallelism strategies that fall out of it — apply to all of them. Once you see it for one, you\u0026rsquo;ve seen it for the whole transformer.\nHere\u0026rsquo;s the fun part: a d × k matrix can be read two ways — column by column or row by row. Same numbers, same multiplication, but two completely different mental scenes. We\u0026rsquo;ll walk through both.\nStory A — read it column by column (a row of fxes) Stop seeing the weight matrix as a grid of numbers. Zoom out. Each column is a self-contained little function — give it a token, it returns one number. We\u0026rsquo;ll call each one fx (short for feature extractor) and just draw the whole weight matrix as a row of k of them:\nweight = [ fx1 fx2 fx3 ... fxk ] That\u0026rsquo;s the whole matrix. Not numbers — fxes. Each one is its own opaque thing.\nHow does each fx turn a token into a number? It happens to be an inner product with that column\u0026rsquo;s d weights. But honestly — for building intuition, you don\u0026rsquo;t care. It\u0026rsquo;s just \u0026ldquo;fxi looks at the token and reports a score.\u0026rdquo;\nNow applying this layer to a token is just: send the token down the row, collect what comes out the bottom.\ntoken ⇒ [ fx1 fx2 ... fxk ] ↓ ↓ ↓ [ fx1(token), fx2(token), ..., fxk(token) ] A token walks past k little extractors, each shouts a number, you collect the numbers. Output has length k. Done.\nStory B — read it row by row (a stack of basis vectors) Now lay the same matrix flat. There are d rows, each of length k:\nRow 1 → [ r1 r2 r3 ... rk ] Row 2 → [ r1 r2 r3 ... rk ] ... Row d → [ r1 r2 r3 ... rk ] Each row is a basis vector living in the output space (length k). And the token\u0026rsquo;s d features are the coefficients that say how much of each row to mix in.\noutput = f1 · Row1 + f2 · Row2 + ... + fd · Rowd The layer\u0026rsquo;s job, told this way: take the d features of a token, use them as a recipe, and linearly combine the d row-vectors into one output vector.\nThe \u0026ldquo;wait, what?\u0026rdquo; moment Both stories describe the exact same multiplication. Same numbers in, same numbers out. But your brain holds two different scenes:\nStory A (columns) Story B (rows) many independent fxes one big linear combination \u0026ldquo;extract k features from the token\u0026rdquo; \u0026ldquo;mix d row-vectors into the output\u0026rdquo; output is collected output is summed This duality isn\u0026rsquo;t a curiosity — it\u0026rsquo;s the seed of tensor parallelism. The two ways you can read a matrix are the two ways you can split it across GPUs.\n4. Now there are two GPUs. What\u0026rsquo;s the obvious thing to do? You have one matrix and two GPUs. You stare at the matrix. There are really only two natural lines you could draw on it: a vertical one or a horizontal one.\nSection 3 just told you what each of those lines means.\n5. Strategy A — Split the fxes (Column Parallel) Take Story A seriously. The weight matrix is just a row of k black-box fxes. Cutting it across two GPUs is — literally — drawing one vertical line through that row:\nweight = [ fx1 ... fx(k/2) ‖ fx(k/2+1) ... fxk ] ↑ ↑ └──── GPU 1 ───┘ └──── GPU 2 ───┘ Each GPU sees the full token. It just runs its half of the fxes.\nGPU 1 → [ fx1(token), ..., fx(k/2)(token) ] GPU 2 → [ fx(k/2+1)(token), ..., fxk(token) ] To assemble the final output, glue them side by side:\noutput = [ GPU1\u0026#39;s half | GPU2\u0026#39;s half ] That\u0026rsquo;s it. No summing, no synchronizing in the middle. Each GPU runs different fxes on the same input, and the answers just live next to each other.\nCost: cheap. Concatenation is basically free.\n6. Strategy B — Split the rows (Row Parallel) Take Story B seriously. The weight matrix is a stack of d basis-vector rows. Cutting it across two GPUs is — literally — drawing one horizontal line through that stack:\nweight = [ Row 1 ] ┐ [ Row 2 ] │ GPU 1 (paired with features 1..d/2) [ ... ] │ [ Row(d/2) ] ┘ ───────────────────────── [ Row(d/2+1) ] ┐ [ ... ] │ GPU 2 (paired with features d/2+1..d) [ Row d ] ┘ And here\u0026rsquo;s the catch: each row is multiplied by its matching token feature (Row i pairs with f_i). So splitting the rows automatically splits the input too — GPU 1 only ever needs f_1..f_(d/2), GPU 2 only ever needs the rest.\nEach GPU sees only half the token. It produces a partial output — a length-k vector that\u0026rsquo;s only part of the sum.\nGPU 1 → partial output (its rows, weighted by its features) GPU 2 → partial output (its rows, weighted by its features) To assemble the final output, add them up:\noutput = GPU1\u0026#39;s partial + GPU2\u0026#39;s partial This time you can\u0026rsquo;t just concatenate — both GPUs produced length-k vectors that need to be summed element-wise. That sum has to happen across the network. (This is the \u0026ldquo;all-reduce\u0026rdquo; you\u0026rsquo;ll see in TP papers.)\nCost: more expensive. Every forward pass through this layer pays a cross-GPU sum.\n7. The two strategies, side by side Split columns (A) Split rows (B) Story it lives in \u0026ldquo;row of fxes\u0026rdquo; \u0026ldquo;weighted combination of rows\u0026rdquo; What each GPU holds some of the fxes some of the rows + matching features Each GPU sees\u0026hellip; the full input part of the input How outputs combine concatenate sum (all-reduce) Communication cheap expensive Same matrix. Two stories. Two ways to cut it. That\u0026rsquo;s the whole game.\nThat\u0026rsquo;s where this article stops. Two strategies, applied to one matrix in isolation. The next article picks them up and walks them through a full attention block — Megatron\u0026rsquo;s recipe for interleaving column-parallel and row-parallel TP across the matmuls of a real layer. Same two cuts, woven on purpose.\n","permalink":"https://wgzesg.github.io/llm_stories/posts/02-tensor-parallelism-mental-model/","summary":"Two ways to read a weight matrix, two ways to split it across GPUs. A mental model for tensor parallelism, derived from one matmul in a transformer\u0026rsquo;s prefill phase.","title":"Tensor Parallelism, Built From Scratch in Your Head"},{"content":"Article 02 left you with two ways to split one matmul across two GPUs. They\u0026rsquo;re easier to keep straight by what they do than by what they\u0026rsquo;re called in the literature, so let\u0026rsquo;s lay them out side by side:\nStrategy A — split the fxes Strategy B — split the rows What you slice the matrix\u0026rsquo;s columns (each column is one fx) the matrix\u0026rsquo;s rows (each row is a basis vector) Each GPU\u0026rsquo;s input each token\u0026rsquo;s full input vector half of each token\u0026rsquo;s input features Each GPU\u0026rsquo;s output half of each token\u0026rsquo;s output features a partial sum of each token\u0026rsquo;s full output How outputs combine concatenate (free) all-reduce (one comm step) Also known as column-parallel row-parallel The compact way to read each column: A = \u0026ldquo;full in, half out.\u0026rdquo; B = \u0026ldquo;half in, sum out.\u0026rdquo; That\u0026rsquo;s enough mental model for everything below.\nA real transformer block isn\u0026rsquo;t one matmul — it\u0026rsquo;s four, plus some pointwise glue. So the natural next question is: how do we cut a whole block across two GPUs?\nThere\u0026rsquo;s an obvious first move that almost works. We\u0026rsquo;ll build it, see exactly where it breaks, and let the fix walk us into the canonical Megatron pattern. To keep everything concrete, we\u0026rsquo;ll fix small numbers and watch the shape on each GPU change at every step.\n1. The setup: small numbers you can hold in your head Two GPUs, call them G1 and G2. A tiny batch and a small model:\nvalue batch n 4 tokens model dim d 512 heads h 8 per-head dim d_head 64 attention dim k = h · d_head 512 FFN hidden 4d = 2048 Each token is a row of 512 numbers. The batch is [n × d] = [4 × 512].\nA transformer block, drawn flat:\n← input: [4 × 512] │ LayerNorm │ QKV projection d → 3k ← matmul weight [d × 3k] = [512 × 1536] │ attention (mixes Qs with Ks; no new matmul) │ output projection k → d ← matmul weight [k × d] = [512 × 512] │ + residual │ LayerNorm │ FFN up-projection d → 4d ← matmul weight [d × 4d] = [512 × 2048] │ activation (GeLU) (pointwise) │ FFN down-projection 4d → d ← matmul weight [4d × d] = [2048 × 512] │ + residual │ Four matmuls and some glue.\nSide note — the pointwise glue and why both GPUs do it. LayerNorms, the activation, and the residual adds are all pointwise. They don\u0026rsquo;t care how data is laid out across GPUs as long as each GPU has whatever it needs locally to compute its piece. In TP we make the simple choice: when data is sitting full on both GPUs, both GPUs just run the pointwise op on their own copy. Same input, same output, redundant compute. Why not have one GPU compute it and broadcast the result? Because comm is the bottleneck, not compute. A pointwise op over a few thousand numbers costs essentially nothing on a GPU; sending data across GPUs costs real latency and bandwidth. Doing the same cheap arithmetic twice is the better trade. Keep this in your back pocket — it\u0026rsquo;s why you\u0026rsquo;ll see \u0026ldquo;redundant\u0026rdquo; appear in the trace tables below for every LN and residual step.\nSo the whole TP story for this block lives at those four matmuls. Two GPUs, four cuts to make. Let\u0026rsquo;s play.\n2. v1 — apply Strategy A (full → half) to every matmul What\u0026rsquo;s the obvious first move? From article 01, Strategy A was:\nthe cheap cut (concatenate, no all-reduce inside the matmul); on QKV it happens to land exactly on head boundaries — k = 8 · 64 = 512, split into 256 per GPU = 4 heads each; and \u0026ldquo;full input in, half output out\u0026rdquo; is the easier story to picture. So apply A to all four matmuls. Walk through the block one step at a time, watching what each GPU holds — its weight shard, input, and output at every step.\nStepGPU 1GPU 2 input [4×512] full [4×512] full LayerNorm (redundant) in [4×512] → out [4×512] in [4×512] → out [4×512] QKV proj (A) W [512×768] (heads 1–4)\nin [4×512] → out [4×768]\n= Q+K+V for heads 1–4, each [4×256] W [512×768] (heads 5–8)\nin [4×512] → out [4×768]\n= Q+K+V for heads 5–8, each [4×256] attention heads 1–4\nin [4×768] → out [4×256] heads 5–8\nin [4×768] → out [4×256] ★ GATHER #1 — output proj needs full k=512, each GPU only holds 256 → [4×512] on both output proj (A) W [512×256]\nin [4×512] → out [4×256] W [512×256]\nin [4×512] → out [4×256] ★ GATHER #2 — residual needs full d=512, output is half d=256 → [4×512] on both + residual [4×512] → [4×512] [4×512] → [4×512] LayerNorm (redundant) in [4×512] → out [4×512] in [4×512] → out [4×512] FFN-up (A) W [512×1024]\nin [4×512] → out [4×1024] W [512×1024]\nin [4×512] → out [4×1024] activation (pointwise) [4×1024] → [4×1024] [4×1024] → [4×1024] ★ GATHER #3 — FFN-down needs full 4d=2048, each GPU only holds 1024 → [4×2048] on both FFN-down (A) W [2048×256]\nin [4×2048] → out [4×256] W [2048×256]\nin [4×2048] → out [4×256] ★ GATHER #4 — residual needs full d=512, output is half d=256 → [4×512] on both + residual [4×512] → [4×512] [4×512] → [4×512] Four cross-GPU gathers per block.\nTwo of them happen because the next A-style matmul demands a full input. The other two happen because the residual add expects a full vector and we just produced a half one. Same root cause: Strategy A produces a half output, and almost everything downstream wants a full input.\n3. The cost of v1 Cross-GPU comm is the slow thing in distributed compute. The whole point of TP design is to do as few of these as possible. v1 has us paying for a gather in front of nearly every operation that needs full features.\nFor a 32-block model that\u0026rsquo;s ~130 cross-GPU comms per forward pass. Way too many.\nSo the question becomes:\nCan we avoid the gather?\nEach gather only exists because the next op needed a full vector and Strategy A had just produced a half one. What we actually need is a matmul that\u0026rsquo;s happy consuming the half output directly.\nArticle 02 already handed us one.\n4. v2 — pair Strategy A with Strategy B (half → sum) Look at the two strategies through one specific lens:\nStrategy A outputs a half. Strategy B inputs a half. Same shape. A\u0026rsquo;s output is exactly what B wants as input. They snap together with no comm between them.\nSo replace v1\u0026rsquo;s \u0026ldquo;A → gather → A\u0026rdquo; with \u0026ldquo;A → B.\u0026rdquo; B eats the half output directly. The only comm cost shows up at the end of B — the all-reduce that turns the partial sum into the full output the residual + LN want.\nApply this to the block — pair every A matmul with a B matmul:\nStepGPU 1GPU 2 input [4×512] full [4×512] full LayerNorm (redundant) in [4×512] → out [4×512] in [4×512] → out [4×512] QKV proj (A) W [512×768] (heads 1–4)\nin [4×512] → out [4×768]\n= Q+K+V for heads 1–4, each [4×256] W [512×768] (heads 5–8)\nin [4×512] → out [4×768]\n= Q+K+V for heads 5–8, each [4×256] attention heads 1–4\nin [4×768] → out [4×256] heads 5–8\nin [4×768] → out [4×256] output proj (B) W [256×512]\nin [4×256] → out [4×512] (partial sum) W [256×512]\nin [4×256] → out [4×512] (partial sum) ★ ALL-REDUCE #1 — sum the two partial [4×512] halves into the full [4×512] on both GPUs (residual + LN need it) + residual [4×512] → [4×512] [4×512] → [4×512] LayerNorm (redundant) in [4×512] → out [4×512] in [4×512] → out [4×512] FFN-up (A) W [512×1024]\nin [4×512] → out [4×1024] W [512×1024]\nin [4×512] → out [4×1024] activation (pointwise) [4×1024] → [4×1024] [4×1024] → [4×1024] FFN-down (B) W [1024×512]\nin [4×1024] → out [4×512] (partial sum) W [1024×512]\nin [4×1024] → out [4×512] (partial sum) ★ ALL-REDUCE #2 — sum the two partial [4×512] halves into the full [4×512] on both GPUs (residual + LN need it) + residual [4×512] → [4×512] [4×512] → [4×512] Two all-reduces per block.\nThat\u0026rsquo;s the Megatron pattern. We didn\u0026rsquo;t have to be told it — we walked into it.\n5. The duality you didn\u0026rsquo;t see coming Article 02 introduced A and B as if they were two separate strategies — two ways to read one matrix. Put them side by side and look at what flows in and out of each:\nA takes a full input and produces a half output. B takes a half input and produces a full sum as output. They\u0026rsquo;re not two strategies. They\u0026rsquo;re two halves of one round-trip. A\u0026rsquo;s output shape is B\u0026rsquo;s input shape. B\u0026rsquo;s output shape (after the all-reduce) is A\u0026rsquo;s input shape. You couldn\u0026rsquo;t have invented A without secretly inventing B as its return half.\nNow look at what the block actually does:\nAttention has a widen (QKV projection: d → k) followed by a narrow (output projection: k → d). FFN has a widen (d → 4d) followed by a narrow (4d → d). A widening matmul is exactly where A makes sense — there are lots of output features to spread across GPUs. A narrowing matmul is exactly where B makes sense — there are lots of input features to spread across GPUs, and the small output is something you sum back up.\nThe block isn\u0026rsquo;t accidentally A→B-friendly. It\u0026rsquo;s structurally A→B-friendly: two widen-narrow pairs glued together by pointwise things. The \u0026ldquo;Megatron pattern\u0026rdquo; isn\u0026rsquo;t really an algorithm someone designed. It\u0026rsquo;s the only comm pattern that respects what the architecture was already doing. The duality of A and B and the widen-narrow rhythm of the block are the same fact told twice.\nA quick word on cost: a gather and an all-reduce move similar amounts of data per GPU (an all-reduce is roughly a reduce-scatter followed by an all-gather under the hood). v1 had 4 gathers per block; v2 has 2 all-reduces — half the comm, with no change to the model itself.\n6. Why the cut has to land on a head boundary The v2 trace quietly assumed something: that QKV\u0026rsquo;s column cut splits k = 512 into two slabs of 256 along the head boundary, so each GPU owns 4 whole heads. That assumption is doing more work than it looks like. Try the counterfactual.\nImagine single-head attention — same k = 512, but one head, no head structure. Apply Strategy A on QKV exactly as before: each GPU gets Q, K, V each of shape [4 × 256]. Now run attention.\nThe first step is Q Kᵀ. Each GPU computes Q_half @ K_halfᵀ, producing a [4 × 4] matrix — but that matrix is a partial sum over the 256 features each GPU happens to hold. The true scores are the sum of both GPUs\u0026rsquo; partials.\nHere\u0026rsquo;s the problem: the next step is softmax. Softmax is non-linear, so you can\u0026rsquo;t apply it locally and reconcile after — softmax(a) + softmax(b) ≠ softmax(a + b). The reduction has to happen before softmax. Which means an extra sync sitting right in the middle of attention:\n★ ALL-REDUCE on the [n × n] scores, before softmax.\nThat\u0026rsquo;s a third all-reduce per block, on top of v2\u0026rsquo;s two. The Megatron pattern collapses to three sync points, and the new one is on a tensor that scales with sequence length squared — exactly the comm you most want to avoid.\nThe fix is structural, not algorithmic: don\u0026rsquo;t let the cut cross a head. Each head\u0026rsquo;s Q Kᵀ must live entirely on one GPU, so the partial-sum problem never arises. Multi-head attention gives that to us for free — heads are independent by construction, head boundaries are natural cut points, and the column split on k = h · d_head lands exactly between them whenever h divides evenly across GPUs.\nSo multi-head isn\u0026rsquo;t a happy coincidence the systems people exploited. It\u0026rsquo;s the structural prerequisite for v2 to exist at all. Pick any cut that lands inside a head, and softmax forces a sync that ruins everything. Pick a cut that lands between heads, and the non-linearity stays local. The Megatron pattern doesn\u0026rsquo;t just happen to work on multi-head architectures — it requires them.\n7. What this opens You now have one block running on two GPUs with two all-reduces per pass. That earns the next round of \u0026ldquo;wait, but what about\u0026hellip;\u0026rdquo; questions:\nWhat if I have many blocks and many GPUs? TP cuts within a block. The cut across blocks — staging entire blocks on different GPUs and pipelining microbatches through them — is a different beast. Pipeline parallelism, next article. What if FFN is replaced with experts? The column-then-row pattern still applies to each expert\u0026rsquo;s matmuls, but routing tokens to the right expert introduces a new kind of comm. MoE, soon. What if the batch\u0026rsquo;s sequence lengths are wildly different? The comm pattern is unchanged, but the attention math has to deal with variable-length sequences — and that\u0026rsquo;s where continuous batching enters. Same grammar. Each one is its own walk-through.\n","permalink":"https://wgzesg.github.io/llm_stories/posts/03-tp-through-a-full-block/","summary":"Walk article 02\u0026rsquo;s two cuts through a full transformer block, with concrete shapes on each GPU at every step. Apply one cut to every matmul first — comm explodes (four gathers per block). Then pair the two cuts as duals and watch them snap into the architecture\u0026rsquo;s widen-narrow rhythm, landing at two all-reduces per block.","title":"Walking Tensor Parallelism Through a Full Block"},{"content":"Article 03 left us with one transformer block running on two GPUs in two all-reduces per layer. But a real serving system has many users hitting the model concurrently — and their prompts are all different lengths. A 50-token \u0026ldquo;what time is it\u0026rdquo; sits next to a 5,000-token essay draft.\nTwo questions to chase through this article:\nHow do we batch variable-length requests through one forward pass efficiently? The naive answer — pad everything to the longest prompt and run it as a fixed batch — wastes a lot of compute on the short ones. There has to be a smarter way. Does TP have to know any of this is happening? Or can the batching trick and the model-splitting story stay independent of each other? We\u0026rsquo;ll answer both by carrying Article 03\u0026rsquo;s setup forward and watching what each layer does when more than one request flows through it.\n1. Setup Same numbers as Article 03:\nvalue GPUs 2 (TP=2) layers 8 d (model dim) 512 h (heads) 8, four per GPU d_head 64 k = h · d_head 512 FFN hidden 4d = 2048 Two example requests for the running discussion: request A of length 10, request B of length 30.\nThree explicit assumptions we\u0026rsquo;ll hold this article to:\nPrefill only. We\u0026rsquo;re computing the forward pass over each request\u0026rsquo;s prompt. No token-by-token decoding yet — that\u0026rsquo;s Article 05. Each request fits in one batch. A batch holds ≥1 whole requests, never a fraction. Article 06 will relax this with chunked prefill. No KV cache yet. The KV cache is what lets a later token attend back to earlier ones during decode. In a prefill-only world we just compute outputs and ship them; there\u0026rsquo;s nothing to cache for later. KV cache enters with Article 05. These keep the spatial story clean. The temporal story (continuous batching across iterations) is its own article.\n2. One request first: N is just a tensor dimension Before two requests, recall what one request looks like under the v2 pattern from Article 03. A single prefill of length N flows through the block as [N × 512]. From the trace there:\n8 layers × 2 all-reduces per layer = 16 all-reduces per forward pass. Every all-reduce moves a [N × 512] tensor across GPUs. The thing worth pausing on: N only appears in tensor shapes, never in comm step counts. Whether N=10 or N=10,000, you do exactly 16 all-reduces. They just carry more or fewer bytes per step.\nSo adding more tokens to one request is \u0026ldquo;free\u0026rdquo; comm-wise — the per-byte cost scales linearly with token count, but you\u0026rsquo;re not paying for additional sync events.\nThat\u0026rsquo;s a nice property. The next question is whether it survives when the extra tokens come from different requests.\n3. The naive answer and the smarter idea Naive: pad to max length. Stack A and B as a [2 × 30 × 512] batch. Request A gets 20 padding tokens that the model still computes against. Linear-layer waste is mild (the matmul is bigger by 2×). Attention waste is severe — each request\u0026rsquo;s attention is O(L²) work, so A\u0026rsquo;s attention does 30² = 900 operations per head per layer instead of the 10² = 100 it actually needs. 9× too much work for A alone, and the padded tokens contribute nothing to the output you care about.\nSmarter: flatten. Concatenate A and B\u0026rsquo;s tokens into one tensor of shape [(10+30) × 512] = [40 × 512]. No padding, no batch dimension — just a flat stream of tokens.\nThe question this raises: does every step of the forward pass do the right thing on a flattened tensor of mixed-request tokens? Some steps clearly will. Some will need thought. Let\u0026rsquo;s walk the whole block and see.\n4. The whole block, step by step Start at the input [40 × 512] and trace every step of one block. For each one, ask: does it compute the right answer when its input contains tokens from multiple requests?\nStep What it does On [40 × 512]? LayerNorm normalizes each row independently ✓ trivially fine QKV proj (linear) matmul against shared W needs analysis Attention sequence-mixing per request needs analysis Output proj (linear) matmul against shared W needs analysis Residual add per-row sum ✓ trivially fine LayerNorm normalizes each row independently ✓ trivially fine FFN-up (linear) matmul against shared W needs analysis Activation (GeLU) per-element non-linearity ✓ trivially fine FFN-down (linear) matmul against shared W needs analysis Residual add per-row sum ✓ trivially fine Half the steps check off immediately. Pointwise operations — LayerNorm, GeLU, residual adds — process each row independently. Whether row i belongs to request A or request B is invisible to them. They\u0026rsquo;re per-token and they don\u0026rsquo;t mix. So they batch for free.\nThat leaves five steps that need closer examination: four linear matmuls (QKV proj, output proj, FFN-up, FFN-down) and one attention block.\nBut the convenience is: all four linear matmuls have the same structure — Y = X @ W where W is shared across all rows of X. So once we understand how one linear behaves under flattened batching, all four follow. And there\u0026rsquo;s only one attention block per layer.\nThe whole batching problem reduces to two questions:\nDoes a linear layer compute the right answer on [40 × 512]? Does attention compute the right answer on [40 × 512]? §5 takes the linear question. §6 takes the attention question. Once those two are settled, the whole block is settled.\n5. Linear layers: the easy half Go back to how Article 02 framed a linear layer. The weight matrix is a row of little feature extractors — each fx is its own opaque function that takes one token\u0026rsquo;s d-wide feature vector and returns one number. A linear layer with k outputs is just k of those extractors running side by side on the same token.\ntoken ⇒ [ fx1 fx2 fx3 ... fxk ] ⇒ [ fx1(token), fx2(token), ..., fxk(token) ] The thing worth pausing on: every fx looks at one token and returns one number. It doesn\u0026rsquo;t peek at the next token. It doesn\u0026rsquo;t peek at the previous token. It has no concept of what conversation the token came from. There is no place in the math where request boundaries could enter, because the math only ever sees one token at a time.\nSo when we hand the layer a flat tensor [40 × 512] — 40 tokens stacked — it just runs every fx on every token. 40 tokens, k extractors each, fills out a [40 × k] output. The fact that the first 10 rows are request A and the last 30 are request B is invisible to the operation; we never even had a chance to mix them.\nThat\u0026rsquo;s the entire reason linear layers batch trivially. They\u0026rsquo;re not \u0026ldquo;magically batchable\u0026rdquo; — they were already per-token. We\u0026rsquo;re just running more of them.\nUnder TP=2: unchanged from Article 03. The fxes are still split across GPUs, with each GPU owning half:\nG1 runs heads 1–4\u0026rsquo;s fxes on [40 × 512] → [40 × 768] G2 runs heads 5–8\u0026rsquo;s fxes on [40 × 512] → [40 × 768] The all-reduce shape grew from [N × 512] to [40 × 512], but the count of all-reduces is unchanged. Same comm pattern, more bytes per step.\nAnd since the same argument applies to all four linear matmuls in the block — QKV, output, FFN-up, FFN-down — all the linears are now solved. One step left.\n6. Attention: the hard half Why is attention different? Because attention is sequence-mixing. Each token\u0026rsquo;s output depends on all tokens in its sequence, not just its own row:\nout[i, :] = softmax( Q[i, :] @ K.T / √d_head ) @ V That K.T and V reach across the whole sequence. If K and V come from a tensor that contains tokens from both A and B, then by default token i in A would attend to tokens of B — and vice versa. The math would technically run, but the answer would be wrong: A\u0026rsquo;s output would be mixed with B\u0026rsquo;s keys and values, which is not what the model was trained to produce.\nSo we need a way to keep request A\u0026rsquo;s attention strictly within A\u0026rsquo;s tokens, and B\u0026rsquo;s strictly within B\u0026rsquo;s, while still sharing the underlying flat tensor.\n6.1 The naive approach — compute, then mask The most direct fix: compute the full [40 × 40] attention matrix as if all 40 tokens were one sequence, then mask out the cross-request entries (set them to -∞ before softmax so they contribute nothing).\nThe flat token buffer looks like this:\nflat token tensor: [40 tokens × 512] each row is one token of d=512 features request A 10 tokens request B 30 tokens row 0 row 10 row 40 cu_seqlens = [0, 10, 40] And the full attention matrix, with cross-request blocks masked:\nnaive: compute full 40×40, mask cross-request blocks keys 0..9 keys 10..39 queries 0..9 queries 10..39 A → A 10 × 10 masked to −∞ 10 × 30 masked 30 × 10 B → B 30 × 30 computed: 1600 useful: 1000 wasted: 600 This works, but it\u0026rsquo;s wasteful. The off-diagonal blocks — 10 × 30 and 30 × 10, totaling 600 entries — are computed and immediately discarded. With more concurrent requests it gets worse: with R requests of equal length L, you compute (RL)² but only need R · L². Cross-request work scales as R² while useful work scales only as R. Untenable for serving systems where R can easily reach into the hundreds.\n6.2 The varlen idea — skip, don\u0026rsquo;t mask Instead of computing-then-masking, compute only the diagonal blocks. Loop over requests, and for each one run normal attention on its slice of the flat buffer:\nvarlen: compute only the diagonal blocks keys 0..9 keys 10..39 queries 0..9 queries 10..39 A → A 10 × 10 B → B 30 × 30 (not computed) (not computed) computed: 1000 — no waste This is the variable-length attention kernel — varlen for short. It takes the flat tensor plus an array of request boundaries (cu_seqlens, the cumulative sequence lengths) and walks request-by-request:\n# cu_seqlens = [0, 10, 40] # request A spans [0,10), B spans [10,40) for i in range(num_requests): s, e = cu_seqlens[i], cu_seqlens[i+1] Q_i = Q[s:e] K_i = K[s:e] V_i = V[s:e] scores_i = (Q_i @ K_i.T) / sqrt(d_head) # L_i × L_i probs_i = softmax(scores_i + causal_mask_i) out[s:e] = probs_i @ V_i # write back into flat buffer Visualizing the loop walking down the flat Q, K, V stacks:\nvarlen walks the flat Q, K, V stacks request-by-request Q K V A [0:10] A [0:10] A [0:10] B [10:40] B [10:40] B [10:40] i = 0 read slice [0:10] i = 1 read slice [10:40] step 1 — compute scores = Q_A @ K_A.T probs = softmax(scores) out[0:10] = probs @ V_A step 2 — compute scores = Q_B @ K_B.T probs = softmax(scores) out[10:40] = probs @ V_B flat Q, K, V tensors in HBM — varlen kernel slices [s:e] for each request, top to bottom Three things to notice:\nThe cross-request blocks aren\u0026rsquo;t masked — they\u0026rsquo;re never computed. The kernel skips them entirely. Each iteration\u0026rsquo;s score matrix is the right size for that request — [L_i × L_i], not [40 × 40]. So memory for scores stays small. The flat output buffer is filled in by writing each request\u0026rsquo;s attention output into its own slice. cu_seqlens is the only piece of metadata the kernel needs to know about requests. Everything else is just slicing the flat tensor.\n(In practice the kernel doesn\u0026rsquo;t run this loop in Python — it runs it inside the GPU in one launch, so we don\u0026rsquo;t pay a kernel-launch overhead per request. The mathematical content is identical to the loop above; the optimized kernel just expresses it more efficiently. We\u0026rsquo;ll come back to high-performance attention kernels in a later article.)\n6.3 Under TP=2 Each GPU still owns its 4 heads from Article 03. The varlen kernel runs on each GPU\u0026rsquo;s local Q, K, V — for those heads, for all requests\u0026rsquo; tokens. G1 doesn\u0026rsquo;t need to know what G2 is doing during attention; G2\u0026rsquo;s heads are G2\u0026rsquo;s problem. The per-head independence we relied on in Article 03 still holds inside this loop. No new comm.\nSo with linears (§5) and attention (§6) both handled, every step of the block is now correctly batched.\n7. Stepping back: TP didn\u0026rsquo;t have to change A \u0026ldquo;happy realization\u0026rdquo; worth pausing on. Look at what TP saw during the entire batched forward pass:\nA tensor of shape [tokens × hidden] flowing through the layers. Weights split along heads. All-reduces on [tokens × hidden] partial sums. 16 sync events per block, exactly as in Article 03. TP never saw a request boundary. The flat tensor presented itself the same way to TP whether the 40 tokens came from one request or fifty. Request boundaries entered exactly one place — the cu_seqlens argument inside the varlen attention kernel — and that argument was used entirely on each GPU\u0026rsquo;s local slice. No comm event involved.\nSo request batching and TP turn out to be orthogonal axes that meet only inside the attention kernel:\nTP answers: how is the model split across GPUs? Request batching answers: how are tokens packed into one forward pass? Those questions don\u0026rsquo;t constrain each other. We didn\u0026rsquo;t design for this — it fell out of two facts that were already true:\nLinear layers are per-token (so they don\u0026rsquo;t see request boundaries even on one GPU). Multi-head attention\u0026rsquo;s heads are independent (so each GPU\u0026rsquo;s per-head varlen loop never has to talk to other GPUs). The Article 03 punchline was that multi-head attention was a gift the modelers left for the systems people building TP. Here we see the same gift extended one layer further: the same head independence that makes TP comm-free also makes request batching comm-free. Two unrelated tricks compose for free because they were both granted by the same architectural property.\n8. Cost intuitions A few honest words about where time actually goes once you flatten requests like this.\nLinear layers look great. One weight matrix is read from HBM, then amortized across all (N+M) tokens of the flat tensor. The more tokens you pack in, the closer the GPU runs to its compute peak. This is why aggressive prefill batching is a clear throughput win.\nAttention is more nuanced. Each request\u0026rsquo;s Q_i K_i.T is its own matmul, which means we can\u0026rsquo;t fuse one big GEMM across requests the way linears do. Modern varlen kernels run the request loop inside the GPU in one launch, so we don\u0026rsquo;t pay a launch overhead per request. But each request still gets its own attention work proportional to L_i², which means the bottleneck profile depends heavily on the distribution of request lengths.\nImagine two batches with the same total token count:\n1 request × 1,000 tokens — attention work is 1 × 1000² = 10⁶ per head per layer. The whole square is one block. Attention dominates the forward pass. 10 requests × 100 tokens each — attention work is 10 × 100² = 10⁵. Ten times less. Linears dominate. It\u0026rsquo;s the same picture as the varlen square from §6.2, just at the two extremes. Place all 1,000 tokens along one axis of the attention matrix and only the per-request diagonal blocks ever get computed — everything else is cross-request and skipped:\nsame 1,000 total tokens, very different attention work colored = per-request work that's actually computed; hatched = cross-request, skipped by varlen 1 request × 1,000 tokens 1 × 1000² = 10⁶ no cross-request waste — attention dominates 10 requests × 100 tokens 10 × 100² = 10⁵ most of the square is skipped; linears dominate Same outer square. Same total tokens. The colored fraction — what actually gets computed — drops by 10× as you split one long request into ten short ones. The L²-scaling of attention means long-context batches are attention-compute-bound while many-short batches are linear-bandwidth-bound. The flat-tensor trick is the same in both regimes; the bottleneck shifts.\nThis is also the foreshadowing for decode: when each \u0026ldquo;request\u0026rdquo; generates one token at a time, per-request Q_i K_i.T becomes a 1 × L_kv vector times an L_kv × d_head matrix. Arithmetic intensity drops to ~1, the per-request matmul stops being meaty, and the entire forward pass becomes bandwidth-bound on weight reads. That\u0026rsquo;s a fundamentally different optimization target — and it\u0026rsquo;s why decode lives in its own article.\n9. What this opens We now have a scheme for running many concurrent prefill requests through a TP-parallelized model: flatten tokens into one tensor, do one big matmul through every linear layer, do varlen attention through every attention block. The model\u0026rsquo;s TP comm pattern doesn\u0026rsquo;t change. The waste from naive padding is gone. Each request gets exactly the work it needs — no more, no less.\nThree follow-up questions earn the next round of articles:\nWhat if a request needs to generate many output tokens? Prefill is one shot per prompt. Decode adds a token-by-token phase with a very different bottleneck profile and a new structure (the KV cache) to remember earlier tokens. Article 05 — decode and continuous batching across iterations. What if one request is so long it doesn\u0026rsquo;t fit in a batch? Sometimes the assumption \u0026ldquo;every request fits\u0026rdquo; breaks. The fix is chunked prefill — process the prompt in slices, building up the KV cache as you go. Article 06. How does the varlen attention kernel actually run fast on a GPU? We used naive attention math throughout this article. The high-performance version (FlashAttention) avoids materializing the score matrix at all, using a tiled online-softmax recurrence. That\u0026rsquo;s a kernel-level deep-dive worth its own article, later in the series. Same grammar each time: pick one assumption from the current article, relax it, see what falls out.\n","permalink":"https://wgzesg.github.io/llm_stories/posts/04-batching-many-requests/","summary":"Many users hit the model at once with different-length prompts. Walk through one transformer block on a flat multi-request tensor and see which layers batch for free and which need a real fix — and whether TP has to change.","title":"How to Batch Many Requests Through One Forward Pass"},{"content":"Article 04 left us with one forward pass that batches many prefills cleanly. But prefill is just the front half of a request\u0026rsquo;s life. Once the prompt is consumed, the request enters a decode phase — generating one token at a time, sometimes for hundreds of steps, until it lands on an EOS. A real serving engine doesn\u0026rsquo;t see neat prefill batches; it sees a turbulent mix of arriving prompts, ongoing decodes, and finishing requests, all sharing the same GPU at every moment.\nThis article steps into that mess. We\u0026rsquo;ll lean on Article 01 for the basic generation flow and the KV cache mechanics — assume both are familiar.\n(One name to fix in your head before we start, since the rest of the article leans on it: an iteration is one end-to-end forward pass through all L layers of the model. Whatever rows of input we feed in — a chunk of one prompt, decode steps for many requests, or a mix — an iteration runs them through layer 0, layer 1, all the way to layer L−1, once.)\nPicture a few seconds in the engine\u0026rsquo;s life. Dozens of requests in flight: some still munching prompts, some 50 tokens into decoding, some about to terminate at step 1000, some that just walked in. New requests arrive. Old requests finish. The scheduler\u0026rsquo;s job is to keep the GPU as full as it can while doing right by every one of them.\nTwo questions fall out:\nRequests arrive and finish at different times. How does the engine keep the GPU packed without stalling anyone at the start or end of their life? Per-iteration cost can swing 1000×. A forward of pure decodes runs in a handful of milliseconds; a forward that includes a 100 k-token prefill takes seconds. How do we keep iterations roughly uniform so the scheduler can plan? Both answers — iteration-level scheduling (ORCA) and chunked prefill — pull on the same insight: what we want to even out is the per-iteration cost, not the per-request cost. ORCA cleans up the arrival/finish boundaries; chunked prefill then bounds what any single iteration can carry.\n1. Where naive batching breaks down The simplest scheduler imaginable: pick B requests when slots open, run them through prefill and decode together, return everyone\u0026rsquo;s outputs when the last one is done, then pick the next batch. This is request-level batching — the batch is the scheduling unit, and a batch\u0026rsquo;s membership is fixed at admission.\nIt fights two facts about real traffic.\nRequests arrive at different times. A request that arrives 200 ms into a 5-second batch can\u0026rsquo;t join — the batch\u0026rsquo;s membership was set when it started. It sits in the queue until the entire current batch finishes. The GPU might have plenty of room for one more decoder, but the engine refuses to admit anyone. The arrival\u0026rsquo;s TTFT (time-to-first-token — the pause between hitting enter and seeing ChatGPT\u0026rsquo;s first word appear) inflates from a few milliseconds into seconds, just from waiting. This is the convoy effect: arrivals get queued behind the slowest member of whatever happens to be running.\nRequests finish at different times. Inside a batch, request A might want 50 output tokens and request B might want 1000. Both are decoded together. After step 50, A is done — but its slot can\u0026rsquo;t be reclaimed for someone else, because the batch\u0026rsquo;s shape is frozen until everyone\u0026rsquo;s finished. A\u0026rsquo;s compute slot sits idle for the next ~5 seconds of B\u0026rsquo;s continued decoding. Worse, A\u0026rsquo;s tokens — already produced and ready — can\u0026rsquo;t return to the user until the batch boundary either. This is the frozen batch size problem: short-lived requests pay the longest peer\u0026rsquo;s lifespan twice over, once in stalled return and once in wasted GPU.\nBoth failures come from one root cause: a static batch has one shared lifespan, set by max over its members. Anyone shorter than the max wastes; anyone arriving after the start waits.\nThe scheduler is glued to the wrong granularity. Reality moves at the granularity of iterations — every forward pass produces a token (or a chunk of prefill) for each in-flight request. The scheduler is making decisions at the granularity of batches — once per several thousand iterations. Of course it can\u0026rsquo;t keep up.\n2. ORCA: schedule per iteration, not per batch The fix from the ORCA paper is small to state and large in consequence: treat the iteration — one end-to-end forward through all L layers — as the scheduling unit. The set of in-flight requests becomes a living thing the scheduler curates between every forward, instead of a fixed roster set at admission.\nBetween iterations, the scheduler can:\nDrop any request that produced EOS in the last iteration. Its slot is free immediately. Add a new request from the queue. On its first iteration, it contributes its prompt rows for prefill. Carry mid-decode requests forward, each contributing exactly one Q row this iteration. All three operations are pure scheduler bookkeeping — no GPU work, just updates to per-request metadata. They run between iterations on the host, while the GPU is busy with the previous forward.\nWhat this means for one iteration\u0026rsquo;s contents is a step up in flexibility from Article 04. Article 04\u0026rsquo;s iterations were homogeneous — every request was prefilling, every request contributed prompt rows. Under ORCA, an iteration carries requests at different stages at the same time. Concretely:\nRequest State Q rows this iter kv_length A first-iter prefill, 4096-token prompt 4096 4096 B mid-decode, step 51 1 1500 C mid-decode, step 200 1 1700 Total Q rows in this iteration: 4096 + 1 + 1 = 4098. The varlen kernel walks the flat tensor request-by-request and computes three independent score blocks: A\u0026rsquo;s 4096 × 4096 (lower-triangular — A is prefilling its own tokens), B\u0026rsquo;s 1 × 1500, C\u0026rsquo;s 1 × 1700. Each request reads only its own KV cache — no cross-request bleed.\nTo support this mix, Article 04\u0026rsquo;s cu_seqlens (which only tracked Q-row boundaries) generalizes to one tuple per request:\n(q_start, q_end, kv_length) per request q_rows = q_end - q_start is what this request contributes to this iteration\u0026rsquo;s Q. kv_length is the request\u0026rsquo;s full attention context after this iteration\u0026rsquo;s K, V appends — which now includes any prior cache. The number of Q rows and the KV length are no longer forced to match — a decoder has q_rows = 1 and kv_length = 1500, a fresh prefill has both equal at 4096.\nThat\u0026rsquo;s the only kernel-side change. ORCA\u0026rsquo;s contribution wasn\u0026rsquo;t a new attention kernel — it was a scheduling discipline: don\u0026rsquo;t run a batch to completion, choose membership at every iteration. The kernel work was already in place from Article 04; what was missing was the policy of using it iteration-by-iteration.\nThis is what modern serving systems mean by continuous batching.\nWhat it gives us:\nConvoy effect dissolved — new arrivals join at the next iteration; the wait is one iteration (a few ms), not one batch (seconds). Frozen batch size dissolved — a slot freed at iteration t is filled at iteration t+1; a finished request returns its output as soon as its EOS is sampled, not at some far-off batch boundary. Both problems gone. A clean win. So clean, in fact, that it\u0026rsquo;s tempting to declare scheduling solved and move on. But there\u0026rsquo;s something we glossed over.\n3. The next problem: iterations themselves vary wildly ORCA fixed the boundary problems — arrival and finish — by making the iteration the scheduling unit. But making the iteration the scheduling unit also makes it the heartbeat of the engine. Every in-flight request, decoder or prefiller, gets one bit of work done per iteration. So if iteration t takes 6 ms and iteration t+1 takes 8 seconds, the gap between consecutive tokens for any in-flight decoder is 8 seconds. An iteration\u0026rsquo;s wall time is no longer a private detail of how the GPU spends its compute; it\u0026rsquo;s the latency floor for everyone in the engine that iteration.\nSo how variable is iteration wall time, really? Anchor on Llama-2-7B running on a single H100 and plug a few realistic iteration mixes through the cost model.\nFLOP and wall-time formulas used (click to expand) Llama-2-7B: multi-head attention, 32 layers, hidden 4096, head dim 128. Forward cost has two structurally different terms.\nLinears: forward cost per processed token-row is roughly 2P FLOPs where P ≈ 7×10⁹ — about 14 GFLOPs per token-row that flows through the model. Attention: per (q,k) pair across the whole network ≈ 4 · d_head · heads · layers = 4·128·32·32 ≈ 0.52 MFLOPs per pair. For a length-L prefill with causal mask: ~L²/2 pairs ≈ 2.6×10⁵ · L² FLOPs total. For a single decode step against a cache of size M: M pairs ≈ 0.52·M MFLOPs. H100 effective: ~500 TFLOPs/s fp16 for compute-bound work, ~3.35 TB/s HBM bandwidth for read-bound work. A decode step\u0026rsquo;s main cost is reading the weights once across the whole network (~14 GB at fp16), not the FLOPs themselves — so decode is bandwidth-bound at about 5–7 ms per step, set by bytes ÷ HBM.\nTake a baseline of 8 in-flight decoders at ~1 k context each. Vary what one new request brings into the same iteration:\nIteration mix (8 decodes + …) Linear Attn Total Wall time nothing else (decodes only) ~110 GF ~4 GF ~115 GF ~6 ms (bw-bound on weights) + 1 k-token prefill ~14 TF ~0.3 TF ~14 TF ~30 ms + 4 k-token prefill ~57 TF ~4 TF ~61 TF ~120 ms + 16 k-token prefill ~225 TF ~67 TF ~290 TF ~580 ms + 100 k-token prefill ~1.4 PF ~2.6 PF ~4 PF ~8 s Three patterns to notice:\nLinears scale linearly with the iteration\u0026rsquo;s total token count. Attention scales quadratically in any single request\u0026rsquo;s prefill length — negligible at small sizes, starts dominating around 100 k. Wall-time swing across iterations the scheduler might legitimately assemble is roughly 1300×. That last number is what breaks the engine. To feel what it means: imagine you\u0026rsquo;re using ChatGPT, your next paragraph streaming smoothly at ~150 tokens per second, and then — for no reason visible to you — the model freezes on a half-finished word for eight seconds before resuming. Nothing changed about your conversation. What happened, somewhere upstream, is that a different user pasted a 100 k-token document into their session, and your decode iteration got bundled into the same forward as their prefill. ORCA was happy to assemble that iteration — both were valid pieces of work — but the wall time was set by their prefill, and you paid for it.\nTwo flavors of this head-of-line blocking fall out, both inside a single iteration, not across batches.\n3.1 TBT spike for in-flight decodes The scenario above has a name: TBT (time-between-tokens) is the wait between consecutive output tokens for a decoding request — the steady-streaming feel a user expects. The bundled-with-a-100k-prefill iteration spikes TBT ~1300× for every in-flight decoder that happens to share it.\nA static batch wouldn\u0026rsquo;t have done this — but a static batch had its own catastrophes. ORCA didn\u0026rsquo;t break anything; it just made an existing variability visible at the iteration level, where it now hits everyone in the engine simultaneously.\n3.2 TTFT spike for short prefills batched with long ones Two new requests arrive in the same iteration: one with a 100-token prompt, one with a 10 k-token prompt. ORCA happily packs both into one forward — they both want prefill, no in-flight state to mind, and stuffing more into one iteration is exactly what the kernel is built for. But the forward\u0026rsquo;s wall time is set by the long peer:\nForward content Linear Attn Wall time 100-token prefill alone ~1.4 GF ~3 MF ~4 ms 10 k-token prefill alone ~140 TF ~26 TF ~320 ms 100 + 10 k packed together ~141 TF ~26 TF ~330 ms The short request\u0026rsquo;s TTFT degrades from ~4 ms (alone) to ~330 ms (batched with the long peer) — ~80× worse, purely because they shared a forward. From the short request\u0026rsquo;s perspective, the network was operating at full speed for everyone except them, and there\u0026rsquo;s no reason in their request for the latency they\u0026rsquo;re paying. It\u0026rsquo;s structural — a side effect of the iteration\u0026rsquo;s wall time being set by its largest member.\n3.3 Same root cause Both 3.1 and 3.2 come from one structural fact: an iteration\u0026rsquo;s wall time is set by its largest piece of work. ORCA can decide whether a piece is in this iteration, but not how big a piece is. Until the largest piece is bounded, the iteration heartbeat skips.\nTo get a stable heartbeat back, we need to bound the largest piece. That\u0026rsquo;s what chunked prefill does — and the KV cache already gives us the tool to do it.\n4. Chunked prefill: cap the largest piece If a long prefill is the problem, what stops us from just splitting it?\nNothing structural, as it turns out — the KV cache makes splitting trivial. Once chunk 0 has run, its K and V at every layer are already stored in the cache. Chunk 1\u0026rsquo;s attention can read them just like a decode step would. The math is identical to running the whole prefill at once, by construction; the only difference is when the work happens.\nSo: split a long prompt into chunks of size C, and carry one chunk per iteration. Walk through what happens for a request prefilling a prompt of length N:\nThe prompt becomes ⌈N/C⌉ iterations for that request. Iteration 0 prefills tokens [0, C). Its attention is exactly Article 04\u0026rsquo;s prefill — [C × C] lower-triangular score block. K and V get stored in the cache at every layer. Iteration 1 prefills tokens [C, 2C). Its attention now has Q rows from the new chunk and K, V rows from both the cached prefix and this chunk\u0026rsquo;s freshly projected K, V. Score block: [C × 2C]. \u0026hellip; Iteration k prefills tokens [kC, (k+1)C). Score block: [C × (k+1)C]. The mask on chunk k has two regions:\nThe block attending to the cached prefix [C × kC] is fully unmasked. Every prefix token was emitted before any token in this chunk, so causality permits attending to all of them. The block attending to this chunk\u0026rsquo;s own tokens [C × C] is lower-triangular — causal within the chunk. chunk k of size C, prefix S = kC: scores [C × (S+C)] cached prefix keys (S) this chunk's keys (C) queries from chunk k (C rows) attends to cached prefix all visible — no mask [C × S] causal [C × C], lower-tri masked q_rows = C, kv_length = S + C prefix block fully unmasked; new-token block lower-triangular Walk the block on a chunk of size C, with prefix size S = kC:\nStep What it does Touches cache? LayerNorm per-row no QKV proj matmul on [C × hidden] → Q, K, V each [C × heads × d_head] no Append K, V to cache concat layer\u0026rsquo;s K, V into the per-request cache yes (write) Attention Q [C × heads × d_head], K and V [(S+C) × heads × d_head]. Scores [C × (S+C)] with mask above. yes (read full prefix) Output proj matmul no Residual + LayerNorm + FFN-up + GeLU + FFN-down + Residual per-row no The only step that changes versus Article 04 is attention, with shapes generalized:\nQ row count is C instead of \u0026ldquo;the request\u0026rsquo;s whole length\u0026rdquo;. K, V row count is S + C instead of equal to Q — the prefix now lives in the cache. Score block is rectangular [C × (S+C)], not square. Linears, residuals, layernorms, and pointwise ops are per-token and don\u0026rsquo;t notice the cache. They process [C × hidden] rows row-by-row, indistinguishable from any other batch of C tokens.\nWorth pausing on the score block\u0026rsquo;s shape: it\u0026rsquo;s a hybrid. The left part — chunk\u0026rsquo;s queries against cached prefix — looks exactly like the score block of C decode steps stacked together: full unmasked attention over all prior tokens. The right part — chunk against itself — is a normal causal prefill [C × C] block. Decode and prefill are just two extremes of the same shape, and chunked prefill is any point along the spectrum.\nWhich makes the punchline obvious in retrospect: decode is just C = 1 chunked prefill. Same machinery, different value of one knob.\n5. Piggyback: prefill chunks coexist with decodes Here\u0026rsquo;s where everything composes. The flat-tensor + varlen kernel from Article 04 doesn\u0026rsquo;t care what kind of work each request\u0026rsquo;s slice represents. To the kernel, a request slice is just (q_rows, kv_length) — same shape whether the request is decoding (q_rows = 1), prefilling its first chunk (q_rows = C, kv_length = C), or carrying a middle chunk (q_rows = C, kv_length = S + C).\nSo a single iteration can carry, all packed into one flat tensor:\nIteration content: - Request E: prefill chunk 7 of 50 → 1024 Q rows, kv_length = 8 × 1024 = 8192 - Request A: decode step 51 → 1 Q row, kv_length = 1500 - Request B: decode step 200 → 1 Q row, kv_length = 1700 - Request C: decode step 75 → 1 Q row, kv_length = 1100 Total Q rows in this iteration: 1024 + 3 = 1027 The varlen kernel walks each request\u0026rsquo;s slice independently. TP is still untouched.\nThis is piggyback chunked prefill: long prefills coexist with in-flight decodes inside one forward. The scheduler\u0026rsquo;s job becomes a kind of bin-packing — at every iteration, fill a budget (say \u0026ldquo;no more than 2048 token-rows of Q, no iteration longer than 50 ms\u0026rdquo;) with whatever mix of decode steps and prefill chunks fits. A long prompt becomes a stream of chunk-sized contributions, one per iteration, alongside whatever decodes are running. Short prefills fit in single iterations. Decodes always fit. The 1300× swing from §3 collapses into a stable iteration profile of maybe 2–3× — easily plannable, and the engine\u0026rsquo;s heartbeat is steady again.\nC is the new scheduler knob:\nSmaller C → more uniform iteration time, lower TBT for in-flight decodes; but more cache re-reads per chunk and lower MFU on the linears (small GEMMs run further below peak). Larger C → fewer cache re-reads, higher MFU; but iteration wall time creeps back up and TBT degrades for everyone else in the iteration. Real systems pick C in the 256–8192 range, usually tied to a \u0026ldquo;max batched tokens per iteration\u0026rdquo; budget that targets a TBT ceiling. Concretely: under a budget of \u0026ldquo;≤ 50 ms per iteration, up to 2048 Q rows,\u0026rdquo; a 100 k-token prompt prefills in 100 000 / 2048 ≈ 49 iterations, sharing each one with whatever decodes are currently running.\n6. Cost intuitions Three things worth pausing on, since they all bite.\nTotal compute is preserved. Sum over chunks k = 0 … N/C − 1 of C · (k+1)C causal pairs equals N²/2. Chunked prefill redistributes attention work across iterations; it doesn\u0026rsquo;t reduce it.\nHBM bandwidth on KV reads grows. Chunk k re-reads kC rows of cache per layer per attention call. Summed over all chunks: ≈ N²/(2C) rows of cumulative cache traffic, vs ~N rows for an unchunked prefill (which streams the cache through tiled attention exactly once). For N = 100 k and C = 2048, that\u0026rsquo;s about 25× more cumulative cache-read bandwidth spent on the same prompt — the price chunking pays for keeping iterations bounded. It\u0026rsquo;s also why C can\u0026rsquo;t be made arbitrarily small: at some point the bandwidth tax overtakes the schedulability win.\nPer-iteration MFU dips at small C. Small-C iterations run their linear matmuls below peak — fewer rows for the tensor cores to chew on. Real serving engines tune C to a sweet spot where iteration time meets the TBT target without leaving too much MFU on the table.\nThe three together explain the typical C ∈ [256, 8192] band. There\u0026rsquo;s no single right answer; the band depends on the model\u0026rsquo;s compute/bandwidth profile and the engine\u0026rsquo;s TBT/throughput targets.\n7. What this opens A real serving loop now: prefill, decode, mixed iterations, bounded per-iteration cost, no idle slots. Some assumptions still leak, each seeding the next round of articles.\nThe KV cache\u0026rsquo;s physical layout. We\u0026rsquo;ve quietly assumed each request\u0026rsquo;s cache is a contiguous slab per layer. As B grows and contexts vary, this gets ugly fast — fragmentation, eviction, allocation overhead. PagedAttention treats the cache as virtual memory; the next article. Two regimes sharing one engine. Decode is bandwidth-bound on weight reads; prefill chunks are compute-bound. Maybe they shouldn\u0026rsquo;t share the same GPUs at all. Prefill/decode disaggregation explores running them on separate replicas. Heads aren\u0026rsquo;t always independent. GQA, MLA, and the rest of the \u0026ldquo;fewer KV heads\u0026rdquo; family shrink the cache dramatically — bigger batches, longer contexts — but introduce sharing patterns we\u0026rsquo;ve been able to ignore so far. A whole sub-series. One request\u0026rsquo;s cache outgrows one GPU. Once a context gets long enough that its KV cache alone won\u0026rsquo;t fit on a single card, sequence/context parallelism splits one request across GPUs. Its own article, much later. Same grammar each time: relax one assumption, see what falls out.\n","permalink":"https://wgzesg.github.io/llm_stories/posts/05-orca-and-chunked-prefill/","summary":"Many requests, each finishing at a different time, and some carrying prefills 1000× the size of a decode step. Per-iteration cost swings wildly. ORCA-style iteration-level scheduling fixes one half; chunked prefill bounds the largest iteration so short work isn\u0026rsquo;t dragged behind long work.","title":"ORCA and Chunked Prefill: Evening Out the Iteration"},{"content":"Article 05 ended with a smooth heartbeat. ORCA fixed the boundary problems by scheduling per iteration; chunked prefill capped the iteration so a long prompt couldn\u0026rsquo;t hijack the room. Every iteration is bounded, every request is roughly fair, the engine breathes evenly.\nBut that article also left a thread dangling, in §7\u0026rsquo;s second bullet:\nDecode is bandwidth-bound on weight reads; prefill chunks are compute-bound. Maybe they shouldn\u0026rsquo;t share the same GPUs at all.\nThis article pulls on that thread. We\u0026rsquo;ll measure the gap with the roofline model, watch it widen as context length grows, and end with the structural fix: stop putting the two phases on the same machine.\nThe starting point is uncomfortable: piggyback chunked prefill from Article 05 isn\u0026rsquo;t a solution to the prefill/decode mismatch — it\u0026rsquo;s a compromise. It flattens the heartbeat, but the underlying truth is that a prefill chunk and a decode token want the GPU to be in two different regimes. Sharing forces both to settle for the wrong one.\n1. The roofline, in one page Every kernel on every GPU is bottlenecked on one of two physical resources:\nCompute — the tensor cores\u0026rsquo; peak FLOPs/s. Memory bandwidth — the rate at which HBM can deliver bytes to the SMs. (This is intra-GPU bandwidth, the wire from HBM to the tensor cores. Inter-GPU bandwidth — NVLink, InfiniBand — is a separate axis we\u0026rsquo;ll meet later when TP and PP enter the story.)\nA concrete picture of where bytes live The \u0026ldquo;memory bandwidth\u0026rdquo; number is opaque without a picture of the chip. A modern GPU has a memory hierarchy — several layers, each smaller and faster than the one below it. Tensor cores can only do math on data sitting in registers, so every byte of weight or KV cache has to traverse the hierarchy up before any compute happens.\nGPU memory hierarchy (H100-flavored numbers) GPU die SM 0 registers SRAM (~256 KB) tensor cores ~30 TB/s effective SM 1 registers SRAM (~256 KB) tensor cores SM 131 registers SRAM (~256 KB) tensor cores ⋯ ⋯ L2 cache ~50 MB shared ~5 TB/s 3.35 TB/s HBM bandwidth HBM — 80 GB model weights · KV cache · activations between kernels large enough to hold the model and the batch's state, but the slowest tier Numbers are H100-flavored; other GPUs differ in absolute terms but the shape — three to four orders of magnitude between top and bottom in both capacity and speed — is universal.\nWhat\u0026rsquo;s stored where:\nHBM holds the persistent stuff: model weights (14 GB for Llama-2-7B), every request\u0026rsquo;s KV cache, and activations that survive between kernel launches. Big and slow-relative. L2 cache is a shared scratch — useful when many SMs read overlapping data, but it\u0026rsquo;s only ~50 MB, far too small to hold weights or KV. SRAM (per-SM shared memory) is where a kernel stages the current tile of weights, queries, and keys it\u0026rsquo;s working on. FlashAttention\u0026rsquo;s whole trick is keeping the attention score matrix in SRAM so it never spills to HBM. Registers are where tensor cores actually read operands from. A few hundred KB per SM, accessible in a single cycle. So when you read \u0026ldquo;the kernel loaded 14 GB of weights from HBM,\u0026rdquo; the path is: HBM → L2 → SRAM → registers → tensor cores. Each layer is smaller and faster than the one below it, and the 3.35 TB/s number is the bottom of that chain — the one bottleneck that can\u0026rsquo;t be cached around for a transformer iteration, because the weights are larger than every layer above HBM.\nWhat \u0026ldquo;compute-bound\u0026rdquo; vs \u0026ldquo;bandwidth-bound\u0026rdquo; actually means physically A matrix multiply works in tiles: load a tile of A and a tile of B from HBM into SRAM, multiply them in registers (many FLOPs per element), accumulate, move on. The same tile of weights is reused across many output rows before being evicted.\nCompute-bound means the tensor cores are saturated. They consume the current tile fast enough that HBM can comfortably deliver the next tile in the background. Bandwidth has slack. Each byte of weight, once loaded, is reused for many FLOPs. Bandwidth-bound means HBM can\u0026rsquo;t deliver the next tile fast enough. The tensor cores have already finished with the current one and sit idle waiting for bytes. Each byte is reused for too few FLOPs to amortize the load. The number that decides which regime you\u0026rsquo;re in is exactly how many FLOPs you do per byte you pulled from HBM — that\u0026rsquo;s the intensity, and that\u0026rsquo;s why the roofline rule is so unforgiving. It isn\u0026rsquo;t an empirical observation; it\u0026rsquo;s a direct consequence of the hierarchy above.\nThe roofline rule Which resource binds is decided by a single number: arithmetic intensity I, the ratio of FLOPs done to bytes loaded from HBM:\nI = FLOPs done / bytes loaded (units: FLOPs/byte) The hardware has a matching number, the ridge point R:\nR = peak FLOPs/s / peak HBM bandwidth (units: FLOPs/byte) For an H100 SXM5: ~500 TFLOPs/s sustained fp16 GEMM, 3.35 TB/s HBM3 → R ≈ 150 FLOPs/byte.\nThe rule:\nI \u0026gt; R → compute-bound. The arithmetic dominates; bandwidth has slack. I \u0026lt; R → bandwidth-bound. The bytes dominate; tensor cores idle waiting for data. That\u0026rsquo;s it. The whole rest of this article is two questions:\nWhat\u0026rsquo;s I for a prefill iteration vs. a decode iteration? How does I change as context length grows? 2. Notation and the iteration cost model Before any numbers, fix symbols. We\u0026rsquo;ll assume fp16 throughout (2 bytes per parameter, 2 bytes per cached number). Lower-precision dtypes change the numerics but not the story.\nSymbol Meaning Units Π parameter count dimensionless K_tok KV bytes stored per token (sum over all layers, both K and V) bytes/token T total tokens in this iteration tokens B in-flight requests in this iteration dimensionless L average context length per request tokens C prefill chunk size (new tokens per chunk) tokens R hardware ridge point FLOPs/byte (K_tok is the sum across all layers — what one token of context costs across the full network\u0026rsquo;s KV cache, not per-layer.)\nFor one transformer iteration on a model of size Π, two physical quantities matter, and both are linear in Π:\nBytes pulled from HBM for weights: every parameter is 2 bytes wide and the iteration reads each one once → 2Π bytes. For Llama-2-7B (Π = 7B), 14 GB. Paid once per iteration, no matter how many tokens we packed in. FLOPs to push one token through the network: each token\u0026rsquo;s pass through the model multiplies by every parameter once (2 FLOPs per multiply-accumulate) → 2Π FLOPs/token. For Llama-2-7B, 14 GFLOPs/token. An iteration processing T tokens does 2Π · T FLOPs — tokens don\u0026rsquo;t interact in the matmul layers (only in attention), so each one costs the same 2Π independently. Add the KV-cache reads to the byte side and write the two together:\nbytes_loaded = 2Π (weights, paid once per iteration) + K_tok · L · B (KV cache, each request reads its own L rows) FLOPs_done = 2Π · T (T = tokens in this iteration) Plug into the intensity definition:\nI = 2Π · T / (2Π + K_tok · L · B) Stare at this formula for a moment — the rest of this section is reading it carefully. The denominator has two terms, the numerator has one, and walking through them in sequence gives us the whole prefill/decode story.\nPart 1: pretend the KV term is zero At very short L, or before any context has accumulated, the denominator is dominated by 2Π and the formula collapses to:\nI ≈ T Intensity is literally the number of tokens sharing one weight load. This is where prefill and decode part ways:\nPrefill iteration: T = C = 2048 tokens → I ≈ 2000 → way above any modern ridge (~150) → compute-bound. Decode iteration: T = B (concurrent decoding requests, typically tens to low hundreds) → I ≈ B → way below ridge → bandwidth-bound. Same hardware, same model, same kernel. The only difference is how many tokens the iteration is carrying. Prefill amortizes the weight load over thousands of tokens; decode amortizes it over B. They land on opposite sides of the ridge from the very first iteration — and not by a small margin: an order of magnitude or more in intensity.\nThe instinct is to fix this by batching decode harder — push B up until intensity clears the ridge. To clear R = 150 you\u0026rsquo;d need B ≥ 150. The next part of the formula explains why that\u0026rsquo;s not feasible.\nPart 2: turn the KV term back on As context grows, K_tok · L · B adds to the denominator. The two denominator terms cross when:\nL · B = 2Π / K_tok For Llama-2-7B (Π = 7B, K_tok ≈ 512 KB), L · B ≈ 27 k. At decode batch B = 32, the crossover is at L ≈ 850 tokens.\nThat number — 850 — is tiny by today\u0026rsquo;s standards, and it\u0026rsquo;s worth pausing on. Production prompts routinely run tens of thousands of tokens: large system prompts and tool definitions, RAG-injected documents, accumulated multi-turn conversations, agentic chains where input-to-output ratios commonly run 100:1 or higher. Frontier models ship with 200 k – 2 M context windows precisely because real workloads fill them. So \u0026ldquo;past the crossover\u0026rdquo; isn\u0026rsquo;t a corner case — it\u0026rsquo;s the median request.\nPast the crossover, the formula approximates in the other direction:\nI ≈ 2Π · T / (K_tok · L · B) And here the cancellations matter:\nDecode (T = B): I ≈ 2Π / (K_tok · L). The Bs cancel — increasing the decode batch no longer raises intensity at long context. You just pay proportional KV reads for processing more requests in parallel. And before B can grow much, you run out of KV memory. So the \u0026ldquo;just batch harder\u0026rdquo; instinct from Part 1 fails twice over. Prefill (T = C): I ≈ 2Π · C / (K_tok · L · B). Nothing cancels — C stays in the numerator. Prefill stays compute-bound out to absurd contexts. Two facts from one formula Prefill is compute-bound; decode is bandwidth-bound. This holds even at zero context, set entirely by how many tokens share one weight load. They\u0026rsquo;re on opposite sides of the ridge from the start. Long context widens the gap. A second bandwidth cost — KV reads — emerges in the denominator and dominates past the crossover (which production traffic routinely sits past). It lands disproportionately on decode, while leaving prefill mostly untouched. §3 confirms both with concrete numbers on Llama-2-7B.\n3. The numbers, one model, two phases To put the formula on the ground, sweep L for a single model on a single GPU.\nLlama-2-7B (MHA, 32 layers, 32 heads, head_dim 128, fp16) on H100:\nweight bytes 2Π = 14 GB K_tok = 2 (K,V) · 32 layers · 32 heads · 128 head_dim · 2 bytes ≈ 512 KB/token ridge R ≈ 150 FLOPs/byte Decode at B = 32 L weight bytes KV bytes total I = 2Π·B / total regime 1 k 14 GB 16 GB 30 GB ~15 bandwidth-bound (weights ≈ KV) 4 k 14 GB 64 GB 78 GB ~5.7 bandwidth-bound (KV dominates) 16 k 14 GB 256 GB 270 GB ~1.7 catastrophically bandwidth-bound 64 k 14 GB 1.0 TB 1.0 TB ~0.4 the cache doesn\u0026rsquo;t even fit on one H100 (Numerator 2Π · B = 448 GFLOPs — pinned. The denominator is what blows up.)\nNotice:\nIntensity falls fast. From ~15 at L=1k to ~0.4 at L=64k — more than an order of magnitude over a single dimension of context. Memory budget bites before bandwidth does. At L=16k, B=32 the KV alone is 256 GB, way past the H100\u0026rsquo;s 80 GB. PagedAttention exists partly to manage this, and B is forced down at long context, which makes intensity worse. (Llama-2-7B uses MHA; modern GQA/MLA models cut K_tok by 4–8×, mostly to push this wall back.) The dominant byte changes. At small L, weights dominate. At large L, KV dominates. Both are bandwidth-bound, but the fix is different — bigger batch helps with weight pressure; GQA/MLA/FlashDecoding help with KV pressure. Prefill at C = 2048 Chunked prefill processes C new tokens against a prefix of size S (so T = C tokens of compute, reads S tokens of cached KV):\nI_prefill = 2Π · C / (2Π + K_tok · S) The numerator scales with C — every byte loaded is reused across thousands of tokens of math.\nprefix S weight bytes KV bytes total I regime 4 k 14 GB 2 GB 16 GB ~1800 compute-bound (×12 above ridge) 64 k 14 GB 32 GB 46 GB ~620 compute-bound (×4) 256 k 14 GB 128 GB 142 GB ~200 still compute-bound (×1.3) 1 M 14 GB 512 GB 526 GB ~55 finally below ridge — but we\u0026rsquo;re at a million tokens Prefill stays compute-bound out to extreme contexts. Even where it crosses below the ridge, it\u0026rsquo;s nowhere near as bandwidth-bound as decode is at common contexts.\nThe asymmetry, stated cleanly:\nEach byte of bandwidth is amortized over C ≈ 2000 tokens in prefill, but over 1 token per request in decode. Long context turns the screw on decode and barely touches prefill.\nSame model, same GPU. Two phases. Completely different fates.\n4. Why one engine can\u0026rsquo;t serve both well Take the engine from article 05 — continuous batching, chunked prefill, piggyback iterations — and ask: how do you size it?\nSize for prefill: buy GPUs for FLOPs. Decode then runs on hardware where most of the compute is structurally unreachable, because decode is bandwidth-bound. You\u0026rsquo;re paying for tensor cores decode physically can\u0026rsquo;t use. Size for decode: buy fewer GPUs sized for HBM bandwidth and capacity. Prefill takes longer than it needs to. TTFT (time-to-first-token, the pause before the first word) inflates. Mix: every iteration packs prefill chunks and decode tokens. TBT (time-between-tokens, the cadence between words) is held hostage by however much compute the prefill chunks are eating. Chunked prefill bounds this — that was article 05\u0026rsquo;s whole point — but it can\u0026rsquo;t make the bound free. A decode iteration sharing the engine pays for C rows of prefill work that doesn\u0026rsquo;t help any decode at all. The deeper issue: the workload\u0026rsquo;s bottleneck profile is bimodal, but the engine is unimodal. There\u0026rsquo;s no single sizing, no single parallelism strategy, no single batch policy that\u0026rsquo;s right for both phases simultaneously. The two phases stress different physical resources and have different SLOs (TTFT vs TBT), and one scheduler with one knob can\u0026rsquo;t satisfy two SLOs against two regimes.\nSo you stop trying. You build two pools.\n5. The split prompt Prefill pool compute-bound optimizes TTFT stateless KV cache transfer per request, L_p · K_tok bytes Decode pool bandwidth-bound optimizes TBT holds long-lived KV tokens A request\u0026rsquo;s lifecycle now has a hop in the middle:\nPrefill pool receives the prompt, runs chunked prefill across all L_p tokens, produces the request\u0026rsquo;s full KV cache plus the first generated token. KV cache transfer ships those L_p · K_tok bytes from prefill GPU memory to a decode GPU\u0026rsquo;s memory. Decode pool receives the KV cache, slots the request into its continuous-batching pool, and runs decode iterations until EOS, streaming tokens back to the user. Two pools, two scheduling regimes, two SLO targets. The compromise is gone. Each pool is now free to pick its own parallelism, batch policy, hardware mix, and scheduling discipline against a single objective. That freedom is most of the win — what each pool actually does with it is the subject of later articles in this series.\nThe handoff is the new cost. We\u0026rsquo;ll price it in §6.\n6. The new cost: KV cache transfer Splitting the engines means KV moves between machines, once per request. That\u0026rsquo;s a real cost — let\u0026rsquo;s price it.\nFor Llama-2-7B (K_tok ≈ 512 KB) and a 4 k-token prompt:\nKV bytes per request = L_p · K_tok = 4096 · 512 KB ≈ 2 GB That\u0026rsquo;s per request. At a few hundred requests per second (modest production load), the aggregate east-west traffic between the two pools can run into hundreds of GB/s. Whichever fabric connects them needs to handle it.\nWhat that fabric looks like, and what one transfer costs:\nFabric Bandwidth 2 GB transfer time NVLink (intra-node) ~900 GB/s ~2 ms NVLink-network / NVSwitch fabric (cluster) ~400 GB/s ~5 ms InfiniBand HDR (cross-node) ~50 GB/s ~40 ms PCIe Gen5 (host-mediated) ~64 GB/s ~30 ms So the handoff is cheap if your pools are co-located in the same NVLink domain, and a real tax if they\u0026rsquo;re across an IB hop. A 40 ms hit on TTFT is meaningful; a 5 ms hit is not.\nA few engineering knobs that immediately fall out (each could justify its own article — we\u0026rsquo;re surfacing, not solving):\nLayer-streaming overlap. Don\u0026rsquo;t wait for prefill to finish to start the transfer. Each layer\u0026rsquo;s K, V are produced in order; ship them while later layers are still computing. Done well, the transfer is mostly hidden behind prefill compute. GPUDirect RDMA. Move bytes directly between GPU HBMs without bouncing through CPU memory. Saves a copy and a context switch. Topology awareness. Schedule prefill and decode for the same request onto pools that are close — same rack, same NVLink domain — to minimize fabric class. Prefix reuse. If two requests share a long prefix, you only need to compute and transfer the suffix\u0026rsquo;s KV. Production systems (Mooncake at Moonshot is a well-documented example) turn this into a memory-hierarchy problem: hot prefixes in HBM, warm in DRAM, cold on SSD. GQA / MLA shrink the bill directly. Cutting K_tok by 4–8× cuts the transfer by 4–8×. This isn\u0026rsquo;t usually framed as a disaggregation optimization, but it is one. There\u0026rsquo;s a real article\u0026rsquo;s worth of detail under each of those. For now the takeaway is just that the transfer is the price of disaggregation, and it\u0026rsquo;s payable — bounded, well-engineered, and small relative to the wins on TTFT and TBT.\nWhat the user feels:\nTTFT = prefill time + transfer time + first decode iter. Transfer is a real but small component (a few ms to tens of ms). TBT = pure decode, no prefill contention. The decode pool\u0026rsquo;s iterations only ever contain decode work, so TBT is as smooth as the decode hardware alone can make it. The trade is the one you want: a small one-time tax on TTFT in exchange for clean, predictable TBT throughout the generation. Users feel TBT far more than TTFT — TTFT is one pause, TBT is every pause.\n7. What this opens Article 05 ended by capping the iteration. Article 06 ends by splitting it. The formula in §2 forces the why; this article has spent most of its pages on that argument. The how is a different question, and §6 should be read as a doorway, not a destination — the visible tip of a much larger engineering surface.\nStand at that doorway for a second. Two GPUs, possibly in different racks, possibly under different memory tiers, have to move gigabytes of state per request fast enough to disappear behind prefill latency. Every choice in that pipeline has its own real design space:\nWhich fabric carries the bytes — NVLink vs NVSwitch vs InfiniBand vs PCIe — sets a per-transfer cost that ranges across nearly two orders of magnitude (§6\u0026rsquo;s table). The cluster topology you build looks completely different depending on the answer. Where the KV cache lives between requests — HBM vs DRAM vs SSD — turns the disaggregated engine into a tiered memory system. Mooncake-style prefix pools are one way; there are others, with different invalidation and locality behaviors. How the transfer overlaps with compute — layer-by-layer streaming, GPUDirect RDMA, double-buffered queues — is what makes the handoff invisible end-to-end vs. dominant in TTFT. How requests are routed across pools — fabric-locality-aware scheduling, prefix-cache hits, decode capacity tracking — is its own scheduling problem on top of everything in article 05. Each of those is a real article on its own, and the next piece in this series picks up the thread — the engineering of running a disaggregated serving stack. Then we\u0026rsquo;ll be in a position to ask the optimization questions disaggregation finally lets us ask cleanly: what does each pool want, now that it\u0026rsquo;s free to specialize? Pipeline parallelism for prefill, tensor parallelism for decode, paging, GQA/MLA, FlashDecoding, speculative decoding — each has a clean home once the pools are split, and we\u0026rsquo;ll work through them in turn.\nSame grammar each time: name the bottleneck, factor the workload until each piece sees only the bottleneck that binds it, optimize per piece. Disaggregation was the biggest factoring move available. The next stretch of this series is the engineering and the optimization that the cuts unlock.\n","permalink":"https://wgzesg.github.io/llm_stories/posts/06-prefill-decode-disaggregation/","summary":"Article 05 left two phases politely sharing one engine. This article shows they shouldn\u0026rsquo;t — prefill is compute-bound, decode is bandwidth-bound, and long context drives the gap wider, not smaller. Once we accept the asymmetry, splitting them is the structural fix.","title":"Prefill and Decode Disaggregation: Two Phases on Opposite Sides of the Roofline"},{"content":"The FFN we walked in article 03 hits a wall at frontier scale — every token has to read every parameter, and at 700B that bill dominates the cost of serving. Mixture-of-Experts is the move that solves it: replace one big FFN with many small experts and a router that picks a few per token, decoupling capacity from compute.\nThis article rebuilds the FFN as MoE, anchors on DeepSeek-V3 for the concrete numbers, then walks through the parallelism the new shape demands.\n1. The FFN, where the parameters live Recap from article 03. Each transformer block is attention then FFN. The FFN takes the residual stream, projects up to a wider intermediate dimension, applies a SwiGLU nonlinearity, projects back down. For Llama-2-7B that\u0026rsquo;s 4096 → 11008 → 4096, three matrices per layer:\ngate_proj: 4096 × 11008 up_proj: 4096 × 11008 down_proj: 11008 × 4096 Multiplying the shapes out, each FFN holds about 135M parameters. Across 32 layers, that\u0026rsquo;s ~4.3B of Llama-2-7B\u0026rsquo;s 7B total — well over half the model lives in the FFN. And in the dense formulation, every one of those parameters gets read on every token\u0026rsquo;s pass through every layer.\nThat\u0026rsquo;s fine at 7B. It gets expensive at 700B. From article 01 we know decode is bandwidth-bound — every generated token has to pull the full weight set through HBM. People do serve dense models at this scale (Llama-3 405B is dense, GPT-3 175B was dense), so it\u0026rsquo;s not impossible — it\u0026rsquo;s just that the per-token bandwidth bill is set by every parameter the model has, and at this size the bill becomes the dominant cost of serving.\nWhat if we could keep the capacity of 700B parameters but only touch a small fraction of them per token? The bandwidth bill would drop by exactly that fraction, and we\u0026rsquo;d get the model quality of the full 700B for the cost of a much smaller one. That\u0026rsquo;s the trade MoE makes.\n2. The MoE move: condition the FFN on the token The classical FFN treats every token identically. Same matrices, same multiplications, whether the token is the or mitochondria. That\u0026rsquo;s wasteful — most parameters in a giant FFN are surely specialized for something, and most tokens don\u0026rsquo;t need most specializations.\nMixture-of-Experts replaces one big FFN with many smaller ones — the experts — plus a tiny router that, per token, picks which k of E experts to actually run. The shape of computation is unchanged inside each expert — it\u0026rsquo;s still a SwiGLU FFN on the residual stream. What changes is which experts run for which tokens.\nDense FFNMoE FFNone big FFNevery token, every paramrouterE experts; each token wakes only k of themsame matmul for every tokendifferent experts for different tokens\nTwo consequences fall out:\nParameters scale with E, compute scales with k. Add more experts and the model\u0026rsquo;s total capacity grows; keep k fixed and per-token FLOPs don\u0026rsquo;t budge. Capacity and compute, decoupled. The FFN becomes conditional. Different tokens take different paths through the model. Two tokens in the same sequence at the same layer may hit completely disjoint sets of experts. That\u0026rsquo;s the entire architectural idea. Everything from here is how to make it run on real hardware — starting with the specific shape DeepSeek-V3 picks for E, k, and the expert size.\n3. DeepSeek-V3 in concrete numbers §2 was the architecture in the abstract. To feel what the choice buys, we need to anchor on a real model — and DeepSeek-V3 is the cleanest large-scale instance to look at, both because the design is publicly documented and because it pushes the MoE shape hard.\nThe headline: 671B parameters total, 37B active per token. The cleanest way to feel what that buys is to put DeepSeek-V3 next to a contemporary dense model of comparable capability. Qwen2.5-72B fits the role — different lab, different philosophy, same generation, aimed at similar tasks.\nQwen2.5-72B (dense) DeepSeek-V3 (MoE) Hidden dim d 8192 7168 FFN intermediate dim i 29568 2048 (per expert) Ratio i / d 3.6× 0.29× Experts per FFN layer 1 (no router) 256 routed + 1 shared Active per token the whole FFN top-8 + shared = 9 of 257 Params per FFN/MoE layer ~727M ~11.3B stored, ~400M active Routing combinations per token 1 ~4 × 10¹⁴ Total model params ~72B 671B Active params per token ~72B 37B The contrast is the whole point. DeepSeek-V3 stores 9× more parameters than Qwen2.5-72B but touches half as many per token. Per FFN layer, the active compute is actually smaller in the MoE model. Dense models pay for every parameter on every token, full stop; MoE pays only for the parameters relevant to this token, plus a small router. And the last row but one says it best — a dense FFN has exactly one \u0026ldquo;specialization\u0026rdquo; per layer (itself), while MoE has ~10¹⁴ possible specializations per layer per token. Expressivity grows combinatorially with the routing choice; compute does not.\nFine-grained MoE is one of the more elegant ideas of the current era — same compute budget as a dense model in the same class, but vastly more combinatorial expressivity and sharper per-expert specialization. It also reshapes the deployment problem: 256 narrow experts spread across a cluster behave very differently from one big FFN, and the systems machinery had to catch up to make it work at scale.\nThe picture below traces what 9-of-257 routing actually looks like for a single token:\nOne token through one MoE layer in DeepSeek-V3tokenrouterscores all 256 routed expertstop 8 scores selectedrouted expert (1)routed expert (2)… 6 more …routed expert (8)shared expert9 of 257 experts runΣweighted sumoutput\nThe arithmetic that gets you from 671B stored to 37B active is just the ratio in the picture: per layer, 9 of 257 experts run, so each layer touches ≈ 3.5% of its expert parameters. Stack 58 MoE layers (plus 3 dense FFN layers and attention) and you land at 37B active.\nAn MoE layer is a parameter store with a router on top. Compute touches a few percent of it per token; the rest sits in HBM waiting to be the right expert for some other token.\nThe decode-bandwidth bill is set by what\u0026rsquo;s active, not what\u0026rsquo;s stored. You pay for 37B and get the capacity of 671B. Once routing is the thing deciding which weights get used, where the experts live becomes the central design question — and the rest of this article is about answering it.\n4. Why TP doesn\u0026rsquo;t fit experts Each MoE layer holds 11B of expert weights, far too much for a single card. So the natural first attempt is to spread them with TP — the parallelism we already know from articles 02 and 03. It doesn\u0026rsquo;t fit, for two reasons.\nTensor parallelism slices one big matrix across GPUs and stitches the shards with an all-reduce. MoE breaks that premise on two axes:\nSize. Each DeepSeek-V3 expert is just 7168 × 2048 per matmul — already modest. Slice it 8 ways and each shard becomes 7168 × 256, too skinny to keep tensor cores fed. Fine-grained MoE went smaller per expert by design; TP wants the opposite. Structure. You have 257 independent small matrices, not one big one, and any token uses only 9. TP-slicing each would fire an all-reduce per expert per layer for computation that was already factored. The natural cut: put whole experts on different GPUs. Each card holds a subset of the 256 routed experts and runs them locally on whatever tokens land there. The per-matmul all-reduce disappears; what replaces it is the routing itself — moving tokens to the right card and back.\nThat\u0026rsquo;s expert parallelism.\n5. EP, end to end Time to zoom in. Take one transformer block, and inside that block focus on the FFN — which in DeepSeek-V3 is the MoE layer. Suppose attention has already run and produced its output for a batch of T tokens. Those activations are sitting in HBM, ready for the FFN to consume.\nThe FFN isn\u0026rsquo;t one big matmul anymore. It\u0026rsquo;s 256 routed experts plus 1 shared expert, and in this section we\u0026rsquo;ll spread them across 8 cards with EP = 8 — 32 routed experts per card, plus the shared expert replicated everywhere. The question this section answers: how does an MoE layer actually compute, with experts living on 8 different cards and every token wanting its own subset?\nTo keep the picture simple, assume each card already holds its own unique slice of the batch — T / 8 tokens per card, all different, no overlap with the other cards. (This is the layout you\u0026rsquo;d get with pure data parallelism across cards, no TP in attention. §6 layers in TP and SP and shows how to arrive at this same layout when those are in play too.)\nWalk one MoE layer end to end.\nStep 1 — route. Each GPU runs the router on its local tokens. The router is a tiny matmul (d × E); its output is, per token, the 8 chosen expert IDs and their weights. The assignment is the entire content of the dispatch plan.\nStep 2 — all-to-all dispatch. Every GPU now knows which of its tokens want experts on which other GPUs. Tokens get packed and sent. A single token may go to up to 8 different destination GPUs (one per chosen expert). After this exchange, every GPU is holding the set of tokens that want its experts.\nStep 3 — local expert compute. Each GPU runs its 32 experts on the tokens it received. Just SwiGLU FFN on whatever subset of tokens picked each expert. Pure local compute, no communication. The shared expert runs in parallel on the local tokens.\nStep 4 — all-to-all combine. Send each token\u0026rsquo;s expert outputs back to its origin GPU. There the outputs are weighted by the router scores and summed (along with the shared expert\u0026rsquo;s output) to form this layer\u0026rsquo;s FFN output.\nTwo all-to-alls per MoE layer, one round of local FFN in the middle. That\u0026rsquo;s it.\nOne MoE layer under EP = 8 GPU 0 local tokens+ router decision GPU 1 local tokens+ router decision ⋮ GPU 7 local tokens+ router decision all-to-all #1dispatch tokens toGPUs holding theirchosen expertslocal expert FFNon received tokensall-to-all #2combine expert outputsback to origin GPU GPU 0 outputweighted sum + shared GPU 1 outputweighted sum + shared ⋮ GPU 7 outputweighted sum + shared tokens reshuffle by routing → compute locally on the GPU that owns the expert → reshuffle back Three things this picture makes visible.\nEach expert sees only the tokens that picked it. That\u0026rsquo;s the win. GPU 3 doesn\u0026rsquo;t run its 32 experts on all T tokens; it runs them on the fraction routing sent its way. With uniform routing, each routed expert sees T · 8 / 256 = T / 32 tokens, and total FFN compute across the cluster matches a dense 9/257-sized FFN exactly — which is what the \u0026ldquo;37B active\u0026rdquo; number was claiming all along.\nThe all-to-alls move activations, not weights. Each token carries a d = 7168 fp16 vector ≈ 14 KB through dispatch, and the same again through combine. Small per token, real per iteration — and the traffic is fully meshed, every card sending to every other.\nRouting decides everything about utilization. If 90% of tokens route to one GPU\u0026rsquo;s experts, that GPU bottlenecks the whole layer and the others sit idle waiting for the all-to-all combine. Uniform routing is what makes the scheme efficient. DeepSeek and others spend real complexity on the losses, biases, and dispatch constraints that keep routing close to balanced. We\u0026rsquo;ll treat that as its own topic.\n6. One node, three parallelisms: TP × EP × SP §5 was pure EP — no TP, each card with its own batch. Real production layouts compose EP with attention\u0026rsquo;s TP and a third axis, sequence parallelism (SP), all on the same hardware. The composition is where the design gets interesting.\nThe deployment unit: one 8-GPU H200 node (8 × 141 GB ≈ 1.1 TB of HBM, enough for DeepSeek-V3\u0026rsquo;s fp8 weights plus KV cache), with TP = 4 for attention, EP = 8 for MoE, and SP = 4 matching TP. The 8 cards split into 2 TP-groups of 4, each running its own batch. The full expert set lives within the node — 256 routed experts spread 32 per card, plus the shared expert replicated everywhere.\nOne 8-GPU node · TP = 4 · EP = 8 · SP = 4Two TP-groups of 4 cards run independent batches; together the 8 cards hold all 256 routed experts (32 per card)Node — 8 GPUs on NVLinkTP-A · batch 0GPU 0GPU 1GPU 2GPU 3TP-B · batch 1GPU 4GPU 5GPU 6GPU 7Throughput scales by replicating this unit across more nodes — no cross-node traffic during forward.\nThis node is the complete inference unit — everything happens inside it. To handle more concurrent batches, replicate the node; replicas are fully independent during forward, with no cross-node coordination beyond request routing at the front. (DeepSeek-V3 at fp16 is ~1.3 TB and doesn\u0026rsquo;t actually fit in a single 8-GPU node\u0026rsquo;s 640 GB; production deployments use fp8 weights or stretch EP across 2 nodes. The parallelism pattern below is the same either way; we keep the picture at 8 cards for clarity.)\nThe key property: every collective stays inside the node, where cards talk fast. Nothing has to cross between nodes during forward — cross-node communication is much slower, and the layout is designed to avoid it entirely.\nNow we get to the part that\u0026rsquo;s actually fun: how do these three parallelisms compose, and why does the design feel elegant rather than just bolted together?\nThe trick is to forget the names of the collectives for a moment and just track the shape of the data on each card at each stage. We\u0026rsquo;ll introduce the names as labels for shape-changes we see in the picture.\nThe activation table, and two ways to split it Inside one transformer block, think of the activations as a 2D table: rows are tokens, columns are hidden features. The whole point of multi-card parallelism is to spread this table across the 4 cards in a TP-group so they can work in parallel.\nThere are two natural ways to split:\nTwo ways to split the activation table across 4 cardsSame total data — different shape on each cardColumn-shardedeach card: all tokens · 1/4 featurescard 0card 1card 2card 3attention\u0026rsquo;s natural fit — per-head matmulsRow-shardedeach card: 1/4 tokens · all featurescard 0card 1card 2card 3MoE\u0026rsquo;s natural fit — per-token dispatch\nColumn-sharded: each card has all the tokens, but only a slice of features (a quarter, since TP=4). Row-sharded: each card has all the features, but only a slice of tokens. Same total bytes — same data — just shaped differently on each card. The two layouts are interconvertible by reshuffling bytes between cards. That\u0026rsquo;s all a \u0026ldquo;collective\u0026rdquo; really is: a reshuffle that turns one layout into another.\nAttention naturally wants column-sharded Attention\u0026rsquo;s TP work is column-sharded. Each card holds 1/4 of the attention heads (which corresponds to a slice of output features), runs them on all the tokens, and produces a partial result. To get the full output, the 4 cards have to merge their partials: each contributes its slice, they all sum into one, and every card ends up with the same complete output.\nThat merge has a name: all-reduce. It\u0026rsquo;s a convergent-then-divergent pattern — partials flow in, the sum flows back out — and after it runs, every card holds the same full result.\nSo after attention\u0026rsquo;s all-reduce, the layout is: every card has every token at full features. Fully duplicated. That\u0026rsquo;s fine for the next attention step in a dense model.\nMoE doesn\u0026rsquo;t want duplicated data — it wants unique tokens per card Now we hit MoE. Each token needs to go to its experts; experts live on specific cards. The natural unit of work is one token at a time — card X is responsible for sending its own tokens, card Y compiles results for its own tokens, and so on.\nBut after attention\u0026rsquo;s all-reduce, all 4 cards in the TP-group hold the same full sequence. If they all try to dispatch their tokens, we end up sending each token from 4 different cards — a 4× duplication of bandwidth and work. We want each card to own a different slice of tokens. In the table picture: we want row-sharded after attention.\nThe fix: end attention with reduce-scatter Here\u0026rsquo;s the move that makes everything compose. Instead of ending attention with a full all-reduce, end it with a reduce-scatter: same partial-merging step, but the result is scattered across cards by row instead of replicated to all of them.\nReduce-scatter is actually cheaper than all-reduce — about half the fabric bytes. (A ring all-reduce is internally a reduce-scatter followed by an all-gather, so RS alone is half the work.) The thing that changes is where the merged result lands: duplicated on every card (all-reduce), or split row-wise across cards (reduce-scatter). After reduce-scatter, the layout is row-sharded: each card holds a unique 1/4 of tokens at full features. Exactly what MoE wants — and we get there for half the bytes of a full all-reduce. We\u0026rsquo;ll pay the other half later, when we put the sequence back together with an all-gather before the next attention block.\nSending tokens to experts: the all-to-all Each card now has its own tokens. The router decides which experts each token wants, and each card sends each token to the card holding that expert. Every card is sending to every other card simultaneously, but each card-to-card transfer carries different data.\nThis \u0026ldquo;different data to every destination, all in parallel\u0026rdquo; pattern has a name: all-to-all. It\u0026rsquo;s not a single mysterious operation — it\u0026rsquo;s just everyone sending different slices to everyone in parallel.\n(Quick name check: an all-reduce sends the same data to everyone — the merged sum. An all-to-all sends different data to each destination. Same hardware, different shapes.)\nAfter the experts run on the tokens they received, the results need to come back. A second all-to-all — the combine — sends each token\u0026rsquo;s expert output back to its origin card. Layout: still row-sharded.\nClosing the loop: all-gather before the next attention The next attention block wants column-sharded again. To get there from row-sharded, we all-gather: every card sends its slice of tokens to every other card, and each card ends up with all tokens.\nTwo things to notice:\nAll-gather is the opposite of reduce-scatter (one scatters; one gathers). Reduce-scatter + all-gather, taken together, move exactly the same total bytes as one all-reduce would have. The trick was just to do half the work (RS) at the end of attention and the other half (AG) at the start of the next attention block — with MoE happening in between, in the row-sharded layout it wanted. The whole flow, on one picture Data flow through one layer: dense vs MoEDense layer (TP = 4) · two all-reduces per block · same data on every cardcard 0card 1card 2card 3full inputfull inputfull inputfull inputattn (1/4 heads)attn (1/4 heads)attn (1/4 heads)attn (1/4 heads)⊕all-reduce #1 (attn)full hiddenfull hiddenfull hiddenfull hiddenFFN (1/4 inter)FFN (1/4 inter)FFN (1/4 inter)FFN (1/4 inter)⊕all-reduce #2 (FFN)full outputfull outputfull outputfull outputMoE layer · attention\u0026rsquo;s AR becomes reduce-scatter; FFN\u0026rsquo;s AR becomes an all-to-all paircard 0card 1card 2card 3tokens Atokens Btokens Ctokens Dattn (TP) + RSattn (TP) + RSattn (TP) + RSattn (TP) + RS↓ all-to-all dispatch (each card sends different data to each destination)experts (mix)experts (mix)experts (mix)experts (mix)↓ all-to-all combine (expert outputs flow back to their origin card)tokens A outtokens B outtokens C outtokens D out\nTop half (dense): every arrow converges to the same sum, then radiates back — same data on every card. Bottom half (MoE): every arrow carries different data to a different destination — different tokens on every card. Same hardware, very different traffic shape.\nThe per-block rhythm, in plain language 1. Start: row-sharded (each card has its slice of tokens, all features). 2. All-gather → column-sharded (every card has all tokens, slice of features). 3. Attention compute (TP=4 across heads). 4. Reduce-scatter → row-sharded again (each card has its slice of tokens). 5. Router decides per-token expert IDs. 6. All-to-all dispatch → tokens routed to expert-holding cards. 7. Experts compute locally (each card runs its 32 experts on the tokens it received). 8. All-to-all combine → results back to origin (row-sharded again). 9. Exit row-sharded — next block starts at step 2. Two more details worth pulling out:\nThe MoE all-to-all mingles batches. Both TP-groups in the node dispatch into the same EP=8 all-to-all, so the experts see tokens from both batches simultaneously. Different batches happily share expert weights. Four collectives per block, all intra-node. All-gather, reduce-scatter, dispatch, combine — every one stays inside the 8-GPU node, no cross-node hops. The aesthetic to take away: every collective stays inside the node, and the data layout transforms cleanly through MoE\u0026rsquo;s parallelism axis. Twice per MoE layer, row-sharded data opens up into a fully meshed exchange and settles back. The whole dance fits inside one box.\n7. What this opens We\u0026rsquo;ve gone from a dense FFN to a sparse one, from one big matmul to 257 small ones, from per-layer all-reduce to per-layer meshed all-to-all. Compute per token went down; comms per token went up; the engineering attention moved from \u0026ldquo;make HBM fast\u0026rdquo; to \u0026ldquo;make the interconnect not the bottleneck.\u0026rdquo;\nWhat\u0026rsquo;s left for future articles:\nRouting as a load-balancing problem. This article assumed routing is roughly uniform. In practice the router is a tiny matmul whose outputs decide which cards are busy and which sit idle waiting for combine. Aux losses, expert biases, dispatch caps, drop-token behavior, shared-expert design — the levers production MoEs use to keep the workload balanced get their own treatment. Overlapping the all-to-all with compute. A naive implementation stalls on dispatch and combine. Production stacks split the batch into micro-batches and pipeline them through dispatch / compute / combine, or kick off the next layer\u0026rsquo;s local work while the current all-to-all is still in flight. This is where deployment frameworks (TensorRT-LLM, vLLM, SGLang, Megatron) actually compete on MoE numbers. MoE under disaggregation. Article 06 split prefill and decode. MoE has its own version of the same question: prefill all-to-alls move many tokens at once (good per-token amortization, fabric stays saturated); decode all-to-alls move tens of tokens (bad amortization, fabric latency dominates). The two phases may want different EP sizes or different placement strategies. The next article picks up the disaggregation-engineering thread from article 06; once that\u0026rsquo;s in place we\u0026rsquo;ll come back to how MoE fits into a disaggregated stack. Same grammar as the rest of the series. Name the bottleneck, factor the workload until each piece sees only the bottleneck that binds it, optimize per piece. MoE factors the FFN along the parameter axis: most parameters sit idle for most tokens, and EP is how \u0026ldquo;sit idle on a different GPU\u0026rdquo; actually works in practice.\n","permalink":"https://wgzesg.github.io/llm_stories/posts/07-moe-and-expert-parallelism/","summary":"Dense FFNs hit a wall around the hundreds-of-billions mark — every token reads every parameter, and bandwidth runs out. Mixture-of-Experts breaks the symmetry: many small FFNs, a router that picks a few per token, capacity decoupled from compute. This article builds MoE from the dense FFN, anchors on DeepSeek-V3\u0026rsquo;s 671B/37B split, and walks through expert parallelism end to end.","title":"MoE and Expert Parallelism: From One Big FFN to 256 Small Ones"}]