About this series

These are learning notes — me working through how modern LLMs are actually served, mostly by talking to Claude and writing up the parts that finally clicked. The articles themselves are written in a confident “discovery journey” voice, but the project underneath is just someone learning in public.

The list below is alive — articles flip status as they ship, and the roadmap grows whenever a discussion surfaces a hole worth digging.


Articles

#TitleStatusLink
01An LLM, end to end — bird’s-eye stack, one block, the generation loop, and the questions the rest of the series picks up[next]
02Tensor parallelism, built from scratch in your head[done]read →
03Walking TP through a full block — start column-parallel everywhere, watch the comm explode, pair with row-parallel until two all-reduces per block fall out[done]read →
04How to batch many requests through one forward pass — varlen attention, prefill only, TP turns out to be untouched[done]read →
05ORCA and chunked prefill — iteration-level scheduling solves the boundary problems; chunked prefill bounds the iteration so a long prompt can’t hijack the engine’s heartbeat[done]read →
06Prefill and decode disaggregation — two phases on opposite sides of the roofline; once you accept the asymmetry, sharing a GPU pool is no longer a compromise but a fight against the formula[done]read →
07The engineering of disaggregation — KV cache transfer across fabrics (NVLink, NVSwitch, IB, PCIe), tiered memory pools (HBM, DRAM, SSD), overlap with prefill, topology-aware routing[next]
08Pipeline parallelism — the cut across blocks instead of within one, and the bubble it creates; why the prefill pool wants it[planned]
09MoE and expert parallelism — what changes when FFN becomes routed[planned]
10PagedAttention — the KV cache as virtual memory, blocks instead of contiguous slabs, copy-on-write across requests[planned]
11Sequence and context parallelism — splitting one request across GPUs, ring attention, the long-context move[planned]
12FlashAttention — tiled online softmax, why the [L × L] score matrix never has to exist[speculative]
13FlashDecoding — making the 1 × L_kv decode-attention call fast under bandwidth pressure[speculative]
14GQA and MLA — fewer KV heads, smaller KV cache, faster decode (and what it costs the model)[speculative]
15Speculative decoding — a draft model proposes, the big model verifies, two passes for the price of one[speculative]
16KV compression — quantization, eviction policies, what we can drop and what we can’t[speculative]

Status legend

[done] shipped & linked · [next] actively drafting · [planned] on deck, will get there · [speculative] a hole worth digging — may or may not get filled, but the question is interesting


Recurring threads worth flagging

A few observations that keep showing up across articles, worth keeping in the back of your mind as you read:

  • TP turns out to be remarkably non-disruptive. Request batching didn’t disturb it (Article 04), and continuous batching + chunked prefill didn’t either (Article 05). PP and MoE do interact with TP in interesting ways — that’s why those come up next.
  • The KV cache is the connective tissue between articles 05 onward. It enters with decode and never really leaves; it’s also the thing that makes long contexts hard.
  • Decode flips the bottleneck profile. Articles 02–04 assume prefill, where compute dominates. Once decode is in scope (Article 05 onward), bandwidth on weight reads becomes the binding constraint — and that’s what motivates almost every later optimization (FlashDecoding, GQA, prefill/decode disaggregation, speculative decoding).
  • Modelers’ choices keep load-bearing for serving in ways that weren’t designed in. Multi-head independence made TP comm-free; it also made request batching comm-free; it’ll show up again when we look at GQA/MLA. Worth tracking as a recurring theme.