Roadmap

About this series

These are learning notes — me working through how modern LLMs are actually served, mostly by talking to Claude and writing up the parts that finally clicked. The articles themselves are written in a confident “discovery journey” voice, but the project underneath is just someone learning in public.

The list below is alive — articles flip status as they ship, and the roadmap grows whenever a discussion surfaces a hole worth digging.

Articles

#	Title	Status	Link
01	An LLM, end to end — bird’s-eye stack, one block, the generation loop, and the questions the rest of the series picks up	`[next]`	—
02	Tensor parallelism, built from scratch in your head	`[done]`	read →
03	Walking TP through a full block — start column-parallel everywhere, watch the comm explode, pair with row-parallel until two all-reduces per block fall out	`[done]`	read →
04	How to batch many requests through one forward pass — varlen attention, prefill only, TP turns out to be untouched	`[done]`	read →
05	ORCA and chunked prefill — iteration-level scheduling solves the boundary problems; chunked prefill bounds the iteration so a long prompt can’t hijack the engine’s heartbeat	`[done]`	read →
06	Prefill and decode disaggregation — two phases on opposite sides of the roofline; once you accept the asymmetry, sharing a GPU pool is no longer a compromise but a fight against the formula	`[done]`	read →
07	The engineering of disaggregation — KV cache transfer across fabrics (NVLink, NVSwitch, IB, PCIe), tiered memory pools (HBM, DRAM, SSD), overlap with prefill, topology-aware routing	`[next]`	—
08	Pipeline parallelism — the cut across blocks instead of within one, and the bubble it creates; why the prefill pool wants it	`[planned]`	—
09	MoE and expert parallelism — what changes when FFN becomes routed	`[planned]`	—
10	PagedAttention — the KV cache as virtual memory, blocks instead of contiguous slabs, copy-on-write across requests	`[planned]`	—
11	Sequence and context parallelism — splitting one request across GPUs, ring attention, the long-context move	`[planned]`	—
12	FlashAttention — tiled online softmax, why the `[L × L]` score matrix never has to exist	`[speculative]`	—
13	FlashDecoding — making the `1 × L_kv` decode-attention call fast under bandwidth pressure	`[speculative]`	—
14	GQA and MLA — fewer KV heads, smaller KV cache, faster decode (and what it costs the model)	`[speculative]`	—
15	Speculative decoding — a draft model proposes, the big model verifies, two passes for the price of one	`[speculative]`	—
16	KV compression — quantization, eviction policies, what we can drop and what we can’t	`[speculative]`	—

Status legend

[done] shipped & linked · [next] actively drafting · [planned] on deck, will get there · [speculative] a hole worth digging — may or may not get filled, but the question is interesting

Recurring threads worth flagging

A few observations that keep showing up across articles, worth keeping in the back of your mind as you read:

TP turns out to be remarkably non-disruptive. Request batching didn’t disturb it (Article 04), and continuous batching + chunked prefill didn’t either (Article 05). PP and MoE do interact with TP in interesting ways — that’s why those come up next.
The KV cache is the connective tissue between articles 05 onward. It enters with decode and never really leaves; it’s also the thing that makes long contexts hard.
Decode flips the bottleneck profile. Articles 02–04 assume prefill, where compute dominates. Once decode is in scope (Article 05 onward), bandwidth on weight reads becomes the binding constraint — and that’s what motivates almost every later optimization (FlashDecoding, GQA, prefill/decode disaggregation, speculative decoding).
Modelers’ choices keep load-bearing for serving in ways that weren’t designed in. Multi-head independence made TP comm-free; it also made request batching comm-free; it’ll show up again when we look at GQA/MLA. Worth tracking as a recurring theme.

About this series#

Articles#

Status legend#

Recurring threads worth flagging#

About this series

Articles

Status legend

Recurring threads worth flagging