LLM Stories 👋

A series of essays building up mental models for how modern LLMs are actually served — written in plain language, no math notation, lots of ASCII diagrams.

The goal isn’t to teach equations. It’s to build the intuitions that make every later equation feel inevitable. Each article picks one slice of the LLM serving pipeline and walks through it as a discovery journey.

Roadmap

What this series is, and a living map of the articles — shipped, in progress, and the holes we’ve dug for our future selves to fill.

An LLM, End to End

Three zoom levels — the model end-to-end, one transformer block opened up, and the loop that turns a prompt into output. Just enough to ask the right questions about everything that comes after.

Tensor Parallelism, Built From Scratch in Your Head

Two ways to read a weight matrix, two ways to split it across GPUs. A mental model for tensor parallelism, derived from one matmul in a transformer’s prefill phase.

Walking Tensor Parallelism Through a Full Block

Walk article 02’s two cuts through a full transformer block, with concrete shapes on each GPU at every step. Apply one cut to every matmul first — comm explodes (four gathers per block). Then pair the two cuts as duals and watch them snap into the architecture’s widen-narrow rhythm, landing at two all-reduces per block.

How to Batch Many Requests Through One Forward Pass

Many users hit the model at once with different-length prompts. Walk through one transformer block on a flat multi-request tensor and see which layers batch for free and which need a real fix — and whether TP has to change.

ORCA and Chunked Prefill: Evening Out the Iteration

Many requests, each finishing at a different time, and some carrying prefills 1000× the size of a decode step. Per-iteration cost swings wildly. ORCA-style iteration-level scheduling fixes one half; chunked prefill bounds the largest iteration so short work isn’t dragged behind long work.

Prefill and Decode Disaggregation: Two Phases on Opposite Sides of the Roofline

Article 05 left two phases politely sharing one engine. This article shows they shouldn’t — prefill is compute-bound, decode is bandwidth-bound, and long context drives the gap wider, not smaller. Once we accept the asymmetry, splitting them is the structural fix.

MoE and Expert Parallelism: From One Big FFN to 256 Small Ones

Dense FFNs hit a wall around the hundreds-of-billions mark — every token reads every parameter, and bandwidth runs out. Mixture-of-Experts breaks the symmetry: many small FFNs, a router that picks a few per token, capacity decoupled from compute. This article builds MoE from the dense FFN, anchors on DeepSeek-V3’s 671B/37B split, and walks through expert parallelism end to end.