Llm-Serving

An LLM, End to End

Three zoom levels — the model end-to-end, one transformer block opened up, and the loop that turns a prompt into output. Just enough to ask the right questions about everything that comes after.

Tensor Parallelism, Built From Scratch in Your Head

Two ways to read a weight matrix, two ways to split it across GPUs. A mental model for tensor parallelism, derived from one matmul in a transformer’s prefill phase.

Walking Tensor Parallelism Through a Full Block

Walk article 02’s two cuts through a full transformer block, with concrete shapes on each GPU at every step. Apply one cut to every matmul first — comm explodes (four gathers per block). Then pair the two cuts as duals and watch them snap into the architecture’s widen-narrow rhythm, landing at two all-reduces per block.

How to Batch Many Requests Through One Forward Pass

Many users hit the model at once with different-length prompts. Walk through one transformer block on a flat multi-request tensor and see which layers batch for free and which need a real fix — and whether TP has to change.

ORCA and Chunked Prefill: Evening Out the Iteration

Many requests, each finishing at a different time, and some carrying prefills 1000× the size of a decode step. Per-iteration cost swings wildly. ORCA-style iteration-level scheduling fixes one half; chunked prefill bounds the largest iteration so short work isn’t dragged behind long work.

Prefill and Decode Disaggregation: Two Phases on Opposite Sides of the Roofline

Article 05 left two phases politely sharing one engine. This article shows they shouldn’t — prefill is compute-bound, decode is bandwidth-bound, and long context drives the gap wider, not smaller. Once we accept the asymmetry, splitting them is the structural fix.

MoE and Expert Parallelism: From One Big FFN to 256 Small Ones

Dense FFNs hit a wall around the hundreds-of-billions mark — every token reads every parameter, and bandwidth runs out. Mixture-of-Experts breaks the symmetry: many small FFNs, a router that picks a few per token, capacity decoupled from compute. This article builds MoE from the dense FFN, anchors on DeepSeek-V3’s 671B/37B split, and walks through expert parallelism end to end.