LLM Stories

LLM Storieshttps://wgzesg.github.io/llm_stories/Recent content on LLM StoriesHugoenWed, 27 May 2026 00:00:00 +0000Roadmaphttps://wgzesg.github.io/llm_stories/posts/00-roadmap/Wed, 29 Apr 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/posts/00-roadmap/A living roadmap of the LLM Stories series. Tracks shipped articles and the questions each one opens up next.An LLM, End to Endhttps://wgzesg.github.io/llm_stories/posts/01-llm-end-to-end/Wed, 06 May 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/posts/01-llm-end-to-end/Fundamentals of a modern decoder-only LLM at three zoom levels: the bird's-eye stack, one transformer block, and the generation loop. The on-ramp for the rest of the LLM Stories series.Tensor Parallelism, Built From Scratch in Your Headhttps://wgzesg.github.io/llm_stories/posts/02-tensor-parallelism-mental-model/Sun, 26 Apr 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/posts/02-tensor-parallelism-mental-model/A mental model for tensor parallelism — derived from one matmul in a transformer's prefill phase. Two ways to read a weight matrix, two ways to split it across GPUs.Walking Tensor Parallelism Through a Full Blockhttps://wgzesg.github.io/llm_stories/posts/03-tp-through-a-full-block/Wed, 29 Apr 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/posts/03-tp-through-a-full-block/How to split a full transformer block across two GPUs, with concrete shapes traced through every step. Start with column-parallel everywhere, see why it costs four gathers per block, then pair it with row-parallel to land at the Megatron pattern of two all-reduces per block.How to Batch Many Requests Through One Forward Passhttps://wgzesg.github.io/llm_stories/posts/04-batching-many-requests/Sun, 03 May 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/posts/04-batching-many-requests/How to batch many concurrent prefill requests through a TP-parallelized transformer. Walk a full block on a flattened multi-request tensor and watch where batching is free vs. where it isn't.ORCA and Chunked Prefill: Evening Out the Iterationhttps://wgzesg.github.io/llm_stories/posts/05-orca-and-chunked-prefill/Wed, 06 May 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/posts/05-orca-and-chunked-prefill/How iteration-level scheduling (ORCA) and chunked prefill flatten per-iteration cost. We walk the cost variance, see what ORCA fixes and what it leaves open, and watch chunked prefill bound the longest iteration.Prefill and Decode Disaggregation: Two Phases on Opposite Sides of the Rooflinehttps://wgzesg.github.io/llm_stories/posts/06-prefill-decode-disaggregation/Sat, 09 May 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/posts/06-prefill-decode-disaggregation/A roofline-first argument for prefill/decode disaggregation. Defines arithmetic intensity, derives that intensity ≈ tokens-per-iteration for transformers, sweeps context length to show decode falling further below the ridge as L grows, then walks through the split and the KV-transfer cost it introduces.MoE and Expert Parallelism: From One Big FFN to 256 Small Oneshttps://wgzesg.github.io/llm_stories/posts/07-moe-and-expert-parallelism/Wed, 27 May 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/posts/07-moe-and-expert-parallelism/Introduces MoE as a conditional FFN — same shape of computation, but per-token routing across many small experts. Anchors on DeepSeek-V3 (671B total, 37B active, 256 routed + 1 shared experts, top-8). Explains why TP is the wrong cut for experts, then walks through expert parallelism: route, all-to-all dispatch, local FFN, all-to-all combine.