<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>LLM Stories</title><link>https://wgzesg.github.io/llm_stories/</link><description>Recent content on LLM Stories</description><generator>Hugo</generator><language>en</language><lastBuildDate>Wed, 27 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://wgzesg.github.io/llm_stories/index.xml" rel="self" type="application/rss+xml"/><item><title>Roadmap</title><link>https://wgzesg.github.io/llm_stories/posts/00-roadmap/</link><pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate><guid>https://wgzesg.github.io/llm_stories/posts/00-roadmap/</guid><description>A living roadmap of the LLM Stories series. Tracks shipped articles and the questions each one opens up next.</description></item><item><title>An LLM, End to End</title><link>https://wgzesg.github.io/llm_stories/posts/01-llm-end-to-end/</link><pubDate>Wed, 06 May 2026 00:00:00 +0000</pubDate><guid>https://wgzesg.github.io/llm_stories/posts/01-llm-end-to-end/</guid><description>Fundamentals of a modern decoder-only LLM at three zoom levels: the bird&amp;#39;s-eye stack, one transformer block, and the generation loop. The on-ramp for the rest of the LLM Stories series.</description></item><item><title>Tensor Parallelism, Built From Scratch in Your Head</title><link>https://wgzesg.github.io/llm_stories/posts/02-tensor-parallelism-mental-model/</link><pubDate>Sun, 26 Apr 2026 00:00:00 +0000</pubDate><guid>https://wgzesg.github.io/llm_stories/posts/02-tensor-parallelism-mental-model/</guid><description>A mental model for tensor parallelism — derived from one matmul in a transformer&amp;#39;s prefill phase. Two ways to read a weight matrix, two ways to split it across GPUs.</description></item><item><title>Walking Tensor Parallelism Through a Full Block</title><link>https://wgzesg.github.io/llm_stories/posts/03-tp-through-a-full-block/</link><pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate><guid>https://wgzesg.github.io/llm_stories/posts/03-tp-through-a-full-block/</guid><description>How to split a full transformer block across two GPUs, with concrete shapes traced through every step. Start with column-parallel everywhere, see why it costs four gathers per block, then pair it with row-parallel to land at the Megatron pattern of two all-reduces per block.</description></item><item><title>How to Batch Many Requests Through One Forward Pass</title><link>https://wgzesg.github.io/llm_stories/posts/04-batching-many-requests/</link><pubDate>Sun, 03 May 2026 00:00:00 +0000</pubDate><guid>https://wgzesg.github.io/llm_stories/posts/04-batching-many-requests/</guid><description>How to batch many concurrent prefill requests through a TP-parallelized transformer. Walk a full block on a flattened multi-request tensor and watch where batching is free vs. where it isn&amp;#39;t.</description></item><item><title>ORCA and Chunked Prefill: Evening Out the Iteration</title><link>https://wgzesg.github.io/llm_stories/posts/05-orca-and-chunked-prefill/</link><pubDate>Wed, 06 May 2026 00:00:00 +0000</pubDate><guid>https://wgzesg.github.io/llm_stories/posts/05-orca-and-chunked-prefill/</guid><description>How iteration-level scheduling (ORCA) and chunked prefill flatten per-iteration cost. We walk the cost variance, see what ORCA fixes and what it leaves open, and watch chunked prefill bound the longest iteration.</description></item><item><title>Prefill and Decode Disaggregation: Two Phases on Opposite Sides of the Roofline</title><link>https://wgzesg.github.io/llm_stories/posts/06-prefill-decode-disaggregation/</link><pubDate>Sat, 09 May 2026 00:00:00 +0000</pubDate><guid>https://wgzesg.github.io/llm_stories/posts/06-prefill-decode-disaggregation/</guid><description>A roofline-first argument for prefill/decode disaggregation. Defines arithmetic intensity, derives that intensity ≈ tokens-per-iteration for transformers, sweeps context length to show decode falling further below the ridge as L grows, then walks through the split and the KV-transfer cost it introduces.</description></item><item><title>MoE and Expert Parallelism: From One Big FFN to 256 Small Ones</title><link>https://wgzesg.github.io/llm_stories/posts/07-moe-and-expert-parallelism/</link><pubDate>Wed, 27 May 2026 00:00:00 +0000</pubDate><guid>https://wgzesg.github.io/llm_stories/posts/07-moe-and-expert-parallelism/</guid><description>Introduces MoE as a conditional FFN — same shape of computation, but per-token routing across many small experts. Anchors on DeepSeek-V3 (671B total, 37B active, 256 routed + 1 shared experts, top-8). Explains why TP is the wrong cut for experts, then walks through expert parallelism: route, all-to-all dispatch, local FFN, all-to-all combine.</description></item></channel></rss>