LLM Stories

LLM Storieshttps://wgzesg.github.io/llm_stories/zh/Recent content on LLM StoriesHugozhSat, 09 May 2026 00:00:00 +0000Roadmap：这个系列要写些什么https://wgzesg.github.io/llm_stories/zh/posts/00-roadmap/Wed, 29 Apr 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/zh/posts/00-roadmap/LLM Stories 系列的活地图。记录已发表的文章，以及每一篇会开出来的下一批问题。LLM 从头到尾走一遍https://wgzesg.github.io/llm_stories/zh/posts/01-llm-end-to-end/Wed, 06 May 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/zh/posts/01-llm-end-to-end/现代 decoder-only LLM 的基础知识，从三个 zoom level 看：鸟瞰整个 stack、打开一个 transformer block、跑完一次完整生成。LLM Stories 系列的入口。Tensor Parallelism 心智模型：从零搭起https://wgzesg.github.io/llm_stories/zh/posts/02-tensor-parallelism-mental-model/Sun, 26 Apr 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/zh/posts/02-tensor-parallelism-mental-model/Tensor parallelism 心智模型 —— 从 transformer prefill 阶段的一次 matmul 推出。顺带讲清楚 multi-head attention 为什么早就为 column-parallel TP 切好了刀口。在一个 transformer block 中完整走完一遍 Tensor Parallelismhttps://wgzesg.github.io/llm_stories/zh/posts/03-tp-through-a-full-block/Wed, 29 Apr 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/zh/posts/03-tp-through-a-full-block/怎么把一整个 transformer block 切上两张 GPU。先全用 column-parallel，看一下为什么每个 block 要付四次 gather，再配上 row-parallel，自然走到 Megatron 那个每个 block 两次 all-reduce 的经典 pattern。每一步的 shape 都标在表里。一次 forward 怎么塞下很多个 requesthttps://wgzesg.github.io/llm_stories/zh/posts/04-batching-many-requests/Sun, 03 May 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/zh/posts/04-batching-many-requests/怎么把多个并发的 prefill request batch 起来一次 forward 跑完。把整块 block 摆到 flatten 之后的多 request tensor 上走一遍，看清楚哪些地方 batching 是白送、哪些不是。ORCA 和 chunked prefill：把每次 iteration 的开销摆平https://wgzesg.github.io/llm_stories/zh/posts/05-orca-and-chunked-prefill/Wed, 06 May 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/zh/posts/05-orca-and-chunked-prefill/iteration-level 调度（ORCA）和 chunked prefill 是怎么把每次 iteration 的开销抚平的。先把开销的方差讲一遍，看 ORCA 修了什么、留了什么，再看 chunked prefill 怎么给最长的 iteration 封顶。Prefill/Decode 拆机：两个阶段坐在 roofline 的两边https://wgzesg.github.io/llm_stories/zh/posts/06-prefill-decode-disaggregation/Sat, 09 May 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/zh/posts/06-prefill-decode-disaggregation/用 roofline 的视角论证 prefill/decode 拆机。先把 arithmetic intensity 讲清楚，推出 transformer 一次 iteration 的强度 ≈ 一次 weight 加载被几个 token 共享，再扫一遍 context length 看 decode 怎么进一步往 ridge 之下掉，最后走一遍拆机后的形态和它带来的 KV 传输成本。