Moe on LLM Stories

Moe on LLM Storieshttps://wgzesg.github.io/llm_stories/tags/moe/Recent content in Moe on LLM StoriesHugoenWed, 27 May 2026 00:00:00 +0000MoE and Expert Parallelism: From One Big FFN to 256 Small Oneshttps://wgzesg.github.io/llm_stories/posts/07-moe-and-expert-parallelism/Wed, 27 May 2026 00:00:00 +0000https://wgzesg.github.io/llm_stories/posts/07-moe-and-expert-parallelism/Introduces MoE as a conditional FFN — same shape of computation, but per-token routing across many small experts. Anchors on DeepSeek-V3 (671B total, 37B active, 256 routed + 1 shared experts, top-8). Explains why TP is the wrong cut for experts, then walks through expert parallelism: route, all-to-all dispatch, local FFN, all-to-all combine.