MoE and Expert Parallelism: From One Big FFN to 256 Small Ones
Dense FFNs hit a wall around the hundreds-of-billions mark — every token reads every parameter, and bandwidth runs out. Mixture-of-Experts breaks the symmetry: many small FFNs, a router that picks a few per token, capacity decoupled from compute. This article builds MoE from the dense FFN, anchors on DeepSeek-V3’s 671B/37B split, and walks through expert parallelism end to end.