Walking Tensor Parallelism Through a Full Block

Walk article 02’s two cuts through a full transformer block, with concrete shapes on each GPU at every step. Apply one cut to every matmul first — comm explodes (four gathers per block). Then pair the two cuts as duals and watch them snap into the architecture’s widen-narrow rhythm, landing at two all-reduces per block.

April 29, 2026 · 11 min · Pino