Tensor Parallelism, Built From Scratch in Your Head
Two ways to read a weight matrix, two ways to split it across GPUs. A mental model for tensor parallelism, derived from one matmul in a transformer’s prefill phase.
Two ways to read a weight matrix, two ways to split it across GPUs. A mental model for tensor parallelism, derived from one matmul in a transformer’s prefill phase.
Walk article 02’s two cuts through a full transformer block, with concrete shapes on each GPU at every step. Apply one cut to every matmul first — comm explodes (four gathers per block). Then pair the two cuts as duals and watch them snap into the architecture’s widen-narrow rhythm, landing at two all-reduces per block.