MoE and Expert Parallelism: From One Big FFN to 256 Small Ones

The FFN we walked in article 03 hits a wall at frontier scale — every token has to read every parameter, and at 700B that bill dominates the cost of serving. Mixture-of-Experts is the move that solves it: replace one big FFN with many small experts and a router that picks a few per token, decoupling capacity from compute.

This article rebuilds the FFN as MoE, anchors on DeepSeek-V3 for the concrete numbers, then walks through the parallelism the new shape demands.

1. The FFN, where the parameters live

Recap from article 03. Each transformer block is attention then FFN. The FFN takes the residual stream, projects up to a wider intermediate dimension, applies a SwiGLU nonlinearity, projects back down. For Llama-2-7B that’s 4096 → 11008 → 4096, three matrices per layer:

gate_proj:  4096 × 11008
up_proj:    4096 × 11008
down_proj:  11008 × 4096

Multiplying the shapes out, each FFN holds about 135M parameters. Across 32 layers, that’s ~4.3B of Llama-2-7B’s 7B total — well over half the model lives in the FFN. And in the dense formulation, every one of those parameters gets read on every token’s pass through every layer.

That’s fine at 7B. It gets expensive at 700B. From article 01 we know decode is bandwidth-bound — every generated token has to pull the full weight set through HBM. People do serve dense models at this scale (Llama-3 405B is dense, GPT-3 175B was dense), so it’s not impossible — it’s just that the per-token bandwidth bill is set by every parameter the model has, and at this size the bill becomes the dominant cost of serving.

What if we could keep the capacity of 700B parameters but only touch a small fraction of them per token? The bandwidth bill would drop by exactly that fraction, and we’d get the model quality of the full 700B for the cost of a much smaller one. That’s the trade MoE makes.

2. The MoE move: condition the FFN on the token

The classical FFN treats every token identically. Same matrices, same multiplications, whether the token is the or mitochondria. That’s wasteful — most parameters in a giant FFN are surely specialized for something, and most tokens don’t need most specializations.

Mixture-of-Experts replaces one big FFN with many smaller ones — the experts — plus a tiny router that, per token, picks which k of E experts to actually run. The shape of computation is unchanged inside each expert — it’s still a SwiGLU FFN on the residual stream. What changes is which experts run for which tokens.

Two consequences fall out:

Parameters scale with E, compute scales with k. Add more experts and the model’s total capacity grows; keep k fixed and per-token FLOPs don’t budge. Capacity and compute, decoupled.
The FFN becomes conditional. Different tokens take different paths through the model. Two tokens in the same sequence at the same layer may hit completely disjoint sets of experts.

That’s the entire architectural idea. Everything from here is how to make it run on real hardware — starting with the specific shape DeepSeek-V3 picks for E, k, and the expert size.

3. DeepSeek-V3 in concrete numbers

§2 was the architecture in the abstract. To feel what the choice buys, we need to anchor on a real model — and DeepSeek-V3 is the cleanest large-scale instance to look at, both because the design is publicly documented and because it pushes the MoE shape hard.

The headline: 671B parameters total, 37B active per token. The cleanest way to feel what that buys is to put DeepSeek-V3 next to a contemporary dense model of comparable capability. Qwen2.5-72B fits the role — different lab, different philosophy, same generation, aimed at similar tasks.

	Qwen2.5-72B (dense)	DeepSeek-V3 (MoE)
Hidden dim `d`	8192	7168
FFN intermediate dim `i`	29568	2048 (per expert)
Ratio `i / d`	3.6×	0.29×
Experts per FFN layer	1 (no router)	256 routed + 1 shared
Active per token	the whole FFN	top-8 + shared = 9 of 257
Params per FFN/MoE layer	~727M	~11.3B stored, ~400M active
Routing combinations per token	1	~4 × 10¹⁴
Total model params	~72B	671B
Active params per token	~72B	37B

The contrast is the whole point. DeepSeek-V3 stores 9× more parameters than Qwen2.5-72B but touches half as many per token. Per FFN layer, the active compute is actually smaller in the MoE model. Dense models pay for every parameter on every token, full stop; MoE pays only for the parameters relevant to this token, plus a small router. And the last row but one says it best — a dense FFN has exactly one “specialization” per layer (itself), while MoE has ~10¹⁴ possible specializations per layer per token. Expressivity grows combinatorially with the routing choice; compute does not.

Fine-grained MoE is one of the more elegant ideas of the current era — same compute budget as a dense model in the same class, but vastly more combinatorial expressivity and sharper per-expert specialization. It also reshapes the deployment problem: 256 narrow experts spread across a cluster behave very differently from one big FFN, and the systems machinery had to catch up to make it work at scale.

The picture below traces what 9-of-257 routing actually looks like for a single token:

The arithmetic that gets you from 671B stored to 37B active is just the ratio in the picture: per layer, 9 of 257 experts run, so each layer touches ≈ 3.5% of its expert parameters. Stack 58 MoE layers (plus 3 dense FFN layers and attention) and you land at 37B active.

An MoE layer is a parameter store with a router on top. Compute touches a few percent of it per token; the rest sits in HBM waiting to be the right expert for some other token.

The decode-bandwidth bill is set by what’s active, not what’s stored. You pay for 37B and get the capacity of 671B. Once routing is the thing deciding which weights get used, where the experts live becomes the central design question — and the rest of this article is about answering it.

4. Why TP doesn’t fit experts

Each MoE layer holds 11B of expert weights, far too much for a single card. So the natural first attempt is to spread them with TP — the parallelism we already know from articles 02 and 03. It doesn’t fit, for two reasons.

Tensor parallelism slices one big matrix across GPUs and stitches the shards with an all-reduce. MoE breaks that premise on two axes:

Size. Each DeepSeek-V3 expert is just 7168 × 2048 per matmul — already modest. Slice it 8 ways and each shard becomes 7168 × 256, too skinny to keep tensor cores fed. Fine-grained MoE went smaller per expert by design; TP wants the opposite.
Structure. You have 257 independent small matrices, not one big one, and any token uses only 9. TP-slicing each would fire an all-reduce per expert per layer for computation that was already factored.

The natural cut: put whole experts on different GPUs. Each card holds a subset of the 256 routed experts and runs them locally on whatever tokens land there. The per-matmul all-reduce disappears; what replaces it is the routing itself — moving tokens to the right card and back.

That’s expert parallelism.

5. EP, end to end

Time to zoom in. Take one transformer block, and inside that block focus on the FFN — which in DeepSeek-V3 is the MoE layer. Suppose attention has already run and produced its output for a batch of T tokens. Those activations are sitting in HBM, ready for the FFN to consume.

The FFN isn’t one big matmul anymore. It’s 256 routed experts plus 1 shared expert, and in this section we’ll spread them across 8 cards with EP = 8 — 32 routed experts per card, plus the shared expert replicated everywhere. The question this section answers: how does an MoE layer actually compute, with experts living on 8 different cards and every token wanting its own subset?

To keep the picture simple, assume each card already holds its own unique slice of the batch — T / 8 tokens per card, all different, no overlap with the other cards. (This is the layout you’d get with pure data parallelism across cards, no TP in attention. §6 layers in TP and SP and shows how to arrive at this same layout when those are in play too.)

Walk one MoE layer end to end.

Step 1 — route. Each GPU runs the router on its local tokens. The router is a tiny matmul (d × E); its output is, per token, the 8 chosen expert IDs and their weights. The assignment is the entire content of the dispatch plan.

Step 2 — all-to-all dispatch. Every GPU now knows which of its tokens want experts on which other GPUs. Tokens get packed and sent. A single token may go to up to 8 different destination GPUs (one per chosen expert). After this exchange, every GPU is holding the set of tokens that want its experts.

Step 3 — local expert compute. Each GPU runs its 32 experts on the tokens it received. Just SwiGLU FFN on whatever subset of tokens picked each expert. Pure local compute, no communication. The shared expert runs in parallel on the local tokens.

Step 4 — all-to-all combine. Send each token’s expert outputs back to its origin GPU. There the outputs are weighted by the router scores and summed (along with the shared expert’s output) to form this layer’s FFN output.

Two all-to-alls per MoE layer, one round of local FFN in the middle. That’s it.

Three things this picture makes visible.

Each expert sees only the tokens that picked it. That’s the win. GPU 3 doesn’t run its 32 experts on all T tokens; it runs them on the fraction routing sent its way. With uniform routing, each routed expert sees T · 8 / 256 = T / 32 tokens, and total FFN compute across the cluster matches a dense 9/257-sized FFN exactly — which is what the “37B active” number was claiming all along.

The all-to-alls move activations, not weights. Each token carries a d = 7168 fp16 vector ≈ 14 KB through dispatch, and the same again through combine. Small per token, real per iteration — and the traffic is fully meshed, every card sending to every other.

Routing decides everything about utilization. If 90% of tokens route to one GPU’s experts, that GPU bottlenecks the whole layer and the others sit idle waiting for the all-to-all combine. Uniform routing is what makes the scheme efficient. DeepSeek and others spend real complexity on the losses, biases, and dispatch constraints that keep routing close to balanced. We’ll treat that as its own topic.

6. One node, three parallelisms: TP × EP × SP

§5 was pure EP — no TP, each card with its own batch. Real production layouts compose EP with attention’s TP and a third axis, sequence parallelism (SP), all on the same hardware. The composition is where the design gets interesting.

The deployment unit: one 8-GPU H200 node (8 × 141 GB ≈ 1.1 TB of HBM, enough for DeepSeek-V3’s fp8 weights plus KV cache), with TP = 4 for attention, EP = 8 for MoE, and SP = 4 matching TP. The 8 cards split into 2 TP-groups of 4, each running its own batch. The full expert set lives within the node — 256 routed experts spread 32 per card, plus the shared expert replicated everywhere.

This node is the complete inference unit — everything happens inside it. To handle more concurrent batches, replicate the node; replicas are fully independent during forward, with no cross-node coordination beyond request routing at the front. (DeepSeek-V3 at fp16 is ~1.3 TB and doesn’t actually fit in a single 8-GPU node’s 640 GB; production deployments use fp8 weights or stretch EP across 2 nodes. The parallelism pattern below is the same either way; we keep the picture at 8 cards for clarity.)

The key property: every collective stays inside the node, where cards talk fast. Nothing has to cross between nodes during forward — cross-node communication is much slower, and the layout is designed to avoid it entirely.

Now we get to the part that’s actually fun: how do these three parallelisms compose, and why does the design feel elegant rather than just bolted together?

The trick is to forget the names of the collectives for a moment and just track the shape of the data on each card at each stage. We’ll introduce the names as labels for shape-changes we see in the picture.

The activation table, and two ways to split it

Inside one transformer block, think of the activations as a 2D table: rows are tokens, columns are hidden features. The whole point of multi-card parallelism is to spread this table across the 4 cards in a TP-group so they can work in parallel.

There are two natural ways to split:

Column-sharded: each card has all the tokens, but only a slice of features (a quarter, since TP=4).
Row-sharded: each card has all the features, but only a slice of tokens.

Same total bytes — same data — just shaped differently on each card. The two layouts are interconvertible by reshuffling bytes between cards. That’s all a “collective” really is: a reshuffle that turns one layout into another.

Attention naturally wants column-sharded

Attention’s TP work is column-sharded. Each card holds 1/4 of the attention heads (which corresponds to a slice of output features), runs them on all the tokens, and produces a partial result. To get the full output, the 4 cards have to merge their partials: each contributes its slice, they all sum into one, and every card ends up with the same complete output.

That merge has a name: all-reduce. It’s a convergent-then-divergent pattern — partials flow in, the sum flows back out — and after it runs, every card holds the same full result.

So after attention’s all-reduce, the layout is: every card has every token at full features. Fully duplicated. That’s fine for the next attention step in a dense model.

MoE doesn’t want duplicated data — it wants unique tokens per card

Now we hit MoE. Each token needs to go to its experts; experts live on specific cards. The natural unit of work is one token at a time — card X is responsible for sending its own tokens, card Y compiles results for its own tokens, and so on.

But after attention’s all-reduce, all 4 cards in the TP-group hold the same full sequence. If they all try to dispatch their tokens, we end up sending each token from 4 different cards — a 4× duplication of bandwidth and work. We want each card to own a different slice of tokens. In the table picture: we want row-sharded after attention.

The fix: end attention with reduce-scatter

Here’s the move that makes everything compose. Instead of ending attention with a full all-reduce, end it with a reduce-scatter: same partial-merging step, but the result is scattered across cards by row instead of replicated to all of them.

Reduce-scatter is actually cheaper than all-reduce — about half the fabric bytes. (A ring all-reduce is internally a reduce-scatter followed by an all-gather, so RS alone is half the work.) The thing that changes is where the merged result lands: duplicated on every card (all-reduce), or split row-wise across cards (reduce-scatter). After reduce-scatter, the layout is row-sharded: each card holds a unique 1/4 of tokens at full features. Exactly what MoE wants — and we get there for half the bytes of a full all-reduce. We’ll pay the other half later, when we put the sequence back together with an all-gather before the next attention block.

Sending tokens to experts: the all-to-all

Each card now has its own tokens. The router decides which experts each token wants, and each card sends each token to the card holding that expert. Every card is sending to every other card simultaneously, but each card-to-card transfer carries different data.

This “different data to every destination, all in parallel” pattern has a name: all-to-all. It’s not a single mysterious operation — it’s just everyone sending different slices to everyone in parallel.

(Quick name check: an all-reduce sends the same data to everyone — the merged sum. An all-to-all sends different data to each destination. Same hardware, different shapes.)

After the experts run on the tokens they received, the results need to come back. A second all-to-all — the combine — sends each token’s expert output back to its origin card. Layout: still row-sharded.

Closing the loop: all-gather before the next attention

The next attention block wants column-sharded again. To get there from row-sharded, we all-gather: every card sends its slice of tokens to every other card, and each card ends up with all tokens.

Two things to notice:

All-gather is the opposite of reduce-scatter (one scatters; one gathers).
Reduce-scatter + all-gather, taken together, move exactly the same total bytes as one all-reduce would have. The trick was just to do half the work (RS) at the end of attention and the other half (AG) at the start of the next attention block — with MoE happening in between, in the row-sharded layout it wanted.

The whole flow, on one picture

Top half (dense): every arrow converges to the same sum, then radiates back — same data on every card. Bottom half (MoE): every arrow carries different data to a different destination — different tokens on every card. Same hardware, very different traffic shape.

The per-block rhythm, in plain language

1. Start: row-sharded         (each card has its slice of tokens, all features).
2. All-gather                 → column-sharded (every card has all tokens, slice of features).
3. Attention compute          (TP=4 across heads).
4. Reduce-scatter             → row-sharded again (each card has its slice of tokens).
5. Router decides per-token expert IDs.
6. All-to-all dispatch        → tokens routed to expert-holding cards.
7. Experts compute locally    (each card runs its 32 experts on the tokens it received).
8. All-to-all combine         → results back to origin (row-sharded again).
9. Exit row-sharded           — next block starts at step 2.

Two more details worth pulling out:

The MoE all-to-all mingles batches. Both TP-groups in the node dispatch into the same EP=8 all-to-all, so the experts see tokens from both batches simultaneously. Different batches happily share expert weights.
Four collectives per block, all intra-node. All-gather, reduce-scatter, dispatch, combine — every one stays inside the 8-GPU node, no cross-node hops.

The aesthetic to take away: every collective stays inside the node, and the data layout transforms cleanly through MoE’s parallelism axis. Twice per MoE layer, row-sharded data opens up into a fully meshed exchange and settles back. The whole dance fits inside one box.

7. What this opens

We’ve gone from a dense FFN to a sparse one, from one big matmul to 257 small ones, from per-layer all-reduce to per-layer meshed all-to-all. Compute per token went down; comms per token went up; the engineering attention moved from “make HBM fast” to “make the interconnect not the bottleneck.”

What’s left for future articles:

Routing as a load-balancing problem. This article assumed routing is roughly uniform. In practice the router is a tiny matmul whose outputs decide which cards are busy and which sit idle waiting for combine. Aux losses, expert biases, dispatch caps, drop-token behavior, shared-expert design — the levers production MoEs use to keep the workload balanced get their own treatment.
Overlapping the all-to-all with compute. A naive implementation stalls on dispatch and combine. Production stacks split the batch into micro-batches and pipeline them through dispatch / compute / combine, or kick off the next layer’s local work while the current all-to-all is still in flight. This is where deployment frameworks (TensorRT-LLM, vLLM, SGLang, Megatron) actually compete on MoE numbers.
MoE under disaggregation. Article 06 split prefill and decode. MoE has its own version of the same question: prefill all-to-alls move many tokens at once (good per-token amortization, fabric stays saturated); decode all-to-alls move tens of tokens (bad amortization, fabric latency dominates). The two phases may want different EP sizes or different placement strategies. The next article picks up the disaggregation-engineering thread from article 06; once that’s in place we’ll come back to how MoE fits into a disaggregated stack.

Same grammar as the rest of the series. Name the bottleneck, factor the workload until each piece sees only the bottleneck that binds it, optimize per piece. MoE factors the FFN along the parameter axis: most parameters sit idle for most tokens, and EP is how “sit idle on a different GPU” actually works in practice.

1. The FFN, where the parameters live#

2. The MoE move: condition the FFN on the token#

3. DeepSeek-V3 in concrete numbers#

4. Why TP doesn’t fit experts#

5. EP, end to end#

6. One node, three parallelisms: TP × EP × SP#

The activation table, and two ways to split it#

Attention naturally wants column-sharded#

MoE doesn’t want duplicated data — it wants unique tokens per card#

The fix: end attention with reduce-scatter#

Sending tokens to experts: the all-to-all#

Closing the loop: all-gather before the next attention#

The whole flow, on one picture#

The per-block rhythm, in plain language#

7. What this opens#