MoE (Mixture of Experts)

Mixture of Experts (MoE) is a neural network architecture where a model contains many specialized sub-networks called "experts" but only activates a small subset for each input token. A learned router network decides which experts process each token. This allows models to have enormous total parameter counts while keeping the compute per token (and thus inference cost) much lower than dense models of equivalent capacity.

Dense vs MoE models

In a standard dense model like GPT-4 or Claude 3.5 Sonnet, every token activates all model parameters on every forward pass. In an MoE model like DeepSeek V3 (671B total parameters, 37B active per token), Mixtral 8x7B (56B total, 12.9B active), or Llama 3.1 405B (405B total, approximately 141B active), only a fraction of parameters are active per token. The total parameter count determines the model's knowledge capacity; the active parameter count determines inference cost and latency.

Why MoE matters

MoE enables training and deploying models with frontier-level performance at substantially lower computational cost than dense alternatives. DeepSeek V3 demonstrated this by achieving strong performance with estimated training costs around $5.5M, compared to hundreds of millions for comparable dense models. At inference time, MoE models are faster than their total parameter count suggests, since only a fraction of parameters are computed per token.

MoE challenges

Load balancing: If most tokens route to the same subset of experts, efficiency gains disappear and expert specialization degrades. Training techniques like auxiliary loss functions help distribute tokens evenly across experts.
Communication overhead: In distributed inference, routing tokens to experts on different devices or accelerators adds network latency and synchronization costs.
Memory requirements: All expert weights must be loaded into memory even if only a few experts are active per token, increasing memory footprint compared to dense models of similar active parameter count.
Training complexity: MoE models require careful tuning of routing mechanisms and load-balancing strategies. Some routing approaches (like top-k selection) are not fully differentiable, requiring specialized training techniques.

Dense vs MoE models

Why MoE matters

MoE challenges

Related terms

Models relevant to MoE (Mixture of Experts)