MoE (Mixture of Experts)
A model architecture where only a subset of the model's parameters are activated per token, enabling very large total capacity at manageable inference cost.
Mixture of Experts (MoE) is a neural network architecture where a model has many specialized sub-networks ("experts") but only activates a small subset for each input token. A learned router decides which experts to route each token through. This allows models to have enormous total parameter counts while keeping the compute per token (and thus inference cost) much lower.
Dense vs MoE models
In a standard "dense" model like GPT-4o or Claude Sonnet, every token activates all model parameters on every forward pass. In an MoE model like DeepSeek V3 (671B total, 37B active) or Mixtral 8x7B, only a fraction of parameters are active per token. The total parameter count determines the model's knowledge capacity; the active parameter count determines inference cost.
Why MoE matters
DeepSeek V3's training cost of ~$5.5M (compared to estimated hundreds of millions for comparable dense models) demonstrated how MoE enables frontier performance at dramatically lower cost. At inference time, MoE models can also be faster than their total parameter count suggests, since fewer parameters are active per token.
MoE challenges
- Load balancing: If most tokens route to the same experts, you lose the efficiency benefit and expert specialization degrades.
- Communication overhead: In distributed inference, routing tokens to experts on different devices adds latency.
- Memory: All expert weights must be loaded into memory even if only a few are active per token.
Related terms
Models relevant to MoE (Mixture of Experts)
DeepSeek V3
State-of-the-art open-weights model that shocked the industry with frontier performance at minimal cost
View model →Qwen 3
Alibaba's highly capable open-weights model with top-tier multilingual performance
View model →Llama 4
Meta's multimodal open-weights model family with a 10M context window variant
View model →