Mixture of Experts (MoE)
An architecture where only a subset of model parameters activate for each input, enabling larger models at lower compute cost.
AI-generated
In a Mixture of Experts model, the network contains multiple "expert" sub-networks. For each input, a routing mechanism selects only a few experts to process it. This means the model can have many more total parameters (knowledge capacity) while only using a fraction of them for any given input (compute cost).
MoE is likely the architecture behind GPT-4 and is confirmed for Mixtral. It solves a fundamental problem: bigger models know more but cost more to run. MoE decouples knowledge capacity from inference cost. A 1 trillion parameter MoE model that activates 100B per input costs roughly the same to run as a dense 100B model but knows much more.
Mistral AI: Mixtral - https://mistral.ai/
Fedus et al., "Switch Transformers" - https://arxiv.org/abs/2101.03961