Mixture of Experts (MoE)

An architecture where only a subset of model parameters activate for each input, enabling larger models at lower compute cost.

architecturemodels

AI Confidence: 85%

AI-generated

What It Means

In a Mixture of Experts model, the network contains multiple "expert" sub-networks. For each input, a routing mechanism selects only a few experts to process it. This means the model can have many more total parameters (knowledge capacity) while only using a fraction of them for any given input (compute cost).

Why It Matters

MoE is likely the architecture behind GPT-4 and is confirmed for Mixtral. It solves a fundamental problem: bigger models know more but cost more to run. MoE decouples knowledge capacity from inference cost. A 1 trillion parameter MoE model that activates 100B per input costs roughly the same to run as a dense 100B model but knows much more.

Sources & Further Reading

Mistral AI: Mixtral - https://mistral.ai/

Fedus et al., "Switch Transformers" - https://arxiv.org/abs/2101.03961