Paper #16
Switch Transformers: Scaling to Trillion Parameter Models (2022)
AI-generated
Switch Transformers introduced a simplified mixture-of-experts (MoE) approach where each input token is routed to a single expert, enabling models with trillions of parameters that are computationally feasible because only a fraction of parameters activate for each input.
In a standard Transformer, every input activates every parameter. In a Switch Transformer, the feed-forward layers are replaced with multiple "expert" layers, and a routing network selects which single expert processes each token. This means a model with 1 trillion total parameters might only use 100 billion for any given input.
The "switch" refers to routing each token to exactly one expert (as opposed to top-k experts in earlier MoE work). This simplification reduces communication costs and improves training stability.
MoE is the architecture behind many modern frontier models. Mixtral 8x7B from Mistral uses it. GPT-4 is widely believed to use a MoE architecture. It allows building models that have the knowledge capacity of very large models while maintaining the inference cost of much smaller ones.
This is critical for deployment: serving a dense 1T parameter model would be prohibitively expensive, but a MoE model that activates only 100B parameters per forward pass is feasible.
The paper showed 7x speedups in pre-training while maintaining quality, making it one of the most practical contributions to efficient AI scaling.
Authors: William Fedus, Barret Zoph, Noam Shazeer (Google Brain).
Key result: 1.6 trillion parameter model trained with the same computational budget as a 10B dense model.
Link to paper: https://arxiv.org/abs/2101.03961
Full paper: https://arxiv.org/abs/2101.03961
Google AI Blog: "Switch Transformers" - https://ai.googleblog.com/2022/01/switch-transformers-scaling-to-trillion.html
Yannic Kilcher video walkthrough - https://www.youtube.com/watch?v=iARWDRCLXgA