AI-101

Paper #16

Switch Transformers: Scaling to Trillion Parameter Models (2022)

AI Confidence: 80%

AI-generated

TL;DR

Switch Transformers introduced a simplified mixture-of-experts (MoE) approach where each input token is routed to a single expert, enabling models with trillions of parameters that are computationally feasible because only a fraction of parameters activate for each input.

What It Does

In a standard Transformer, every input activates every parameter. In a Switch Transformer, the feed-forward layers are replaced with multiple "expert" layers, and a routing network selects which single expert processes each token. This means a model with 1 trillion total parameters might only use 100 billion for any given input.

The "switch" refers to routing each token to exactly one expert (as opposed to top-k experts in earlier MoE work). This simplification reduces communication costs and improves training stability.

Why It Matters

MoE is the architecture behind many modern frontier models. Mixtral 8x7B from Mistral uses it. GPT-4 is widely believed to use a MoE architecture. It allows building models that have the knowledge capacity of very large models while maintaining the inference cost of much smaller ones.

This is critical for deployment: serving a dense 1T parameter model would be prohibitively expensive, but a MoE model that activates only 100B parameters per forward pass is feasible.

The paper showed 7x speedups in pre-training while maintaining quality, making it one of the most practical contributions to efficient AI scaling.

Key Details

Authors: William Fedus, Barret Zoph, Noam Shazeer (Google Brain).

Key result: 1.6 trillion parameter model trained with the same computational budget as a 10B dense model.

Link to paper: https://arxiv.org/abs/2101.03961

Sources & Further Reading

Full paper: https://arxiv.org/abs/2101.03961

Google AI Blog: "Switch Transformers" - https://ai.googleblog.com/2022/01/switch-transformers-scaling-to-trillion.html

Yannic Kilcher video walkthrough - https://www.youtube.com/watch?v=iARWDRCLXgA