Paper #1
Attention Is All You Need (2017)
AI-generated
This paper introduced the Transformer architecture, which replaced recurrence and convolution with self-attention mechanisms. It is the foundation of every modern large language model including GPT, Claude, Gemini, and LLaMA.
The authors proposed a new neural network architecture called the Transformer that processes entire sequences in parallel using a mechanism called self-attention. Instead of reading text word by word (like RNNs) or looking at local windows (like CNNs), the Transformer lets every word attend to every other word simultaneously.
The key innovation is multi-head attention, where the model learns multiple different attention patterns in parallel, allowing it to capture different types of relationships (syntactic, semantic, positional) at the same time.
Before this paper, sequence processing was inherently sequential, which made training slow and limited the ability to capture long-range dependencies. The Transformer solved both problems at once: it trains much faster (because of parallelization) and handles long-range dependencies better (because any two positions are directly connected via attention).
Every major AI system today is built on Transformers. GPT-4, Claude, Gemini, LLaMA, Mistral, DALL-E, Stable Diffusion, Whisper, and hundreds more all descend from this architecture. It is arguably the most impactful machine learning paper of the decade.
Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (Google Brain and Google Research).
Original task: Machine translation (English to German and English to French).
Link to paper: https://arxiv.org/abs/1706.03762
Full paper: https://arxiv.org/abs/1706.03762
The Illustrated Transformer (Jay Alammar) - https://jalammar.github.io/illustrated-transformer/
Yannic Kilcher video walkthrough - https://www.youtube.com/watch?v=iDulhoQ2pro
3Blue1Brown: "Attention in transformers" - https://www.youtube.com/watch?v=eMlx5fFNoYc