Paper #3
Language Models Are Few-Shot Learners - GPT-3 (2020)
AI-generated
GPT-3 demonstrated that scaling a language model to 175 billion parameters enables it to perform tasks it was never explicitly trained on, simply by being shown a few examples in the prompt. This paper established the concept of in-context learning and kickstarted the modern AI era.
GPT-3 is a 175-billion-parameter autoregressive language model trained to predict the next word in a sequence. The breakthrough discovery was that at this scale, the model could perform new tasks without any fine-tuning. You simply describe the task and provide a few examples in the prompt (few-shot), and the model generalizes.
For example, you could show it three examples of English-to-French translations, then give it a new English sentence, and it would produce the correct French translation without ever being explicitly trained as a translator.
GPT-3 changed the AI paradigm from "train a specialized model for each task" to "prompt a general-purpose model." This is the foundation of how people use AI today. Every time you ask ChatGPT or Claude to do something, you are leveraging the in-context learning capability that GPT-3 first demonstrated at scale.
It also showed that scale matters enormously. GPT-3 was 100x larger than GPT-2, and the qualitative improvement was not just incremental but transformative, enabling entirely new capabilities that did not exist at smaller scales.
Authors: Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, and 28 others (OpenAI).
Model size: 175 billion parameters.
Training data: 570GB of filtered text from the internet.
Link to paper: https://arxiv.org/abs/2005.14165
Full paper: https://arxiv.org/abs/2005.14165
The Illustrated GPT-2 (Jay Alammar) - https://jalammar.github.io/illustrated-gpt2/
Yannic Kilcher video walkthrough - https://www.youtube.com/watch?v=SY5PvZrJhLE