Paper #3

Language Models Are Few-Shot Learners - GPT-3 (2020)

AI Confidence: 80%

AI-generated

TL;DR

GPT-3 demonstrated that scaling a language model to 175 billion parameters enables it to perform tasks it was never explicitly trained on, simply by being shown a few examples in the prompt. This paper established the concept of in-context learning and kickstarted the modern AI era.

What It Does

GPT-3 is a 175-billion-parameter autoregressive language model trained to predict the next word in a sequence. The breakthrough discovery was that at this scale, the model could perform new tasks without any fine-tuning. You simply describe the task and provide a few examples in the prompt (few-shot), and the model generalizes.

For example, you could show it three examples of English-to-French translations, then give it a new English sentence, and it would produce the correct French translation without ever being explicitly trained as a translator.

Why It Matters

GPT-3 changed the AI paradigm from "train a specialized model for each task" to "prompt a general-purpose model." This is the foundation of how people use AI today. Every time you ask ChatGPT or Claude to do something, you are leveraging the in-context learning capability that GPT-3 first demonstrated at scale.

It also showed that scale matters enormously. GPT-3 was 100x larger than GPT-2, and the qualitative improvement was not just incremental but transformative, enabling entirely new capabilities that did not exist at smaller scales.

Key Details

Authors: Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, and 28 others (OpenAI).

Model size: 175 billion parameters.

Training data: 570GB of filtered text from the internet.

Link to paper: https://arxiv.org/abs/2005.14165

Sources & Further Reading

Full paper: https://arxiv.org/abs/2005.14165

The Illustrated GPT-2 (Jay Alammar) - https://jalammar.github.io/illustrated-gpt2/

Yannic Kilcher video walkthrough - https://www.youtube.com/watch?v=SY5PvZrJhLE