AI-101

Paper #4

Training Language Models to Follow Instructions with Human Feedback (2022)

AI Confidence: 80%

AI-generated

TL;DR

This paper (often called the InstructGPT paper) introduced reinforcement learning from human feedback (RLHF) as a method to align language models with human intent. It is the technique that made ChatGPT possible.

What It Does

The core problem: GPT-3 was powerful but often produced outputs that were unhelpful, harmful, or untruthful. It was trained to predict text, not to be a useful assistant. InstructGPT solved this with a three-step process:

1. Collect demonstrations of desired behavior from human labelers and fine-tune the model on them (supervised fine-tuning). 2. Collect human rankings of model outputs (which response is better?) and train a reward model from these preferences. 3. Use the reward model to fine-tune the language model via reinforcement learning (specifically, Proximal Policy Optimization).

The result is a model that is dramatically more helpful, honest, and harmless than the base model, despite being 100x smaller.

Why It Matters

RLHF is the reason modern AI assistants behave like assistants rather than text completion engines. Without it, language models tend to produce plausible-sounding text that may not actually address what the user wants. With RLHF, models learn to follow instructions, admit uncertainty, refuse harmful requests, and produce genuinely useful responses.

This paper is the direct ancestor of ChatGPT, which was essentially InstructGPT scaled up and released to the public. The RLHF approach is now used by virtually every AI lab.

Key Details

Authors: Long Ouyang, Jeff Wu, Xu Jiang, and 20 others (OpenAI).

Key result: A 1.3B parameter InstructGPT model was preferred by human evaluators over the 175B GPT-3 model.

Link to paper: https://arxiv.org/abs/2203.02155

Sources & Further Reading

Full paper: https://arxiv.org/abs/2203.02155

OpenAI blog: "Aligning language models" - https://openai.com/research/instruction-following

Chip Huyen: "RLHF: Reinforcement Learning from Human Feedback" - https://huyenchip.com/2023/05/02/rlhf.html