AI-101

Paper #9

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (2022)

AI Confidence: 80%

AI-generated

TL;DR

Simply adding "Let's think step by step" to a prompt dramatically improves language model performance on reasoning tasks. This paper formalized chain-of-thought prompting and showed it is an emergent capability that appears only in sufficiently large models.

What It Does

The authors demonstrated that providing a few examples of step-by-step reasoning in the prompt (few-shot chain of thought) causes language models to generate intermediate reasoning steps before arriving at an answer. This is opposed to standard prompting, where the model jumps directly to the final answer.

For example, instead of asking "Roger has 5 tennis balls. He buys 2 more cans of 3. How many does he have?" and getting a direct (often wrong) answer, you show the model an example where someone writes out "5 + (2 x 3) = 5 + 6 = 11" and the model learns to show its work.

A follow-up finding (zero-shot chain of thought) showed that simply appending "Let's think step by step" to any prompt achieves much of the same benefit without needing examples.

Why It Matters

Chain of thought prompting is one of the most practical and widely-used prompting techniques. It costs nothing to implement (just add words to your prompt) and significantly improves accuracy on math, logic, coding, and complex reasoning tasks.

It also revealed something profound about large language models: the ability to reason step-by-step is an emergent capability. Models below a certain size do not benefit from chain-of-thought prompting. Above that threshold, the improvement is dramatic. This suggests that scale unlocks qualitatively new capabilities, not just quantitative improvements.

Key Details

Authors: Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou (Google Brain).

Key finding: On the GSM8K math benchmark, chain-of-thought prompting with PaLM 540B achieved 57% accuracy, compared to 18% with standard prompting.

Link to paper: https://arxiv.org/abs/2201.11903

Sources & Further Reading

Full paper: https://arxiv.org/abs/2201.11903

Google AI Blog: "Language models perform reasoning" - https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html

Kojima et al., "Zero-shot CoT" - https://arxiv.org/abs/2205.11916