Paper #5
Constitutional AI: Harmlessness from AI Feedback (2022)
AI-generated
Constitutional AI (CAI) is Anthropic's approach to making AI systems safer without relying entirely on human labelers. Instead of humans rating every output, the AI critiques and revises its own responses based on a set of written principles (a "constitution").
CAI works in two phases. In the first phase (supervised learning), the model generates a response, then is asked to critique that response according to a principle from the constitution (e.g., "choose the response that is least harmful"). The model then revises its response based on its own critique. These revised responses become training data.
In the second phase (reinforcement learning), instead of using human preference labels, the model itself evaluates pairs of responses according to constitutional principles. These AI-generated preferences are used to train a reward model, which then guides further fine-tuning via RL.
The approach addresses a fundamental scaling problem with RLHF: you need enormous amounts of human feedback data, and human labelers can disagree, get tired, or introduce biases. CAI reduces this dependence by having the AI do much of its own alignment work, guided by explicit principles.
It also makes the values being taught to the AI transparent and auditable. Instead of values being implicit in thousands of human preference judgments, they are written down as a constitution that anyone can read and critique.
Anthropic uses Constitutional AI as a core part of Claude's training. It is one of the reasons Claude tends to be direct about its limitations and uncertainties.
Authors: Yuntao Bai, Saurav Kadavath, Sandipan Kundu, and 20 others (Anthropic).
Link to paper: https://arxiv.org/abs/2212.08073
Full paper: https://arxiv.org/abs/2212.08073
Anthropic blog: "Constitutional AI" - https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
Anthropic research overview - https://www.anthropic.com/research