Paper #11
DALL-E: Zero-Shot Text-to-Image Generation (2021)
AI-generated
DALL-E showed that a Transformer trained on text-image pairs can generate original images from text descriptions, with no task-specific training. It was the first convincing demonstration that language models could be extended to visual creativity.
DALL-E is a 12-billion-parameter version of GPT-3 trained to generate images from text captions. It treats image generation as a sequence prediction problem: text tokens are followed by image tokens (from a discrete VAE), and the model learns to predict the image tokens given the text.
Given a prompt like "an armchair in the shape of an avocado," DALL-E generates multiple novel images matching the description. It can combine unrelated concepts, apply visual styles, and render text within images.
DALL-E proved that the Transformer architecture and language modeling approach could extend beyond text to visual generation. This was a conceptual breakthrough: instead of building specialized image generation architectures, you could use the same approach that powers text generation.
It catalyzed the AI image generation revolution. DALL-E was followed by DALL-E 2, Midjourney, Stable Diffusion, and others, creating an entirely new category of creative tools that millions of people now use daily.
The commercial and cultural impact has been enormous: AI image generation has changed graphic design, marketing, art, and content creation.
Authors: Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, and 6 others (OpenAI).
Model size: 12 billion parameters.
Link to paper: https://arxiv.org/abs/2102.12092
Full paper: https://arxiv.org/abs/2102.12092
OpenAI: DALL-E 3 - https://openai.com/dall-e-3
Two Minute Papers: "DALL-E explained" - https://www.youtube.com/c/K%C3%A1rolyZsolnai