Paper #13
High-Resolution Image Synthesis with Latent Diffusion Models (2022)
AI-generated
Latent Diffusion Models (the paper behind Stable Diffusion) made high-quality image generation computationally accessible by running the diffusion process in a compressed latent space rather than pixel space.
Standard diffusion models operate directly on images (pixel space), which is computationally expensive for high-resolution outputs. Latent Diffusion Models first compress the image into a smaller latent representation using a pre-trained autoencoder, then run the diffusion process in that compressed space.
The model also introduced cross-attention conditioning, which allows flexible conditioning on text, layouts, or other modalities. This is what lets you generate images from text descriptions efficiently.
By moving diffusion to latent space, the authors reduced the computational cost by roughly 10-100x while maintaining image quality. This made it practical to train and run diffusion models on consumer hardware and led directly to the release of Stable Diffusion, which anyone can download and run on a decent GPU.
Stable Diffusion democratized AI image generation. Unlike DALL-E and Midjourney, which required API access or subscriptions, Stable Diffusion is open source. This spawned an enormous ecosystem of fine-tuned models, community extensions, ControlNet, LoRA adaptations, and creative applications.
The paper is one of the most practically impactful AI papers ever published, because it made powerful image generation accessible to everyone.
Authors: Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Bjorn Ommer (CompVis, Runway, LMU Munich).
Key innovation: Diffusion in latent space with cross-attention conditioning.
Link to paper: https://arxiv.org/abs/2112.10752
Full paper: https://arxiv.org/abs/2112.10752
Stability AI: Stable Diffusion - https://stability.ai/
CompVis GitHub: Stable Diffusion - https://github.com/CompVis/stable-diffusion