Paper #20
Gemini: A Family of Highly Capable Multimodal Models (2023)
AI-generated
Google's Gemini is a natively multimodal model family trained from the ground up on text, images, audio, and video interleaved together, rather than bolting vision onto a text model after training.
Unlike GPT-4 and Claude 3, which are primarily text models with vision added, Gemini was designed from inception to process and reason across multiple modalities simultaneously. It can understand images, video, and audio alongside text in a unified architecture.
The family includes Ultra (most capable), Pro (balanced), and Nano (efficient, for on-device use). Gemini Ultra was the first model to exceed human expert performance on the MMLU benchmark (a massive multitask language understanding test covering 57 subjects).
Native multimodality is a significant architectural choice. When a model is trained on interleaved modalities from the start, it develops more natural cross-modal reasoning. Instead of translating images to text-like representations, Gemini reasons about images and text in a unified way.
The Nano variant is particularly important: it brings capable AI to mobile devices and edge computing, enabling on-device AI that works without internet connectivity.
Gemini's release intensified the competition between Google, OpenAI, and Anthropic, accelerating progress across the entire field.
Authors: Gemini Team, Google (team paper).
Model family: Gemini Ultra, Pro, and Nano.
Key benchmark: First model to exceed human expert performance on MMLU (90.0% vs. 89.8%).
Link to paper: https://arxiv.org/abs/2312.11805
Full paper: https://arxiv.org/abs/2312.11805
Google DeepMind: Gemini announcement - https://deepmind.google/technologies/gemini/
Google AI: Gemini API documentation - https://ai.google.dev/gemini-api