Paper #20

Gemini: A Family of Highly Capable Multimodal Models (2023)

AI Confidence: 80%

AI-generated

TL;DR

Google's Gemini is a natively multimodal model family trained from the ground up on text, images, audio, and video interleaved together, rather than bolting vision onto a text model after training.

What It Does

Unlike GPT-4 and Claude 3, which are primarily text models with vision added, Gemini was designed from inception to process and reason across multiple modalities simultaneously. It can understand images, video, and audio alongside text in a unified architecture.

The family includes Ultra (most capable), Pro (balanced), and Nano (efficient, for on-device use). Gemini Ultra was the first model to exceed human expert performance on the MMLU benchmark (a massive multitask language understanding test covering 57 subjects).

Why It Matters

Native multimodality is a significant architectural choice. When a model is trained on interleaved modalities from the start, it develops more natural cross-modal reasoning. Instead of translating images to text-like representations, Gemini reasons about images and text in a unified way.

The Nano variant is particularly important: it brings capable AI to mobile devices and edge computing, enabling on-device AI that works without internet connectivity.

Gemini's release intensified the competition between Google, OpenAI, and Anthropic, accelerating progress across the entire field.

Key Details

Authors: Gemini Team, Google (team paper).

Model family: Gemini Ultra, Pro, and Nano.

Key benchmark: First model to exceed human expert performance on MMLU (90.0% vs. 89.8%).

Link to paper: https://arxiv.org/abs/2312.11805

Sources & Further Reading

Full paper: https://arxiv.org/abs/2312.11805

Google DeepMind: Gemini announcement - https://deepmind.google/technologies/gemini/

Google AI: Gemini API documentation - https://ai.google.dev/gemini-api