Gemma 4 Technical Deep Dive: Multimodal Models from 2B to 31B Parameters

Source: Hugging Face NewsPublished: 2 Apr 2026(3mo ago)Added to AI-101: 5 Apr 2026

AI-generated

TLDR

Gemma 4's architecture introduces several innovations: per-layer embeddings providing residual signals to each decoder layer, shared KV cache where later layers reuse key-value states from earlier layers, alternating local sliding-window and global full-context attention, and variable aspect ratio vision encoding supporting 70 to 1120 image tokens.

The 31B model achieves 85.2% on MMLU Pro, 89.2% on AIME 2026, and 80.0% on LiveCodeBench. Multimodal capabilities span object detection with native JSON bounding boxes, GUI element detection, video understanding with audio, and function calling. Day-one deployment support includes transformers, llama.cpp, MLX for Apple Silicon, and mistral.rs, with TRL providing full multimodal fine-tuning support.

Key Takeaways

Gemma 4 features per-layer embeddings, shared KV cache, variable aspect ratio vision, and audio support—achieving 85
2% MMLU Pro and 89
2% AIME 2026 at frontier-level efficiency

Read original →