Introducing Gemma 4 12B: a unified encoder-free multimodal model

Share

Today we’re introducing Gemma 4 12B, our latest model designed to bring agent-based multimodal intelligence directly to laptops. Bridging the gap between our edge-friendly E4B and our more advanced 26B mix of experts (MoE), Gemma 4 12B offers powerful capabilities with reduced memory footprint. It’s also our first mid-size model to feature native audio inputs.

Thanks to the developer community Gemma 4 The models have already exceeded 150 million downloads. You built everything from wearable robot arms for physical assistance with enterprise-grade AI security. We’re excited to see what you’ll build with this newest addition.

Here’s an overview of what makes the Gemma 4 12B unique:

  • Pioneering, unified architecture: No multimodal encoders. Video and audio inputs flow directly into the LLM framework.
  • Advanced Reasoning: Benchmark performance is close to our 26B, unlocking advanced multi-step reasoning and agentic workflows.
  • Ready laptop: Miniature enough to run locally with just 16GB of VRAM or unified memory.
  • Open and available: Released under the Apache 2.0 license with support across the entire developer ecosystem.
  • Designer Ready: Gemma 4 12B is equipped with Multi-Token Prediction (MTP) plotting tools that reduce latency.

Together, these features bring advanced multimodal capabilities to everyday hardware without sacrificing speed or reasoning. Now let’s take a closer look at how the Gemma 4 12B achieves this.

Operate state-of-the-art agents locally

The Gemma 4 12B delivers performance close to our larger 26B MoE model in standard benchmarks, but at less than half the total memory footprint. Miniature enough to run locally on consumer laptops with 16GB RAM, it unlocks powerful multimodal and agent capabilities right on your PC.

Latest Posts

More News