Simon Willison’s Weblog: Introducing Gemma 3n: The developer guide

Source URL: https://simonwillison.net/2025/Jun/26/gemma-3n/
Source: Simon Willison’s Weblog
Title: Introducing Gemma 3n: The developer guide

Feedly Summary: Introducing Gemma 3n: The developer guide
Extremely consequential new open weights model release from Google today:

Multimodal by design: Gemma 3n natively supports image, audio, video, and text inputs and text outputs.

Optimized for on-device: Engineered with a focus on efficiency, Gemma 3n models are available in two sizes based on effective parameters: E2B and E4B. While their raw parameter count is 5B and 8B respectively, architectural innovations allow them to run with a memory footprint comparable to traditional 2B and 4B models, operating with as little as 2GB (E2B) and 3GB (E4B) of memory.

This is very exciting: a 2B and 4B model optimized for end-user devices which accepts text, images and audio as inputs!
Gemma 3n is also the most comprehensive day one launch I’ve seen for any model: Google partnered with “AMD, Axolotl, Docker, Hugging Face, llama.cpp, LMStudio, MLX, NVIDIA, Ollama, RedHat, SGLang, Unsloth, and vLLM" so there are dozens of ways to try this out right now.
So far I’ve run two variants on my Mac laptop. Ollama offer a 7.5GB version (full tag gemma3n:e4b-it-q4_K_M0) of the 4B model, which I ran like this:
ollama pull gemma3n
llm install llm-ollama
llm -m gemma3n:latest "Generate an SVG of a pelican riding a bicycle"

It drew me this:

The Ollama version doesn’t appear to support image or audio input yet.
… but the mlx-vlm version does!
First I tried that on this WAV file like so (using a recipe adapted from Prince Canuma’s video):
uv run –with mlx-vlm mlx_vlm.generate \
–model gg-hf-gm/gemma-3n-E4B-it \
–max-tokens 100 \
–temperature 0.7 \
–prompt "Transcribe the following speech segment in English:" \
–audio pelican-joke-request.wav

That downloaded a 15.74 GB bfloat16 version of the model and output the following correct transcription:

Tell me a joke about a pelican.

Then I had it draw me a pelican for good measure:
uv run –with mlx-vlm mlx_vlm.generate \
–model gg-hf-gm/gemma-3n-E4B-it \
–max-tokens 100 \
–temperature 0.7 \
–prompt "Generate an SVG of a pelican riding a bicycle"

I quite like this one:

It’s interesting to see such a striking visual difference between those 7.5GB and 15GB model quantizations.
Tags: google, ai, generative-ai, local-llms, llms, vision-llms, mlx, ollama, pelican-riding-a-bicycle, gemma, llm-release, prince-canuma

AI Summary and Description: Yes

Summary: The text discusses the release of Gemma 3n, a new open weights model by Google, designed for multimodal input (image, audio, text) and optimized for on-device efficiency. This model demonstrates significant advancements in architecture and usability, allowing end-users to process complex types of input with manageable resource consumption. The collaborative launch with multiple tech partners highlights its importance in the AI and generative AI landscape.

Detailed Description:

– **Product Overview**:
– Gemma 3n is a multimodal model capable of accepting various input types, including images, audio, and text while providing text outputs.
– It is optimized for on-device usage, making it accessible for end-user devices without demanding extensive resources.

– **Technical Specifications**:
– The model comes in two versions, E2B and E4B, featuring 5 billion and 8 billion parameters respectively but designed to have a memory footprint similar to that of traditional 2 billion and 4 billion models (requiring as little as 2GB and 3GB of memory).
– This architectural innovation is crucial for improving the performance of multimodal systems on low-resource platforms.

– **Collaboration and Availability**:
– The launch included partnerships with prominent tech companies, which enhances the model’s usability and demonstrates a strong community backing for its deployment across various platforms. Partners include:
– AMD
– Docker
– Hugging Face
– NVIDIA
– RedHat
– Among others.

– **Practical Applications**:
– Users can experiment with handling different types of media, such as generating images (SVG) or transcribing audio. Sample commands illustrate how to run the model using frameworks like Ollama and mlx-vlm.
– Provided examples showcase the ease of use from prompt generation to producing high-quality outputs, emphasizing the model’s versatility in application.

– **Importance for Professionals**:
– Security and compliance professionals in AI, cloud, and infrastructure sectors should take note of the advancements in model optimization and multimodal processing, as these are pivotal for developing robust AI applications and services in resource-constrained environments.
– The collaborative approach to the model’s launch suggests an evolving landscape where interoperability and shared innovation are key in AI development.

Understanding the release of Gemma 3n is significant not only for the prospective capabilities it offers but also for its implications in security contexts where running AI models efficiently on local devices becomes increasingly critical.