Simon Willison’s Weblog: Gemma 3 QAT Models

Source URL: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/
Source: Simon Willison’s Weblog
Title: Gemma 3 QAT Models

Feedly Summary: Gemma 3 QAT Models
Interesting release from Google, as a follow-up to Gemma 3 from last month:

To make Gemma 3 even more accessible, we are announcing new versions optimized with Quantization-Aware Training (QAT) that dramatically reduces memory requirements while maintaining high quality. This enables you to run powerful models like Gemma 3 27B locally on consumer-grade GPUs like the NVIDIA RTX 3090.

I wasn’t previously aware of Quantization-Aware Training but it turns out to be quite an established pattern now, supported in both Tensorflow and PyTorch.
Google report model size drops from BF16 to int4 for the following models:

Gemma 3 27B: 54GB to 14.1GB
Gemma 3 12B: 24GB to 6.6GB
Gemma 3 4B: 8GB to 2.6GB
Gemma 3 1B: 2GB to 0.5GB

They partnered with Ollama, LM Studio, MLX and llama.cpp for this release – I’d love to see more AI labs following their example.
The Ollama model version picker currently hides them behind “View all" option, so here are the direct links:

gemma3:1b-it-qat – 1GB
gemma3:4b-it-qat – 4GB
gemma3:12b-it-qat – 8.9GB
gemma3:27b-it-qat – 18GB

I fetched that largest model with:
ollama pull gemma3:27b-it-qat

And now I’m trying it out with llm-ollama:
llm -m gemma3:27b-it-qat "impress me with some physics"

I got a pretty great response!
Tags: llm, ai, ollama, llms, gemma, llm-release, google, generative-ai

AI Summary and Description: Yes

Summary: The announcement regarding the new versions of Gemma 3 optimized with Quantization-Aware Training (QAT) is significant for AI professionals, particularly as it lowers memory requirements for powerful models and enhances accessibility for local deployment on consumer-grade hardware.

Detailed Description: The release of Gemma 3 QAT Models from Google presents major insights and advancements in AI model capacity and efficiency, specifically relevant to AI and Generative AI security professionals. Here are the critical points:

– **Quantization-Aware Training (QAT)**: This technique reduces the memory footprint of AI models while retaining their performance quality, making it a pivotal method for deploying large-scale AI applications effectively.
– **Model Size Reduction**: Google has reported a substantial drop in the model sizes for various versions of Gemma 3 after applying QAT:
– Gemma 3 27B: Reduced from 54GB to 14.1GB
– Gemma 3 12B: Reduced from 24GB to 6.6GB
– Gemma 3 4B: Reduced from 8GB to 2.6GB
– Gemma 3 1B: Reduced from 2GB to 0.5GB
– **Accessibility**: These optimizations enable the execution of sophisticated AI models on consumer-grade GPUs like NVIDIA RTX 3090, thus broadening access for developers and researchers who may not have access to high-end hardware.
– **Partnerships**: Google partnered with other tools and platforms such as Ollama, LM Studio, MLX, and llama.cpp, which emphasizes collaboration within the AI community to enhance model deployment and usability.
– **Use Case Example**: The author mentions their experience pulling and trying out the largest model (gemma3:27b-it-qat) and receiving favorable responses, demonstrating practical applications and the quality of output from the model.

This development holds significant implications for AI security as teams can now deploy advanced generative models in a more scalable and resource-efficient manner, ensuring that robust AI models can be tested and deployed securely across varied infrastructures. Emphasizing innovation in model efficiency addresses concerns over resource allocation and can lead to more responsible AI development practices.