The release shows a case in which quantization can be combined with training-stage design beyond simple compression. [Photo: Google]

[Digital Today reporter Jinju Hong (홍진주)] Google unveiled Gemma 4 QAT, which sharply reduces memory use to make it easier to run large artificial intelligence models on smartphones and standard laptops. As demand grows to run AI models directly on devices rather than in the cloud, it is seen as a strategy to lower barriers to running models while minimising performance degradation.

On June 8 (local time), online media outlet Gigazine reported that Google unveiled the Gemma 4 model family applying quantization-aware training (QAT).

The key to the new models is that they were trained with quantization in mind from the AI training process. Typically, AI models undergo quantization after training to reduce memory use. But that method can reduce response quality as computational precision falls.

Gemma 4 QAT, by contrast, applies an approach that simulates quantization in advance during the training stage. Google explained that this can sharply reduce memory use while maintaining response quality at the level of existing models.

The release ties in with rising demand to run AI models locally. Generally, to run a large language model on a PC, the entire model must be loaded into graphics card memory (VRAM). If VRAM capacity is exceeded, system memory (RAM) or storage (SSD) is used, which can sharply slow response speeds.

Google plans to ease those constraints with Gemma 4 QAT and support running AI models across a wider range of devices. Gemma 4 QAT applies to the entire Gemma 4 line-up, including E2B, E4B, 12B, 26B A4B and 31B. Optimised versions for mobile devices are also provided for the E2B and E4B models.

The memory-saving effect is more pronounced in smaller models. The existing Gemma 4 E2B model required about 11.4 GB of memory, but the QAT-based 4-bit (Q4_0) version can run with about 2.9 GB. The mobile-optimised version lowered the memory requirement to about 1.1 GB, and a text-only E2B model excluding image and audio processing can run with just 0.84 GB of memory.

That aligns with a recent push to run generative AI directly on smartphones and lightweight laptops. Previously, many models required tens of gigabytes or more of memory, making them hard to use on consumer devices. Gemma 4 QAT is viewed as sharply lowering the entry barrier for local AI.

Google also took an open approach to distribution. The Gemma 4 QAT models are provided free of charge under the open-source-friendly Apache License 2.0. It also officially supports major local AI runtime environments widely used by developers, including llama.cpp, Ollama and LM Studio. That is intended to allow use across various environments without being tied to a separate closed platform.

The industry sees the announcement as an example showing AI model competition expanding beyond simple performance improvements to execution efficiency and accessibility. There is also a view that the mobile version that can run on around 1 GB of memory and the 0.84 GB text-only model could become a catalyst for AI features to spread rapidly to smartphones, tablets and low-spec laptops.

Google's strategy is to use Gemma 4 QAT to lay the groundwork for expanding AI from data centres and high-performance PCs to consumer devices. As competition over AI model performance intensifies, the ability to run on more devices with fewer resources is emerging as a new competitive point.

Keyword

#Google #Gemma 4 QAT #Quantization-Aware Training #Apache License 2.0 #llama.cpp
Copyright © DigitalToday. All rights reserved. Unauthorized reproduction and redistribution are prohibited.