Google unveiled its TurboQuant compression algorithm. [Photo: Google]

[Digital Today Kyung-min Hong (홍경민), intern reporter] Google has unveiled TurboQuant, a new compression algorithm that can cut memory use and increase speed for large language models (LLMs).

On March 25 local time, IT outlet Ars Technica reported that TurboQuant is designed to reduce the size of the key-value cache that stores important information in LLMs. It aims to cut memory use while maintaining performance and accuracy. Google said early tests showed, by some experimental benchmarks, memory use fell by up to 6 times and performance improved by up to 8 times.

The technology works by processing high-dimensional vector data used by AI models more efficiently. Previously, vectors were stored in coordinate form, but Google used a system called PolarQuant to convert them into polar coordinates, simplifying data representation and improving compression efficiency.

It also includes a correction step that applies a Quantized Johnson-Lindenstrauss (QJL) method to reduce errors that can occur during compression. The process helps improve the accuracy of attention score calculations, an important operation for AI models, by reducing vector information to a minimal unit while preserving relationships between data.

Google said it tested the algorithm on open models including Gemma and Mistral, and that it can be applied without additional training. The industry views the technology as potentially cutting AI model operating costs and enabling more efficient AI use on limited hardware such as in mobile environments.

Keyword

#Google #TurboQuant #Key-Value Cache #PolarQuant #Gemma
Copyright © DigitalToday. All rights reserved. Unauthorized reproduction and redistribution are prohibited.