Google's TurboQuant Shrinks AI Memory Use by 6x

April 20, 20262 min read

TL;DR

Google's training-free compression algorithm cuts AI memory needs sixfold with no accuracy loss, reducing hardware costs and reshaping AI infrastructure.

Google Research has unveiled what may be the most consequential advance in AI infrastructure efficiency this year. TurboQuant, a novel vector quantization algorithm, compresses the key-value cache of large language models from 16 bits down to just 3 bits per value — a six-fold reduction in memory footprint — while producing zero measurable accuracy loss on standard benchmarks. The technique also accelerates attention computation by up to eight times on Nvidia H100 GPUs, and Google estimates it could slash enterprise AI infrastructure costs by more than 50 percent.

The algorithm operates through a two-stage pipeline that is elegant in its simplicity. First, a method called PolarQuant randomly rotates data vectors and converts them from Cartesian to polar coordinates before quantizing them. Then, a second stage known as QJL — short for Quantized Johnson-Lindenstrauss — applies 1-bit sign quantization to residual errors, effectively eliminating bias in attention score calculations. What makes TurboQuant particularly significant is that it requires no training, fine-tuning, or access to the original training data. The approach is entirely data-oblivious, meaning it can be deployed immediately on any existing model without modification.

The research team, led by Google Research Scientist Amir Zandieh and Vahab Mirrokni, a Vice President and Google Fellow, demonstrated that TurboQuant achieves perfect downstream results across all LongBench benchmarks while maintaining superior recall ratios compared to established baselines like PQ and RabbiQ on vector search tasks. The paper is set to be formally presented at the International Conference on Learning Representations (ICLR) 2026 in Rio de Janeiro later this month, where it is expected to draw significant attention from both academic researchers and industry practitioners.

Wall Street responded swiftly to the announcement. Memory chip stocks sold off as investors recalculated future demand projections for AI data centers. Micron Technology dropped 3 percent, Western Digital fell 4.7 percent, and SanDisk slid 5.7 percent in the sessions following the news. Wells Fargo analyst Andrew Rocha noted that "TurboQuant directly attacks the cost curve for memory in AI systems," adding that broad adoption would force the industry to reassess actual memory demand — though he cautioned that overall appetite for AI infrastructure remains robust. The market reaction underscored how a single algorithmic breakthrough can reverberate across the semiconductor supply chain.

The technology targets one of the most expensive bottlenecks in AI inference: the KV cache. This data structure stores contextual information generated during a model's operation so that it does not need to be recomputed with every new token produced. As models have grown to hundreds of billions — and increasingly trillions — of parameters, the KV cache has become a primary driver of memory consumption and, by extension, hardware costs. By compressing this cache so aggressively without sacrificing quality, TurboQuant could make trillion-parameter models feasible on hardware configurations that previously could not support them.

The internet was quick to draw comparisons to the fictional "middle-out" compression algorithm from HBO's Silicon Valley, with the Pied Piper references spreading rapidly across social media and developer forums. But unlike its fictional counterpart, TurboQuant arrives with rigorous benchmarking and the backing of one of the world's largest AI research organizations. If adoption proves as straightforward as Google's data-oblivious design promises, the implications extend well beyond cost savings — democratizing access to frontier-scale AI capabilities for organizations that were previously priced out of the infrastructure required to run them.

The broader significance may take months to fully materialize. Chip manufacturers will need to determine whether efficiency gains of this magnitude translate into reduced orders or whether the insatiable growth in AI workloads simply absorbs the savings. For now, TurboQuant stands as a reminder that in the race to scale artificial intelligence, software innovation can be every bit as disruptive as hardware advances — and sometimes more so.