A few days ago I wrote about KV cache as AI consciousness, comparing it to human working memory. Google recently published a paper at ICLR 2026 called TurboQuant (arXiv 2504.19874) that figured out how to compress this “consciousness” to one-sixth its size while keeping it functionally identical to the original.
That turned out to be more interesting than it sounds.
Start with the analogy
Imagine AI reading a very long book. As it gets through each page, it jots notes in a notebook — what happened, which details might matter later. That notebook is the KV cache.
Problem: the longer the book, the more notes, and eventually the notebook is full. The old approach was to move some notes into a backpack (CPU RAM), and when that fills up, into a storage locker (SSD). But fetching old notes means running back to check, which is slow.
TurboQuant takes a different approach entirely. It teaches AI a better note-taking system.
Original notes: “Page 3: Tom wore a blue shirt, rode a red bicycle, left home at 3:42 PM, headed east toward the park.”
Compressed notes: “P3: Tom → park, bike.”
One-sixth the space. But when you ask “where did Tom go?” the answer is exactly right.
How it works technically
TurboQuant has three steps.
Step 1: Roll the vector into a ball
KV cache stores high-dimensional floating-point vectors (FP16/32). Each vector represents the model’s “understanding” of some text. The problem is that dimensions are distributed unevenly — some have huge values, others tiny. Compressing directly means errors pile up in the extreme dimensions.
TurboQuant multiplies each vector by a random orthogonal matrix. The effect is like taking an irregularly shaped object and rolling it into a sphere before compressing — pressure distributes evenly. Mathematically, each dimension converges toward N(0, 1/d) after rotation, making quantization errors manageable.
Step 2: Replace coordinates with polar codes
This is the main compression step. TurboQuant converts the rotated vector from Cartesian to polar coordinates — pair up every two dimensions into (radius, angle), then recursively pair the radii, until you’re left with a single radius and a bunch of angle values.
Each angle gets independently quantized using Max-Lloyd optimal quantization. Think spy codebook — “blue shirt” becomes a code number, look it up to decode. This step uses b-1 bits and captures the vector’s main information.
Step 3: Correct the bias with 1 remaining bit
Step 2 introduces quantization residuals. TurboQuant uses a method called QJL to project these residuals into a {-1, +1} sign space, costing just 1 bit. The math proves this correction is unbiased — it cancels the inner product bias from step 2.
Like spending one minute checking your exam answers at the end. Minimal time, catches the critical errors.
The final attention score becomes: the approximation from step 2 + the residual correction term. Combined error has mathematical guarantees — MSE distortion stays within 2.7x of the information-theoretic lower bound. At 3 bits, MSE is just 0.03.
The actual numbers
Straight to the numbers. At 3.5 bits per dimension, LongBench scores 50.06 — identical to FP16 (no compression at all). Drop to 2.5 bits and you only lose 1.2%. Needle-in-Haystack under 4x compression hits 0.997 recall. Basically zero degradation. Attention computation is 8x faster on H100.
But the killer number is quantization time: 0.0007 seconds. Product Quantization takes 37 to 494 seconds. Five orders of magnitude difference. This means you can compress KV cache in real time as each token is generated, with zero impact on inference latency.
What makes this approach clever
Two words: data-oblivious.
Most compression methods need calibration data to figure out how to quantize. TurboQuant needs none — it uses random matrices and works out of the box in any scenario. No task-specific tuning, no calibration set, no offline preprocessing. Model goes live, flip the switch, compressed. And since it works online, it compresses token by token, fitting right into streaming inference.
What compressing “consciousness” tells us
Back to the framework from my previous post. If KV cache is AI’s working memory — the “comprehension state” left after text has been digested by the model — then what TurboQuant proves is worth sitting with for a moment.
It proves this “comprehension state” carries enormous redundancy. FP16 KV cache is “high-definition consciousness.” Compressed to 3 bits, it becomes “low-resolution consciousness.” But the two are functionally equivalent. Most of the precision the model uses to “understand” text is wasted. The actual semantic signal density is far below what FP16 can represent.
This parallels how human memory works. When you remember something, your brain doesn’t store a high-fidelity copy of raw sensory input. It stores a heavily compressed semantic summary. You remember “had coffee with a friend yesterday, talked about AI” but not every pixel on the coffee cup. TurboQuant reveals something similar about transformers: most precision in the attention mechanism goes to redundancy. The real semantic signal is sparse.
Market reactions and Jevons Paradox
Some people see “KV cache compressed 6x” and immediately assume GPU demand is about to collapse. History says otherwise. Jevons Paradox.
When steam engines got more efficient, coal consumption didn’t drop. It went up, because better efficiency unlocked new use cases. Compress KV cache 6x, and the same hardware can run a 6x longer context window. Developers will immediately fill that space — longer conversations, more documents, more complex agent tasks. Net HBM demand might actually increase.
Make the notebook smaller, and AI just reads longer books.
Paper: TurboQuant (arXiv 2504.19874), ICLR 2026.