

The era of local AI is undergoing a radical transformation. For a long time, the dream of running powerful Large Language Models (LLMs) on your own machine was dampened by a frustrating reality: the "blinking cursor" syndrome. You had the privacy, you had the ownership, but you lacked the speed.
That is officially changing. Google has unveiled a breakthrough optimization for its Gemma 4 family of open models that increases inference speeds by up to 3x, requiring absolutely no new hardware. Through the release of Multi-Token Prediction (MTP) drafters, Google is proving that the secret to the AI revolution isn’t just about bigger GPUs—it’s about smarter software.
To understand why this is a big deal, we have to look at how AI normally "thinks." Standard LLMs generate text through a process called autoregressive generation. Essentially, the model looks at the previous words and predicts exactly one "token" (a word fragment) at a time.
This process is incredibly resource-intensive. Every time the model generates a single token, the hardware has to move billions of parameters from the memory to the compute units. On consumer-grade laptops or desktop GPUs, this creates a massive bottleneck. The result is a slow, stuttering output that makes real-time interaction feel clunky and impractical.
Google’s new solution utilizes a technique known as speculative decoding. While the concept has been discussed in research circles since 2022, Google has finally refined it for the mainstream with the Gemma 4 release.
Instead of the massive, high-intelligence model doing all the heavy lifting for every single character, it is paired with a tiny, lightweight "drafter" model. Think of the drafter as a fast-thinking assistant and the main model as the expert editor.
The beauty of this system is that nothing is sacrificed. Because the main model still verifies every single token, the reasoning ability and output quality remain identical to the original model. You get the intelligence of a massive model with the speed of a tiny one.
The performance metrics provided by Google are significant for anyone running AI locally. For those using an Nvidia RTX Pro 6000, a Gemma 4 26B model can see nearly double the tokens per second. On Apple Silicon (MacBooks), users can see speedups of up to 2.2x. In ideal scenarios, this optimization reaches a 3x ceiling.
This isn't just a win for desktop users. Google has also optimized these MTP drafters for "edge" devices, such as smartphones and Raspberry Pi units, using efficient clustering techniques to ensure that even the smallest devices feel snappier.
This move by Google follows a growing trend in the industry: efficiency is the new frontier. Earlier this year, models like DeepSeek proved that you could achieve world-class intelligence without spending billions on the latest Nvidia H100 chips.
By releasing these drafters under an Apache 2.0 license, Google is democratizing high-speed AI. They aren't locking these speed boosts behind a subscription or a cloud paywall; they are giving them to the community to run on the hardware people already own.
What does a 3x speedup actually mean for you? It changes the "vibe" of AI from a tool you wait for into a tool you collaborate with.
Google’s MTP drafters are currently available on platforms like Hugging Face, Kaggle, and Ollama. They are compatible with popular tools like vLLM and MLX, making it easier than ever for developers and enthusiasts to "supercharge" their local setups today.
To learn more about the technical details of this release, you can read the full report on Decrypt:
👉 Google Found a Way to Make Local AI Up to 3x Faster—No New Hardware Required
Disclaimer: This article is provided for informational purposes only, mistakes may be made, and it's not offered or intended to be used as legal, tax, investment, financial, or any other advice.
