🔬 Researchconfirmedresearch

Mobile AI is getting faster by sharing its memory

Google Research Blog2d2 min read

Running large language models on a phone feels like trying to run a marathon while breathing through a straw. You're constantly fighting against tight RAM limits and a ticking battery clock.

To speed things up, the standard move has been "speculative decoding"—using a tiny, separate model to guess what the big model will say next. But these little drafters come with a hidden cost. They compete for precious memory and, because they operate separately, they’re essentially guessing in the dark without knowing what the main model is actually processing.

Google is bypassing this "drafter tax" by retrofitting Multi-Token Prediction (MTP) onto frozen production models like Gemini Nano v3.

Instead of bringing in a whole new model to do the drafting, they're attaching a lightweight "head" to the very end of the existing model. This head doesn't need its own separate memory; it uses a "zero-copy" architecture to tap directly into the main model's existing cache.

By leaving the main model's weights frozen, they keep the original intelligence and safety intact while shaving off about 130MB of RAM usage. On Pixel 9 and 10 devices, this tweak is delivering speedups of 50% or more for everyday tasks like summarizing notifications and proofreading text.

Takeaway

It demonstrates how to squeeze massive efficiency gains out of existing, deployed models without the massive cost or risk of retraining them from scratch.

Sources

Google Research Blog