According to Beating, Google deployed Multi-Token Prediction (MTP) architecture on Pixel 9 and Pixel 10 devices, significantly accelerating the on-device Gemini Nano v3 model. The new architecture increased inference speed by over 50% while preserving the model's safety alignment and output quality.
The zero-copy mechanism allows the prediction head to directly reuse the main model's cached features through cross-attention, eliminating the separate key-value cache overhead of traditional draft models. This design saved approximately 130MB of memory while reducing startup latency. In real-world applications like notification summarization and smart replies, MTP achieved a 55% increase in token acceptance rate, reducing processor wake-up frequency and lowering system power consumption.