Xiaomi Cuts MiMo API Costs 80% Via Hierarchical Cache, Equals 10-Layer GQA Model

According to Beating, Xiaomi revealed core optimization techniques for its MiMo-V2.5 API following recent price cuts aligned with DeepSeek. The company's high-load inference engine maintains profitability through hybrid attention architecture and hierarchical KV cache optimization.

Xiaomi's inference framework achieved an 80% reduction in cache costs by implementing hierarchical optimization for sliding window attention (SWA), increasing token capacity by 5x. The 70-layer MiMo-V2.5-Pro model, using a 1:7 sparse ratio between global attention (GA) and SWA layers, performs prefill computations equivalent to a traditional 10-layer global GQA model, significantly lowering inference costs.

Disclaimer: The information on this page may come from third-party sources and is for reference only. It does not represent the views or opinions of Gate and does not constitute any financial, investment, or legal advice. Virtual asset trading involves high risk. Please do not rely solely on the information on this page when making decisions. For details, see the Disclaimer.
Comment
0/400
No comments