According to Beating, Xiaomi revealed core optimization techniques for its MiMo-V2.5 API following recent price cuts aligned with DeepSeek. The company's high-load inference engine maintains profitability through hybrid attention architecture and hierarchical KV cache optimization.
Xiaomi's inference framework achieved an 80% reduction in cache costs by implementing hierarchical optimization for sliding window attention (SWA), increasing token capacity by 5x. The 70-layer MiMo-V2.5-Pro model, using a 1:7 sparse ratio between global attention (GA) and SWA layers, performs prefill computations equivalent to a traditional 10-layer global GQA model, significantly lowering inference costs.