Current location: Home> Ai News

DeepSeek-R1 MLA architecture: a new breakthrough in big model migration

Author: LoRA Time: 07 Mar 2025 977

In the field of artificial intelligence, the launch of DeepSeek-R1 has attracted widespread attention, and this innovation represents a disruptive progress in the AI ​​industry. Its Multi-head Latent Attention (MLA) architecture significantly reduces the cost of training and inference with the help of low-rank compression technology, even only one-tenth of the same performance large models. This result was jointly accomplished by Ji Tao, a postdoctoral fellow at the NLP Laboratory of Fudan University and his team. The goal is to enable arbitrary pre-trained large language models to quickly migrate to the MLA architecture without the need to train again from scratch.

Currently, mainstream big models are generally based on standard multi-head attention mechanisms (MHA) and their variants, which have significant disadvantages in inference costs compared to MLA. Therefore, the research team proposed the MHA2MLA framework, aiming to successfully realize the migration of the MHA/GQA architecture to MLA through two key steps - part of the RoPE retention and key-value joint representing low-rank approximation.

image.png

During the implementation of MHA2MLA, the team first separated the location encoding from the large dimension through partial RoPE fine-tuning strategies, retaining a small number of dimensions related to the location, and resolving the conflict between MLA and RoPE. Next, a low-rank approximation of key-value vectors is performed by singular value decomposition (SVD) technique to maximize pre-training knowledge while significantly reducing cache space. Experimental results show that only fine-tuning is required to use 0.3% to 0.6% of the pretrained data to basically restore performance losses during migration.

After being combined with other efficient inference techniques, such as 4-bit KV cache quantization, the KV cache of the Llama2-7B model has decreased by 92.19% while the performance loss is only 0.5%. This result demonstrates the superior compatibility of the MHA2MLA framework in compression technology, while maintaining the model's inference ability and long context processing ability, providing a new feasible path for deploying resource-efficient large language models.

However, the research team also pointed out that the experiment is limited by hardware conditions and has not yet covered models such as Llama3 that require 128K long context fine-tuning. Future research will focus on expanding to more model architectures, and combining efficient parameter fine-tuning strategies to further reduce the scale of parameter updates during the migration process.