Current location: Home> Ai News

DeepSeek open source FlashMLA: greatly improves large model inference performance

Author: LoRA Time: 24 Feb 2025 676

DeepSeek officially opens its latest technological achievement FlashMLA on the first day of its open source week, a multi-layer Attention decoding kernel specially designed for Nvidia Hopper architecture GPUs. This technology is especially optimized for variable-length sequence scenarios, which can significantly improve the performance of large-model inference.

QQ20250224-101526.png

The core technical features of FlashMLA include comprehensive support for BF16 accuracy and the use of a paged KV Cache system with a block size of 64 to achieve more precise memory management. In terms of performance, based on the CUDA12.6 platform, FlashMLA has achieved remarkable results on the H800SXM5GPU: it reaches a processing speed of 3000GB/s in memory-constrained scenarios, and achieves a computing power level of 580TFLOPS in computing-constrained scenarios.

The project has been verified in the production environment and demonstrates excellent stability. The development team said that the design of FlashMLA borrowed the excellent experience of projects such as FlashAttention2&3 and cutlass, and on this basis achieved innovative breakthroughs.

Developers can quickly deploy FlashMLA through simple installation commands: just execute "python setup.py install" to complete the installation, and then run the test script "python tests/test_flash_mla.py" to experience its performance.

Open source address: https://github.com/deepseek-ai/FlashMLA