DeepSeek open source FlashMLA: greatly improves large model inference performance

Author: LoRA Time: 24 Feb 2025 713

DeepSeek officially opens its latest technological achievement FlashMLA on the first day of its open source week, a multi-layer Attention decoding kernel specially designed for Nvidia Hopper architecture GPUs. This technology is especially optimized for variable-length sequence scenarios, which can significantly improve the performance of large-model inference.

The core technical features of FlashMLA include comprehensive support for BF16 accuracy and the use of a paged KV Cache system with a block size of 64 to achieve more precise memory management. In terms of performance, based on the CUDA12.6 platform, FlashMLA has achieved remarkable results on the H800SXM5GPU: it reaches a processing speed of 3000GB/s in memory-constrained scenarios, and achieves a computing power level of 580TFLOPS in computing-constrained scenarios.

The project has been verified in the production environment and demonstrates excellent stability. The development team said that the design of FlashMLA borrowed the excellent experience of projects such as FlashAttention2&3 and cutlass, and on this basis achieved innovative breakthroughs.

Developers can quickly deploy FlashMLA through simple installation commands: just execute "python setup.py install" to complete the installation, and then run the test script "python tests/test_flash_mla.py" to experience its performance.

Open source address: https://github.com/deepseek-ai/FlashMLA

Tips & Information

DeepSeek open source FlashMLA: greatly improves large model inference performance

Manus Invitation Code Application Guide

Character.AI launches AvatarFX: AI video generation model allows static images to "open to speak"

Manychat completes US$140 million Series B financing, using AI to accelerate global social e-commerce layout

Google AI Overview Severely Impacts SEO Click-through Rate: Ahrefs Research shows traffic drop by more than 34%