Nvidia launches FFN fusion technology to significantly improve the inference efficiency of large language models
Author: LoRA
Time: 31 Mar 2025
186
Recently, a research team from Nvidia, a leading company in the field of artificial intelligence chips, launched an innovative technology called "FFN Convergence". This technology is optimized for serial computing bottlenecks in the Transformer architecture, aiming to significantly improve the inference efficiency of large language models (LLMs) and provide support for a wide range of applications of high-performance AI applications. In recent years, with the powerful capabilities of LLMs in the fields of natural language processing, scientific research and dialogue agency, the scale and complexity of their models have also increased rapidly. However, this brings higher computational demands, resulting in a decrease in inference efficiency. Although the Transformer architecture is the basis of LLM, its alternating structure consisting of attention mechanisms and feedforward networks (FFNs) requires the input to be processed sequentially, which will significantly increase the communication overhead between computing and GPUs in large-scale models, especially in application scenarios where multiple tokens are needed to quickly generate. To solve the above problems, Nvidia researchers proposed a new optimization strategy - FFN fusion technology. The core of this technology is to integrate the continuous and less interdependent FFN layers in the model into a wider FFN module. The study found that after removing the attention layer, long continuous FFN sequences usually appear in the LLM, and the dependence between these FFN layers is actually small, so parallel computing can be implemented. By splicing the weights of multiple series FFNs to form a single module that can be calculated in parallel, FFN fusion not only improves the computing efficiency, but also ensures the same representation ability as the original FFN. In order to verify the actual effect of FFN fusion, the research team applied it to Meta's Llama-3.1-405B-Instruct model, and obtained a new model - Ultra-253B-Base through pruning and reconstruction. Experimental results show that Ultra-253B-Base has significantly improved in terms of inference speed and resource utilization. For example, with a batch size of 32, the inference delay is reduced by 1.71 times, and the calculation cost per token is reduced by 35 times. Although the number of parameters decreased from 405 billion to 253 billion, the model still performed well on multiple authoritative evaluation benchmarks, including MMLU scores of 85.17%, MMLU-Pro scores of 72.25%, HumanEval scores of 86.58%, Arena Hard scores of 84.92%, and MT-Bench scores of 9.19. In addition, the memory usage of Ultra-253B-Base has also been reduced by about half, thanks to the optimization of kv-cache. Further research shows that FFN fusion technology is suitable for models of different scales, from 49 billion parameters to 70 billion parameters to 253 billion parameters, all of which can achieve good results, reflecting the universality of this technology. This breakthrough not only improves the reasoning efficiency of LLM, but also provides an important reference for designing LLMs that are more parallel and more suitable for hardware characteristics in the future. Through this study, it can be seen that through in-depth analysis and innovative transformation of the existing architecture, a huge leap in efficiency can be achieved without sacrificing model capabilities. Although the complete parallelization of Transformer module still faces many challenges, the success of FFN fusion technology undoubtedly opens up a new path for future LLM optimization. Related papers have been published on the arXiv platform for global scientific researchers to review. (See the original link for details of the pictures in the article)  