Transformer without normalization layer: DyT technology breaks through a new height of deep learning

Author: LoRA Time: 14 Mar 2025 166

In the field of deep learning, the normalization layer is regarded as one of the indispensable components in modern neural networks. Recently, a research result led by Meta FAIR research scientist Liu Zhuang - "Transformer without a normalization layer" has attracted widespread attention. This study not only proposes a new technology called Dynamic Tanh (Dynamic Tanh, DyT), but also shows that Transformer architecture can still achieve efficient training and reasoning without using traditional normalization layers.

Cloud Computing Internet Metaverse (3)

Normalization layers, especially Layer Normalization (LN), have played a crucial role in optimizing deep learning models over the past decade. The LN layer accelerates the convergence of the model by scaling and compressing the input activation. However, researchers found that widespread use of the LN layer was not the only option. Their research began with observing the behavior of the LN layer and proposed a new alternative method, DyT. This element-level operation not only simulates the scaling and compression effects of the LN layer, but also eliminates complex activation data calculations.

In the experiment, the research team replaced traditional normalization layers in multiple Transformer architectures with DyT, and the results showed that models using DyT can be trained stably and achieve higher final performance. Even more exciting is that this new approach usually does not require hyperparameter adjustments to the original architecture, reducing the complexity of model training.

By analyzing the forward propagation process of three different Transformer models, the researchers found that the early LN layer showed linear relationships, but in the deeper LN layer, the relationship between input and output showed an S-shaped curve similar to the tanh function. This finding surprised the research team and also provided strong empirical support for the effectiveness of DyT.

Liu Zhuang said that this work helped him deeply understand the role of the normalization layer and expected DyT to bring new possibilities to reduce the cost of model training and reasoning. In the future, DyT is expected to become an important candidate in efficiency-oriented network design, promoting the further development of deep learning.

Tips & Information

Transformer without normalization layer: DyT technology breaks through a new height of deep learning

AI Hug Meow: Smart Plush toys, emotional companionship in the new era

OpenAI CEO: GPT-5 is about to be launched, worrying about being surpassed by DeepSeek every day

Google AI Overview Quote YouTube Videos Rises 25% and Healthcare Industry Performs Outstanding

Rakuten releases Japanese language model Rakuten AI 2.0 to promote the development of AI

Banyu AI private training system released: Go beyond GPT and reshape the future of education

Google Gemini AI Assistant: Customize personalized answers based on search history

AI search engine news query error rate exceeds 60%, and reliability is worrying

Giant Network releases the first DeepSeek native gameplay "Insider Challenge"