In the field of deep learning, the normalization layer is regarded as one of the indispensable components in modern neural networks. Recently, a research result led by Meta FAIR research scientist Liu Zhuang - "Transformer without a normalization layer" has attracted widespread attention. This study not only proposes a new technology called Dynamic Tanh (Dynamic Tanh, DyT), but also shows that Transformer architecture can still achieve efficient training and reasoning without using traditional normalization layers.
Normalization layers, especially Layer Normalization (LN), have played a crucial role in optimizing deep learning models over the past decade. The LN layer accelerates the convergence of the model by scaling and compressing the input activation. However, researchers found that widespread use of the LN layer was not the only option. Their research began with observing the behavior of the LN layer and proposed a new alternative method, DyT. This element-level operation not only simulates the scaling and compression effects of the LN layer, but also eliminates complex activation data calculations.
In the experiment, the research team replaced traditional normalization layers in multiple Transformer architectures with DyT, and the results showed that models using DyT can be trained stably and achieve higher final performance. Even more exciting is that this new approach usually does not require hyperparameter adjustments to the original architecture, reducing the complexity of model training.
By analyzing the forward propagation process of three different Transformer models, the researchers found that the early LN layer showed linear relationships, but in the deeper LN layer, the relationship between input and output showed an S-shaped curve similar to the tanh function. This finding surprised the research team and also provided strong empirical support for the effectiveness of DyT.
Liu Zhuang said that this work helped him deeply understand the role of the normalization layer and expected DyT to bring new possibilities to reduce the cost of model training and reasoning. In the future, DyT is expected to become an important candidate in efficiency-oriented network design, promoting the further development of deep learning.