On March 6, Mobvoi joined hands with top academic institutions such as the Hong Kong University of Science and Technology, Shanghai Jiaotong University, Nanyang Technological University, and Northwestern Polytechnical University to open source the new generation of speech generation model Spark-TTS, and launched its commercial high-quality TTS engine - TicVoice7.0. As the seventh generation TTS engine of Going Out, TicVoice7.0 has achieved a major breakthrough in the field of voice generation and opened up a new voice generation paradigm.
The core advantage of TicVoice 7.0 lies in its innovative voice encoding method and modeling structure. The engine uses BiCodec encoding technology to encode speech into two complementary parts: Global Tokens with fixed sequence length and Semantic Tokens with low bitrate. Global Tokens are responsible for modeling timing-independent global features, such as tone, to ensure the global controllability of speech generation; Semantic Tokens uses features extracted by wav2vec2.0 as input to encode information closely related to text, ensuring strong correlation of semantics. This design not only solves the problems existing in traditional speech coding, but also realizes the high unity of speech token modeling and text token modeling, making speech generation more efficient and controllable.
Based on this innovation, TicVoice 7.0 demonstrates outstanding voice cloning capabilities and emotional expression. It can keenly capture voiceprint features within 3 seconds, allowing AI to not only "speak human words", but also imitate subtle emotional expressions such as human sighs and pauses. Compared with the previous generation of voice models, TicVoice 7.0 has significantly improved tone similarity, emotional performance and stability. The international general MOS score has been increased from 3.9 to 4.2, with stronger emotional expression and more natural, pleasant and stable listening.
In addition, TicVoice 7.0 also performs well in personalized customization. Users can accurately shape a unique sound style by adjusting various attributes such as gender, speech speed, and basic frequency. In terms of customization of "Zhizhen Pro-Quality Pronunciator", users only need to provide 20-200 corpus to get a professional dubbing experience of broadcasting. The international universal MOS score has been upgraded from 4.3 to 4.7, reaching the broadcasting level, providing professional voice generation solutions for film and television, games and other scenarios.
At present, Gouwuwen has put TicVoice 7.0 in its AI dubbing product "Maoyin Workshop", bringing users better services and experiences. This engine not only performs well in application scenarios such as customer service, audio books, emotional live broadcasts, film and television commentary, but also injects new impetus into the development of the industry through the deep collaboration between open source ecology and industry, academia and research.