CosyVoice2.0 is a multilingual, large-scale speech generation model with complete full-stack capabilities, covering reasoning, training and deployment, and is of great value in the field of speech synthesis. It not only supports multilingual voice generation, but also generates natural and smooth voices that are close to human voices, which are suitable for multiple locales.
The project was developed by the FunAudioLLM team and is open sourced under the Apache-2.0 license.
Main features
Multilingual support: CosyVoice supports pronunciation synthesis in Chinese, English, Japanese, Korean and a variety of Chinese dialects (such as Cantonese, Sichuan, Shanghai, Tianjin, Wuhan dialect, etc.).
Ultra-low latency: CosyVoice 2.0 integrates offline and streaming modeling technology and supports bidirectional streaming voice synthesis, with first-pack synthesis latency as low as 150 milliseconds while maintaining high-quality audio output.
High Accuracy: CosyVoice 2.0 reduces pronunciation errors in synthetic audio by 30% to 50% compared to version 1.0, achieving the lowest character error rate on the difficult test set of the Seed-TTS evaluation set.
Strong stability: CosyVoice 2.0 ensures excellent timbre consistency in zero-sample and cross-language speech synthesis.
Natural experience: The rhythm, sound quality and emotional alignment of synthetic audio have been significantly improved, with the MOS evaluation score increased from 5.4 to 5.53.
This tutorial will guide you on-premises CosyVoice 2.0 , from environment configuration to model runs, for Windows users.
Miniconda is a Conda management tool that is very convenient to install on Windows. After downloading, click Next like normal software until the installation is completed.
Get the CosyVoice source code from the official repository or specified channel and unzip it.
Open Anaconda Prompt or CMD and enter the following command to create and activate the environment:
conda create -n cosyvoice python=3.8 -y conda activated cosyvoice
The pynini module can only be installed using Conda under Windows, so it runs in an activated environment:
conda install -y -c conda-forge pynini==2.1.5 WeTextProcessing==1.0.3
Edit requirements.txt
Delete WeTextProcessing==1.0.3
of the last line (avoid installation failure)
Adding Matcha-TTS
dependencies
Installation dependencies (using Alibaba Cloud Mirror Acceleration):
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/
At this point, CosyVoice and all its dependencies have been installed and can be started.