![]() Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model,Īnd regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work.ĭuring the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. We introduce a language modeling approach for text to speech synthesis (TTS). ![]() Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei VALL-E Neural Codec Language Models are Zero-Shot Text to Speech SynthesizersĬhengyi Wang*, Sanyuan Chen*, Yu Wu*, Ziqiang Zhang, Long Zhou, Shujie Liu,
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |