Training Universal Vocoders with Feature Smoothing-based Augmentation Methods for High-quality TTS Systems


Jeongmin Liu1, Eunwoo-Song1

1 NAVER Cloud Corp., Seongnam, Korea
jeongmin.liu@navercorp.com, eunwoo.song@navercorp.com

Abstract

While universal vocoders have achieved proficient waveform generation across diverse voices, their integration into text-to-speech (TTS) tasks often results in degraded synthetic quality. To address this challenge, we present a novel augmentation technique for training universal vocoders. Our training scheme randomly applies linear smoothing filters to input acoustic features, facilitating vocoder generalization across a wide range of smoothings. It significantly mitigates the training-inference mismatch, enhancing the naturalness of synthetic output even when the acoustic model produces overly smoothed features. Notably, our method is applicable to any vocoder without requiring architectural modifications or dependencies on specific acoustic models. The experimental results validate the superiority of our vocoder over conventional methods, achieving 11.99% and 12.05% improvements in mean opinion scores when integrated with Tacotron 2 and FastSpeech 2 TTS acoustic models, respectively.


Table of Contents


Systems

Vocoder Training Methods

Block diagram of conventional separate training method. The vocoder is trained with ground-truth acoustic features, and inferenced with acoustic model's generations.
Block diagram of proposed smoothing augmentation method. The vocoder is trained with acoustic features that are augmented by random smoothing filters.

Figure 1: Block diagram of the vocoding process in the TTS framework: (a) ST and (b) ST-SA (Proposed).

Vocoder Models


Figure 2: The UnivNet architectures: (a) the vanilla UnivNet-c32 model and (b) the proposed eUnivNet model. The notations c and k denote the number of channels and the kernel size of the convolution layer, respectively.


Training Method Comparison

Tacotron 2 + eUnivNet

Seen Speakers Recording ST FT ST-SA (Proposed)
F1
F2
M1
M2

Tacotron 2 + HiFi-GAN V1

Seen Speakers Recording ST (GT features*) ST ST-SA (Proposed)
F1
F2
M1
M2
* A case that the model synthesize waveforms using the ground-truth acoustic features. Note that HiFi-GAN V1 ST model's synthesis quality is greatly degraded when using the acoustic model's predictions compared to ground-truth features.

FastSpeech 2 + eUnivNet

Seen Speakers Recording ST ST-SA (Proposed)
F1
F2
M1
M2

Unseen Speakers (Tacotron 2 + eUnivNet)

Unseen Speakers Recording ST ST-SA (Proposed)
F3
M3



Vocoder Model Comparison

Tacotron 2 + ST-SA vocoders

Speaker Recording UnivNet-c32 eUnivNet (Proposed) eUnivNet-H/N-G eUnivNet-M/C-D
F1
F2
M1
M2


[1] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, 2020, pp. 17022–17033.
(We trained the model for 1M steps using the official implementation.)
[2] W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation,” in Proc. INTERSPEECH, 2021, pp. 2207–2211.
(We trained the model for 1M steps using an open-source implementation.)