HooliGAN Vocoder Demo Samples

ABSTRACT: Recent developments in generative models have shown that deep learning combined with traditional DSP techniques could successfully generate convincing violin samples (DDSP), that source-excitation combined with WaveNet yields high-quality vocoders (NSF) and that GAN based techniques can improve naturalness (MelGAN).

By combining the ideas in these models we introduce HooliGAN, a robust vocoder that has state of the art results, fine-tunes very well to smaller datasets (<30 minutes of speechdata) and generates audio at 2.2MHz on GPU and 35kHz onCPU. We also show a simple modification to Tacotron-based models that allows seamless integration with HooliGAN.

Results from our listening tests show the proposed model’s ability to consistently output high-quality audio with a variety of datasets, big and small.

Paper Preprint Available Here


Analysis/Synthesis

Inverting Ground Truth Acoustic Features

Ground Truth
MOS: 4.29 ± 0.06
HooliGAN
MOS: 4.07 ± 0.06
WaveGlow
MOS: 3.77 ± 0.07
WaveRNN
MOS: 3.77 ± 0.07
MelGAN
MOS: 3.02 ± 0.08

Small Dataset Finetuning (30mins data)

Inverting Predicted Acoustic Features from Tacotron2

SpeakerA - HooliGAN
MOS: 4.49 ± 0.05
SpeakerA - WaveRNN
MOS: 4.10 ± 0.07
SpeakerB - HooliGAN
MOS: 4.05 ± 0.06
SpeakerB - WaveRNN
MOS: 3.18 ± 0.09

Large Dataset (24 hours data)

Inverting Predicted Acoustic Features from Tacotron2

LJSpeech - Ground Truth
MOS: 4.28 ± 0.06
LJSpeech - HooliGAN
MOS: 4.18 ± 0.07
LJSpeech - WaveRNN
MOS: 3.65 ± 0.08