Generating Training Data for Keyword Spotting given Few Samples
Time: Wed 2019-03-27 13.15
Participating: Pius Friesch
Speech recognition systems generally need a large quantity of highly variable voice and recording conditions in order to produce robust results. In the specific case of keyword spotting, where only short commands are recognized instead of large vocabularies, the resource-intensive task of data acquisition has to be repeated for each keyword individually. Over the past few years, neural methods in speech synthesis and voice conversion made tremendous progress and generate samples that are realistic to the human ear. In this work, we explore the feasibility of using such methods to generate training data for keyword spotting methods. In detail, we want to evaluate if the generated samples are indeed realistic or only sound so and if a model trained on these generated samples can generalize to real samples. We evaluated three neural network speech synthesis and voice conversion techniques : (1) Speaker Adaptive VoiceLoop, (2) Factorized Hierarchical Variational Autoencoder (FHVAE), (3) Vector Quantised-Variational AutoEncoder (VQVAE).
These three methods are evaluated as data augmentation or data generation techniques on a keyword spotting task. The performance of the models is compared to a baseline of changing the pitch, tempo, and speed of the original sample. The experiments show that using the neural network techniques can provide an up to 20% relative accuracy improvement on the validation set. The baseline augmentation technique performs at least twice as good. This seems to indicate that using multi-speaker speech synthesis or voice conversation naively does not yield varied or realistic enough samples.