Skip to main content

Utilizing Self-supervised Representation and Novel Evaluation Methods in Spontaneous Speech Synthesis

Siyang Wang's Annual PhD Progress Seminar

Siyang Wang is a third-year PhD student at Division of Speech Music and Hearing at KTH. His research interest include various topics in speech and gesture synthesis, such as spontaneous speech synthesis, integrated speech and gesture synthesis, and evaluation of synthesis.

Time: Wed 2023-10-25 15.00 - 16.00

Location: Fantum (Lindstedtsvägen 24, floor 5, room no. 522)

Language: English

Contact:

Doctoral student: Siyang Wang

Abstract

I will share my recent work on two topics in spontaneous speech synthesis: (1) How and why to utilize self-supervised speech representations in spontaneous speech synthesis? (2) A set of novel evaluation methods that can be applied to spontaneous speech synthesis to better understand model performance. The dominant paradigm in text-to-speech (TTS) today is a so-alled "two-stage" pipeline approach, where the first stage, often called "acoustic model", predicts a pre-determined intermediate representation, and then the second stage, often called "vocoder", takes the generated intermediate representation from the first stage and outputs speech audio. Our work shows that replacing the intermediate representation from conventionally used mel-spectrogram to a self-supervised learning (SSL) speech representation such as wav2vec2.0 yields better spontaneous TTS. I will share how we arrived at this conclusion, as well as which SSL and which layer of each SSL is superior at this task. But how do we even know which model is better? I will talk about why evaluating spontaneous speech synthesis is difficult and in which aspects current evaluation methods are lacking. Several alternative evaluation methods I have worked on to remedy some of these aspects and corresponding studies will be presented. Specifcally, I will talk about assessing turn-taking cues in speech syntehsis automatically and evaluating how appropriate different TTS voices are in a social robotics scenario. This is my second PhD progress seminar.

Project credit: Digital Futures AAIS