Skip to main content
To KTH's start page To KTH's start page

Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis

Time: Wed 2021-11-03 09.00

Location: Zoom

Participating: Dr. Erica Cooper & Dr. Xin Wang

Export to calendar

Speech and music audio differ in many aspects but also share
similarities. In this talk, we will discuss how speech technologies can
be applied to tasks in the music domain. We show that text-to-speech
synthesis techniques can be used for piano MIDI-to-audio synthesis
tasks, and that speaker recognition architectures can be adapted to
musical instrument recognition. For the synthesis task, Tacotron and
neural source-filter waveform models are used as the basic components
with which we build MIDI-to-audio synthesis systems in similar ways to
TTS frameworks. The subjective experimental results demonstrate that
the investigated TTS components can be applied to piano MIDI-to-audio
synthesis with minor modifications. For the instrument recognition
task, we show that the use of speaker recognition systems modified for
music can learn a meaningful embedding space that contains information
about instrument identity, as well as pitch, velocity, and playing style.

Speaker bio (Erica Cooper):
Erica Cooper is a post-doc at National Institute of Informatics, Japan.
She received the Ph.D. degree from Columbia University, USA in 2019.
Her research interests include low-resource languages, multi-speaker
synthesis, and other speech-related topics.

Speaker bio (Xin Wang):
Xin Wang is a post-doc at National Institute of Informatics, Japan. He
received the Ph.D. degree from the same institute in 2018. His research
interests include speech synthesis, speech anti-spoofing, and other
speech and language processing topics.