Speech Disentanglement
Time: Fri 2021-12-03 15.15
Location: Zoom
Participating: Jennifer Williams
Abstract:
This talk explores the idea of disentanglement in the speech domain. An
end-to-end machine learning task is proposed that deconstructs the
speech signal into abstract representations that can be learned and
later reused in various speech technology tasks. This task of
deconstructing, also known as disentanglement, is a form of distributed
representation learning. In some cases, learned speech representations
can be re-assembled in different ways according to the requirements of
downstream applications. For example, in a voice conversion (VC) task,
the speech content is retained while the speaker identity is changed.
This talk explores a variety of use-cases for disentangled
representations including phone recognition, speaker diarization,
linguistic code-switching, and content-based privacy masking. Speech
representations can also be utilised for automatically assessing the
quality and authenticity of speech, such as automatic MOS ratings or
detecting deep fakes. The meaning of the term 'disentanglement' is not
well defined in previous work. It has been used to mean different things
depending on the domain (e.g. image vs. speech). Sometimes the term is
used interchangeably with the term 'factorization'. This talk proposes
that the two terms are indeed distinct, and offers a viewpoint of
disentanglement for the audience to consider both theoretically and
practically.
Speaker bio:
Jennifer Williams is finishing her PhD at the Centre for Speech
Technology Research at the University of Edinburgh. Her main research
interest is in speech representation learning with additional interests
in ethical issues surrounding speech technology, including: voice data
privacy, voice spoofing/anti-spoofing, and speech technology security.
She is also a senior speech scientist for a London-based start-up called
MyVoice AI where she works on TinyML for speech.