Mapping voice quality in normal, pathological and synthetic voices
Time: Fri 2025-03-14 14.00
Location: Rum B:218, Q2, Malvinas Väg 10, Campus
Video link: https://kth-se.zoom.us/j/61856204062?pwd=0aLP1ptM9OMUaaXUFuSBxV6bbu74iO.1
Language: English
Subject area: Computer Science
Doctoral student: Huanchen Cai , Tal, musik och hörsel, TMH
Opponent: Professor Zhaoyan Zhang, University of California, Los Angeles, USA
Supervisor: Professor Sten Ternström, Tal-kommunikation; Professor Olov Engwall, Tal-kommunikation
QC 20250224
Abstract
Voice quality evaluation is an integral aspect of both clinical and technological applications, encompassing areas such as speech therapy, phonation disorder diagnosis, and text-to-speech (TTS) synthesis. Traditional methods of assessing voice quality are often subjective, relying on auditory-perceptual evaluation scales, which introduce variability and bias. This thesis explores several novel applications for objective voice quality assessment, utilizing voice mapping—a visualization technique that integrates voice range and quality metrics. By plotting acoustic and electroglottographic (EGG) metrics across a plane defined by fundamental frequency (fo) and sound pressure level (SPL), voice mapping enables a comprehensive understanding of vocal characteristics.
This thesis is based on a compilation of five studies, three of which have been published in archive journals and two of which are in revision at this writing. Paper I establishes the foundational relationship between voice metrics and the fo and the SPL using data from individuals with vocal disorders. Paper II extends the methodology by employing clustering techniques to classify phonation types based on a diverse dataset of normophonic adults and children. Paper III applies voice mapping to pre- and post-thyroidectomy recordings, revealing surgery-induced changes in voice quality and range. Paper IV develops a deep learning-based model for predicting EGG signals from acoustic recordings. Paper V demonstrates the utility of voice mapping in evaluating the performance of synthetic TTS voices, indicating its potential for objective, metric-based TTS quality assessment.
This thesis further speaks for the importance of integrating acoustic and EGG metrics to achieve an objective assessment of voice quality. The metrics used in this approach, including acoustic and EGG-based measures, capture aspects of phonation in both the time and frequency domains, enabling detailed characterization of vocal dynamics. The findings demonstrate that voice mapping is effective not only in clinical settings for understanding voicedisorders but also offers a robust framework for evaluating synthetic voices, helping to bridge the gap between perceptual evaluation and quantitative analysis. Future directions include refining clustering methodologies, enhancing EGG prediction accuracy, and expanding the application of voice mapping to broader clinical and technological applications.