Skip to main content
To KTH's start page

How to predict a conversation

Gabriel Skantze and Erik Ekstedt from the Division of Speech, Music and Hearing (TMH)

Erik Ekstedt and Gabriel Skantze from the Division of Speech, Music and Hearing
Published Sep 26, 2022

The SIGIDAL best paper award went to Erik Ekstedt and Gabriel Skantze from Speech, Music and Hearing (TMH). Their model learns to predict what will happen in the next two seconds of the conversation. The research improves the interaction between humans and conversational systems, such as social robots or voice assistants.

Congratulations on winning the Best paper award at the SIGDIAL conference in Edinburgh. Please tell us about your paper.

"Thank you! We are improving the interaction between humans and conversational systems, such as social robots or voice assistants. More specifically, we are interested in modelling fluent turn-taking in conversation.

We have recently developed a deep learning model to train on large amounts of spoken interactions between humans. The model learns to predict continuously what will happen in the next two seconds of the conversation.

What's a conversational system?

A conversational system is an intelligent machine that can understand language and conduct a written or verbal conversation with a customer. 

Conversational systems research at KTH Speech, Music and Hearing seeks to make interactions with these systems more fluent and the systems more human-like.

From a scientific perspective, a challenge when training deep learning models is that it is tough to know what they learn. In this paper, we present a method for analysing our model and show that it has learned to pick up very subtle prosodic cues, such as the tone of the voice, which is essential for human listeners."

"Our model can be directly applied to improve turn-taking in conversational systems of today and allow for applications in health care, education, and entertainment."

What impact could your research have on society?

"Our model can be directly applied to improve turn-taking in conversational systems of today and allow for applications in health care, education, and entertainment.

The present analysis can also be a powerful tool for improving our scientific understanding of how humans coordinate turn-taking in conversation. Doing psycholinguistic experiments on humans can be very expensive and limited, so it is interesting to see that we can complement such studies with large-scale experiments using our computational models."

What's the most exciting research in your field?

"Research on conversational systems has exploded in recent years in academia and industry; for example, the second Best Paper Award at SIGDIAL 2022 went to Google.

Many people might, for example, have read about the Google LaMDA chatbot or OpenAI's GPT-3, which can sometimes engage in very human-like interactions. However, most of these systems are text-based, and our focus is on how to allow for a more natural spoken interaction."

The paper in short

How much does prosody help turn-taking? Investigations using voice activity projection models

Erik Ekstedt & Gabriel Skantze

Turn-taking is a fundamental aspect of human communication. It can be described as the ability to take turns, project upcoming turn shifts, and supply backchannels at appropriate locations throughout a conversation. In this work, we investigate the role of prosody in turn-taking using the recently proposed Voice Activity Projection model, which incrementally models the upcoming speech activity of the interlocutors in a self-supervised manner without relying on the explicit annotation of turn-taking events or the detailed modelling of prosodic features. We investigate how these models implicitly utilise prosodic information by manipulating the speech signal. We show that these systems learn to use various prosodic aspects of speech on aggregate quantitative metrics of long-form conversations and on single utterances specifically designed to depend on prosody.

The SIGDIAL conference

The authors

Gabriel Skantze
Gabriel Skantze professor

Related news

Humour - a key ingredient in software development

Humour and cultural influences play an important role in most human relationships and, indeed, in software. Deepika Tiwari, a PhD student at KTH Royal Institute of Technology, discovered this when, to...

Read the article

Research behind the efficiency of mobile phone networks - one of the most highly cited

The research behind using mobile phone masts with many small electrically steerable antennas has been included in Clarivate's annual Clarivate Highly Cited Researchers 2023 list. ”This is of cours...

Read the article