Adaptive Robot Presenters
Modelling Grounding in Multimodal Interaction
Time: Fri 2023-11-10 14.00
Subject area: Speech and Music Communication
Doctoral student: Agnes Axelsson , Tal, musik och hörsel, TMH
Opponent: Professor Elisabeth André, Universität Augsburg
Supervisor: Professor Gabriel Skantze, Tal, musik och hörsel, TMH; Professor Johan Boye,
This thesis addresses the topic of grounding in human-robot interaction, that is, the process by which the human and robot can ensure mutual understanding. To explore this topic, the scenario of a robot holding a presentation to a human audience is used, where the robot has to process multimodal feedback from the human in order to adapt the presentation to the human's level of understanding.
First, the use of behaviour trees to model real-time interactive processes of the presentation is addressed. A system based on the behaviour tree architecture is used in a semi-automated Wizard-of-oz experiment, showing that audience members prefer an adaptive system to a non-adaptive alternative.
Next, the thesis addresses the use of knowledge graphs to represent the content of the presentation given by the robot. By building a small, local knowledge graph containing properties (edges) that represent facts about the presentation, the system can iterate over that graph and consistently find ways to refer to entities by referring to previously grounded content. A system based on this architecture is implemented, and an evaluation using simulated users is presented. The results show that crowdworkers comparing different adaptation strategies are sensitive to the types of adaptation enabled by the knowledge graph approach.
In a face-to-face presentation setting, feedback from the audience can potentially be expressed through various modalities, including speech, head movements, gaze, facial gestures and body pose. The thesis explores how such feedback can be automatically classified. A corpus of human-robot interactions is annotated, and models are trained to classify human feedback as positive, negative or neutral. A relatively high accuracy is achieved by training simple classifiers with signals found mainly in the speech and head movements.
When knowledge graphs are used as the underlying representation of the system's presentation, some consistent way of generating text, that can be turned into speech, is required. This graph-to-text problem is explored by proposing several methods, both template-based and methods based on zero-shot generation using large language models (LLMs). A novel evaluation method using a combination of factual, counter-factual and fictional graphs is proposed.
Finally, the thesis presents and evaluates a fully automated system using all of the components above. The results show that audience members prefer the adaptive system to a non-adaptive system, matching the results from the beginning of the thesis. However, we note that clear learning results are not found, which means that the entertainment aspects of the presentation are perhaps more prominent than the learning aspects.