Challenges in connecting language and vision for multimodal dialogue systems
Time: Fri 2022-01-28 15.15
Location: Fantum (Lindstedtsvägen 24, floor 5, room no. 522)
Video link: Zoom url
Participating: Bram Willemsen
Abstract:
To have a conversation that involves references to objects and entities
in a shared environment, be it simulated or physical, requires not only
the ability to produce referring expressions but also the ability to
comprehend them. Understanding whether or not an utterance contains
referring language is a start, but what tends to be essential for
effective communication is knowing exactly which words refer to what things.
In this seminar, I will talk about language grounding, and more
specifically the challenges involved in producing and understanding
grounded language for conversational systems. I will discuss in more
detail some of the problems we have come to address over the last two
years and the progress made thus far, including our use of large,
pre-trained multimodal embedding models for downstream tasks, and
difficulties faced in the process of collecting visually-grounded
dialogue data via crowdsourcing.
To have a conversation that involves references to objects and entities
in a shared environment, be it simulated or physical, requires not only
the ability to produce referring expressions but also the ability to
comprehend them. Understanding whether or not an utterance contains
referring language is a start, but what tends to be essential for
effective communication is knowing exactly which words refer to what things.
In this seminar, I will talk about language grounding, and more
specifically the challenges involved in producing and understanding
grounded language for conversational systems. I will discuss in more
detail some of the problems we have come to address over the last two
years and the progress made thus far, including our use of large,
pre-trained multimodal embedding models for downstream tasks, and
difficulties faced in the process of collecting visually-grounded
dialogue data via crowdsourcing.