Multilingual Language Models
Studies of Pre-Training Approaches and Hallucination Detection
Time: Mon 2024-12-16 14.00
Location: Kollegiesalen, Brinellvägen 8, Stockholm
Video link: https://kth-se.zoom.us/s/3719008936
Language: English
Doctoral student: Evangelia Gogoulou , Programvaruteknik och datorsystem, SCS, RISE Research Institutes of Sweden
Opponent: Professor Barbara Plank, Ludwig-Maximilians-Universität München, München, Germany
Supervisor: Professor Magnus Boman, Programvaruteknik och datorsystem, SCS; Professor Joakim Nivre, RISE Research Institutes of Sweden, Uppsala University; Professor Hedvig Kjellström, Robotik, perception och lärande, RPL
QC 20241119
Abstract
The performance of large language models has been improving steadily but varies considerably across languages. One strategy for improving this situation is to train multilingual models that enable cross-lingual transfer, such that knowledge from high-resource languages can be leveraged to improve performance on low-resource languages, but there are limits to the number of languages models can effectively support. Understanding the factors influencing cross-lingual transfer is crucial for building models that perform consistently across languages. This thesis investigates how the interaction between languages during pre-training affects model performance in different scenarios of training schemes, model architecture, and evaluation criteria. We first investigate the scalability of multilingual joint pre-training in the generative setting. We pre-train the first large-scale autoregressive language model for English and Swedish and find that its performance improves with increasing data volumes and number of parameters. Then, we study the forward cross-lingual transfer effects in the case of incremental language pre-training. Our experimental results of transferring monolingual encoder language models from a set of four languages to English demonstrate that forward transfer effects, measured in terms of downstream performance, are consistently positive. Building on this, we next analyze both forward and backward effects of incrementally pre-training autoregressive language models on a sequence of languages, with varying order. While forward transfer effects are again always positive, it is observed that backward transfer effects depend on the order and characteristics of languages. Our analysis of possible explanatory factors for backward transfer reveals the potentially important role of language contamination and syntactic similarity. Lastly, we conduct a comparative study of the performance of autoregressive models with varying language coverage on the task of detecting intrinsic hallucinations in paraphrase generation and machine translation scenarios, in different languages. Our experimental results show that models have consistent performance across languages, and also suggest that model-specific factors, such as model size and instruction tuning, have a large impact on the performance. These findings advance the understanding of cross-lingual transfer, providing the foundations for multilingual models with enhanced learning capacity and consistent performance across previously learned languages. Additionally, our work contributes to the evaluation of autoregressive multilingual language models, by providing resources and methods for studying the hallucination phenomenon in machine-generated text.