Skip to main content
To KTH's start page

Bittersweet Lessons in Music AI Research

Neural Instrument Synthesis, Multi-modal Representations, Symbolic Music Generation

Time: Fri 2026-03-27 15.00

Location: F3 (Flodis), Lindstedtsvägen 26 & 28, Sweden

Video link: https://kth-se.zoom.us/j/64932870406

Language: English

Subject area: Speech and Music Communication

Doctoral student: Nicolas Jonason , Tal, musik och hörsel

Opponent: Associate Professor Gus Xia, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates

Supervisor: Associate professor Bob Sturm, Tal, musik och hörsel

Export to calendar

QC 20260306

Abstract

This compilation thesis explores AI techniques in three areas related to music making: neural instrument synthesis, multi-modal representations, and symbolic music generation. In neural instrument synthesis, we explore architectural changes and transfer learning techniques to apply neural synthesis methods to instruments where little data is available. We then move to zero-shot audio applications of multi-modal representations, including text-guided audio equalization, visualization of instrument sounds, and text-driven synthesizer programming. In the domain of symbolic music, we propose superposed language modelling, a generalisation of masked language modelling that enables controllable generation and editing of music using event-attribute domain constraints. We then experiment with text-driven music generation and editing with LLMs augmented with a retrieval system to fetch relevant few-shot examples, showing early signs that LLMs could challenge domain specific approaches to symbolic music generation. We then bridge the symbolic and audio domains by using an audio-domain model of human preferences as a reward to tune a symbolic music generation model, producing music which according to the preference model is better than Mozart. Reflecting on our work, we focus on data availability as the key factor in determining Music AI capabilities and that much of our work in this thesis can be seen as capability arbitrage: redirecting capabilities from data-rich domains towards data-poor domains. We conclude by speculating on music making capabilities of future AI considering the massive iceberg of data that remains unused.

Link to DiVA