Skip to main content

Search the KTH Intranet
Search in KTH.se Search the Student web
Svenska
Den här sidan är ej översatt Startsida på svenska

Minding the Gap between Music Audio and Lyrics - A Comparative Analysis of Audio Embedding Techniques with and without Lyrics

Master's Thesis Presentation

Time: Mon 2023-08-14 14.00 - 15.30

Location: TMH, LV24, level 5, room F0

Language: English

Contact:

André Tiago Abelho Pereira researcher atap@kth.se Profile

DIVISION OF SPEECH, MUSIC AND HEARING

Subject area: Acustic

Respondent: Joel Kärn , Intelligent Systems

Opponent: Fehmi Ayberk Uçkun

Supervisor: André Abelho Pereira

Examiner: Sten Ternström

Export to calendar

ABSTRACT

This thesis investigates various audio and lyric embedding methods for Music Information Retrieval (MIR) tasks, specifically in the contexts of music genre classification and tagging. Genre classification refers to the categorization of music into distinct genres, while music tagging is the process of assigning categorical identifiers to music tracks. These tasks are central to the MIR field and have important implications for music streaming businesses like Soundtrack Your Brand, which depend on precise classification and tagging to tailor user experiences.The study evaluates three audio embedding techniques—Codified Audio Language Modeling (CALM), Choi, and L3-Net—and two lyric embedding methods—BERT and OpenAI's Embeddings. A particular focus is placed on the integration of the highest-performing audio embedder with a lyrics text embedder, an approach that remains unexplored but may have substantial future applications, especially with the emergence of advanced transcribing models.In terms of genre classification, the results show that CALM was more effective in capturing relevant musical features than other audio models. However, the combination of CALM with OpenAI's Embeddings yielded the highest classification accuracy, emphasizing the important role of lyrical content in genre prediction. The findings were consistent in the task of music tagging. While CALM outperformed other models in almost all music tags, the combined model of CALM and OpenAI's Embeddings proved to be even more effective. In the comparison between lyric embedding methods, OpenAI's Embeddings outperformed BERT across most music tags and genres.The research supports the concept that lyrics, though not entirely sufficient for comprehensive music tagging or genre classification, provide valuable additional information to audio embeddings. It further concludes with a reflection on the benefits of a multimodal approach that combines both audio and lyrical features in MIR tasks, highlighting the potential for improved performance through the integration of these methods.

To the calendar