Machine Learning Models in Proteomics and Phylogenetics
Time: Thu 2025-06-12 14.00
Location: Air&Fire, Tomtebodavägen 23A, Solna
Language: English
Subject area: Biotechnology
Doctoral student: Patrick Truong , Science for Life Laboratory, SciLifeLab, Genteknologi
Opponent: Dr. Thomas Burger, University of Grenoble Alpes
Supervisor: Professor Lukas Käll, Science for Life Laboratory, SciLifeLab, Genteknologi, SeRC - Swedish e-Science Research Centre
QC 2025-05-21
Abstract
The exponential growth of biological data in recent years has necessitated the development of sophisticated computational methods to extract meaningful insights. This thesis explores various aspects of bioinformatics, focusing on benchmarking existing methods and developing novel approaches to address current challenges.
As computational biology and large-scale biological datasets continue to expand, the discipline has undergone a paradigm shift toward data-driven methodologies. This transformation is driven by advances in high-throughput technologies that generate vast amounts of genomic, proteomic, and other omics data. The sheer volume and complexity of these datasets demand innovative computational strategies.
Data-driven methods are increasingly central to biological research due to their ability to uncover hidden patterns, predict outcomes, and generate hypotheses from large-scale data. These approaches enable researchers to address complex biological problems that were previously intractable, leading to breakthroughs in areas such as personalized medicine, drug discovery, and systems biology.
This thesis presents four studies that advance bioinformatic methods and their applications. The first study modifies and evaluates the performance of Triqler, a probabilistic graphical model, for protein quantification in data-independent acquisition (DIA) mass spectrometry. By adapting Triqler for DIA data and comparing it with established methods, we demonstrate its superior performance in identifying differential proteins while maintaining better statistical calibration.
The second study introduces Prosit-transformers, a novel approach to prediction of MS2 spectrum intensity. By incorporating a transformer model pre-trained on protein features, we achieve improved prediction accuracy and reduced training time compared to the original Prosit model based on recurrent neural networks.
The third study explores proteome-wide alkylation to enhance peptide sequence coverage and detection sensitivity in proteomic analyses. Through systematic modification of peptides with varying alkyl chain lengths, we demonstrate significant improvements in ionization signals, particularly for hydrophilic peptides. This approach has potential applications in nanoproteomics and single-cell proteomics, where sample material is limited.
Finally, the fourth study presents difFUBAR, a scalable Bayesian method for comparing the selection pressure between different sets of branches in phylogenetic analyzes. Implemented in the Julia-based MolecularEvolution.jl framework, difFUBAR offers improved computational efficiency through subtree-likelihood caching and provides a robust alternative to frequentist approaches for characterizing site-wise variation in selection parameters.
Together, these studies contribute to the benchmarks for these novel methods to establish their superiority over existing methods and to develop the arsenal of novel computational approaches in bioinformatics. By addressing challenges in proteomics, computational biology, and evolutionary analysis, this thesis contributes to the ongoing advancement of data-driven methods in biology. The work presented here not only improves our understanding of biological systems, but also provides researchers with enhanced tools to extract meaningful insights from complex biological data.