Algorithms and machine learning for single-molecule protein sequencing methods
Time: Fri 2025-11-07 13.00
Location: F3, Lindstedtvägen 26
Language: English
Doctoral student: Javier Kipen , Teknisk informationsvetenskap
Opponent: Senior assistant professor Simone Tiberi,
Supervisor: Professor Joakim Jaldén, Teknisk informationsvetenskap
QC 20250930
Abstract
Single-molecule protein sequencing (SMPS) technologies are powerfulalternatives to mass spectrometry, offering new opportunities for highresolutionproteomics. These technologies, including nanopores, nanogaps,and fluorosequencing, enable the direct identification of protein moleculesat single-molecule resolution. Their potential spans diverse applications,from supporting cutting-edge biological research to developing diagnosticsand therapeutics. However, SMPS platforms generate complex and noisysignals in large volumes, making computational analysis a key bottleneck inunlocking their full potential.This thesis addresses that challenge by developing scalable, modelinformedand data-driven algorithms tailored to SMPS data. Drawing ontools from statistical signal processing and machine learning, the work focuseson computational methods that improve signal denoising, inference accuracy,and runtime efficiency across several SMPS technologies.The contributions span three major sensing platforms. For nanogap tunnelingdevices, a fast and robust denoising algorithm is introduced to managethe heavy-tailed noise characteristic of electronic tunneling signals. Fornanopore DNA sensing, a physics-inspired data augmentation method is proposedto improve the generalization of neural networks without requiring additionalexperimental data. Alongside this data augmentation, the thesisintroduces a novel neural network architecture that leverages the augmentation’sbenefits and incorporates modern design principles, such as residualconnections and attention mechanisms, to outperform state-of-the-art modelson a nanopore classification task.Finally, for fluorosequencing, this thesis presents two complementary contributions:(i) a fast beam search decoder for peptide inference and (ii) anexpectation-maximization framework for protein abundance estimation. Theproposed decoder achieves up to a tenfold speedup over existing methods withonly minimal loss in accuracy. Building on its output, the EM-based proteininference framework enables efficient estimation of protein abundances frompeptide-level posteriors. We demonstrate that this approach not only improvesquantification accuracy on small-scale datasets but also scales to thefull human proteome with tractable computation times, offering a viable routetoward single-molecule proteomics at large scale. Together, these tools contributeto the broader effort of making SMPS computationally tractable atthe scale required for full-proteome and single-cell analyses.All methods in this thesis have been made available as open-source software,reflecting a commitment to reproducibility and to supporting the growingSMPS research community. Through the integration of domain knowledge,algorithmic design, and computational efficiency, this thesis aims topush the boundaries of what is achievable in next-generation proteomics.