Skip to main content
To KTH's start page To KTH's start page

Algorithms and Models in Nanopore DNA Sequencing

Advanced Decoding and Modeling with Hierarchical Hidden Markov Models

Time: Fri 2024-05-24 13.00

Location: F3 (Flodis), Lindstedtsvägen 26 & 28, Stockholm

Language: English

Subject area: Electrical Engineering

Doctoral student: Xuechun Xu , Teknisk informationsvetenskap, Division of Information Science and Engineering

Opponent: Professor Broňa Brejová, Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava

Supervisor: Joakim Jaldén, Teknisk informationsvetenskap, ACCESS Linnaeus Centre

Export to calendar

QC 20240502


Within less than four decades, the nanopore sequencing technology accelerated from an implausible idea scribbled on a notebook page to one of the decisive contributors to the complete sequence of the human genome. Its rapid evolution, particularly in recent years, is driven not only by its inherent innovation of the nanopores but also by synergistic advancements in complementary fields, such as GPU acceleration and deep neural networks, as well as cross-disciplinary influences from domains like speech recognition. However, during this rapid advancement, certain methods within nanopore sequencing remain relatively unexplored. This oversight has the potential to create bottlenecks in the technology's further development.

In this thesis work, we delve into these uncharted areas, seeking to fill critical gaps and branch the technology into new frontiers. Our objective is to unleash its potential, enabling further breakthroughs in genomic research and beyond.

Through our research, we have developed two novel algorithms and two innovative models tailored to address these under-explored aspects of nanopore sequencing. The two algorithms, the GMBS and the LFBS, both belonging to the MBS group, offer innovative solutions to the challenging decoding problems inherent in HHMMs. They are two distinct variations tailored to different scenarios. While the GMBS is specifically suited for decoding lengthy sequences, such as those encountered in long-read basecalling, the LFBS is optimized for parallel programming and excels in processing short-length sequences. 

The two innovative models developed in this research, each leveraging variations of HHMMs and employing an end-to-end approach, exhibit distinguished structures. The first model, a hybrid of EDHMM and DNN, showcases the effectiveness of integrating both knowledge-driven and data-driven techniques. In contrast, the second model, a custom-designed Helicase HMM, draws inspiration from pioneering studies on motor proteins found in sequencing devices. With its elaborate hierarchical state architecture boasting over five million emission states, this model offers a comprehensive feature space comparable to its predecessor.