Chapter 1 lays the foundation for understanding computational pangenomics by highlighting its broader relevance, introducing its core concepts, and reviewing the latest developments in the field. It also situates the results of this thesis within the context of ongoing research, helping readers appreciate both the theoretical underpinnings and the practical implications of this work. Chapter 2 then provides a high-level overview of the key data structures used to represent pangenomes. Starting with the fundamentals of pairwise sequence alignment, it builds up to the construction of the most straightforward pangenome representation---the multiple sequence alignment (MSA). The chapter then explores how to transform the MSA into various pangenome representations such as the Degenerate String (D-string), Elastic Degenerate String (ED-string), and variation graph. The chapter then concludes with a discussion of the strengths, limitations, and current advancements of various pangenomic representations. Following these introductory chapters, the remainder of the thesis describes my original contributions to pangenomics. Chapter 2, associated with the list of publications one through four, describes my contribution to the first draft of the human pangenome, which is the achievement of a large multidisciplinary consortium. After briefly describing the different steps and fields of study involved in constructing the pangenome, the focus switches to cutting-edge methods towards achieving chromosome-scale pairwise sequence alignment. The latter is a fundamental step in constructing real-life complex pangenomes such as the human pangenome. In particular, the chapter describes a strategy to improve a preprocessing step to pairwise alignment, which provides a possible better estimate of finding regions of similarity within segments of a pair of genomes. Chapter 4, associated with conference publication two, details wavefront alignment, which will have been introduced in Chapter~\ref{chpt: reps}. The chapter details an extension of wavefront alignment to the alignment of a linear string to a D-string, detailing both the theoretical foundations of the method and the experimental validation of the tool developed through our implementation. In Chapter 5, associated with journal publication seven and conference publication one, the wavefront alignment paradigm is further extended to the case of the global alignment of a linear string with an ED-string. More specifically, it describes how to extend the commonly used set of edit operations that shape the \emph{edit transcript}-the sequence of operations that details the alignment-by introducing a new edit operation: the \emph{skip} operation. This somewhat fills a gap in the literature in the field of the \emph{progressive} multiple sequence alignment, that is, when an MSA is built with an iterative procedure by progressively aligning a new string onto an existing multiple alignment. This was usually done disregarding existing gaps, whereas the explicit management of skips forces the multiple sequence alignment to account for them. Chapter 6, associated with journal publication six and conference publication three as well as preprint one, transitions from string-to-pangenome comparison onto pangenome-to-pangenome comparison. Focusing on ED-strings through the lens of automata, it describes how to outdo the naive automata-based approach that would require quadratic time complexity to quickly find whether two pangenomes share a genome; this is termed as the \emph{Elastic Degenerate String Intersection} (EDSI) problem. Furthermore, we design a data structure, which we shall call the intersection graph, to enable efficient responses to a broader range of real-world queries beyond the EDSI. Using the intersection graph, we show how to quickly compute two possible notions of local \emph{matching statistics} between pangenomes that we devised, as well as many consequent measures of similarity and distance metrics between pangenomes. Finally, we shall observe a validation of the method by applying the consequent tool to a SARS-CoV-2 dataset. Finally, 7 concludes the thesis by summarizing the work presented and outlining promising directions for future research.

Matching to and between Pangenomes

MWANIKI, MOSES NJAGI
2025

Abstract

Chapter 1 lays the foundation for understanding computational pangenomics by highlighting its broader relevance, introducing its core concepts, and reviewing the latest developments in the field. It also situates the results of this thesis within the context of ongoing research, helping readers appreciate both the theoretical underpinnings and the practical implications of this work. Chapter 2 then provides a high-level overview of the key data structures used to represent pangenomes. Starting with the fundamentals of pairwise sequence alignment, it builds up to the construction of the most straightforward pangenome representation---the multiple sequence alignment (MSA). The chapter then explores how to transform the MSA into various pangenome representations such as the Degenerate String (D-string), Elastic Degenerate String (ED-string), and variation graph. The chapter then concludes with a discussion of the strengths, limitations, and current advancements of various pangenomic representations. Following these introductory chapters, the remainder of the thesis describes my original contributions to pangenomics. Chapter 2, associated with the list of publications one through four, describes my contribution to the first draft of the human pangenome, which is the achievement of a large multidisciplinary consortium. After briefly describing the different steps and fields of study involved in constructing the pangenome, the focus switches to cutting-edge methods towards achieving chromosome-scale pairwise sequence alignment. The latter is a fundamental step in constructing real-life complex pangenomes such as the human pangenome. In particular, the chapter describes a strategy to improve a preprocessing step to pairwise alignment, which provides a possible better estimate of finding regions of similarity within segments of a pair of genomes. Chapter 4, associated with conference publication two, details wavefront alignment, which will have been introduced in Chapter~\ref{chpt: reps}. The chapter details an extension of wavefront alignment to the alignment of a linear string to a D-string, detailing both the theoretical foundations of the method and the experimental validation of the tool developed through our implementation. In Chapter 5, associated with journal publication seven and conference publication one, the wavefront alignment paradigm is further extended to the case of the global alignment of a linear string with an ED-string. More specifically, it describes how to extend the commonly used set of edit operations that shape the \emph{edit transcript}-the sequence of operations that details the alignment-by introducing a new edit operation: the \emph{skip} operation. This somewhat fills a gap in the literature in the field of the \emph{progressive} multiple sequence alignment, that is, when an MSA is built with an iterative procedure by progressively aligning a new string onto an existing multiple alignment. This was usually done disregarding existing gaps, whereas the explicit management of skips forces the multiple sequence alignment to account for them. Chapter 6, associated with journal publication six and conference publication three as well as preprint one, transitions from string-to-pangenome comparison onto pangenome-to-pangenome comparison. Focusing on ED-strings through the lens of automata, it describes how to outdo the naive automata-based approach that would require quadratic time complexity to quickly find whether two pangenomes share a genome; this is termed as the \emph{Elastic Degenerate String Intersection} (EDSI) problem. Furthermore, we design a data structure, which we shall call the intersection graph, to enable efficient responses to a broader range of real-world queries beyond the EDSI. Using the intersection graph, we show how to quickly compute two possible notions of local \emph{matching statistics} between pangenomes that we devised, as well as many consequent measures of similarity and distance metrics between pangenomes. Finally, we shall observe a validation of the method by applying the consequent tool to a SARS-CoV-2 dataset. Finally, 7 concludes the thesis by summarizing the work presented and outlining promising directions for future research.
10-lug-2025
Inglese
pangenomes
graphs
D-strings
ED-strings
sequence
alignment
algorithms
Pisanti, Nadia
File in questo prodotto:
File Dimensione Formato  
PhD_Thesis.pdf

accesso aperto

Licenza: Creative Commons
Dimensione 2.42 MB
Formato Adobe PDF
2.42 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/310179
Il codice NBN di questa tesi è URN:NBN:IT:UNIPI-310179