This thesis is divided into two main sections: the first focuses on bioinformatic variant calling pipelines, while the second explores machine learning tools. The common thread linking these two sections is the shared goal of developing personalized treatment strategies. First section In clinical settings, as many disorders are caused by genetic variants, it is essential to accurately identify mutational patterns within the human genome. In the past few years, remarkable progress has been made in developing and optimising bioinformatic pipelines for variant calling. The rapidly decreasing cost of next-generation sequencing (NGS) technologies and the availability of gold-standard sample datasets have enabled the creation of cutting-edge tools for detecting a wide range of genomic aberrations and the evaluation of their performance at a large scale. Despite the current technological limitations associated with the main sequencing platforms, including sequencing errors and uneven coverage, germline variant calling has reached a performance level that firmly consolidates its use for clinical decision-making when addressing inherited disorders. Conversely, somatic variant calling introduces additional complexities with respect to germline variant calling, especially in the field of cancer genetics. Indeed, cancer genomes exhibit significant heterogeneity. There is a lack of a comprehensive, open-access dataset of gold-standard tumoral samples to be used for the purpose of benchmarking and optimising somatic variant calling methods. Nevertheless, somatic variant detection already proved to be crucial when addressing specific tumours. Personalised medicine approaches have achieved notable successes in terms of preventing and diagnosing tumours, as well as developing targeted therapies. Therefore, in order to benchmark somatic variant calling pipelines, and optimise their performance in different possible tumoral scenarios, a reliable dataset of gold-standard samples, that considers both the biological complexities of tumoral genomes and technological limitations of NGS sequencing is needed. The first chapter of this thesis is divided into three main sections. The first section provides a brief biological background on the different sequencing platforms and the different steps of bioinformatic analysis used to process the output files from sequencers in order to detect genetic variants presence. The second section presents an example of a bioinformatic pipeline developed using the Nextflow framework, which can be employed to detect both germline and somatic variants. Finally, the third section details the work related to the development of a meta-simulator designed to generate synthetic tumor data, which can be used to validate and optimize somatic variant calling pipelines.Second section The fields of clinical medicine and public health are undergoing a significant transformation driven by the digitization of medical records and the growth in data collected from health registries and clinical studies, empowering precise risk prediction and tailored intervention selection. AI and ML methods can describe the disease process and make predictions, while also developing personalized care approaches tailored to individual patient characteristics. In clinical decision support (CDS), ML tools assist clinicians in making more informed treatment decisions, including known treatment plans and patient outcomes. However, medical datasets are often incomplete and noisy, introducing substantial uncertainty in the data processing and analysis phases. Bayesian networks (BNs) represent a powerful knowledge representation and machine learning technique for risk prediction that offer a structured approach to manage uncertainty.
Bioinformatics and Machine Learning tools for personalised medicine: variant calling pipelines and Dynamic Bayesian Networks
HAZIZAJ, ENIDIA
2025
Abstract
This thesis is divided into two main sections: the first focuses on bioinformatic variant calling pipelines, while the second explores machine learning tools. The common thread linking these two sections is the shared goal of developing personalized treatment strategies. First section In clinical settings, as many disorders are caused by genetic variants, it is essential to accurately identify mutational patterns within the human genome. In the past few years, remarkable progress has been made in developing and optimising bioinformatic pipelines for variant calling. The rapidly decreasing cost of next-generation sequencing (NGS) technologies and the availability of gold-standard sample datasets have enabled the creation of cutting-edge tools for detecting a wide range of genomic aberrations and the evaluation of their performance at a large scale. Despite the current technological limitations associated with the main sequencing platforms, including sequencing errors and uneven coverage, germline variant calling has reached a performance level that firmly consolidates its use for clinical decision-making when addressing inherited disorders. Conversely, somatic variant calling introduces additional complexities with respect to germline variant calling, especially in the field of cancer genetics. Indeed, cancer genomes exhibit significant heterogeneity. There is a lack of a comprehensive, open-access dataset of gold-standard tumoral samples to be used for the purpose of benchmarking and optimising somatic variant calling methods. Nevertheless, somatic variant detection already proved to be crucial when addressing specific tumours. Personalised medicine approaches have achieved notable successes in terms of preventing and diagnosing tumours, as well as developing targeted therapies. Therefore, in order to benchmark somatic variant calling pipelines, and optimise their performance in different possible tumoral scenarios, a reliable dataset of gold-standard samples, that considers both the biological complexities of tumoral genomes and technological limitations of NGS sequencing is needed. The first chapter of this thesis is divided into three main sections. The first section provides a brief biological background on the different sequencing platforms and the different steps of bioinformatic analysis used to process the output files from sequencers in order to detect genetic variants presence. The second section presents an example of a bioinformatic pipeline developed using the Nextflow framework, which can be employed to detect both germline and somatic variants. Finally, the third section details the work related to the development of a meta-simulator designed to generate synthetic tumor data, which can be used to validate and optimize somatic variant calling pipelines.Second section The fields of clinical medicine and public health are undergoing a significant transformation driven by the digitization of medical records and the growth in data collected from health registries and clinical studies, empowering precise risk prediction and tailored intervention selection. AI and ML methods can describe the disease process and make predictions, while also developing personalized care approaches tailored to individual patient characteristics. In clinical decision support (CDS), ML tools assist clinicians in making more informed treatment decisions, including known treatment plans and patient outcomes. However, medical datasets are often incomplete and noisy, introducing substantial uncertainty in the data processing and analysis phases. Bayesian networks (BNs) represent a powerful knowledge representation and machine learning technique for risk prediction that offer a structured approach to manage uncertainty.File | Dimensione | Formato | |
---|---|---|---|
Tesi_Enidia_Hazizaj.pdf
accesso aperto
Dimensione
5.47 MB
Formato
Adobe PDF
|
5.47 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/196579
URN:NBN:IT:UNIPD-196579