Bioinformatics and Machine Learning tools for personalised medicine: variant calling pipelines and Dynamic Bayesian Networks

Hazizaj, Enidia

This thesis is divided into two main sections: the first focuses on bioinformatic variant calling pipelines, while the second explores machine learning tools. The common thread linking these two sections is the shared goal of developing personalized treatment strategies. First section In clinical settings, as many disorders are caused by genetic variants, it is essential to accurately identify mutational patterns within the human genome. In the past few years, remarkable progress has been made in developing and optimising bioinformatic pipelines for variant calling. The rapidly decreasing cost of next-generation sequencing (NGS) technologies and the availability of gold-standard sample datasets have enabled the creation of cutting-edge tools for detecting a wide range of genomic aberrations and the evaluation of their performance at a large scale. Despite the current technological limitations associated with the main sequencing platforms, including sequencing errors and uneven coverage, germline variant calling has reached a performance level that firmly consolidates its use for clinical decision-making when addressing inherited disorders. Conversely, somatic variant calling introduces additional complexities with respect to germline variant calling, especially in the field of cancer genetics. Indeed, cancer genomes exhibit significant heterogeneity. There is a lack of a comprehensive, open-access dataset of gold-standard tumoral samples to be used for the purpose of benchmarking and optimising somatic variant calling methods. Nevertheless, somatic variant detection already proved to be crucial when addressing specific tumours. Personalised medicine approaches have achieved notable successes in terms of preventing and diagnosing tumours, as well as developing targeted therapies. Therefore, in order to benchmark somatic variant calling pipelines, and optimise their performance in different possible tumoral scenarios, a reliable dataset of gold-standard samples, that considers both the biological complexities of tumoral genomes and technological limitations of NGS sequencing is needed. The first chapter of this thesis is divided into three main sections. The first section provides a brief biological background on the different sequencing platforms and the different steps of bioinformatic analysis used to process the output files from sequencers in order to detect genetic variants presence. The second section presents an example of a bioinformatic pipeline developed using the Nextflow framework, which can be employed to detect both germline and somatic variants. Finally, the third section details the work related to the development of a meta-simulator designed to generate synthetic tumor data, which can be used to validate and optimize somatic variant calling pipelines.Second section The fields of clinical medicine and public health are undergoing a significant transformation driven by the digitization of medical records and the growth in data collected from health registries and clinical studies, empowering precise risk prediction and tailored intervention selection. AI and ML methods can describe the disease process and make predictions, while also developing personalized care approaches tailored to individual patient characteristics. In clinical decision support (CDS), ML tools assist clinicians in making more informed treatment decisions, including known treatment plans and patient outcomes. However, medical datasets are often incomplete and noisy, introducing substantial uncertainty in the data processing and analysis phases. Bayesian networks (BNs) represent a powerful knowledge representation and machine learning technique for risk prediction that offer a structured approach to manage uncertainty.

Bioinformatics and Machine Learning tools for personalised medicine: variant calling pipelines and Dynamic Bayesian Networks

HAZIZAJ, ENIDIA

2025

Abstract

This thesis is divided into two main sections: the first focuses on bioinformatic variant calling pipelines, while the second explores machine learning tools. The common thread linking these two sections is the shared goal of developing personalized treatment strategies. First section In clinical settings, as many disorders are caused by genetic variants, it is essential to accurately identify mutational patterns within the human genome. In the past few years, remarkable progress has been made in developing and optimising bioinformatic pipelines for variant calling. The rapidly decreasing cost of next-generation sequencing (NGS) technologies and the availability of gold-standard sample datasets have enabled the creation of cutting-edge tools for detecting a wide range of genomic aberrations and the evaluation of their performance at a large scale. Despite the current technological limitations associated with the main sequencing platforms, including sequencing errors and uneven coverage, germline variant calling has reached a performance level that firmly consolidates its use for clinical decision-making when addressing inherited disorders. Conversely, somatic variant calling introduces additional complexities with respect to germline variant calling, especially in the field of cancer genetics. Indeed, cancer genomes exhibit significant heterogeneity. There is a lack of a comprehensive, open-access dataset of gold-standard tumoral samples to be used for the purpose of benchmarking and optimising somatic variant calling methods. Nevertheless, somatic variant detection already proved to be crucial when addressing specific tumours. Personalised medicine approaches have achieved notable successes in terms of preventing and diagnosing tumours, as well as developing targeted therapies. Therefore, in order to benchmark somatic variant calling pipelines, and optimise their performance in different possible tumoral scenarios, a reliable dataset of gold-standard samples, that considers both the biological complexities of tumoral genomes and technological limitations of NGS sequencing is needed. The first chapter of this thesis is divided into three main sections. The first section provides a brief biological background on the different sequencing platforms and the different steps of bioinformatic analysis used to process the output files from sequencers in order to detect genetic variants presence. The second section presents an example of a bioinformatic pipeline developed using the Nextflow framework, which can be employed to detect both germline and somatic variants. Finally, the third section details the work related to the development of a meta-simulator designed to generate synthetic tumor data, which can be used to validate and optimize somatic variant calling pipelines.Second section The fields of clinical medicine and public health are undergoing a significant transformation driven by the digitization of medical records and the growth in data collected from health registries and clinical studies, empowering precise risk prediction and tailored intervention selection. AI and ML methods can describe the disease process and make predictions, while also developing personalized care approaches tailored to individual patient characteristics. In clinical decision support (CDS), ML tools assist clinicians in making more informed treatment decisions, including known treatment plans and patient outcomes. However, medical datasets are often incomplete and noisy, introducing substantial uncertainty in the data processing and analysis phases. Bayesian networks (BNs) represent a powerful knowledge representation and machine learning technique for risk prediction that offer a structured approach to manage uncertainty.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				SCIENZE FARMACOLOGICHE
			
	Data di pubblicazione
	
				21-gen-2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				FERRI, NICOLA
			
	Nome Editore
	
				Università degli studi di Padova
			
	Collezione di appartenenza
	
				Università degli Studi di Padova

File in questo prodotto:

File	Dimensione	Formato
Tesi_Enidia_Hazizaj.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 5.47 MB Formato Adobe PDF Visualizza/Apri	5.47 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/196579

Il codice NBN di questa tesi è URN:NBN:IT:UNIPD-196579