In recent years, Machine Learning (ML) has emerged as a powerful tool in omics data analysis, particularly for uncovering complex biological mechanisms in diseases such as Duchenne Muscular Dystrophy (DMD). This thesis explores two distinct research contributions within the broader context of omics. The first contribution focuses on benchmarking ML algorithms in Genome-Wide Association Studies (GWAS) for DMD. It employs a time-to-event phenotype to improve the detection of genetic variants associated with age-at-loss of ambulation (age-at-LoA) in patients with DMD. We applied various ML techniques to the GWAS of a cohort of 500 patients with DMD collected from different academic neuromuscular centers across Italy. Considering the nested structure of patients within centers, we performed multilevel analysis to capture the potential effect of the center on the age-at-LoA in patients. Classic survival analysis like Cox Proportional Hazard (Cox PH) Model was also used in this research. We evaluated the predictive performance and effectiveness of difference Machine Learning and Classic statistical methods by comparing our findings with a proven Single Nucleotide Polymorphism (SNP) associated with age-at-LoA in patients with DMD. The results of Cox PH and multilevel models presented an acceptable effectiveness of conventional statistical analyses. However, among various ML algorithms, LASSO was able to identify the proven locus associated with disease progression in DMD patients. Our result highlights the weaknesses of Machine Learning methods when the data has a complex architecture, like genomic data, in particular when dealing with small samples such as rare diseases. The second contribution focuses on transcriptomics, where Pathway Enrichment Analysis (PEA) is employed to investigate the possibility of overrepresenting COVID-19-related genes. We performed an in-depth analysis of pathways associated with COVID-19 using MeSH Disease terms from 2018. Several pathway enrichment analyses were conducted across different databases, identifying significantly enriched COVID-related pathways based on adjusted p-values using multiple p-value correction methods (<0.01). Comparative analyses were carried out within each database, focusing on adjusted p-values. We also assessed the performance of various p-value correction techniques in transcriptomics by comparing their effects across databases. Our results showed that over 20% of MeSH Disease terms in each database were incorrectly associated with COVID-related pathways. Furthermore, the influence of p-value correction methods varied between databases. This study underscores a significant issue in pathway enrichment analysis methods, revealing the risk of false discoveries and incorrect associations between MeSH Disease terms and COVID-related pathways. Our findings challenge the assumption that a single p-value correction method is always optimal for transcriptomics across all databases.
Machine Learning for Omics Data Analysis in Neuroscience (Duchenne Muscular Dystrophy)
AHSANINASAB, SARA
2025
Abstract
In recent years, Machine Learning (ML) has emerged as a powerful tool in omics data analysis, particularly for uncovering complex biological mechanisms in diseases such as Duchenne Muscular Dystrophy (DMD). This thesis explores two distinct research contributions within the broader context of omics. The first contribution focuses on benchmarking ML algorithms in Genome-Wide Association Studies (GWAS) for DMD. It employs a time-to-event phenotype to improve the detection of genetic variants associated with age-at-loss of ambulation (age-at-LoA) in patients with DMD. We applied various ML techniques to the GWAS of a cohort of 500 patients with DMD collected from different academic neuromuscular centers across Italy. Considering the nested structure of patients within centers, we performed multilevel analysis to capture the potential effect of the center on the age-at-LoA in patients. Classic survival analysis like Cox Proportional Hazard (Cox PH) Model was also used in this research. We evaluated the predictive performance and effectiveness of difference Machine Learning and Classic statistical methods by comparing our findings with a proven Single Nucleotide Polymorphism (SNP) associated with age-at-LoA in patients with DMD. The results of Cox PH and multilevel models presented an acceptable effectiveness of conventional statistical analyses. However, among various ML algorithms, LASSO was able to identify the proven locus associated with disease progression in DMD patients. Our result highlights the weaknesses of Machine Learning methods when the data has a complex architecture, like genomic data, in particular when dealing with small samples such as rare diseases. The second contribution focuses on transcriptomics, where Pathway Enrichment Analysis (PEA) is employed to investigate the possibility of overrepresenting COVID-19-related genes. We performed an in-depth analysis of pathways associated with COVID-19 using MeSH Disease terms from 2018. Several pathway enrichment analyses were conducted across different databases, identifying significantly enriched COVID-related pathways based on adjusted p-values using multiple p-value correction methods (<0.01). Comparative analyses were carried out within each database, focusing on adjusted p-values. We also assessed the performance of various p-value correction techniques in transcriptomics by comparing their effects across databases. Our results showed that over 20% of MeSH Disease terms in each database were incorrectly associated with COVID-related pathways. Furthermore, the influence of p-value correction methods varied between databases. This study underscores a significant issue in pathway enrichment analysis methods, revealing the risk of false discoveries and incorrect associations between MeSH Disease terms and COVID-related pathways. Our findings challenge the assumption that a single p-value correction method is always optimal for transcriptomics across all databases.File | Dimensione | Formato | |
---|---|---|---|
Thesis_Sara_Ahsaninasab_PDF-A.pdf
embargo fino al 12/03/2028
Dimensione
1.35 MB
Formato
Adobe PDF
|
1.35 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/208369
URN:NBN:IT:UNIPD-208369