Tree-based ensembles have recently gained popularity in genome-wide association studies (GWASs) because particularly suited to discover interactions and non-linear effects. In this context, it is particularly important to discover a small subset of single-nucleotide polymorphisms (SNPs) associated to the outcome of interest (feature selection phase), as well as to provide results that are interpretable (rule extraction phase). Using a dataset of 300K SNPs from a previous study, we propose a method for feature selction based on the use of Random Forests in a two-stage approach in order to select SNPs relevant to the prediction of the estimated glomerular fi ltration rate (eGFR). The work focuses on the application of Random Forest for extremely large datasets along with three different wrappers around the Random Forest algorithm. The results of this analysis are compared to findings from the original GWA study, and demonstrate some overlap. Moreover, other additional SNPs have been identi ed as being potentially associated with the outcome. Subsequently, in order to overcome the limitations of black-box models, we carry out a rule extraction phase, as to obtain a clear model for interpretation purposes.

Machine Learning Methods for Feature Selection and Rule Extraction in Genome-wide Association Studies (GWASs).

NEMBRINI, STEFANO
2013

Abstract

Tree-based ensembles have recently gained popularity in genome-wide association studies (GWASs) because particularly suited to discover interactions and non-linear effects. In this context, it is particularly important to discover a small subset of single-nucleotide polymorphisms (SNPs) associated to the outcome of interest (feature selection phase), as well as to provide results that are interpretable (rule extraction phase). Using a dataset of 300K SNPs from a previous study, we propose a method for feature selction based on the use of Random Forests in a two-stage approach in order to select SNPs relevant to the prediction of the estimated glomerular fi ltration rate (eGFR). The work focuses on the application of Random Forest for extremely large datasets along with three different wrappers around the Random Forest algorithm. The results of this analysis are compared to findings from the original GWA study, and demonstrate some overlap. Moreover, other additional SNPs have been identi ed as being potentially associated with the outcome. Subsequently, in order to overcome the limitations of black-box models, we carry out a rule extraction phase, as to obtain a clear model for interpretation purposes.
21-mar-2013
Inglese
machine learning, genome-wide association studies, feature selection, rule extraction
Università degli Studi di Milano-Bicocca
File in questo prodotto:
File Dimensione Formato  
phd_unimib_734383.pdf

Open Access dal 19/03/2016

Dimensione 1.67 MB
Formato Adobe PDF
1.67 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/172229
Il codice NBN di questa tesi è URN:NBN:IT:UNIMIB-172229