Tree-based ensembles have recently gained popularity in genome-wide association studies (GWASs) because particularly suited to discover interactions and non-linear effects. In this context, it is particularly important to discover a small subset of single-nucleotide polymorphisms (SNPs) associated to the outcome of interest (feature selection phase), as well as to provide results that are interpretable (rule extraction phase). Using a dataset of 300K SNPs from a previous study, we propose a method for feature selction based on the use of Random Forests in a two-stage approach in order to select SNPs relevant to the prediction of the estimated glomerular fi ltration rate (eGFR). The work focuses on the application of Random Forest for extremely large datasets along with three different wrappers around the Random Forest algorithm. The results of this analysis are compared to findings from the original GWA study, and demonstrate some overlap. Moreover, other additional SNPs have been identi ed as being potentially associated with the outcome. Subsequently, in order to overcome the limitations of black-box models, we carry out a rule extraction phase, as to obtain a clear model for interpretation purposes.
Machine Learning Methods for Feature Selection and Rule Extraction in Genome-wide Association Studies (GWASs).
NEMBRINI, STEFANO
2013
Abstract
Tree-based ensembles have recently gained popularity in genome-wide association studies (GWASs) because particularly suited to discover interactions and non-linear effects. In this context, it is particularly important to discover a small subset of single-nucleotide polymorphisms (SNPs) associated to the outcome of interest (feature selection phase), as well as to provide results that are interpretable (rule extraction phase). Using a dataset of 300K SNPs from a previous study, we propose a method for feature selction based on the use of Random Forests in a two-stage approach in order to select SNPs relevant to the prediction of the estimated glomerular fi ltration rate (eGFR). The work focuses on the application of Random Forest for extremely large datasets along with three different wrappers around the Random Forest algorithm. The results of this analysis are compared to findings from the original GWA study, and demonstrate some overlap. Moreover, other additional SNPs have been identi ed as being potentially associated with the outcome. Subsequently, in order to overcome the limitations of black-box models, we carry out a rule extraction phase, as to obtain a clear model for interpretation purposes.File | Dimensione | Formato | |
---|---|---|---|
phd_unimib_734383.pdf
Open Access dal 19/03/2016
Dimensione
1.67 MB
Formato
Adobe PDF
|
1.67 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/172229
URN:NBN:IT:UNIMIB-172229