Machine Learning Methods for Feature Selection and Rule Extraction in Genome-wide Association Studies (GWASs).

Nembrini, Stefano

Tree-based ensembles have recently gained popularity in genome-wide association studies (GWASs) because particularly suited to discover interactions and non-linear effects. In this context, it is particularly important to discover a small subset of single-nucleotide polymorphisms (SNPs) associated to the outcome of interest (feature selection phase), as well as to provide results that are interpretable (rule extraction phase). Using a dataset of 300K SNPs from a previous study, we propose a method for feature selction based on the use of Random Forests in a two-stage approach in order to select SNPs relevant to the prediction of the estimated glomerular fi ltration rate (eGFR). The work focuses on the application of Random Forest for extremely large datasets along with three different wrappers around the Random Forest algorithm. The results of this analysis are compared to findings from the original GWA study, and demonstrate some overlap. Moreover, other additional SNPs have been identi ed as being potentially associated with the outcome. Subsequently, in order to overcome the limitations of black-box models, we carry out a rule extraction phase, as to obtain a clear model for interpretation purposes.

Machine Learning Methods for Feature Selection and Rule Extraction in Genome-wide Association Studies (GWASs).

NEMBRINI, STEFANO

2013

Abstract

Tree-based ensembles have recently gained popularity in genome-wide association studies (GWASs) because particularly suited to discover interactions and non-linear effects. In this context, it is particularly important to discover a small subset of single-nucleotide polymorphisms (SNPs) associated to the outcome of interest (feature selection phase), as well as to provide results that are interpretable (rule extraction phase). Using a dataset of 300K SNPs from a previous study, we propose a method for feature selction based on the use of Random Forests in a two-stage approach in order to select SNPs relevant to the prediction of the estimated glomerular fi ltration rate (eGFR). The work focuses on the application of Random Forest for extremely large datasets along with three different wrappers around the Random Forest algorithm. The results of this analysis are compared to findings from the original GWA study, and demonstrate some overlap. Moreover, other additional SNPs have been identi ed as being potentially associated with the outcome. Subsequently, in order to overcome the limitations of black-box models, we carry out a rule extraction phase, as to obtain a clear model for interpretation purposes.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				STATISTICA ED APPLICAZIONI - 62R
			
	Data di pubblicazione
	
				21-mar-2013
			
	Lingua
	
				Inglese
			
	Parola chiave
	
				machine learning, genome-wide association studies, feature selection, rule extraction
			
	Nome Editore
	
				Università degli Studi di Milano-Bicocca
			
	Collezione di appartenenza
	
				Università degli Studi di Milano - Bicocca

File in questo prodotto:

File	Dimensione	Formato
phd_unimib_734383.pdf Open Access dal 19/03/2016 Dimensione 1.67 MB Formato Adobe PDF Visualizza/Apri	1.67 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/172229

Il codice NBN di questa tesi è URN:NBN:IT:UNIMIB-172229