High-throughput sequencing now enables comprehensive multi-omics profiling of disease, capturing genomic, transcriptomic and functional alterations within the same samples. These datasets are rich in information but also high-dimensional, noisy and heterogeneous, which complicates their integration and the extraction of interpretable, clinically meaningful markers. In this context, robust feature selection is essential to reduce complexity, preserve predictive accuracy and highlight biologically relevant signals. This thesis introduces Rank’n’Select, a structured workflow for feature selection in supervised classification with multi-omics data. The workflow combines multiple information-theoretic filter methods with repeated resampling to obtain stable rankings of features, which are then synthesised within a model-based statistical framework. Features are grouped into tiers of decreasing relevance, and increasingly larger subsets are used to train classification models. Performance is assessed as the feature set expands, and an automated stopping rule identifies a compact subset that maintains accuracy while limiting model complexity. The properties of Rank’n’Select are first evaluated on simulated data designed to mirror key challenges of multi-omics studies, including limited sample sizes, complex correlation structures and the coexistence of informative and noisy variables. These experiments show that the workflow can recover most truly informative features, limit redundancy and maintain robust performance even in demanding scenarios. Finally, Rank’n’Select is applied to a colorectal cancer case study from the ONCOBIOME project, integrating multiple molecular layers and immune-related information across controls, polyps and tumours. Across several classification tasks, the workflow yields compact panels of features that accurately discriminate between clinical groups and align with current biological knowledge
Enhancing functional insights through robust and interpretable feature selection in multi-omics data integration
ROSSO, ELENA
2025
Abstract
High-throughput sequencing now enables comprehensive multi-omics profiling of disease, capturing genomic, transcriptomic and functional alterations within the same samples. These datasets are rich in information but also high-dimensional, noisy and heterogeneous, which complicates their integration and the extraction of interpretable, clinically meaningful markers. In this context, robust feature selection is essential to reduce complexity, preserve predictive accuracy and highlight biologically relevant signals. This thesis introduces Rank’n’Select, a structured workflow for feature selection in supervised classification with multi-omics data. The workflow combines multiple information-theoretic filter methods with repeated resampling to obtain stable rankings of features, which are then synthesised within a model-based statistical framework. Features are grouped into tiers of decreasing relevance, and increasingly larger subsets are used to train classification models. Performance is assessed as the feature set expands, and an automated stopping rule identifies a compact subset that maintains accuracy while limiting model complexity. The properties of Rank’n’Select are first evaluated on simulated data designed to mirror key challenges of multi-omics studies, including limited sample sizes, complex correlation structures and the coexistence of informative and noisy variables. These experiments show that the workflow can recover most truly informative features, limit redundancy and maintain robust performance even in demanding scenarios. Finally, Rank’n’Select is applied to a colorectal cancer case study from the ONCOBIOME project, integrating multiple molecular layers and immune-related information across controls, polyps and tumours. Across several classification tasks, the workflow yields compact panels of features that accurately discriminate between clinical groups and align with current biological knowledge| File | Dimensione | Formato | |
|---|---|---|---|
|
Tesi-Rosso-Elena.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
6.64 MB
Formato
Adobe PDF
|
6.64 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/352713
URN:NBN:IT:UNITO-352713