In feature selection current methods are often limited by the types and dimensions of data they can handle. Supervised methods, in particular, are rigid regarding their target space, typically requiring it to be one-dimensional and of a specific type (e.g. continuous or categorical). This thesis introduces feature selection methods which mitigate these limitations using a statistic called the Information Imbalance. This method identifies a low-dimensional subset of input features that best preserves pairwise distance relations found in the target feature space by ranking nearest neighbors. First, we derive a weighted Information Imbalance approach to handle class-imbalanced medical data, along with an optimization routine capable of managing missing data. The study on COVID-19 severity prediction showcased this approach, successfully isolating a 13-feature subset from a pool of roughly 150 features. This subset outperformed traditional feature selection methods in subsequent predictions for patient severity. We then introduce an Information Imbalance variant that can handle binary and categorical data. We benchmarked this approach on Amazon Rainforest biodiversity data. By quantifying the relative information content of continuous features, like average temperature, and categorical features, like the label of the region in which data are recorded, this method identifies plausible predictors of species richness and asymmetric information even between variables which are not correlated. Finally, we introduced a differentiable variant of the Information Imbalance, implemented in the easy-to-use Python package, DADApy. Differentiable Information Imbalance (DII) optimizes relative feature weights via gradient descent, addressing combinatorial challenges of high-dimensional data. The weights correct for different units of measure and relative importance and allow for feature selection through sparsity-inducing optimization approaches. In molecular dynamics simulations, this method reduced the feature set to three collective variables effectively describing a beta-pin peptide. In another application on machine learning potentials, the input feature space was compressed, reducing run time while preserving accuracy.
Feature selection by Information Imbalance optimization: Clinics, molecular modeling and ecology
WILD, ROMINA
2024
Abstract
In feature selection current methods are often limited by the types and dimensions of data they can handle. Supervised methods, in particular, are rigid regarding their target space, typically requiring it to be one-dimensional and of a specific type (e.g. continuous or categorical). This thesis introduces feature selection methods which mitigate these limitations using a statistic called the Information Imbalance. This method identifies a low-dimensional subset of input features that best preserves pairwise distance relations found in the target feature space by ranking nearest neighbors. First, we derive a weighted Information Imbalance approach to handle class-imbalanced medical data, along with an optimization routine capable of managing missing data. The study on COVID-19 severity prediction showcased this approach, successfully isolating a 13-feature subset from a pool of roughly 150 features. This subset outperformed traditional feature selection methods in subsequent predictions for patient severity. We then introduce an Information Imbalance variant that can handle binary and categorical data. We benchmarked this approach on Amazon Rainforest biodiversity data. By quantifying the relative information content of continuous features, like average temperature, and categorical features, like the label of the region in which data are recorded, this method identifies plausible predictors of species richness and asymmetric information even between variables which are not correlated. Finally, we introduced a differentiable variant of the Information Imbalance, implemented in the easy-to-use Python package, DADApy. Differentiable Information Imbalance (DII) optimizes relative feature weights via gradient descent, addressing combinatorial challenges of high-dimensional data. The weights correct for different units of measure and relative importance and allow for feature selection through sparsity-inducing optimization approaches. In molecular dynamics simulations, this method reduced the feature set to three collective variables effectively describing a beta-pin peptide. In another application on machine learning potentials, the input feature space was compressed, reducing run time while preserving accuracy.File | Dimensione | Formato | |
---|---|---|---|
PhD_Thesis_Romina_Wild_2024.pdf
accesso aperto
Dimensione
22.36 MB
Formato
Adobe PDF
|
22.36 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/183923
URN:NBN:IT:SISSA-183923