High-dimensional data are at the core of many areas of modern research, with single-cell RNA sequencing (scRNA-seq) providing a prominent example. While these data hold great promise for uncovering biological mechanisms, they also raise major statistical challenges, particularly in uncertainty quantification and multiple testing. This thesis develops new methodological contributions at the intersection of conformal prediction, multiple hypothesis testing, and high-dimensional selective inference. First, we address the problem of cell type annotation in scRNA-seq data. Building on conformal prediction, we introduce a graph-structured approach that incorporates prior knowledge from the Cell Ontology. This method produces valid prediction sets that are more interpretable than standard conformal classifiers and has been implemented in the open-source R package scConform. Second, we investigate simultaneous inference for high-dimensional variable selection. We provide the first systematic comparison of two recent knockoff-based approaches, unifying them within a common framework and conducting extensive simulations. Our results show that neither method dominates across all scenarios, but that each offers complementary advantages depending on signal sparsity and the size of rejection sets. Finally, we extend the flipscore test, a nonparametric score-based test, to high-dimensional settings by introducing preliminary variable selection strategies. We establish theoretical validity for our approaches, and prove their competitive empirical strengths through simulation studies. Overall, the thesis highlights that rigorous inference procedures can be designed to balance statistical validity with interpretability and applicability in complex data environments. Although motivated by scRNA-seq, the proposed methodologies are broadly relevant to high-dimensional inference across genomics and beyond.
Advances in selective inference: from hierarchical conformal prediction to high-dimensional post-hoc methodologies with applications to scRNA-seq data
CORBETTA, DANIELA
2026
Abstract
High-dimensional data are at the core of many areas of modern research, with single-cell RNA sequencing (scRNA-seq) providing a prominent example. While these data hold great promise for uncovering biological mechanisms, they also raise major statistical challenges, particularly in uncertainty quantification and multiple testing. This thesis develops new methodological contributions at the intersection of conformal prediction, multiple hypothesis testing, and high-dimensional selective inference. First, we address the problem of cell type annotation in scRNA-seq data. Building on conformal prediction, we introduce a graph-structured approach that incorporates prior knowledge from the Cell Ontology. This method produces valid prediction sets that are more interpretable than standard conformal classifiers and has been implemented in the open-source R package scConform. Second, we investigate simultaneous inference for high-dimensional variable selection. We provide the first systematic comparison of two recent knockoff-based approaches, unifying them within a common framework and conducting extensive simulations. Our results show that neither method dominates across all scenarios, but that each offers complementary advantages depending on signal sparsity and the size of rejection sets. Finally, we extend the flipscore test, a nonparametric score-based test, to high-dimensional settings by introducing preliminary variable selection strategies. We establish theoretical validity for our approaches, and prove their competitive empirical strengths through simulation studies. Overall, the thesis highlights that rigorous inference procedures can be designed to balance statistical validity with interpretability and applicability in complex data environments. Although motivated by scRNA-seq, the proposed methodologies are broadly relevant to high-dimensional inference across genomics and beyond.| File | Dimensione | Formato | |
|---|---|---|---|
|
tesi_Daniela_Corbetta.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
2.63 MB
Formato
Adobe PDF
|
2.63 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/354631
URN:NBN:IT:UNIPD-354631