High-dimensional data are at the core of many areas of modern research, with single-cell RNA sequencing (scRNA-seq) providing a prominent example. While these data hold great promise for uncovering biological mechanisms, they also raise major statistical challenges, particularly in uncertainty quantification and multiple testing. This thesis develops new methodological contributions at the intersection of conformal prediction, multiple hypothesis testing, and high-dimensional selective inference. First, we address the problem of cell type annotation in scRNA-seq data. Building on conformal prediction, we introduce a graph-structured approach that incorporates prior knowledge from the Cell Ontology. This method produces valid prediction sets that are more interpretable than standard conformal classifiers and has been implemented in the open-source R package scConform. Second, we investigate simultaneous inference for high-dimensional variable selection. We provide the first systematic comparison of two recent knockoff-based approaches, unifying them within a common framework and conducting extensive simulations. Our results show that neither method dominates across all scenarios, but that each offers complementary advantages depending on signal sparsity and the size of rejection sets. Finally, we extend the flipscore test, a nonparametric score-based test, to high-dimensional settings by introducing preliminary variable selection strategies. We establish theoretical validity for our approaches, and prove their competitive empirical strengths through simulation studies. Overall, the thesis highlights that rigorous inference procedures can be designed to balance statistical validity with interpretability and applicability in complex data environments. Although motivated by scRNA-seq, the proposed methodologies are broadly relevant to high-dimensional inference across genomics and beyond.

Advances in selective inference: from hierarchical conformal prediction to high-dimensional post-hoc methodologies with applications to scRNA-seq data

CORBETTA, DANIELA
2026

Abstract

High-dimensional data are at the core of many areas of modern research, with single-cell RNA sequencing (scRNA-seq) providing a prominent example. While these data hold great promise for uncovering biological mechanisms, they also raise major statistical challenges, particularly in uncertainty quantification and multiple testing. This thesis develops new methodological contributions at the intersection of conformal prediction, multiple hypothesis testing, and high-dimensional selective inference. First, we address the problem of cell type annotation in scRNA-seq data. Building on conformal prediction, we introduce a graph-structured approach that incorporates prior knowledge from the Cell Ontology. This method produces valid prediction sets that are more interpretable than standard conformal classifiers and has been implemented in the open-source R package scConform. Second, we investigate simultaneous inference for high-dimensional variable selection. We provide the first systematic comparison of two recent knockoff-based approaches, unifying them within a common framework and conducting extensive simulations. Our results show that neither method dominates across all scenarios, but that each offers complementary advantages depending on signal sparsity and the size of rejection sets. Finally, we extend the flipscore test, a nonparametric score-based test, to high-dimensional settings by introducing preliminary variable selection strategies. We establish theoretical validity for our approaches, and prove their competitive empirical strengths through simulation studies. Overall, the thesis highlights that rigorous inference procedures can be designed to balance statistical validity with interpretability and applicability in complex data environments. Although motivated by scRNA-seq, the proposed methodologies are broadly relevant to high-dimensional inference across genomics and beyond.
16-gen-2026
Inglese
RISSO, DAVIDE
Università degli studi di Padova
File in questo prodotto:
File Dimensione Formato  
tesi_Daniela_Corbetta.pdf

accesso aperto

Licenza: Tutti i diritti riservati
Dimensione 2.63 MB
Formato Adobe PDF
2.63 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/354631
Il codice NBN di questa tesi è URN:NBN:IT:UNIPD-354631