Genomic and transcriptomic technologies have advanced rapidly in recent years, becoming essential tools for modern biological and medical research. As data volume and complexity continue to grow, there is an increasing demand for computational pipelines that are reproducible, accessible, and rigorously tested, while remaining aligned with the latest experimental methodologies. In this work, I addressed several open challenges in contemporary bioinformatics through the development of reproducible computational frameworks spanning diverse techniques and biological questions. To support the need for benchmarking in single-cell RNA sequencing, I systematically compared multiple clustering algorithms using a custom dataset specifically designed to model cancer heterogeneity in a controlled environment. Such benchmarking on labeled data is essential to ensure robust and generalizable tool development. Building on this experience, I contributed to the development of iPS2-seq (iPS-optimized inducible Post-transcriptional Silencing in pool deconvoluted by single-cell sequencing), a novel method enabling clonal, single-cell-resolved gene perturbation screens applicable to human pluripotent stem cell-derived lineages. Within this framework, I designed catcheR (clonality and treatment-controlled shRNA effect findeR), a reproducible and user-friendly data analysis pipeline. Implemented as a Dockerized R package, catcheR performs quality control and filtering to achieve reliable perturbation assignment, followed by dimensionality reduction, clustering, and annotation via Monocle3. It quantifies how gene perturbations affect transcriptional modules, pseudotime trajectories, and population shifts, generating publication-ready plots and statistics. Finally, I explored telomere biology in ALT+ sarcomas, exploiting long-read sequencing to characterize telomeric repeats and telomere insertions, repetitive genomic features that have long posed challenges to standard analyses. By assembling extended consensus sequences containing telomeric sequences, I was able to both quantify repeat content and map telomeric insertions to specific genomic regions, revealing significant overlap with structural variants and extrachromosomal DNA. Overall, this thesis demonstrates how reproducible computational frameworks can bridge diverse experimental contexts, from single-cell transcriptomics to long-read genomics, advancing both methodological rigor and functional discovery in modern bioinformatics.

Reproducible Computational Frameworks for Single-Cell and Long-Read Functional Genomics

RATTO, MARIA LUISA
2026

Abstract

Genomic and transcriptomic technologies have advanced rapidly in recent years, becoming essential tools for modern biological and medical research. As data volume and complexity continue to grow, there is an increasing demand for computational pipelines that are reproducible, accessible, and rigorously tested, while remaining aligned with the latest experimental methodologies. In this work, I addressed several open challenges in contemporary bioinformatics through the development of reproducible computational frameworks spanning diverse techniques and biological questions. To support the need for benchmarking in single-cell RNA sequencing, I systematically compared multiple clustering algorithms using a custom dataset specifically designed to model cancer heterogeneity in a controlled environment. Such benchmarking on labeled data is essential to ensure robust and generalizable tool development. Building on this experience, I contributed to the development of iPS2-seq (iPS-optimized inducible Post-transcriptional Silencing in pool deconvoluted by single-cell sequencing), a novel method enabling clonal, single-cell-resolved gene perturbation screens applicable to human pluripotent stem cell-derived lineages. Within this framework, I designed catcheR (clonality and treatment-controlled shRNA effect findeR), a reproducible and user-friendly data analysis pipeline. Implemented as a Dockerized R package, catcheR performs quality control and filtering to achieve reliable perturbation assignment, followed by dimensionality reduction, clustering, and annotation via Monocle3. It quantifies how gene perturbations affect transcriptional modules, pseudotime trajectories, and population shifts, generating publication-ready plots and statistics. Finally, I explored telomere biology in ALT+ sarcomas, exploiting long-read sequencing to characterize telomeric repeats and telomere insertions, repetitive genomic features that have long posed challenges to standard analyses. By assembling extended consensus sequences containing telomeric sequences, I was able to both quantify repeat content and map telomeric insertions to specific genomic regions, revealing significant overlap with structural variants and extrachromosomal DNA. Overall, this thesis demonstrates how reproducible computational frameworks can bridge diverse experimental contexts, from single-cell transcriptomics to long-read genomics, advancing both methodological rigor and functional discovery in modern bioinformatics.
3-feb-2026
Inglese
CALOGERO, Raffaele Adolfo
Università degli Studi di Torino
File in questo prodotto:
File Dimensione Formato  
Tesi-Ratto-MariaLuisa.pdf

embargo fino al 03/02/2027

Licenza: Tutti i diritti riservati
Dimensione 17.05 MB
Formato Adobe PDF
17.05 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/357245
Il codice NBN di questa tesi è URN:NBN:IT:UNITO-357245