Multi-omics approaches integrate various biological data layers, including genomics, transcriptomics, and proteomics, to provide a comprehensive understanding of cellular functions and disease mechanisms. Transcriptome data play a pivotal role in multimodal data integration by capturing dynamic gene and non-coding RNAs expression profiles but also somatic variants enabling to disclose different levels of biological information from a single source. To achieve this, RNAcall has been developed, an innovative pipeline designed for the detection of bona fide somatic variants from paired-end Illumina RNA sequencing (RNA-Seq) data employing Nextflow and Docker for efficient workflow management and containerization. RNAcall utilizes a comprehensive approach to variant calling from transcriptomic data by employing STAR in a two-pass mode. It incorporates SplitNCigar to split splice junction reads, uses RNAIndel, an optimized tool for detecting InDels in RNA-Seq data, and annotates variants that occur in editing sites and splicing regions. A systematic benchmarking of filtering parameters was conducted across three different cancer types to optimize the overlap of common variants with those identified from Whole Exome Sequencing (WES) data, regarded as ground truth. Variants located at editing sites, splice junctions and within five bases, as well as those with alleles’ quality issues in RNA-seq data, do not significantly reduce the number of common variants detected by the two methodologies. Their removal enhances the performance metrics of the pipeline, thus categorizing them as spurious variants and suggesting their exclusion from the RNA-call results as a default filter. A comparison of variant allele frequencies (VAF) between WES and RNA-Seq revealed an overestimation of VAF from RNA-Seq, likely attributable to the phenomenon of allele-specific expression. Motivated by the need to elucidate the molecular characteristics of T-cell prolymphocytic leukemia (T-PLL), a rare and aggressive malignancy, a multi-omics study was established. In this study, RNA-seq profiling of 10 T-PLL samples and CD4+ cells from 5 healthy donors allowed us to report gene expression, non coding, antisense and circular RNA expression but also genomic variants identification. T-PLL gene expression dysregulation displayed activation of several oncogenic pathways, in addition to the suppression of healthy T-cellactivities and escape of multiple cell death mechanisms. Interestingly, tumor suppressor lncRNAs (NEAT1, MIAT and LUCAT1) with reduced expression, and several upregulated oncogenic pro-proliferative lncRNAs (FIRRE, TERC, XIST and PVT1) were identified. Our first account of circRNAome dysregulation of T-PLL could open new lines of investigation showing the ectopically expression of oncogenic circRNAs (circPVT1, circFKBP5, circFIRRE) in T-PLL. Furthermore, focusing on five (STAT5B, JAK3, ATM, KMT2C, and ARID1A) genes with validated oncogenic variants, previously detected via RNA-seq variant calling, recurrent in our cohort, we investigated genotype/phenotype relations in T-PLL, using a multiple predictor linear model to define the link between the five driver variants and alteration of gene and circRNA expression profiles. Overall, our pilot study disclosed three different expression profiles linked to the mutational status of the patients: mutations of STAT5B, JAK3, and KMT2C were linked to similar expression patterns, ATM to mild changes, and ARID1A to a peculiar profile. Notable genes whose expression could be altered specifically in link with each of these lesions, may be therapeutic targets and deserve further investigation.
Development of RNAcall, a Nextflow-based pipeline for somatic variant calling from RNA-seq data and its application in a multi-omics study on T-Cell Prolymphocytic Leukemia
ORSI, SILVIA
2025
Abstract
Multi-omics approaches integrate various biological data layers, including genomics, transcriptomics, and proteomics, to provide a comprehensive understanding of cellular functions and disease mechanisms. Transcriptome data play a pivotal role in multimodal data integration by capturing dynamic gene and non-coding RNAs expression profiles but also somatic variants enabling to disclose different levels of biological information from a single source. To achieve this, RNAcall has been developed, an innovative pipeline designed for the detection of bona fide somatic variants from paired-end Illumina RNA sequencing (RNA-Seq) data employing Nextflow and Docker for efficient workflow management and containerization. RNAcall utilizes a comprehensive approach to variant calling from transcriptomic data by employing STAR in a two-pass mode. It incorporates SplitNCigar to split splice junction reads, uses RNAIndel, an optimized tool for detecting InDels in RNA-Seq data, and annotates variants that occur in editing sites and splicing regions. A systematic benchmarking of filtering parameters was conducted across three different cancer types to optimize the overlap of common variants with those identified from Whole Exome Sequencing (WES) data, regarded as ground truth. Variants located at editing sites, splice junctions and within five bases, as well as those with alleles’ quality issues in RNA-seq data, do not significantly reduce the number of common variants detected by the two methodologies. Their removal enhances the performance metrics of the pipeline, thus categorizing them as spurious variants and suggesting their exclusion from the RNA-call results as a default filter. A comparison of variant allele frequencies (VAF) between WES and RNA-Seq revealed an overestimation of VAF from RNA-Seq, likely attributable to the phenomenon of allele-specific expression. Motivated by the need to elucidate the molecular characteristics of T-cell prolymphocytic leukemia (T-PLL), a rare and aggressive malignancy, a multi-omics study was established. In this study, RNA-seq profiling of 10 T-PLL samples and CD4+ cells from 5 healthy donors allowed us to report gene expression, non coding, antisense and circular RNA expression but also genomic variants identification. T-PLL gene expression dysregulation displayed activation of several oncogenic pathways, in addition to the suppression of healthy T-cellactivities and escape of multiple cell death mechanisms. Interestingly, tumor suppressor lncRNAs (NEAT1, MIAT and LUCAT1) with reduced expression, and several upregulated oncogenic pro-proliferative lncRNAs (FIRRE, TERC, XIST and PVT1) were identified. Our first account of circRNAome dysregulation of T-PLL could open new lines of investigation showing the ectopically expression of oncogenic circRNAs (circPVT1, circFKBP5, circFIRRE) in T-PLL. Furthermore, focusing on five (STAT5B, JAK3, ATM, KMT2C, and ARID1A) genes with validated oncogenic variants, previously detected via RNA-seq variant calling, recurrent in our cohort, we investigated genotype/phenotype relations in T-PLL, using a multiple predictor linear model to define the link between the five driver variants and alteration of gene and circRNA expression profiles. Overall, our pilot study disclosed three different expression profiles linked to the mutational status of the patients: mutations of STAT5B, JAK3, and KMT2C were linked to similar expression patterns, ATM to mild changes, and ARID1A to a peculiar profile. Notable genes whose expression could be altered specifically in link with each of these lesions, may be therapeutic targets and deserve further investigation.File | Dimensione | Formato | |
---|---|---|---|
tesi_definitiva_Silvia_Orsi.pdf
embargo fino al 27/02/2028
Dimensione
5.79 MB
Formato
Adobe PDF
|
5.79 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/219284
URN:NBN:IT:UNIPD-219284