A Distribution-Based Model for Single-Cell Transcriptomics

Lazzardi, Silvia

The emergence of single-cell transcriptomics has recently offered the potential to uncover the diversity in gene-expression among individual cells. However, its potential to identify genuine biological heterogeneity is challenged by the high level of noise that affects those data, making it extremely important to disentangle technical and unwanted biological noise from the actual interesting biological variability. Treatable statistical models can be useful tools to disentangle the actual biological variability from general statistical effects due to the consequences of the sampling process inherent to the experimental technique. Furthermore, discerning between stochastic biological noise and significant signals is the ultimate objective. In this thesis, we first exploit the structural similarity between these expression datasets and several other complex systems that can be described through the statistics of their basic components. Transcriptomes of single cells can be in fact seen as collections of messenger RNA abundances transcribed from a common set of genes just as books are different collections of words from a shared vocabulary, genomes of different species are specific compositions of genes belonging to evolutionary families, and ecological niches can be described by their species abundances. Following this analogy, we identified several emergent statistical laws in single-cell transcriptomic data that closely resemble regularities found in linguistics, ecology, or genomics. We demonstrated that a simple mathematical framework can be employed to analyze the relationships between different laws and the potential mechanisms behind their ubiquity. Taking a step further, we introduced a source of biological variation to our model, assuming a gamma distribution to describe stochastic gene expression. This null model represents a novel approach to selecting putative key genes for downstream analysis without requiring any modifications to the original expression values in the count matrix. The identification of genes deviating significantly from this distribution, particularly those exhibiting multimodal behavior, offers valuable insights into tissue heterogeneity, presenting an alternative to highly variable genes in feature selection procedures. Moreover, the quantity of genes displaying significant deviations indicates the dataset’s complexity in terms of heterogeneity within cell populations. 3

A Distribution-Based Model for Single-Cell Transcriptomics

Lazzardi, Silvia

2024

Abstract

The emergence of single-cell transcriptomics has recently offered the potential to uncover the diversity in gene-expression among individual cells. However, its potential to identify genuine biological heterogeneity is challenged by the high level of noise that affects those data, making it extremely important to disentangle technical and unwanted biological noise from the actual interesting biological variability. Treatable statistical models can be useful tools to disentangle the actual biological variability from general statistical effects due to the consequences of the sampling process inherent to the experimental technique. Furthermore, discerning between stochastic biological noise and significant signals is the ultimate objective. In this thesis, we first exploit the structural similarity between these expression datasets and several other complex systems that can be described through the statistics of their basic components. Transcriptomes of single cells can be in fact seen as collections of messenger RNA abundances transcribed from a common set of genes just as books are different collections of words from a shared vocabulary, genomes of different species are specific compositions of genes belonging to evolutionary families, and ecological niches can be described by their species abundances. Following this analogy, we identified several emergent statistical laws in single-cell transcriptomic data that closely resemble regularities found in linguistics, ecology, or genomics. We demonstrated that a simple mathematical framework can be employed to analyze the relationships between different laws and the potential mechanisms behind their ubiquity. Taking a step further, we introduced a source of biological variation to our model, assuming a gamma distribution to describe stochastic gene expression. This null model represents a novel approach to selecting putative key genes for downstream analysis without requiring any modifications to the original expression values in the count matrix. The identification of genes deviating significantly from this distribution, particularly those exhibiting multimodal behavior, offers valuable insights into tissue heterogeneity, presenting an alternative to highly variable genes in feature selection procedures. Moreover, the quantity of genes displaying significant deviations indicates the dataset’s complexity in terms of heterogeneity within cell populations. 3

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				COMPLEX SYSTEMS FOR QUANTITATIVE BIOMEDICINE
			
	Data di pubblicazione
	
				11-giu-2024
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				CASELLE, Michele
			
	Nome Editore
	
				Università degli Studi di Torino
			
	Collezione di appartenenza
	
				Università degli Studi di Torino

File in questo prodotto:

File	Dimensione	Formato
Lazzardi_PhD-Thesis_final.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 9.8 MB Formato Adobe PDF Visualizza/Apri	9.8 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/363453

Il codice NBN di questa tesi è URN:NBN:IT:UNITO-363453