The emergence of single-cell transcriptomics has recently offered the potential to uncover the diversity in gene-expression among individual cells. However, its potential to identify genuine biological heterogeneity is challenged by the high level of noise that affects those data, making it extremely important to disentangle technical and unwanted biological noise from the actual interesting biological variability. Treatable statistical models can be useful tools to disentangle the actual biological variability from general statistical effects due to the consequences of the sampling process inherent to the experimental technique. Furthermore, discerning between stochastic biological noise and significant signals is the ultimate objective. In this thesis, we first exploit the structural similarity between these expression datasets and several other complex systems that can be described through the statistics of their basic components. Transcriptomes of single cells can be in fact seen as collections of messenger RNA abundances transcribed from a common set of genes just as books are different collections of words from a shared vocabulary, genomes of different species are specific compositions of genes belonging to evolutionary families, and ecological niches can be described by their species abundances. Following this analogy, we identified several emergent statistical laws in single-cell transcriptomic data that closely resemble regularities found in linguistics, ecology, or genomics. We demonstrated that a simple mathematical framework can be employed to analyze the relationships between different laws and the potential mechanisms behind their ubiquity. Taking a step further, we introduced a source of biological variation to our model, assuming a gamma distribution to describe stochastic gene expression. This null model represents a novel approach to selecting putative key genes for downstream analysis without requiring any modifications to the original expression values in the count matrix. The identification of genes deviating significantly from this distribution, particularly those exhibiting multimodal behavior, offers valuable insights into tissue heterogeneity, presenting an alternative to highly variable genes in feature selection procedures. Moreover, the quantity of genes displaying significant deviations indicates the dataset’s complexity in terms of heterogeneity within cell populations. 3
A Distribution-Based Model for Single-Cell Transcriptomics
Lazzardi, Silvia
2024
Abstract
The emergence of single-cell transcriptomics has recently offered the potential to uncover the diversity in gene-expression among individual cells. However, its potential to identify genuine biological heterogeneity is challenged by the high level of noise that affects those data, making it extremely important to disentangle technical and unwanted biological noise from the actual interesting biological variability. Treatable statistical models can be useful tools to disentangle the actual biological variability from general statistical effects due to the consequences of the sampling process inherent to the experimental technique. Furthermore, discerning between stochastic biological noise and significant signals is the ultimate objective. In this thesis, we first exploit the structural similarity between these expression datasets and several other complex systems that can be described through the statistics of their basic components. Transcriptomes of single cells can be in fact seen as collections of messenger RNA abundances transcribed from a common set of genes just as books are different collections of words from a shared vocabulary, genomes of different species are specific compositions of genes belonging to evolutionary families, and ecological niches can be described by their species abundances. Following this analogy, we identified several emergent statistical laws in single-cell transcriptomic data that closely resemble regularities found in linguistics, ecology, or genomics. We demonstrated that a simple mathematical framework can be employed to analyze the relationships between different laws and the potential mechanisms behind their ubiquity. Taking a step further, we introduced a source of biological variation to our model, assuming a gamma distribution to describe stochastic gene expression. This null model represents a novel approach to selecting putative key genes for downstream analysis without requiring any modifications to the original expression values in the count matrix. The identification of genes deviating significantly from this distribution, particularly those exhibiting multimodal behavior, offers valuable insights into tissue heterogeneity, presenting an alternative to highly variable genes in feature selection procedures. Moreover, the quantity of genes displaying significant deviations indicates the dataset’s complexity in terms of heterogeneity within cell populations. 3| File | Dimensione | Formato | |
|---|---|---|---|
|
Lazzardi_PhD-Thesis_final.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
9.8 MB
Formato
Adobe PDF
|
9.8 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/363453
URN:NBN:IT:UNITO-363453