The project presented in this thesis was conceived during the COVID-19 pandemic, with the initial goal of understanding the evolutionary dynamics of SARS-CoV-2 and, in particular, those events in the viral evolutionary history that hold direct relevance for public health. The urgent need to track the spread of the virus and to decipher its transmission and evolutionary patterns led to an unprecedented demand for genomic data, resulting in the generation and sharing of an extraordinary number of viral sequences in public databases worldwide. In this framework, Next-Generation Sequencing rapidly became a cornerstone technology for both research and genomic surveillance, enabling for the first time the real-time monitoring of viral evolution. However, the race to produce data also brought a considerable amount of unreliable records, both at the genomic and metadata level. To address this issue, we developed CLsquared, a tool designed to filter large viral genomic datasets while retaining only high-quality sequences. CLsquared fills a gap in the field, where no dedicated tools existed to assess and ensure input data quality, despite its crucial role in enabling reliable downstream analyses. The project then progressed with the development of VirNA, a tool capable of reconstructing SARS-CoV-2 evolutionary dynamics directly from genomic sequences, without relying on inference models to fill in missing information. The exceptional completeness of SARS-CoV-2 datasets made it possible to exploit VirNA to uncover the viral evolutionary pathways purely from real data. In this context, traditional phylogenetic methods often struggled, as they produced unstable topologies when dealing with highly similar sequences. However, despite the global sequencing effort, it remains impossible to sample every infected individual, and surveillance strategies vary widely across countries. This raised a key question: to what extent can insufficient sampling hinder our understanding of viral evolution? To tackle this, we combined CLsquared, VirNA, and a transmission model of a SARS-CoV-2 variant spreading in a population. This integrated approach allowed us to simulate infection chains and apply different sampling strategies to assess how undersampling impacts the reconstruction of viral evolutionary history. In conclusion, the aim of this work is twofold: to clarify how SARS-CoV-2 evolutionary history can be reliably reconstructed, and to highlight how genomic surveillance strategies shape our knowledge about pathogen evolution, starting from the massive viral sequence datasets now at our disposal.

SARS-CoV-2 epidemiology, genomics and phylogeny: identification and characterization of circulating viral variants which are clinically relevant for public health.

MAZZOTTI, GIORGIA
2026

Abstract

The project presented in this thesis was conceived during the COVID-19 pandemic, with the initial goal of understanding the evolutionary dynamics of SARS-CoV-2 and, in particular, those events in the viral evolutionary history that hold direct relevance for public health. The urgent need to track the spread of the virus and to decipher its transmission and evolutionary patterns led to an unprecedented demand for genomic data, resulting in the generation and sharing of an extraordinary number of viral sequences in public databases worldwide. In this framework, Next-Generation Sequencing rapidly became a cornerstone technology for both research and genomic surveillance, enabling for the first time the real-time monitoring of viral evolution. However, the race to produce data also brought a considerable amount of unreliable records, both at the genomic and metadata level. To address this issue, we developed CLsquared, a tool designed to filter large viral genomic datasets while retaining only high-quality sequences. CLsquared fills a gap in the field, where no dedicated tools existed to assess and ensure input data quality, despite its crucial role in enabling reliable downstream analyses. The project then progressed with the development of VirNA, a tool capable of reconstructing SARS-CoV-2 evolutionary dynamics directly from genomic sequences, without relying on inference models to fill in missing information. The exceptional completeness of SARS-CoV-2 datasets made it possible to exploit VirNA to uncover the viral evolutionary pathways purely from real data. In this context, traditional phylogenetic methods often struggled, as they produced unstable topologies when dealing with highly similar sequences. However, despite the global sequencing effort, it remains impossible to sample every infected individual, and surveillance strategies vary widely across countries. This raised a key question: to what extent can insufficient sampling hinder our understanding of viral evolution? To tackle this, we combined CLsquared, VirNA, and a transmission model of a SARS-CoV-2 variant spreading in a population. This integrated approach allowed us to simulate infection chains and apply different sampling strategies to assess how undersampling impacts the reconstruction of viral evolutionary history. In conclusion, the aim of this work is twofold: to clarify how SARS-CoV-2 evolutionary history can be reliably reconstructed, and to highlight how genomic surveillance strategies shape our knowledge about pathogen evolution, starting from the massive viral sequence datasets now at our disposal.
6-mar-2026
Inglese
TOPPO, STEFANO
Università degli studi di Padova
File in questo prodotto:
File Dimensione Formato  
tesi_Giorgia_Mazzotti.pdf

embargo fino al 06/03/2027

Licenza: Tutti i diritti riservati
Dimensione 7.28 MB
Formato Adobe PDF
7.28 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/363049
Il codice NBN di questa tesi è URN:NBN:IT:UNIPD-363049