Urinary tract infections (UTIs) are among the most common bacterial infections in humans, primarily caused by uropathogenic Escherichia coli (UPEC). A key challenge in UTI management is UPEC’s ability to persist intracellularly, which enables evasion of host defenses and of antibiotic treatments. Although recent studies revealed that mobile genetic elements may be crucial in UPEC persistence, the potential genetic basis linked to UPEC persistence remains poorly understood. Meanwhile, recent advances in sequencing technologies resulted in vast bacterial genome collections being generated, such as the AllTheBacteria collection (n = 2,440,377), holding great potential for several applications, such as in rapid diagnostics and epidemiological surveillance at point-of-care (POC). However, the exponential growth of data has outpaced computational performance, limiting our ability to perform real-time searches across million-genome collections, especially on portable devices. These searches on portable devices have been recently made possible by Phylign, a sequence alignment tool combining phylogenetic compression with k-mer matching and alignment. Yet, Phylign remains unsuitable for time-sensitive analyses with long and divergent queries, as no established methodology exists for guiding the selection, application, parametrization, and calibration of low-level k-mer indexes with phylogenetic compression with respect to specific biological questions. This study, therefore, has two primary aims: 1) to develop an end-to-end methodology for rapid k-mer searches across multi-million-genome collections on portable devices using phylogenetic compression; and 2) to investigate the potential genetic determinants of UPEC lifestyle and intracellular persistence, characterizing the plasmid of the persistent prostatic UPEC strain EC73. Here, we developed and implemented a comprehensive methodology for rapid k-mer searches across million-genome collections on portable devices and applied it to elucidate the genetic determinants of UPEC lifestyle and persistence. We structured the methodology in three steps: 1) translating the biological question of interest into a k-mer-based problem, where k-mer matching is 2) formalized through a matching strategy, defined as the combination of 3 elements, i.e. a query type, a reference collection type, and a matching mode; this strategy was then used to 3) guide the selection of the most suitable k-mer indexes for the given application. We applied this methodology to the plasmid search problem across million-genome collections using Phylign, evaluating four state-of-the-art k-mer indexes (COBS, Fulgor, Themisto, and Metagraph), where we identified Fulgor as the best trade-off between space efficiency and search speed. Finally, we characterized the EC73 plasmid, identifying multiple functionally distinct genes and the widespread prevalence of its genes among all E coli and UPEC genomes ever sequenced. Furthermore, we identified in-silico three candidate genes in the EC73 plasmid that might be potentially involved in UPEC intracellular persistence. Overall, this work provides the first systematic framework for large-scale k-mer search on portable devices leveraging phylogenetic compression. We developed Phylign-Fulgor, an optimized version of Phylign, which improved scalability and speed of genome searches across multi-million–genome collections on standard laptops, paving the way for novel biological applications on datasets that were previously difficult to handle outside high-performance computing platforms. In the context of EC73, Phylign-Fulgor enabled the investigation of the prevalence of the EC73 plasmid and its genes across all sequenced E. coli and UPEC genomes in the AllTheBacteria collection, suggesting no particular association with the UPEC pathotype but instead a hypothetical strain-specific role in host adaptation and fitness.
Assessment of genetic determinants in Escherichia coli uropathogenic lifestyle and intracellular persistence via optimized k-mer matching of million-genome collections on laptops
Brunetti, Francesca
2026
Abstract
Urinary tract infections (UTIs) are among the most common bacterial infections in humans, primarily caused by uropathogenic Escherichia coli (UPEC). A key challenge in UTI management is UPEC’s ability to persist intracellularly, which enables evasion of host defenses and of antibiotic treatments. Although recent studies revealed that mobile genetic elements may be crucial in UPEC persistence, the potential genetic basis linked to UPEC persistence remains poorly understood. Meanwhile, recent advances in sequencing technologies resulted in vast bacterial genome collections being generated, such as the AllTheBacteria collection (n = 2,440,377), holding great potential for several applications, such as in rapid diagnostics and epidemiological surveillance at point-of-care (POC). However, the exponential growth of data has outpaced computational performance, limiting our ability to perform real-time searches across million-genome collections, especially on portable devices. These searches on portable devices have been recently made possible by Phylign, a sequence alignment tool combining phylogenetic compression with k-mer matching and alignment. Yet, Phylign remains unsuitable for time-sensitive analyses with long and divergent queries, as no established methodology exists for guiding the selection, application, parametrization, and calibration of low-level k-mer indexes with phylogenetic compression with respect to specific biological questions. This study, therefore, has two primary aims: 1) to develop an end-to-end methodology for rapid k-mer searches across multi-million-genome collections on portable devices using phylogenetic compression; and 2) to investigate the potential genetic determinants of UPEC lifestyle and intracellular persistence, characterizing the plasmid of the persistent prostatic UPEC strain EC73. Here, we developed and implemented a comprehensive methodology for rapid k-mer searches across million-genome collections on portable devices and applied it to elucidate the genetic determinants of UPEC lifestyle and persistence. We structured the methodology in three steps: 1) translating the biological question of interest into a k-mer-based problem, where k-mer matching is 2) formalized through a matching strategy, defined as the combination of 3 elements, i.e. a query type, a reference collection type, and a matching mode; this strategy was then used to 3) guide the selection of the most suitable k-mer indexes for the given application. We applied this methodology to the plasmid search problem across million-genome collections using Phylign, evaluating four state-of-the-art k-mer indexes (COBS, Fulgor, Themisto, and Metagraph), where we identified Fulgor as the best trade-off between space efficiency and search speed. Finally, we characterized the EC73 plasmid, identifying multiple functionally distinct genes and the widespread prevalence of its genes among all E coli and UPEC genomes ever sequenced. Furthermore, we identified in-silico three candidate genes in the EC73 plasmid that might be potentially involved in UPEC intracellular persistence. Overall, this work provides the first systematic framework for large-scale k-mer search on portable devices leveraging phylogenetic compression. We developed Phylign-Fulgor, an optimized version of Phylign, which improved scalability and speed of genome searches across multi-million–genome collections on standard laptops, paving the way for novel biological applications on datasets that were previously difficult to handle outside high-performance computing platforms. In the context of EC73, Phylign-Fulgor enabled the investigation of the prevalence of the EC73 plasmid and its genes across all sequenced E. coli and UPEC genomes in the AllTheBacteria collection, suggesting no particular association with the UPEC pathotype but instead a hypothetical strain-specific role in host adaptation and fitness.| File | Dimensione | Formato | |
|---|---|---|---|
|
Tesi_dottorato_Brunetti.pdf
accesso aperto
Licenza:
Creative Commons
Dimensione
8.92 MB
Formato
Adobe PDF
|
8.92 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/357557
URN:NBN:IT:UNIROMA1-357557