Design and Implementation of a Comprehensive Bioinformatics Tool and Online Database for Exploring Genomic Diversity Across All Domains of Life

Yaddehige, Sachithra Kalhari

The rapid advancements in sequencing technologies have revolutionised genomics, enabling entire genomes to be sequenced in a shorter time and generating an unprecedented volume of genomic data. While repositories such as GenBank and RefSeq provide centralised access to these sequences, the extraction of meaningful biological insights requires standardised, comprehensive, and comparable genomic metrics calculated from the sequences. Effective interpretation of genomic data extends beyond fundamental research, supporting improvements in crop and livestock productivity, guiding the identification of genetic variants linked to disease prevention, and enabling the construction of evolutionary relationships between species. Yet current resources for comparative genomics often remain limited in scope. This thesis aims to fill a gap in current bioinformatic resources for parsing genomic data by introducing a unified platform and a tool capable of producing comprehensive genomic datasets. The novelty of this work lies in GBRAP (Genome-Based Retrieval and Analysis Parser), a software tool and freely accessible online database (https://tacclab.org/gbrap/) offering more than 200 genome-derived statistics per sequence across both coding and non-coding regions. Through integration of the entire NCBI RefSeq repository, GBRAP provides consistent, machine-readable, and downloadable datasets that support large-scale comparative studies across viruses, archaea, bacteria, protozoa, fungi, plants, and animals. By uniquely parsing GenBank Flat Files (GBFF) and offering a broad range of statistics, GBRAP provides a level of detail and inclusivity not available in other resources. Furthermore, the thesis includes two downstream applications of GBRAP data. First one, a machine learning-based classification of Archaea and Bacteria utilising genomic datasets downloaded from the GBRAP database. With a dataset containing 2,655 genomes and 77 selected GBRAP metrics, machine learning models, including logistic regression, random forests, support vector machines, and neural networks, were able to classify genomes with near-perfect accuracy. The analysis identified tRNA entropies, nucleotide compositions of RNAs, and Chargaff’s score of tRNA, rRNA and CDS as the most important features. The other project, Prob-AI, represents an ongoing study for the identification of novel probiotics with data generated using the GBRAP tool. The Machine learning models trained on an initial dataset revealed distinct codon patterns and compositional features that discriminate between probiotics and non-probiotics. The model predictions will be validated through in vitro assays to confirm the probiotic potential of the selected bacterial strains. By bridging large-scale data curation with machine learning approaches, this thesis demonstrates how unified genomic metrics can drive both fundamental evolutionary research and translational applications, addressing key gaps in current bioinformatics resources and paving the way for new biological discoveries.

Design and Implementation of a Comprehensive Bioinformatics Tool and Online Database for Exploring Genomic Diversity Across All Domains of Life

YADDEHIGE, SACHITHRA KALHARI

2026

Abstract

The rapid advancements in sequencing technologies have revolutionised genomics, enabling entire genomes to be sequenced in a shorter time and generating an unprecedented volume of genomic data. While repositories such as GenBank and RefSeq provide centralised access to these sequences, the extraction of meaningful biological insights requires standardised, comprehensive, and comparable genomic metrics calculated from the sequences. Effective interpretation of genomic data extends beyond fundamental research, supporting improvements in crop and livestock productivity, guiding the identification of genetic variants linked to disease prevention, and enabling the construction of evolutionary relationships between species. Yet current resources for comparative genomics often remain limited in scope. This thesis aims to fill a gap in current bioinformatic resources for parsing genomic data by introducing a unified platform and a tool capable of producing comprehensive genomic datasets. The novelty of this work lies in GBRAP (Genome-Based Retrieval and Analysis Parser), a software tool and freely accessible online database (https://tacclab.org/gbrap/) offering more than 200 genome-derived statistics per sequence across both coding and non-coding regions. Through integration of the entire NCBI RefSeq repository, GBRAP provides consistent, machine-readable, and downloadable datasets that support large-scale comparative studies across viruses, archaea, bacteria, protozoa, fungi, plants, and animals. By uniquely parsing GenBank Flat Files (GBFF) and offering a broad range of statistics, GBRAP provides a level of detail and inclusivity not available in other resources. Furthermore, the thesis includes two downstream applications of GBRAP data. First one, a machine learning-based classification of Archaea and Bacteria utilising genomic datasets downloaded from the GBRAP database. With a dataset containing 2,655 genomes and 77 selected GBRAP metrics, machine learning models, including logistic regression, random forests, support vector machines, and neural networks, were able to classify genomes with near-perfect accuracy. The analysis identified tRNA entropies, nucleotide compositions of RNAs, and Chargaff’s score of tRNA, rRNA and CDS as the most important features. The other project, Prob-AI, represents an ongoing study for the identification of novel probiotics with data generated using the GBRAP tool. The Machine learning models trained on an initial dataset revealed distinct codon patterns and compositional features that discriminate between probiotics and non-probiotics. The model predictions will be validated through in vitro assays to confirm the probiotic potential of the selected bacterial strains. By bridging large-scale data curation with machine learning approaches, this thesis demonstrates how unified genomic metrics can drive both fundamental evolutionary research and translational applications, addressing key gaps in current bioinformatics resources and paving the way for new biological discoveries.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				SCIENZE VETERINARIE E SICUREZZA ALIMENTARE
			
	Data di pubblicazione
	
				6-feb-2026
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				TACCIOLI, CRISTIAN
			
	Nome Editore
	
				Università degli studi di Padova
			
	Collezione di appartenenza
	
				Università degli Studi di Padova

File in questo prodotto:

File	Dimensione	Formato
PhD_Thesis_Sachithra_Kalhari_Yaddehige.pdf embargo fino al 05/02/2029 Licenza: Tutti i diritti riservati Dimensione 5.91 MB Formato Adobe PDF	5.91 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/360886

Il codice NBN di questa tesi è URN:NBN:IT:UNIPD-360886