The rapid expansion of sequencing technologies has led to an overwhelming increase in protein sequence data, yet the challenge of functional annotation remains a major bottleneck in bioinformatics. Traditional annotation methods rely heavily on sequence similarity searches, which are effective for well-characterized proteins but struggle with novel or highly divergent sequences. This limitation is evident in large-scale metagenomic studies, where most proteins lack homologs in high-quality reference databases. The need for accurate and scalable annotation tools is urgent, as functional characterization is essential for understanding biological systems. Compounding this challenge is the global antibiotic resistance crisis, which threatens public health by reducing the efficacy of existing treatments against bacterial infections. Antimicrobial resistance (AMR) is largely driven by the genetic adaptability of bacteria, enabling them to evade antibiotics through various resistance mechanisms encoded in their genomes. Predicting which proteins contribute to antibiotic resistance is crucial for rapid surveillance and diagnostics. However, existing AMR prediction methods typically depend on curated resistance gene databases, which are inherently limited in scope and fail to identify novel resistance determinants. A more advanced approach is needed to recognize resistance-related patterns beyond known resistance genes. To address the limitations of traditional annotation pipelines, this thesis introduces Argot3.0, a deep learning-based tool that predicts protein functions directly from sequence data without requiring similarity searches. Argot3.0 utilizes Evolutionary Scale Modeling (ESM-2) embeddings to transform protein sequences into high-dimensional feature representations, which are then processed by a hybrid neural network combining Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) layers. By learning from the full complexity of protein sequences, rather than relying on homology-based inference, Argot3.0 significantly improves annotation accuracy and coverage, particularly for proteins with no close matches in curated databases. Building on this framework, AmrGOt extends the capabilities of Argot3.0 to predict whether a given protein contributes to antibiotic resistance. Instead of focusing on known resistance genes, AmrGOt leverages a deep learning model trained on bacterial proteins annotated to the Gene Ontology term GO:0046677 (response to antibiotic), allowing it to identify resistance-related proteins even in the absence of prior classification. By integrating sequence embeddings with CNN and LSTM architectures, AmrGOt captures complex sequence patterns associated with antimicrobial resistance, distinguishing resistant from non-resistant proteins with high accuracy. By replacing rigid database-dependent approaches with a scalable, data-driven model, AmrGOt provides a powerful tool for AMR prediction, capable of identifying emerging resistance determinants before they become clinically significant. In a time when antibiotic resistance is outpacing drug development, this work contributes a critical advance in computational approaches to resistance surveillance and functional annotation, offering a more adaptable and predictive model for tackling one of the most pressing challenges in modern medicine.
Aumento della Copertura dell’Annotazione Funzionale nei Proteomi e Rilevamento dei Geni di Resistenza agli Antibiotici mediante Tecniche di Deep Learning
ISPANO, EMILIO
2025
Abstract
The rapid expansion of sequencing technologies has led to an overwhelming increase in protein sequence data, yet the challenge of functional annotation remains a major bottleneck in bioinformatics. Traditional annotation methods rely heavily on sequence similarity searches, which are effective for well-characterized proteins but struggle with novel or highly divergent sequences. This limitation is evident in large-scale metagenomic studies, where most proteins lack homologs in high-quality reference databases. The need for accurate and scalable annotation tools is urgent, as functional characterization is essential for understanding biological systems. Compounding this challenge is the global antibiotic resistance crisis, which threatens public health by reducing the efficacy of existing treatments against bacterial infections. Antimicrobial resistance (AMR) is largely driven by the genetic adaptability of bacteria, enabling them to evade antibiotics through various resistance mechanisms encoded in their genomes. Predicting which proteins contribute to antibiotic resistance is crucial for rapid surveillance and diagnostics. However, existing AMR prediction methods typically depend on curated resistance gene databases, which are inherently limited in scope and fail to identify novel resistance determinants. A more advanced approach is needed to recognize resistance-related patterns beyond known resistance genes. To address the limitations of traditional annotation pipelines, this thesis introduces Argot3.0, a deep learning-based tool that predicts protein functions directly from sequence data without requiring similarity searches. Argot3.0 utilizes Evolutionary Scale Modeling (ESM-2) embeddings to transform protein sequences into high-dimensional feature representations, which are then processed by a hybrid neural network combining Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) layers. By learning from the full complexity of protein sequences, rather than relying on homology-based inference, Argot3.0 significantly improves annotation accuracy and coverage, particularly for proteins with no close matches in curated databases. Building on this framework, AmrGOt extends the capabilities of Argot3.0 to predict whether a given protein contributes to antibiotic resistance. Instead of focusing on known resistance genes, AmrGOt leverages a deep learning model trained on bacterial proteins annotated to the Gene Ontology term GO:0046677 (response to antibiotic), allowing it to identify resistance-related proteins even in the absence of prior classification. By integrating sequence embeddings with CNN and LSTM architectures, AmrGOt captures complex sequence patterns associated with antimicrobial resistance, distinguishing resistant from non-resistant proteins with high accuracy. By replacing rigid database-dependent approaches with a scalable, data-driven model, AmrGOt provides a powerful tool for AMR prediction, capable of identifying emerging resistance determinants before they become clinically significant. In a time when antibiotic resistance is outpacing drug development, this work contributes a critical advance in computational approaches to resistance surveillance and functional annotation, offering a more adaptable and predictive model for tackling one of the most pressing challenges in modern medicine.File | Dimensione | Formato | |
---|---|---|---|
tesi_Emilio_Ispano.pdf
accesso aperto
Dimensione
7.08 MB
Formato
Adobe PDF
|
7.08 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/220254
URN:NBN:IT:UNIPD-220254