For the past fifty years, one of the greatest challenges in bioinformatics has been answering the question: "How can we predict the three-dimensional structure of a protein from its amino acid sequence?”. Experimentally determining the three-dimensional structure of a protein is often a slow and challenging process. Instead, sequencing its amino acids has become a high-throughput task thanks to advancements in technology and reduced costs. This disparity has led structural biology and bioinformatics to focus primarily on globular proteins, which reliably fold into a consistent three-dimensional structure and are therefore more accessible to computational and experimental studies. This focus aligns with the sequence-structure-function paradigm that has long guided our understanding of protein function. However, many proteins belong to the lesser-studied category of Non-Globular Proteins (NGPs), which display more diverse structural and functional characteristics, making them harder to observe. The introduction of cutting-edge protein structure prediction algorithms like AlphaFold2 and RoseTTAFold has revolutionized the field. These algorithms demonstrated remarkable accuracy in the 14th edition of the CASP competition held in 2020, sparking what is often referred to as the AlphaFold revolution. Today, computational models of protein structures are available for nearly every known protein, significantly accelerating research in structural biology. Despite this progress certain classes of NGPs, including Tandem Repeat proteins, pose unique challenges in terms of detection and classification. This work focuses on STRPs, a subset of TR proteins with well-defined structural features. The AlphaFold revolution has prompted major updates to databases that store structural data, such as RepeatsDB, which specializes in STRPs. This thesis outlines the evolution of RepeatsDB from its 3rd version in 2021 to its 4th version in 2024, showcasing improvements in manual curation, automated prediction, and scalability in response to the surge of available structural data. RepeatsDB 4 introduces enhancements to the manual curation process through the development of the RepeatsDB Bio-curation Tool, which has helped refine the definition and classification of STRPs. In collaboration with Pfam, manually collected STRPs have been compared between the two databases, validating and improving the information held in both. The addition of automated prediction methods, such as the newly developed STRPsearch algorithm, represents another significant step forward. STRPsearch integrates curated STRP data with the fast structural search capabilities of FoldSeek to improve and speed up predictions, allowing RepeatsDB to scale and eventually cover the entire AlphaFoldDB. Furthermore, reengineering RepeatsDB 4 has generated various side products, including the ngx-mol-viewers Angular library, which enhances biological molecule visualization and has already been included in other databases like MobiDB. Moreover, RepeatsDB 4 utilizes a specialized pipeline to enrich biological data by integrating external resources and software. The pipeline has been designed to be executed on a High Performance Computing (HPC) cluster environment via the DRMAAtic library for efficient cluster communication. The architecture of RepeatsDB 4 is designed to be reusable across other projects, such as the DOME registry, underscoring its broader applications in bioinformatics.

Caratterizzazione di strutture proteiche ripetute nella rivoluzione di AlphaFold

CLEMENTEL, DAMIANO
2025

Abstract

For the past fifty years, one of the greatest challenges in bioinformatics has been answering the question: "How can we predict the three-dimensional structure of a protein from its amino acid sequence?”. Experimentally determining the three-dimensional structure of a protein is often a slow and challenging process. Instead, sequencing its amino acids has become a high-throughput task thanks to advancements in technology and reduced costs. This disparity has led structural biology and bioinformatics to focus primarily on globular proteins, which reliably fold into a consistent three-dimensional structure and are therefore more accessible to computational and experimental studies. This focus aligns with the sequence-structure-function paradigm that has long guided our understanding of protein function. However, many proteins belong to the lesser-studied category of Non-Globular Proteins (NGPs), which display more diverse structural and functional characteristics, making them harder to observe. The introduction of cutting-edge protein structure prediction algorithms like AlphaFold2 and RoseTTAFold has revolutionized the field. These algorithms demonstrated remarkable accuracy in the 14th edition of the CASP competition held in 2020, sparking what is often referred to as the AlphaFold revolution. Today, computational models of protein structures are available for nearly every known protein, significantly accelerating research in structural biology. Despite this progress certain classes of NGPs, including Tandem Repeat proteins, pose unique challenges in terms of detection and classification. This work focuses on STRPs, a subset of TR proteins with well-defined structural features. The AlphaFold revolution has prompted major updates to databases that store structural data, such as RepeatsDB, which specializes in STRPs. This thesis outlines the evolution of RepeatsDB from its 3rd version in 2021 to its 4th version in 2024, showcasing improvements in manual curation, automated prediction, and scalability in response to the surge of available structural data. RepeatsDB 4 introduces enhancements to the manual curation process through the development of the RepeatsDB Bio-curation Tool, which has helped refine the definition and classification of STRPs. In collaboration with Pfam, manually collected STRPs have been compared between the two databases, validating and improving the information held in both. The addition of automated prediction methods, such as the newly developed STRPsearch algorithm, represents another significant step forward. STRPsearch integrates curated STRP data with the fast structural search capabilities of FoldSeek to improve and speed up predictions, allowing RepeatsDB to scale and eventually cover the entire AlphaFoldDB. Furthermore, reengineering RepeatsDB 4 has generated various side products, including the ngx-mol-viewers Angular library, which enhances biological molecule visualization and has already been included in other databases like MobiDB. Moreover, RepeatsDB 4 utilizes a specialized pipeline to enrich biological data by integrating external resources and software. The pipeline has been designed to be executed on a High Performance Computing (HPC) cluster environment via the DRMAAtic library for efficient cluster communication. The architecture of RepeatsDB 4 is designed to be reusable across other projects, such as the DOME registry, underscoring its broader applications in bioinformatics.
14-feb-2025
Inglese
TOSATTO, SILVIO
Università degli studi di Padova
File in questo prodotto:
File Dimensione Formato  
phd-thesis-compatible.pdf

accesso aperto

Dimensione 10.16 MB
Formato Adobe PDF
10.16 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/192507
Il codice NBN di questa tesi è URN:NBN:IT:UNIPD-192507