Advances in high-throughput sequencing technology have led to an exponential growth in the number of known protein sequences, with public repositories now accumulating billions of entries. Moreover, the recent breakthrough in sequence-based protein structure prediction, driven by deep learning approaches, has enabled the large-scale assignment of putative structures to most of these sequences. Despite these advances, experimental characterization of protein function remains labor-intensive and time-consuming, creating a widening gap between the proteins known at the sequence or structural level and our understanding of their biological role. One effective strategy to address this challenge is the hierarchical classification of protein space, whereby proteins are grouped according to varying degrees of sequence or structural similarity. The underlying hypothesis is that proteins within a group are likely to perform similar functions, presumably derived from their long-lost common ancestor. Established classification databases such as Pfam, CATH, and SCOP have proven to be invaluable resources, as evidenced by their widespread use across diverse biological applications. However, these databases rely to varying degrees on manual curation, limiting their scalability as data continues to grow. While automated initiatives have emerged to address this issue, most of these methods remain computationally intensive and require supervised training. In this thesis, we focus on the development and application of high-throughput, unsupervised pipelines for identifying and classifying protein domains in large datasets. We first re-engineered the DPCfam pipeline—originally limited to handling datasets of thousands of proteins—so it can now process millions of sequences. As the first study, we applied DPCfam to the UniRef50 database v2017_07, containing 23 million proteins, which identified approximately 45,000 protein domain clusters. Our automated classification is in close correspondence to the manually curated Pfam resource, with 78% of clusters with Pfam annotations exhibiting 100% consistency. In addition, our protocol finds more than 14,000 clusters consisting of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. A preliminary analysis performed in collaboration with the Pfam team suggests that many of these unannotated clusters have the potential to be converted into novel families with minimal manual curation. As a follow-up study, we applied DPCfam to classify the Unified Human Gastrointestinal Protein (UHGP) dataset, one of the most relevant metagenomic datasets with applications ranging from medicine to biology. Metagenomic datasets are challenging due to their diverse taxonomic composition compared to standard protein repositories like UniRef. Our classification improved family coverage by more than 15% at both the protein and residue levels relative to Pfam. Moreover, we identified over 1,200 clusters that do not overlap with existing Pfam families or clusters from DPCfam-UniRef50, indicating the presence of metagenome-specific putative families. Motivated by the release of the AlphaFold protein structure database, we developed DPCstruct, an adaptation of DPCfam tailored for large-scale, structure-based domain clustering. When applied to the Foldseek Cluster database (15 million proteins), DPCstruct recovered the majority of protein folds cataloged in SCOP and CATH. Of the 28,246 clusters identified, 24% appear to represent novel folds, including examples within the well-studied human proteome. Together, DPCfam and DPCstruct highlight the power and flexibility of Density Peak Clustering for both sequence- and structure-based protein domain classification. Their high-performance implementation offers scalable, automated solutions that can complement established classification frameworks or serve as standalone tools for specialized applications.

Advances in high-throughput sequencing technology have led to an exponential growth in the number of known protein sequences, with public repositories now accumulating billions of entries. Moreover, the recent breakthrough in sequence-based protein structure prediction, driven by deep learning approaches, has enabled the large-scale assignment of putative structures to most of these sequences. Despite these advances, experimental characterization of protein function remains labor-intensive and time-consuming, creating a widening gap between the proteins known at the sequence or structural level and our understanding of their biological role. One effective strategy to address this challenge is the hierarchical classification of protein space, whereby proteins are grouped according to varying degrees of sequence or structural similarity. The underlying hypothesis is that proteins within a group are likely to perform similar functions, presumably derived from their long-lost common ancestor. Established classification databases such as Pfam, CATH, and SCOP have proven to be invaluable resources, as evidenced by their widespread use across diverse biological applications. However, these databases rely to varying degrees on manual curation, limiting their scalability as data continues to grow. While automated initiatives have emerged to address this issue, most of these methods remain computationally intensive and require supervised training. In this thesis, we focus on the development and application of high-throughput, unsupervised pipelines for identifying and classifying protein domains in large datasets. We first re-engineered the DPCfam pipeline—originally limited to handling datasets of thousands of proteins—so it can now process millions of sequences. As the first study, we applied DPCfam to the UniRef50 database v2017_07, containing 23 million proteins, which identified approximately 45,000 protein domain clusters. Our automated classification is in close correspondence to the manually curated Pfam resource, with 78% of clusters with Pfam annotations exhibiting 100% consistency. In addition, our protocol finds more than 14,000 clusters consisting of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. A preliminary analysis performed in collaboration with the Pfam team suggests that many of these unannotated clusters have the potential to be converted into novel families with minimal manual curation. As a follow-up study, we applied DPCfam to classify the Unified Human Gastrointestinal Protein (UHGP) dataset, one of the most relevant metagenomic datasets with applications ranging from medicine to biology. Metagenomic datasets are challenging due to their diverse taxonomic composition compared to standard protein repositories like UniRef. Our classification improved family coverage by more than 15% at both the protein and residue levels relative to Pfam. Moreover, we identified over 1,200 clusters that do not overlap with existing Pfam families or clusters from DPCfam-UniRef50, indicating the presence of metagenome-specific putative families. Motivated by the release of the AlphaFold protein structure database, we developed DPCstruct, an adaptation of DPCfam tailored for large-scale, structure-based domain clustering. When applied to the Foldseek Cluster database (15 million proteins), DPCstruct recovered the majority of protein folds cataloged in SCOP and CATH. Of the 28,246 clusters identified, 24% appear to represent novel folds, including examples within the well-studied human proteome. Together, DPCfam and DPCstruct highlight the power and flexibility of Density Peak Clustering for both sequence- and structure-based protein domain classification. Their high-performance implementation offers scalable, automated solutions that can complement established classification frameworks or serve as standalone tools for specialized applications.

From Sequences to Structures: Large Scale Protein Domain Classification by Density Peaks Clustering

BARONE, FEDERICO
2025

Abstract

Advances in high-throughput sequencing technology have led to an exponential growth in the number of known protein sequences, with public repositories now accumulating billions of entries. Moreover, the recent breakthrough in sequence-based protein structure prediction, driven by deep learning approaches, has enabled the large-scale assignment of putative structures to most of these sequences. Despite these advances, experimental characterization of protein function remains labor-intensive and time-consuming, creating a widening gap between the proteins known at the sequence or structural level and our understanding of their biological role. One effective strategy to address this challenge is the hierarchical classification of protein space, whereby proteins are grouped according to varying degrees of sequence or structural similarity. The underlying hypothesis is that proteins within a group are likely to perform similar functions, presumably derived from their long-lost common ancestor. Established classification databases such as Pfam, CATH, and SCOP have proven to be invaluable resources, as evidenced by their widespread use across diverse biological applications. However, these databases rely to varying degrees on manual curation, limiting their scalability as data continues to grow. While automated initiatives have emerged to address this issue, most of these methods remain computationally intensive and require supervised training. In this thesis, we focus on the development and application of high-throughput, unsupervised pipelines for identifying and classifying protein domains in large datasets. We first re-engineered the DPCfam pipeline—originally limited to handling datasets of thousands of proteins—so it can now process millions of sequences. As the first study, we applied DPCfam to the UniRef50 database v2017_07, containing 23 million proteins, which identified approximately 45,000 protein domain clusters. Our automated classification is in close correspondence to the manually curated Pfam resource, with 78% of clusters with Pfam annotations exhibiting 100% consistency. In addition, our protocol finds more than 14,000 clusters consisting of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. A preliminary analysis performed in collaboration with the Pfam team suggests that many of these unannotated clusters have the potential to be converted into novel families with minimal manual curation. As a follow-up study, we applied DPCfam to classify the Unified Human Gastrointestinal Protein (UHGP) dataset, one of the most relevant metagenomic datasets with applications ranging from medicine to biology. Metagenomic datasets are challenging due to their diverse taxonomic composition compared to standard protein repositories like UniRef. Our classification improved family coverage by more than 15% at both the protein and residue levels relative to Pfam. Moreover, we identified over 1,200 clusters that do not overlap with existing Pfam families or clusters from DPCfam-UniRef50, indicating the presence of metagenome-specific putative families. Motivated by the release of the AlphaFold protein structure database, we developed DPCstruct, an adaptation of DPCfam tailored for large-scale, structure-based domain clustering. When applied to the Foldseek Cluster database (15 million proteins), DPCstruct recovered the majority of protein folds cataloged in SCOP and CATH. Of the 28,246 clusters identified, 24% appear to represent novel folds, including examples within the well-studied human proteome. Together, DPCfam and DPCstruct highlight the power and flexibility of Density Peak Clustering for both sequence- and structure-based protein domain classification. Their high-performance implementation offers scalable, automated solutions that can complement established classification frameworks or serve as standalone tools for specialized applications.
25-mar-2025
Inglese
Advances in high-throughput sequencing technology have led to an exponential growth in the number of known protein sequences, with public repositories now accumulating billions of entries. Moreover, the recent breakthrough in sequence-based protein structure prediction, driven by deep learning approaches, has enabled the large-scale assignment of putative structures to most of these sequences. Despite these advances, experimental characterization of protein function remains labor-intensive and time-consuming, creating a widening gap between the proteins known at the sequence or structural level and our understanding of their biological role. One effective strategy to address this challenge is the hierarchical classification of protein space, whereby proteins are grouped according to varying degrees of sequence or structural similarity. The underlying hypothesis is that proteins within a group are likely to perform similar functions, presumably derived from their long-lost common ancestor. Established classification databases such as Pfam, CATH, and SCOP have proven to be invaluable resources, as evidenced by their widespread use across diverse biological applications. However, these databases rely to varying degrees on manual curation, limiting their scalability as data continues to grow. While automated initiatives have emerged to address this issue, most of these methods remain computationally intensive and require supervised training. In this thesis, we focus on the development and application of high-throughput, unsupervised pipelines for identifying and classifying protein domains in large datasets. We first re-engineered the DPCfam pipeline—originally limited to handling datasets of thousands of proteins—so it can now process millions of sequences. As the first study, we applied DPCfam to the UniRef50 database v2017_07, containing 23 million proteins, which identified approximately 45,000 protein domain clusters. Our automated classification is in close correspondence to the manually curated Pfam resource, with 78% of clusters with Pfam annotations exhibiting 100% consistency. In addition, our protocol finds more than 14,000 clusters consisting of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. A preliminary analysis performed in collaboration with the Pfam team suggests that many of these unannotated clusters have the potential to be converted into novel families with minimal manual curation. As a follow-up study, we applied DPCfam to classify the Unified Human Gastrointestinal Protein (UHGP) dataset, one of the most relevant metagenomic datasets with applications ranging from medicine to biology. Metagenomic datasets are challenging due to their diverse taxonomic composition compared to standard protein repositories like UniRef. Our classification improved family coverage by more than 15% at both the protein and residue levels relative to Pfam. Moreover, we identified over 1,200 clusters that do not overlap with existing Pfam families or clusters from DPCfam-UniRef50, indicating the presence of metagenome-specific putative families. Motivated by the release of the AlphaFold protein structure database, we developed DPCstruct, an adaptation of DPCfam tailored for large-scale, structure-based domain clustering. When applied to the Foldseek Cluster database (15 million proteins), DPCstruct recovered the majority of protein folds cataloged in SCOP and CATH. Of the 28,246 clusters identified, 24% appear to represent novel folds, including examples within the well-studied human proteome. Together, DPCfam and DPCstruct highlight the power and flexibility of Density Peak Clustering for both sequence- and structure-based protein domain classification. Their high-performance implementation offers scalable, automated solutions that can complement established classification frameworks or serve as standalone tools for specialized applications.
Protein domains; Unsupervised; Clustering; Protein sequences; Protein structures
CAZZANIGA, ALBERTO
ANSUINI, ALESSIO
Università degli Studi di Trieste
File in questo prodotto:
File Dimensione Formato  
barone_thesis_reviewed.pdf

accesso aperto

Dimensione 15.15 MB
Formato Adobe PDF
15.15 MB Adobe PDF Visualizza/Apri
barone_thesis_reviewed_1.pdf

accesso aperto

Dimensione 15.15 MB
Formato Adobe PDF
15.15 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/208544
Il codice NBN di questa tesi è URN:NBN:IT:UNITS-208544