From Sequences to Structures: Large Scale Protein Domain Classification by Density Peaks Clustering

Barone, Federico

Advances in high-throughput sequencing technology have led to an exponential growth in the number of known protein sequences, with public repositories now accumulating billions of entries. Moreover, the recent breakthrough in sequence-based protein structure prediction, driven by deep learning approaches, has enabled the large-scale assignment of putative structures to most of these sequences. Despite these advances, experimental characterization of protein function remains labor-intensive and time-consuming, creating a widening gap between the proteins known at the sequence or structural level and our understanding of their biological role. One effective strategy to address this challenge is the hierarchical classification of protein space, whereby proteins are grouped according to varying degrees of sequence or structural similarity. The underlying hypothesis is that proteins within a group are likely to perform similar functions, presumably derived from their long-lost common ancestor. Established classiﬁcation databases such as Pfam, CATH, and SCOP have proven to be invaluable resources, as evidenced by their widespread use across diverse biological applications. However, these databases rely to varying degrees on manual curation, limiting their scalability as data continues to grow. While automated initiatives have emerged to address this issue, most of these methods remain computationally intensive and require supervised training. In this thesis, we focus on the development and application of high-throughput, unsupervised pipelines for identifying and classifying protein domains in large datasets. We first re-engineered the DPCfam pipeline—originally limited to handling datasets of thousands of proteins—so it can now process millions of sequences. As the ﬁrst study, we applied DPCfam to the UniRef50 database v2017_07, containing 23 million proteins, which identiﬁed approximately 45,000 protein domain clusters. Our automated classiﬁcation is in close correspondence to the manually curated Pfam resource, with 78% of clusters with Pfam annotations exhibiting 100% consistency. In addition, our protocol ﬁnds more than 14,000 clusters consisting of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. A preliminary analysis performed in collaboration with the Pfam team suggests that many of these unannotated clusters have the potential to be converted into novel families with minimal manual curation. As a follow-up study, we applied DPCfam to classify the Unified Human Gastrointestinal Protein (UHGP) dataset, one of the most relevant metagenomic datasets with applications ranging from medicine to biology. Metagenomic datasets are challenging due to their diverse taxonomic composition compared to standard protein repositories like UniRef. Our classification improved family coverage by more than 15% at both the protein and residue levels relative to Pfam. Moreover, we identified over 1,200 clusters that do not overlap with existing Pfam families or clusters from DPCfam-UniRef50, indicating the presence of metagenome-specific putative families. Motivated by the release of the AlphaFold protein structure database, we developed DPCstruct, an adaptation of DPCfam tailored for large-scale, structure-based domain clustering. When applied to the Foldseek Cluster database (15 million proteins), DPCstruct recovered the majority of protein folds cataloged in SCOP and CATH. Of the 28,246 clusters identified, 24% appear to represent novel folds, including examples within the well-studied human proteome. Together, DPCfam and DPCstruct highlight the power and flexibility of Density Peak Clustering for both sequence- and structure-based protein domain classification. Their high-performance implementation offers scalable, automated solutions that can complement established classification frameworks or serve as standalone tools for specialized applications.

From Sequences to Structures: Large Scale Protein Domain Classification by Density Peaks Clustering

BARONE, FEDERICO

2025

Abstract

Advances in high-throughput sequencing technology have led to an exponential growth in the number of known protein sequences, with public repositories now accumulating billions of entries. Moreover, the recent breakthrough in sequence-based protein structure prediction, driven by deep learning approaches, has enabled the large-scale assignment of putative structures to most of these sequences. Despite these advances, experimental characterization of protein function remains labor-intensive and time-consuming, creating a widening gap between the proteins known at the sequence or structural level and our understanding of their biological role. One effective strategy to address this challenge is the hierarchical classification of protein space, whereby proteins are grouped according to varying degrees of sequence or structural similarity. The underlying hypothesis is that proteins within a group are likely to perform similar functions, presumably derived from their long-lost common ancestor. Established classiﬁcation databases such as Pfam, CATH, and SCOP have proven to be invaluable resources, as evidenced by their widespread use across diverse biological applications. However, these databases rely to varying degrees on manual curation, limiting their scalability as data continues to grow. While automated initiatives have emerged to address this issue, most of these methods remain computationally intensive and require supervised training. In this thesis, we focus on the development and application of high-throughput, unsupervised pipelines for identifying and classifying protein domains in large datasets. We first re-engineered the DPCfam pipeline—originally limited to handling datasets of thousands of proteins—so it can now process millions of sequences. As the ﬁrst study, we applied DPCfam to the UniRef50 database v2017_07, containing 23 million proteins, which identiﬁed approximately 45,000 protein domain clusters. Our automated classiﬁcation is in close correspondence to the manually curated Pfam resource, with 78% of clusters with Pfam annotations exhibiting 100% consistency. In addition, our protocol ﬁnds more than 14,000 clusters consisting of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. A preliminary analysis performed in collaboration with the Pfam team suggests that many of these unannotated clusters have the potential to be converted into novel families with minimal manual curation. As a follow-up study, we applied DPCfam to classify the Unified Human Gastrointestinal Protein (UHGP) dataset, one of the most relevant metagenomic datasets with applications ranging from medicine to biology. Metagenomic datasets are challenging due to their diverse taxonomic composition compared to standard protein repositories like UniRef. Our classification improved family coverage by more than 15% at both the protein and residue levels relative to Pfam. Moreover, we identified over 1,200 clusters that do not overlap with existing Pfam families or clusters from DPCfam-UniRef50, indicating the presence of metagenome-specific putative families. Motivated by the release of the AlphaFold protein structure database, we developed DPCstruct, an adaptation of DPCfam tailored for large-scale, structure-based domain clustering. When applied to the Foldseek Cluster database (15 million proteins), DPCstruct recovered the majority of protein folds cataloged in SCOP and CATH. Of the 28,246 clusters identified, 24% appear to represent novel folds, including examples within the well-studied human proteome. Together, DPCfam and DPCstruct highlight the power and flexibility of Density Peak Clustering for both sequence- and structure-based protein domain classification. Their high-performance implementation offers scalable, automated solutions that can complement established classification frameworks or serve as standalone tools for specialized applications.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				APPLIED DATA SCIENCE AND ARTIFICIAL INTELLIGENCE
			
	Data di pubblicazione
	
				25-mar-2025
			
	Lingua
	
				Inglese
			
	Abstract in italiano
	
				Advances in high-throughput sequencing technology have led to an exponential growth in the number of known protein sequences, with public repositories now accumulating billions of entries. Moreover, the recent breakthrough in sequence-based protein structure prediction, driven by deep learning approaches, has enabled the large-scale assignment of putative structures to most of these sequences. Despite these advances, experimental characterization of protein function remains labor-intensive and time-consuming, creating a widening gap between the proteins known at the sequence or structural level and our understanding of their biological role.
One effective strategy to address this challenge is the hierarchical classification of protein space, whereby proteins are grouped according to varying degrees of sequence or structural similarity. The underlying hypothesis is that proteins within a group are likely to perform similar functions, presumably derived from their long-lost common ancestor. Established classiﬁcation databases such as Pfam, CATH, and SCOP have proven to be invaluable resources, as evidenced by their widespread use across diverse biological applications. However, these databases rely to varying degrees on manual curation, limiting their scalability as data continues to grow. While automated initiatives have emerged to address this issue, most of these methods remain computationally intensive and require supervised training.
In this thesis, we focus on the development and application of high-throughput, unsupervised pipelines for identifying and classifying protein domains in large datasets. We first re-engineered the DPCfam pipeline—originally limited to handling datasets of thousands of proteins—so it can now process millions of sequences. As the ﬁrst study, we applied DPCfam to the UniRef50 database v2017_07, containing 23 million proteins, which identiﬁed approximately 45,000 protein domain clusters. Our automated classiﬁcation is in close correspondence to the manually curated Pfam resource, with 78% of clusters with Pfam annotations exhibiting 100% consistency. In addition, our protocol ﬁnds more than 14,000 clusters consisting of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. A preliminary analysis performed in collaboration with the Pfam team suggests that many of these unannotated clusters have the potential to be converted into novel families with minimal manual curation.
As a follow-up study, we applied DPCfam to classify the Unified Human Gastrointestinal Protein (UHGP) dataset, one of the most relevant metagenomic datasets with applications ranging from medicine to biology. Metagenomic datasets are challenging due to their diverse taxonomic composition compared to standard protein repositories like UniRef. Our classification improved family coverage by more than 15% at both the protein and residue levels relative to Pfam. Moreover, we identified over 1,200 clusters that do not overlap with existing Pfam families or clusters from DPCfam-UniRef50, indicating the presence of metagenome-specific putative families.
Motivated by the release of the AlphaFold protein structure database, we developed DPCstruct, an adaptation of DPCfam tailored for large-scale, structure-based domain clustering. When applied to the Foldseek Cluster database (15 million proteins), DPCstruct recovered the majority of protein folds cataloged in SCOP and CATH. Of the 28,246 clusters identified, 24% appear to represent novel folds, including examples within the well-studied human proteome.
Together, DPCfam and DPCstruct highlight the power and flexibility of Density Peak Clustering for both sequence- and structure-based protein domain classification. Their high-performance implementation offers scalable, automated solutions that can complement established classification frameworks or serve as standalone tools for specialized applications.
			
	Parola chiave
	
				Protein domains; Unsupervised; Clustering; Protein sequences; Protein structures
			
	Relatore, Supervisor, Advisor o Tutor
	
				CAZZANIGA, ALBERTO
ANSUINI, ALESSIO
			
	Nome Editore
	
				Università degli Studi di Trieste
			
	Collezione di appartenenza
	
				Università degli Studi di Trieste

File in questo prodotto:

File	Dimensione	Formato
barone_thesis_reviewed.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 15.15 MB Formato Adobe PDF Visualizza/Apri	15.15 MB	Adobe PDF	Visualizza/Apri
barone_thesis_reviewed_1.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 15.15 MB Formato Adobe PDF Visualizza/Apri	15.15 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/208544

Il codice NBN di questa tesi è URN:NBN:IT:UNITS-208544