Protein Language Models: Interpretability and Applications

Villegas Garcia, Edith Natalia

The advent of large scale neural networks has revolutionized computational biology, enabling the development of powerful Protein Language Models (PLMs) that learn rich representations from amino acid sequences. This thesis explores the internal workings of these models through the lens of interpretability, with the aim of understanding the biological knowledge they capture and taking advantage of this understanding for practical applications. We begin by investigating the intrinsic dimensionality (ID) of data representations across layers of deep neural networks trained on diverse modalities. We find that while ID evolution varies across modalities, PLMs exhibit a remarkably robust and consistent pattern, suggesting they converge to a universal representation. This consistency makes PLMs a particularly fertile ground for interpretability research, as insights are likely to generalize across architectures. Building on this, we apply Sparse Autoencoders (SAEs), a state-of-the-art mechanistic interpretability method, to dissect the representations of the ESM2- 8M model. We successfully disentangle model representations and link them to biologically meaningful annotations from the UniProt database. Furthermore, we demonstrate that these features are actionable: by artificially activating SAE latents associated with zinc finger motifs during inference, we can steer the model to generate novel protein sequences containing these structural elements. We complement this with a classical analysis of model neurons, revealing that while individual neurons are polysemantic and each encoded concept is spread across the whole population of neurons; information about specific protein domains is concentrated in a small subset of these, with performance saturating rapidly as more neurons are added. We also identify the presence of “outlier dimensions” in PLMs, analogous to those previously described in natural language models, and show that one such dimension is strongly correlated with intrinsically disordered regions, potentially acting as a biological analogue of punctuation. The practical utility of PLMs is further demonstrated through their application to predicting protein interaction interfaces. Finetuning PLMs for this task showcases their transfer learning capabilities on a problem of direct biological relevance. In conclusion, this thesis provides a multifaceted examination of protein language models, from fundamental geometric properties and mechanistic interpretability to practical applications in protein generation and interaction interface prediction tasks. Our work underscores that PLMs are not merely black-box predictors but are learning structured, biologically-grounded representations that we can begin to understand and control.

Protein Language Models: Interpretability and Applications

VILLEGAS GARCIA, EDITH NATALIA

2026

Abstract

The advent of large scale neural networks has revolutionized computational biology, enabling the development of powerful Protein Language Models (PLMs) that learn rich representations from amino acid sequences. This thesis explores the internal workings of these models through the lens of interpretability, with the aim of understanding the biological knowledge they capture and taking advantage of this understanding for practical applications. We begin by investigating the intrinsic dimensionality (ID) of data representations across layers of deep neural networks trained on diverse modalities. We find that while ID evolution varies across modalities, PLMs exhibit a remarkably robust and consistent pattern, suggesting they converge to a universal representation. This consistency makes PLMs a particularly fertile ground for interpretability research, as insights are likely to generalize across architectures. Building on this, we apply Sparse Autoencoders (SAEs), a state-of-the-art mechanistic interpretability method, to dissect the representations of the ESM2- 8M model. We successfully disentangle model representations and link them to biologically meaningful annotations from the UniProt database. Furthermore, we demonstrate that these features are actionable: by artificially activating SAE latents associated with zinc finger motifs during inference, we can steer the model to generate novel protein sequences containing these structural elements. We complement this with a classical analysis of model neurons, revealing that while individual neurons are polysemantic and each encoded concept is spread across the whole population of neurons; information about specific protein domains is concentrated in a small subset of these, with performance saturating rapidly as more neurons are added. We also identify the presence of “outlier dimensions” in PLMs, analogous to those previously described in natural language models, and show that one such dimension is strongly correlated with intrinsically disordered regions, potentially acting as a biological analogue of punctuation. The practical utility of PLMs is further demonstrated through their application to predicting protein interaction interfaces. Finetuning PLMs for this task showcases their transfer learning capabilities on a problem of direct biological relevance. In conclusion, this thesis provides a multifaceted examination of protein language models, from fundamental geometric properties and mechanistic interpretability to practical applications in protein generation and interaction interface prediction tasks. Our work underscores that PLMs are not merely black-box predictors but are learning structured, biologically-grounded representations that we can begin to understand and control.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				APPLIED DATA SCIENCE AND ARTIFICIAL INTELLIGENCE
			
	Data di pubblicazione
	
				28-gen-2026
			
	Lingua
	
				Inglese
			
	Abstract in italiano
	
				The advent of large scale neural networks has revolutionized computational biology, enabling the development of powerful Protein Language Models (PLMs) that learn rich representations from amino acid sequences. This thesis explores the internal workings of these models through the lens of interpretability, with the aim of understanding the biological knowledge they capture and taking advantage of this understanding for practical applications. 
We begin by investigating the intrinsic dimensionality (ID) of data representations across layers of deep neural networks trained on diverse modalities. We find that while ID evolution varies across modalities, PLMs exhibit a remarkably robust and consistent pattern, suggesting they converge to a universal representation. This consistency makes PLMs a particularly fertile ground for interpretability research, as insights are likely to generalize across architectures. Building on this, we apply Sparse Autoencoders (SAEs), a state-of-the-art mechanistic interpretability method, to dissect the representations of the ESM2- 8M model. We successfully disentangle model representations and link them to biologically meaningful annotations from the UniProt database. Furthermore, we demonstrate that these features are actionable: by artificially activating SAE latents associated with zinc finger motifs during inference, we can steer the model to generate novel protein sequences containing these structural elements. 
We complement this with a classical analysis of model neurons, revealing that while individual neurons are polysemantic and each encoded concept is spread across the whole population of neurons; information about specific protein domains is concentrated in a small subset of these, with performance saturating rapidly as more neurons are added. We also identify the presence of “outlier dimensions” in PLMs, analogous to those previously described in natural language models, and show that one such dimension is strongly correlated with intrinsically disordered regions, potentially acting as a biological analogue of punctuation. 
The practical utility of PLMs is further demonstrated through their application to predicting protein interaction interfaces. Finetuning PLMs for this task showcases their transfer learning capabilities on a problem of direct biological relevance. 
In conclusion, this thesis provides a multifaceted examination of protein language models, from fundamental geometric properties and mechanistic interpretability to practical applications in protein generation and interaction interface prediction tasks. Our work underscores that PLMs are not merely black-box predictors but are learning structured, biologically-grounded representations that we can begin to understand and control.
			
	Parola chiave
	
				protein; language models; interpretability; transformers; generation
			
	Relatore, Supervisor, Advisor o Tutor
	
				ANSUINI ALESSIO
			
	Nome Editore
	
				Università degli Studi di Trieste
			
	Collezione di appartenenza
	
				Università degli Studi di Trieste

File in questo prodotto:

File	Dimensione	Formato
PhD_Villegas.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 6.28 MB Formato Adobe PDF Visualizza/Apri	6.28 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/357308

Il codice NBN di questa tesi è URN:NBN:IT:UNITS-357308