The advent of large scale neural networks has revolutionized computational biology, enabling the development of powerful Protein Language Models (PLMs) that learn rich representations from amino acid sequences. This thesis explores the internal workings of these models through the lens of interpretability, with the aim of understanding the biological knowledge they capture and taking advantage of this understanding for practical applications. We begin by investigating the intrinsic dimensionality (ID) of data representations across layers of deep neural networks trained on diverse modalities. We find that while ID evolution varies across modalities, PLMs exhibit a remarkably robust and consistent pattern, suggesting they converge to a universal representation. This consistency makes PLMs a particularly fertile ground for interpretability research, as insights are likely to generalize across architectures. Building on this, we apply Sparse Autoencoders (SAEs), a state-of-the-art mechanistic interpretability method, to dissect the representations of the ESM2- 8M model. We successfully disentangle model representations and link them to biologically meaningful annotations from the UniProt database. Furthermore, we demonstrate that these features are actionable: by artificially activating SAE latents associated with zinc finger motifs during inference, we can steer the model to generate novel protein sequences containing these structural elements. We complement this with a classical analysis of model neurons, revealing that while individual neurons are polysemantic and each encoded concept is spread across the whole population of neurons; information about specific protein domains is concentrated in a small subset of these, with performance saturating rapidly as more neurons are added. We also identify the presence of “outlier dimensions” in PLMs, analogous to those previously described in natural language models, and show that one such dimension is strongly correlated with intrinsically disordered regions, potentially acting as a biological analogue of punctuation. The practical utility of PLMs is further demonstrated through their application to predicting protein interaction interfaces. Finetuning PLMs for this task showcases their transfer learning capabilities on a problem of direct biological relevance. In conclusion, this thesis provides a multifaceted examination of protein language models, from fundamental geometric properties and mechanistic interpretability to practical applications in protein generation and interaction interface prediction tasks. Our work underscores that PLMs are not merely black-box predictors but are learning structured, biologically-grounded representations that we can begin to understand and control.
The advent of large scale neural networks has revolutionized computational biology, enabling the development of powerful Protein Language Models (PLMs) that learn rich representations from amino acid sequences. This thesis explores the internal workings of these models through the lens of interpretability, with the aim of understanding the biological knowledge they capture and taking advantage of this understanding for practical applications. We begin by investigating the intrinsic dimensionality (ID) of data representations across layers of deep neural networks trained on diverse modalities. We find that while ID evolution varies across modalities, PLMs exhibit a remarkably robust and consistent pattern, suggesting they converge to a universal representation. This consistency makes PLMs a particularly fertile ground for interpretability research, as insights are likely to generalize across architectures. Building on this, we apply Sparse Autoencoders (SAEs), a state-of-the-art mechanistic interpretability method, to dissect the representations of the ESM2- 8M model. We successfully disentangle model representations and link them to biologically meaningful annotations from the UniProt database. Furthermore, we demonstrate that these features are actionable: by artificially activating SAE latents associated with zinc finger motifs during inference, we can steer the model to generate novel protein sequences containing these structural elements. We complement this with a classical analysis of model neurons, revealing that while individual neurons are polysemantic and each encoded concept is spread across the whole population of neurons; information about specific protein domains is concentrated in a small subset of these, with performance saturating rapidly as more neurons are added. We also identify the presence of “outlier dimensions” in PLMs, analogous to those previously described in natural language models, and show that one such dimension is strongly correlated with intrinsically disordered regions, potentially acting as a biological analogue of punctuation. The practical utility of PLMs is further demonstrated through their application to predicting protein interaction interfaces. Finetuning PLMs for this task showcases their transfer learning capabilities on a problem of direct biological relevance. In conclusion, this thesis provides a multifaceted examination of protein language models, from fundamental geometric properties and mechanistic interpretability to practical applications in protein generation and interaction interface prediction tasks. Our work underscores that PLMs are not merely black-box predictors but are learning structured, biologically-grounded representations that we can begin to understand and control.
Protein Language Models: Interpretability and Applications
VILLEGAS GARCIA, EDITH NATALIA
2026
Abstract
The advent of large scale neural networks has revolutionized computational biology, enabling the development of powerful Protein Language Models (PLMs) that learn rich representations from amino acid sequences. This thesis explores the internal workings of these models through the lens of interpretability, with the aim of understanding the biological knowledge they capture and taking advantage of this understanding for practical applications. We begin by investigating the intrinsic dimensionality (ID) of data representations across layers of deep neural networks trained on diverse modalities. We find that while ID evolution varies across modalities, PLMs exhibit a remarkably robust and consistent pattern, suggesting they converge to a universal representation. This consistency makes PLMs a particularly fertile ground for interpretability research, as insights are likely to generalize across architectures. Building on this, we apply Sparse Autoencoders (SAEs), a state-of-the-art mechanistic interpretability method, to dissect the representations of the ESM2- 8M model. We successfully disentangle model representations and link them to biologically meaningful annotations from the UniProt database. Furthermore, we demonstrate that these features are actionable: by artificially activating SAE latents associated with zinc finger motifs during inference, we can steer the model to generate novel protein sequences containing these structural elements. We complement this with a classical analysis of model neurons, revealing that while individual neurons are polysemantic and each encoded concept is spread across the whole population of neurons; information about specific protein domains is concentrated in a small subset of these, with performance saturating rapidly as more neurons are added. We also identify the presence of “outlier dimensions” in PLMs, analogous to those previously described in natural language models, and show that one such dimension is strongly correlated with intrinsically disordered regions, potentially acting as a biological analogue of punctuation. The practical utility of PLMs is further demonstrated through their application to predicting protein interaction interfaces. Finetuning PLMs for this task showcases their transfer learning capabilities on a problem of direct biological relevance. In conclusion, this thesis provides a multifaceted examination of protein language models, from fundamental geometric properties and mechanistic interpretability to practical applications in protein generation and interaction interface prediction tasks. Our work underscores that PLMs are not merely black-box predictors but are learning structured, biologically-grounded representations that we can begin to understand and control.| File | Dimensione | Formato | |
|---|---|---|---|
|
PhD_Villegas.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
6.28 MB
Formato
Adobe PDF
|
6.28 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/357308
URN:NBN:IT:UNITS-357308