In recent years, Vision Transformers (ViTs) have rapidly gained prominence in the field of representation learning, demonstrating remarkable capabilities across diverse vision tasks ranging from low-level image restoration to high-level 3D understanding. This thesis investigates the role of ViTs in unified representation learning, spanning from 2D vision to 3D perception, with an emphasis on enhancing effectiveness, efficiency, robustness, safety, and generalizability. The thesis is composed of three integral parts, each supported by state-of-the-art research contributions. Part I explores the foundational role of positional encoding in Vision Transformers (ViTs). We identify that naïve use of position embeddings (PEs) can introduce vulnerabilities such as privacy leakage and degraded robustness. To address this, we propose the Masked Jigsaw Puzzle (MJP) strategy, which rethinks spatial priors by partially masking and shuffling patches, significantly improving accuracy, consistency, and privacy preservation. Part II focuses on efficient representation learning for low-level vision tasks, particularly image restoration. We introduce two task-aware ViT frameworks: SemanIR, which improves attention efficiency by sharing semantic dictionaries across Transformer blocks, and AnyIR, which models diverse degradations via a unified spatial-frequency-aware embedding space. These models demonstrate that meaningful structure can be learned under strict efficiency constraints, without relying on foundation models or external prompts. Part III extends representation learning into the 3D domain. We propose Point-CMAE, a contrastive-enhanced masked autoencoder for point clouds, and introduce ShapeSplat, a large-scale dataset of Gaussian splats, accompanied by Gaussian-MAE for self-supervised pretraining on continuous 3D representations. Together, these contributions establish scalable frameworks for 3D learning beyond discrete point-based settings. Across all parts, this thesis emphasizes task-aligned Transformer design as a unifying principle. Rather than treating ViTs as generic backbones, we demonstrate how architectural adaptations—whether through position encoding, attention structuring, or contrastive masking—enable more robust and interpretable representations across modalities and tasks. These insights not only improve performance and efficiency but also reveal new opportunities for integrating compact visual Transformers into larger systems, including vision-language and multimodal foundation models. In sum, this thesis advances the theory and practice of Transformer-based representation learning across 2D and 3D domains, laying the groundwork for scalable, semantically structured, and future-ready visual intelligence.

Representation Learning Via Transformer: From 2D to 3D

REN, BIN
2025

Abstract

In recent years, Vision Transformers (ViTs) have rapidly gained prominence in the field of representation learning, demonstrating remarkable capabilities across diverse vision tasks ranging from low-level image restoration to high-level 3D understanding. This thesis investigates the role of ViTs in unified representation learning, spanning from 2D vision to 3D perception, with an emphasis on enhancing effectiveness, efficiency, robustness, safety, and generalizability. The thesis is composed of three integral parts, each supported by state-of-the-art research contributions. Part I explores the foundational role of positional encoding in Vision Transformers (ViTs). We identify that naïve use of position embeddings (PEs) can introduce vulnerabilities such as privacy leakage and degraded robustness. To address this, we propose the Masked Jigsaw Puzzle (MJP) strategy, which rethinks spatial priors by partially masking and shuffling patches, significantly improving accuracy, consistency, and privacy preservation. Part II focuses on efficient representation learning for low-level vision tasks, particularly image restoration. We introduce two task-aware ViT frameworks: SemanIR, which improves attention efficiency by sharing semantic dictionaries across Transformer blocks, and AnyIR, which models diverse degradations via a unified spatial-frequency-aware embedding space. These models demonstrate that meaningful structure can be learned under strict efficiency constraints, without relying on foundation models or external prompts. Part III extends representation learning into the 3D domain. We propose Point-CMAE, a contrastive-enhanced masked autoencoder for point clouds, and introduce ShapeSplat, a large-scale dataset of Gaussian splats, accompanied by Gaussian-MAE for self-supervised pretraining on continuous 3D representations. Together, these contributions establish scalable frameworks for 3D learning beyond discrete point-based settings. Across all parts, this thesis emphasizes task-aligned Transformer design as a unifying principle. Rather than treating ViTs as generic backbones, we demonstrate how architectural adaptations—whether through position encoding, attention structuring, or contrastive masking—enable more robust and interpretable representations across modalities and tasks. These insights not only improve performance and efficiency but also reveal new opportunities for integrating compact visual Transformers into larger systems, including vision-language and multimodal foundation models. In sum, this thesis advances the theory and practice of Transformer-based representation learning across 2D and 3D domains, laying the groundwork for scalable, semantically structured, and future-ready visual intelligence.
15-ott-2025
Inglese
representation learning
self-supervised learning
transformers
low-level vision
3D representations
Sebe, Nicu
Cucchiara, Rita
File in questo prodotto:
File Dimensione Formato  
PhDAIThesis_BinRen_2025.pdf

accesso aperto

Licenza: Creative Commons
Dimensione 149.58 MB
Formato Adobe PDF
149.58 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/308230
Il codice NBN di questa tesi è URN:NBN:IT:UNIPI-308230