Representation Learning Via Transformer: From 2D to 3D

Ren, Bin

In recent years, Vision Transformers (ViTs) have rapidly gained prominence in the field of representation learning, demonstrating remarkable capabilities across diverse vision tasks ranging from low-level image restoration to high-level 3D understanding. This thesis investigates the role of ViTs in unified representation learning, spanning from 2D vision to 3D perception, with an emphasis on enhancing effectiveness, efficiency, robustness, safety, and generalizability. The thesis is composed of three integral parts, each supported by state-of-the-art research contributions. Part I explores the foundational role of positional encoding in Vision Transformers (ViTs). We identify that naïve use of position embeddings (PEs) can introduce vulnerabilities such as privacy leakage and degraded robustness. To address this, we propose the Masked Jigsaw Puzzle (MJP) strategy, which rethinks spatial priors by partially masking and shuffling patches, significantly improving accuracy, consistency, and privacy preservation. Part II focuses on efficient representation learning for low-level vision tasks, particularly image restoration. We introduce two task-aware ViT frameworks: SemanIR, which improves attention efficiency by sharing semantic dictionaries across Transformer blocks, and AnyIR, which models diverse degradations via a unified spatial-frequency-aware embedding space. These models demonstrate that meaningful structure can be learned under strict efficiency constraints, without relying on foundation models or external prompts. Part III extends representation learning into the 3D domain. We propose Point-CMAE, a contrastive-enhanced masked autoencoder for point clouds, and introduce ShapeSplat, a large-scale dataset of Gaussian splats, accompanied by Gaussian-MAE for self-supervised pretraining on continuous 3D representations. Together, these contributions establish scalable frameworks for 3D learning beyond discrete point-based settings. Across all parts, this thesis emphasizes task-aligned Transformer design as a unifying principle. Rather than treating ViTs as generic backbones, we demonstrate how architectural adaptations—whether through position encoding, attention structuring, or contrastive masking—enable more robust and interpretable representations across modalities and tasks. These insights not only improve performance and efficiency but also reveal new opportunities for integrating compact visual Transformers into larger systems, including vision-language and multimodal foundation models. In sum, this thesis advances the theory and practice of Transformer-based representation learning across 2D and 3D domains, laying the groundwork for scalable, semantically structured, and future-ready visual intelligence.

Representation Learning Via Transformer: From 2D to 3D

REN, BIN

2025

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)