Rooted in the remarkable capability of human visual cognition to navigate the real world effortlessly, this thesis spans from perception to memory, to theoretically and empirically examine the fundamental concepts needed to realize the vision of an artificial visual intelligence: a machine with the capacity to autonomously perceive the world, understand semantics, reason in time, and remember. The journey begins with perception. Vision Transformers, despite their success, suffer from feature collapse—a loss of spatial structures in deeper layers caused by over-globalized attention. ReViT addresses this through a simple yet effective residual attention mechanism that preserves spatial awareness, the very foundation of visual perception. However, understanding perception alone is insufficient. The real world unfolds through interconnected events defined by motion, interactions and semantic patterns. This brings our journey towards S-GEAR, which tackles semantic understanding and temporal intelligence in the context of action anticipation —a challenge that goes beyond isolated actions and explores sequences, aiming to understand what has happened in order to predict what is next. S-GEAR tackles this by explicitly embedding semantic relationships and temporal co-occurrence patterns into visual representations. Where existing anticipation methods treat events in isolation, S-GEAR recognizes that semantics constrain what can happen next, dramatically reducing future uncertainty. The final piece is memory—the mechanism that enables us to recall past experiences when needed most. ReWind introduces a query-guided memory system that determines not just how to compress information, but what to remember and what to forget. By coordinating reading, writing, and selection mechanisms, ReWind enables visual models to process long videos efficiently while maintaining coherent and relevant information. Together, these contributions advance the path toward machines that not only see, but also understand and remember.

Perceiving in time: see, understand, remember

DIKO, ANXHELO
2026

Abstract

Rooted in the remarkable capability of human visual cognition to navigate the real world effortlessly, this thesis spans from perception to memory, to theoretically and empirically examine the fundamental concepts needed to realize the vision of an artificial visual intelligence: a machine with the capacity to autonomously perceive the world, understand semantics, reason in time, and remember. The journey begins with perception. Vision Transformers, despite their success, suffer from feature collapse—a loss of spatial structures in deeper layers caused by over-globalized attention. ReViT addresses this through a simple yet effective residual attention mechanism that preserves spatial awareness, the very foundation of visual perception. However, understanding perception alone is insufficient. The real world unfolds through interconnected events defined by motion, interactions and semantic patterns. This brings our journey towards S-GEAR, which tackles semantic understanding and temporal intelligence in the context of action anticipation —a challenge that goes beyond isolated actions and explores sequences, aiming to understand what has happened in order to predict what is next. S-GEAR tackles this by explicitly embedding semantic relationships and temporal co-occurrence patterns into visual representations. Where existing anticipation methods treat events in isolation, S-GEAR recognizes that semantics constrain what can happen next, dramatically reducing future uncertainty. The final piece is memory—the mechanism that enables us to recall past experiences when needed most. ReWind introduces a query-guided memory system that determines not just how to compress information, but what to remember and what to forget. By coordinating reading, writing, and selection mechanisms, ReWind enables visual models to process long videos efficiently while maintaining coherent and relevant information. Together, these contributions advance the path toward machines that not only see, but also understand and remember.
26-gen-2026
Inglese
CINQUE, LUIGI
Università degli Studi di Roma "La Sapienza"
File in questo prodotto:
File Dimensione Formato  
Tesi_dottorato_Anxhelo.pdf

accesso aperto

Licenza: Creative Commons
Dimensione 19.71 MB
Formato Adobe PDF
19.71 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/357135
Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-357135