Perceiving in time: see, understand, remember

Diko, Anxhelo

Rooted in the remarkable capability of human visual cognition to navigate the real world effortlessly, this thesis spans from perception to memory, to theoretically and empirically examine the fundamental concepts needed to realize the vision of an artificial visual intelligence: a machine with the capacity to autonomously perceive the world, understand semantics, reason in time, and remember. The journey begins with perception. Vision Transformers, despite their success, suffer from feature collapse—a loss of spatial structures in deeper layers caused by over-globalized attention. ReViT addresses this through a simple yet effective residual attention mechanism that preserves spatial awareness, the very foundation of visual perception. However, understanding perception alone is insufficient. The real world unfolds through interconnected events defined by motion, interactions and semantic patterns. This brings our journey towards S-GEAR, which tackles semantic understanding and temporal intelligence in the context of action anticipation —a challenge that goes beyond isolated actions and explores sequences, aiming to understand what has happened in order to predict what is next. S-GEAR tackles this by explicitly embedding semantic relationships and temporal co-occurrence patterns into visual representations. Where existing anticipation methods treat events in isolation, S-GEAR recognizes that semantics constrain what can happen next, dramatically reducing future uncertainty. The final piece is memory—the mechanism that enables us to recall past experiences when needed most. ReWind introduces a query-guided memory system that determines not just how to compress information, but what to remember and what to forget. By coordinating reading, writing, and selection mechanisms, ReWind enables visual models to process long videos efficiently while maintaining coherent and relevant information. Together, these contributions advance the path toward machines that not only see, but also understand and remember.

Perceiving in time: see, understand, remember

DIKO, ANXHELO

2026

Abstract

Rooted in the remarkable capability of human visual cognition to navigate the real world effortlessly, this thesis spans from perception to memory, to theoretically and empirically examine the fundamental concepts needed to realize the vision of an artificial visual intelligence: a machine with the capacity to autonomously perceive the world, understand semantics, reason in time, and remember. The journey begins with perception. Vision Transformers, despite their success, suffer from feature collapse—a loss of spatial structures in deeper layers caused by over-globalized attention. ReViT addresses this through a simple yet effective residual attention mechanism that preserves spatial awareness, the very foundation of visual perception. However, understanding perception alone is insufficient. The real world unfolds through interconnected events defined by motion, interactions and semantic patterns. This brings our journey towards S-GEAR, which tackles semantic understanding and temporal intelligence in the context of action anticipation —a challenge that goes beyond isolated actions and explores sequences, aiming to understand what has happened in order to predict what is next. S-GEAR tackles this by explicitly embedding semantic relationships and temporal co-occurrence patterns into visual representations. Where existing anticipation methods treat events in isolation, S-GEAR recognizes that semantics constrain what can happen next, dramatically reducing future uncertainty. The final piece is memory—the mechanism that enables us to recall past experiences when needed most. ReWind introduces a query-guided memory system that determines not just how to compress information, but what to remember and what to forget. By coordinating reading, writing, and selection mechanisms, ReWind enables visual models to process long videos efficiently while maintaining coherent and relevant information. Together, these contributions advance the path toward machines that not only see, but also understand and remember.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				Informatica
			
	Data di pubblicazione
	
				26-gen-2026
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				CINQUE, LUIGI
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Anxhelo.pdf accesso aperto Licenza: Creative Commons Dimensione 19.71 MB Formato Adobe PDF Visualizza/Apri	19.71 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/357135

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-357135