Artificial Intelligence, particularly Computer Vision, holds immense potential to enhance human safety and advance society's digital transition. This thesis addresses the challenges of developing robust and efficient AI for complex, human-centered tasks, spanning from behavior monitoring to driving scenes. We analyze the task of Video Anomaly Detection and its related applications in human action monitoring, crowd occupancy estimation, and out-of-distribution detection in street scenes. For human action monitoring, we propose two novel methods. COSKAD demonstrates the critical impact of latent space geometry on learning representations of expected human actions, proving that low-dimensional vectors can effectively embed complex spatio-temporal dependencies. MoCoDAD advances this by estimating the latent distribution of human motion, leveraging an action's inherent variability to robustly distinguish normal from abnormal behavior. Shifting from individual to group dynamics, STEERER-V introduces a method to precisely estimate a crowd's space occupancy, and by proxy its weight, directly from 2D RGB images. This approach bypasses computationally expensive intermediate steps and is accompanied by ANTHROPOS-V, a new benchmark to spur further research in this domain. Finally, to enhance the reliability of self-driving systems, CMS-OoD presents a cross-modal steering technique. It efficiently adapts a large Vision-Language Model to condition a semantic segmentation task model, significantly improving OOD detection. As a key benefit, this method also generates grounded textual explanations of the observed scene, fostering safer, more interpretable human-vehicle interaction. Collectively, these contributions demonstrate that through geometric priors, distributional assumptions, or cross-modal conditioning, we can develop AI systems that are more robust, efficient, and better aligned with human needs in complex environments.

Video anomaly detection: ensuring the safety of human actions and street scenes

D'ARRIGO, STEFANO
2026

Abstract

Artificial Intelligence, particularly Computer Vision, holds immense potential to enhance human safety and advance society's digital transition. This thesis addresses the challenges of developing robust and efficient AI for complex, human-centered tasks, spanning from behavior monitoring to driving scenes. We analyze the task of Video Anomaly Detection and its related applications in human action monitoring, crowd occupancy estimation, and out-of-distribution detection in street scenes. For human action monitoring, we propose two novel methods. COSKAD demonstrates the critical impact of latent space geometry on learning representations of expected human actions, proving that low-dimensional vectors can effectively embed complex spatio-temporal dependencies. MoCoDAD advances this by estimating the latent distribution of human motion, leveraging an action's inherent variability to robustly distinguish normal from abnormal behavior. Shifting from individual to group dynamics, STEERER-V introduces a method to precisely estimate a crowd's space occupancy, and by proxy its weight, directly from 2D RGB images. This approach bypasses computationally expensive intermediate steps and is accompanied by ANTHROPOS-V, a new benchmark to spur further research in this domain. Finally, to enhance the reliability of self-driving systems, CMS-OoD presents a cross-modal steering technique. It efficiently adapts a large Vision-Language Model to condition a semantic segmentation task model, significantly improving OOD detection. As a key benefit, this method also generates grounded textual explanations of the observed scene, fostering safer, more interpretable human-vehicle interaction. Collectively, these contributions demonstrate that through geometric priors, distributional assumptions, or cross-modal conditioning, we can develop AI systems that are more robust, efficient, and better aligned with human needs in complex environments.
28-gen-2026
Inglese
GALASSO, FABIO
SPINELLI, INDRO
GRISETTI, GIORGIO
Università degli Studi di Roma "La Sapienza"
98
File in questo prodotto:
File Dimensione Formato  
Tesi_dottorato_DArrigo.pdf

accesso aperto

Licenza: Creative Commons
Dimensione 42.17 MB
Formato Adobe PDF
42.17 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/358530
Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-358530