Human behaviour and communication are deeply rooted in non-verbal cues, including body movements (Kinesics), the spatial relationships between individuals (Proxemics) and the analysis of their focus of attention (where and what they are looking at). These subtle cues form an essential part of social interactions and the relationship a person has with their environment, and understanding them is critical for a range of applications, from Augmented Reality (AR) and Human-Robot Interaction (HRI) to surveillance and healthcare-related studies. With recent advances in deep learning and computer vision algorithms, methodologies for analysing such behaviours have become increasingly sophisticated, enabling researchers to model trajectories, poses and gaze with unprecedented accuracy. However, these tasks are complicated by multiple factors: the dynamic nature of human behaviour, the need for privacy-preserving methodologies, the challenge of modelling complex 3D environments and the lack of adequate datasets to explore dedicated use cases. This thesis explores how deep neural networks can be leveraged to analyse human kinesics and proxemics, focusing on three key aspects: Human Trajectory Prediction (HTP), Human Pose Estimation (HPE) and Human Pose Forecasting (HPF), and the Gaze Estimation and Gaze Target Detection (GTD) tasks. The thesis is structured around a series of interrelated works that address both fundamental challenges in data collection and ethical concerns around privacy while advancing state-of-the-art methods for understanding human behaviour. We present two datasets: the first one is related to the Human-Environment Interaction (HEI) scenario and designed to study the Visual Selective Attention (VSA) of a single and a couple of people moving in a laboratory, considering Social Signal Processing (SSP) insights and mapping the attention on a 3D scene model. The second one tackles HRI from the perspective of a Spot quadruped robot, and it’s equipped with HPE, HPF, and Collision Prediction (CP) benchmarks on the video data acquired by the robot, which poses an interesting challenge given the often non-visible parts of the human body. We also present a novel model for HTP in indoor environments, where it is crucial to cope with the different path options and countless layout configurations in contrast to the outdoor counterpart. This is achieved by combining equivariant and invariant geometric feature learning modules and a self-supervised vision representation extracted from the environment map. Regarding the GTD task, i.e., finding the target of the human gaze, we present a novel Multi-Task Learning (MTL) methodology for 3D GTD exploiting the upper-body pose and depth map to preserve the privacy of the observed subject and achieving state-of-the-art results on the most comprehensive dataset available in the literature. Finally, we propose a real-world application, where a Gaze Estimation algorithm working in real time has been integrated inside a smart shopping mall, tracking the users’ favourite products and offering a personalized shopping experience for each customer. In summary, this thesis analyses the application of deep learning and computer vision techniques to understanding and interpreting human behaviour when observed through the lens of a camera. We derive exciting insights, from the need for larger-scale datasets to tackle these tasks in the new foundation model-dominated landscape to the difficulty in solving these challenges in a world where privacy and safety are more critical than ever.
Deep Learning for Human Behaviour Understanding: A Comprehensive Study of Trajectory, Pose, and Gaze in Social and Human-Robot Interaction Scenarios
TOAIARI, ANDREA
2025
Abstract
Human behaviour and communication are deeply rooted in non-verbal cues, including body movements (Kinesics), the spatial relationships between individuals (Proxemics) and the analysis of their focus of attention (where and what they are looking at). These subtle cues form an essential part of social interactions and the relationship a person has with their environment, and understanding them is critical for a range of applications, from Augmented Reality (AR) and Human-Robot Interaction (HRI) to surveillance and healthcare-related studies. With recent advances in deep learning and computer vision algorithms, methodologies for analysing such behaviours have become increasingly sophisticated, enabling researchers to model trajectories, poses and gaze with unprecedented accuracy. However, these tasks are complicated by multiple factors: the dynamic nature of human behaviour, the need for privacy-preserving methodologies, the challenge of modelling complex 3D environments and the lack of adequate datasets to explore dedicated use cases. This thesis explores how deep neural networks can be leveraged to analyse human kinesics and proxemics, focusing on three key aspects: Human Trajectory Prediction (HTP), Human Pose Estimation (HPE) and Human Pose Forecasting (HPF), and the Gaze Estimation and Gaze Target Detection (GTD) tasks. The thesis is structured around a series of interrelated works that address both fundamental challenges in data collection and ethical concerns around privacy while advancing state-of-the-art methods for understanding human behaviour. We present two datasets: the first one is related to the Human-Environment Interaction (HEI) scenario and designed to study the Visual Selective Attention (VSA) of a single and a couple of people moving in a laboratory, considering Social Signal Processing (SSP) insights and mapping the attention on a 3D scene model. The second one tackles HRI from the perspective of a Spot quadruped robot, and it’s equipped with HPE, HPF, and Collision Prediction (CP) benchmarks on the video data acquired by the robot, which poses an interesting challenge given the often non-visible parts of the human body. We also present a novel model for HTP in indoor environments, where it is crucial to cope with the different path options and countless layout configurations in contrast to the outdoor counterpart. This is achieved by combining equivariant and invariant geometric feature learning modules and a self-supervised vision representation extracted from the environment map. Regarding the GTD task, i.e., finding the target of the human gaze, we present a novel Multi-Task Learning (MTL) methodology for 3D GTD exploiting the upper-body pose and depth map to preserve the privacy of the observed subject and achieving state-of-the-art results on the most comprehensive dataset available in the literature. Finally, we propose a real-world application, where a Gaze Estimation algorithm working in real time has been integrated inside a smart shopping mall, tracking the users’ favourite products and offering a personalized shopping experience for each customer. In summary, this thesis analyses the application of deep learning and computer vision techniques to understanding and interpreting human behaviour when observed through the lens of a camera. We derive exciting insights, from the need for larger-scale datasets to tackle these tasks in the new foundation model-dominated landscape to the difficulty in solving these challenges in a world where privacy and safety are more critical than ever.File | Dimensione | Formato | |
---|---|---|---|
AT_PhD_Thesis_Final.pdf
accesso aperto
Dimensione
28.96 MB
Formato
Adobe PDF
|
28.96 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/202922
URN:NBN:IT:UNIVR-202922