Human behavior understanding as an application of artificial intelligence and deep learning has quickly acquired popularity over the past few years, due to the crucial role it plays in trending fields such as human-robot interaction, autonomous driving, drone footage, sports and video surveillance. The variety of different scenarios and conditions in which modern computer vision algorithms are expected to operate, along with the significant cost derived from collecting and annotating data, encouraged the community to devise solutions oriented to adapting models to visual and semantic domains potentially very different from those characterizing the data used to train them. Formally, this task goes by the name of Domain Adaptation (DA), and it has recently been devoted a significant amount of effort. While image-based content has been vastly addressed in this scope, the field of videos turns out to be significantly less explored. This phenomenon can be associated to video data posing quite a harder challenge than images in all the steps involved, being more expensive to collect, store and annotate, and even more importantly harder to model and interpret. This latter aspect is mainly rooted in the additional level of complexity derived by the presence of the temporal dimension, which poses the challenge of understanding and modeling dynamics that evolve simultaneously through space and time, potentially in a radically different fashion across different domains. While the underlying theoretical problem is relevant and sound, it is usually addressed in its base formulation, which is characterized by a set of assumptions that rarely hold in real-world scenarios; these include, for instance, full knowledge of the categories present in the target domain, as well as direct access to the data used to train the models. Driven by the purpose of dropping these assumptions, this thesis proposes a selection of different perspectives on the specific problem of domain adaptation for the task of video action recognition, contextualizing it into challenging and realistic instances and proposing a different methodological approach to each one of them. All proposed methods are accurately described and motivated, and their effectiveness is then thoroughly showcased through an extensive experimental protocol, obtaining state-of-the-art results on the relevant benchmarks.
Human action recognition in the real world: handling domain shift in open-set, source-free and multi-source scenarios
Zara, Giacomo
2025
Abstract
Human behavior understanding as an application of artificial intelligence and deep learning has quickly acquired popularity over the past few years, due to the crucial role it plays in trending fields such as human-robot interaction, autonomous driving, drone footage, sports and video surveillance. The variety of different scenarios and conditions in which modern computer vision algorithms are expected to operate, along with the significant cost derived from collecting and annotating data, encouraged the community to devise solutions oriented to adapting models to visual and semantic domains potentially very different from those characterizing the data used to train them. Formally, this task goes by the name of Domain Adaptation (DA), and it has recently been devoted a significant amount of effort. While image-based content has been vastly addressed in this scope, the field of videos turns out to be significantly less explored. This phenomenon can be associated to video data posing quite a harder challenge than images in all the steps involved, being more expensive to collect, store and annotate, and even more importantly harder to model and interpret. This latter aspect is mainly rooted in the additional level of complexity derived by the presence of the temporal dimension, which poses the challenge of understanding and modeling dynamics that evolve simultaneously through space and time, potentially in a radically different fashion across different domains. While the underlying theoretical problem is relevant and sound, it is usually addressed in its base formulation, which is characterized by a set of assumptions that rarely hold in real-world scenarios; these include, for instance, full knowledge of the categories present in the target domain, as well as direct access to the data used to train the models. Driven by the purpose of dropping these assumptions, this thesis proposes a selection of different perspectives on the specific problem of domain adaptation for the task of video action recognition, contextualizing it into challenging and realistic instances and proposing a different methodological approach to each one of them. All proposed methods are accurately described and motivated, and their effectiveness is then thoroughly showcased through an extensive experimental protocol, obtaining state-of-the-art results on the relevant benchmarks.File | Dimensione | Formato | |
---|---|---|---|
phd_unitn_Zara_Giacomo.pdf
accesso aperto
Dimensione
6.93 MB
Formato
Adobe PDF
|
6.93 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/189795
URN:NBN:IT:UNITN-189795