Remarkable advancements in the field of Computer Vision, Artificial Intelligence and Machine Learning have led to unprecedented breakthroughs in what machines are able to achieve. In many tasks such as in Image Classification in fact, they are now capable of even surpassing human performance. While this is truly outstanding, there are still many tasks in which machines lag far behind. Walking in a room, driving on an highway, grabbing some food for example. These are all actions that feel natural to us but can be quite unfeasible for them. Such actions require to identify and localize objects in the environment, effectively building a robust understanding of the scene. Humans easily gain this understanding thanks to their binocular vision, which provides an high-resolution and continuous stream of information to our brain that efficiently processes it. Unfortunately, things are much different for machines. With cameras instead of eyes and artificial neural networks instead of a brain, gaining this understanding is still an open problem. In this thesis we will not focus on solving this problem as a whole, but instead delve into a very relevant part of it. We will in fact analyze how to make ma- chines be able to identify and precisely localize objects in the 3D space by relying only on visual input i.e. 3D Object Detection from Images. One of the most complex aspects of Image-based 3D Object Detection is that it inherently requires the solution of many different sub-tasks e.g. the estimation of the object’s distance and its rotation. A first contribution of this thesis is an analysis of how these sub-tasks are usually learned, highlighting a destructivebehavior which limits the overall performance and the proposal of an alternative learning method that avoids it. A second contribution is the discovery of a flaw in the computation of the metric which is widely used in the field, affecting the re-computation of the performance of all published methods and the introduction of a novel un-flawed metric which has now become the official one. A third contribution is focused on one particular sub-task, i.e. estimation of the object’s distance, which is demonstrated to be the most challenging. Thanks to the introduction of a novel approach which normalizes the appearance of objects with respect to their distance, detection performances can be greatly improved. A last contribution of the thesis is the critical analysis of the recently proposed Pseudo-LiDAR methods. Two flaws in their training protocol have been identified and analyzed. On top of this, a novel method able to achieve state-of-the-art in Image-based 3D Object Detection has been developed.

3D Object Detection from Images

Simonelli, Andrea
2022

Abstract

Remarkable advancements in the field of Computer Vision, Artificial Intelligence and Machine Learning have led to unprecedented breakthroughs in what machines are able to achieve. In many tasks such as in Image Classification in fact, they are now capable of even surpassing human performance. While this is truly outstanding, there are still many tasks in which machines lag far behind. Walking in a room, driving on an highway, grabbing some food for example. These are all actions that feel natural to us but can be quite unfeasible for them. Such actions require to identify and localize objects in the environment, effectively building a robust understanding of the scene. Humans easily gain this understanding thanks to their binocular vision, which provides an high-resolution and continuous stream of information to our brain that efficiently processes it. Unfortunately, things are much different for machines. With cameras instead of eyes and artificial neural networks instead of a brain, gaining this understanding is still an open problem. In this thesis we will not focus on solving this problem as a whole, but instead delve into a very relevant part of it. We will in fact analyze how to make ma- chines be able to identify and precisely localize objects in the 3D space by relying only on visual input i.e. 3D Object Detection from Images. One of the most complex aspects of Image-based 3D Object Detection is that it inherently requires the solution of many different sub-tasks e.g. the estimation of the object’s distance and its rotation. A first contribution of this thesis is an analysis of how these sub-tasks are usually learned, highlighting a destructivebehavior which limits the overall performance and the proposal of an alternative learning method that avoids it. A second contribution is the discovery of a flaw in the computation of the metric which is widely used in the field, affecting the re-computation of the performance of all published methods and the introduction of a novel un-flawed metric which has now become the official one. A third contribution is focused on one particular sub-task, i.e. estimation of the object’s distance, which is demonstrated to be the most challenging. Thanks to the introduction of a novel approach which normalizes the appearance of objects with respect to their distance, detection performances can be greatly improved. A last contribution of the thesis is the critical analysis of the recently proposed Pseudo-LiDAR methods. Two flaws in their training protocol have been identified and analyzed. On top of this, a novel method able to achieve state-of-the-art in Image-based 3D Object Detection has been developed.
28-set-2022
Inglese
Computer Vision, Object Recognition, Vision and Scene Understanding, Object Detection
Ricci, Elisa
Università degli studi di Trento
122
File in questo prodotto:
File Dimensione Formato  
Andrea_Simonelli_Phd_Thesis_3D_Object_Detection_from_Images.pdf

accesso aperto

Dimensione 4.87 MB
Formato Adobe PDF
4.87 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/179819
Il codice NBN di questa tesi è URN:NBN:IT:UNITN-179819