In recent decades, robotic platforms have become increasingly relevant across industrial, healthcare, and domestic domains. However, the complexity of robot programming still limits their accessibility and widespread adoption. To overcome this barrier, research has focused on developing intuitive programming methods that simplify robot instruction for non-experts. Among these, programming by demonstration enables robots to acquire skills directly from human performances, allowing domain experts to teach robots efficiently without coding expertise. Building on this concept, this thesis investigates a novel framework that combines easily obtainable video demonstrations with natural language instructions, exploiting the same cues humans use to teach one another, to make robot teaching more intuitive and broadly accessible. The proposed framework introduces the first exploration of Shannon's Information Theory (IT) for the representation of manual tasks. Within this framework, IT enables the extraction of active scene elements, quantifies the information shared between hands and objects, and provides insights into the motion patterns during manipulation and the hand coordination strategy in bimanual activities. This allows robots to achieve a higher-level understanding of human-demonstrated manual tasks recorded in RGB videos. By recognizing the task structure and goals, the algorithm generalizes what was observed to unseen scenarios. Scene graphs are exploited to encode the extracted interaction features in a compact structure and segment the demonstration into blocks. By detecting changes in the scene graph topology, the system can automatically generate robot plans in the form of behavior trees, enabling the robotic system to replicate the observed unimanual or bimanual task. By integrating semantic knowledge, the robot execution plan becomes interpretable to the user, who can inspect it and iteratively refine it through natural language requests to a Large Language Model (LLM). The user-specified goals or critical task aspects elaborated by the LLM common-sense reasoning let the system adjust the vision-based plan to prevent potential failures and adapt it based on the received instructions and preferences. The approach was evaluated through elaborating multiple video demonstrations collected in multi-subject experiments, which are part of HANDSOME, an open-source dataset of HAND Skills demOnstrated by Multi-subjEcts, created to encourage further research and benchmarking in this field. Additional experiments using data from publicly available datasets confirmed the generalization capability of the method. Moreover, experiments involving LLM reasoning demonstrated the intuitiveness and usability of the framework even by non-experts, showing that the system can refine and adapt execution plans interactively, mitigating hallucinations and reducing the need for additional demonstrations. This work shows promising results in reducing the complexity of programming robotic platforms, which still demands expert knowledge, time, and high costs. The proposed intuitive programming framework has demonstrated strong potential to democratize robot use and support the deployment of collaborative robots across various hybrid human-robot environments.

Intuitive, Information-Theoretic, and LLM-Interactive Robot Programming from Video Demonstrations

MERLO, ELENA
2026

Abstract

In recent decades, robotic platforms have become increasingly relevant across industrial, healthcare, and domestic domains. However, the complexity of robot programming still limits their accessibility and widespread adoption. To overcome this barrier, research has focused on developing intuitive programming methods that simplify robot instruction for non-experts. Among these, programming by demonstration enables robots to acquire skills directly from human performances, allowing domain experts to teach robots efficiently without coding expertise. Building on this concept, this thesis investigates a novel framework that combines easily obtainable video demonstrations with natural language instructions, exploiting the same cues humans use to teach one another, to make robot teaching more intuitive and broadly accessible. The proposed framework introduces the first exploration of Shannon's Information Theory (IT) for the representation of manual tasks. Within this framework, IT enables the extraction of active scene elements, quantifies the information shared between hands and objects, and provides insights into the motion patterns during manipulation and the hand coordination strategy in bimanual activities. This allows robots to achieve a higher-level understanding of human-demonstrated manual tasks recorded in RGB videos. By recognizing the task structure and goals, the algorithm generalizes what was observed to unseen scenarios. Scene graphs are exploited to encode the extracted interaction features in a compact structure and segment the demonstration into blocks. By detecting changes in the scene graph topology, the system can automatically generate robot plans in the form of behavior trees, enabling the robotic system to replicate the observed unimanual or bimanual task. By integrating semantic knowledge, the robot execution plan becomes interpretable to the user, who can inspect it and iteratively refine it through natural language requests to a Large Language Model (LLM). The user-specified goals or critical task aspects elaborated by the LLM common-sense reasoning let the system adjust the vision-based plan to prevent potential failures and adapt it based on the received instructions and preferences. The approach was evaluated through elaborating multiple video demonstrations collected in multi-subject experiments, which are part of HANDSOME, an open-source dataset of HAND Skills demOnstrated by Multi-subjEcts, created to encourage further research and benchmarking in this field. Additional experiments using data from publicly available datasets confirmed the generalization capability of the method. Moreover, experiments involving LLM reasoning demonstrated the intuitiveness and usability of the framework even by non-experts, showing that the system can refine and adapt execution plans interactively, mitigating hallucinations and reducing the need for additional demonstrations. This work shows promising results in reducing the complexity of programming robotic platforms, which still demands expert knowledge, time, and high costs. The proposed intuitive programming framework has demonstrated strong potential to democratize robot use and support the deployment of collaborative robots across various hybrid human-robot environments.
26-feb-2026
Inglese
Dr. Arash Ajoudani
MASSOBRIO, PAOLO
Università degli studi di Genova
File in questo prodotto:
File Dimensione Formato  
phdunige_4332934.pdf

embargo fino al 26/02/2027

Licenza: Tutti i diritti riservati
Dimensione 10.4 MB
Formato Adobe PDF
10.4 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/359868
Il codice NBN di questa tesi è URN:NBN:IT:UNIGE-359868