Intuitive, Information-Theoretic, and LLM-Interactive Robot Programming from Video Demonstrations

Merlo, Elena

In recent decades, robotic platforms have become increasingly relevant across industrial, healthcare, and domestic domains. However, the complexity of robot programming still limits their accessibility and widespread adoption. To overcome this barrier, research has focused on developing intuitive programming methods that simplify robot instruction for non-experts. Among these, programming by demonstration enables robots to acquire skills directly from human performances, allowing domain experts to teach robots efficiently without coding expertise. Building on this concept, this thesis investigates a novel framework that combines easily obtainable video demonstrations with natural language instructions, exploiting the same cues humans use to teach one another, to make robot teaching more intuitive and broadly accessible. The proposed framework introduces the first exploration of Shannon's Information Theory (IT) for the representation of manual tasks. Within this framework, IT enables the extraction of active scene elements, quantifies the information shared between hands and objects, and provides insights into the motion patterns during manipulation and the hand coordination strategy in bimanual activities. This allows robots to achieve a higher-level understanding of human-demonstrated manual tasks recorded in RGB videos. By recognizing the task structure and goals, the algorithm generalizes what was observed to unseen scenarios. Scene graphs are exploited to encode the extracted interaction features in a compact structure and segment the demonstration into blocks. By detecting changes in the scene graph topology, the system can automatically generate robot plans in the form of behavior trees, enabling the robotic system to replicate the observed unimanual or bimanual task. By integrating semantic knowledge, the robot execution plan becomes interpretable to the user, who can inspect it and iteratively refine it through natural language requests to a Large Language Model (LLM). The user-specified goals or critical task aspects elaborated by the LLM common-sense reasoning let the system adjust the vision-based plan to prevent potential failures and adapt it based on the received instructions and preferences. The approach was evaluated through elaborating multiple video demonstrations collected in multi-subject experiments, which are part of HANDSOME, an open-source dataset of HAND Skills demOnstrated by Multi-subjEcts, created to encourage further research and benchmarking in this field. Additional experiments using data from publicly available datasets confirmed the generalization capability of the method. Moreover, experiments involving LLM reasoning demonstrated the intuitiveness and usability of the framework even by non-experts, showing that the system can refine and adapt execution plans interactively, mitigating hallucinations and reducing the need for additional demonstrations. This work shows promising results in reducing the complexity of programming robotic platforms, which still demands expert knowledge, time, and high costs. The proposed intuitive programming framework has demonstrated strong potential to democratize robot use and support the deployment of collaborative robots across various hybrid human-robot environments.

Intuitive, Information-Theoretic, and LLM-Interactive Robot Programming from Video Demonstrations

MERLO, ELENA

2026

Abstract

In recent decades, robotic platforms have become increasingly relevant across industrial, healthcare, and domestic domains. However, the complexity of robot programming still limits their accessibility and widespread adoption. To overcome this barrier, research has focused on developing intuitive programming methods that simplify robot instruction for non-experts. Among these, programming by demonstration enables robots to acquire skills directly from human performances, allowing domain experts to teach robots efficiently without coding expertise. Building on this concept, this thesis investigates a novel framework that combines easily obtainable video demonstrations with natural language instructions, exploiting the same cues humans use to teach one another, to make robot teaching more intuitive and broadly accessible. The proposed framework introduces the first exploration of Shannon's Information Theory (IT) for the representation of manual tasks. Within this framework, IT enables the extraction of active scene elements, quantifies the information shared between hands and objects, and provides insights into the motion patterns during manipulation and the hand coordination strategy in bimanual activities. This allows robots to achieve a higher-level understanding of human-demonstrated manual tasks recorded in RGB videos. By recognizing the task structure and goals, the algorithm generalizes what was observed to unseen scenarios. Scene graphs are exploited to encode the extracted interaction features in a compact structure and segment the demonstration into blocks. By detecting changes in the scene graph topology, the system can automatically generate robot plans in the form of behavior trees, enabling the robotic system to replicate the observed unimanual or bimanual task. By integrating semantic knowledge, the robot execution plan becomes interpretable to the user, who can inspect it and iteratively refine it through natural language requests to a Large Language Model (LLM). The user-specified goals or critical task aspects elaborated by the LLM common-sense reasoning let the system adjust the vision-based plan to prevent potential failures and adapt it based on the received instructions and preferences. The approach was evaluated through elaborating multiple video demonstrations collected in multi-subject experiments, which are part of HANDSOME, an open-source dataset of HAND Skills demOnstrated by Multi-subjEcts, created to encourage further research and benchmarking in this field. Additional experiments using data from publicly available datasets confirmed the generalization capability of the method. Moreover, experiments involving LLM reasoning demonstrated the intuitiveness and usability of the framework even by non-experts, showing that the system can refine and adapt execution plans interactively, mitigating hallucinations and reducing the need for additional demonstrations. This work shows promising results in reducing the complexity of programming robotic platforms, which still demands expert knowledge, time, and high costs. The proposed intuitive programming framework has demonstrated strong potential to democratize robot use and support the deployment of collaborative robots across various hybrid human-robot environments.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				100023 - Dipartimento di Informatica, bioingegneria, robotica e ingegneria dei sistemi
			
	Corso di studio
	
				XXXVIII CICLO - BIOINGEGNERIA E ROBOTICA - BIOENGINEERING AND ROBOTICS - ADVANCED AND HUMANOID ROBOTICS
			
	Data di pubblicazione
	
				26-feb-2026
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				Dr. Arash Ajoudani
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				MASSOBRIO, PAOLO
			
	Nome Editore
	
				Università degli studi di Genova
			
	Collezione di appartenenza
	
				Università degli Studi di Genova

File in questo prodotto:

File	Dimensione	Formato
phdunige_4332934.pdf embargo fino al 26/02/2027 Licenza: Tutti i diritti riservati Dimensione 10.4 MB Formato Adobe PDF	10.4 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/359868

Il codice NBN di questa tesi è URN:NBN:IT:UNIGE-359868