Towards Embodied Intelligence and Autonomy in Robotic Loco-manipulation

Wang, Jin

Recent years have witnessed remarkable breakthroughs in the mobility and whole-body motion control of legged robots, such as humanoid and quadruped systems. These advances have enabled their deployment beyond structured environments, allowing them to traverse complex terrains and perform highly dynamic motions, such as parkour and backflips, that were previously infeasible with traditional control methods. By further equipping such robots with manipulation capabilities, their functionality and value can be significantly enhanced, enabling physical interaction with the environment. However, despite the effectiveness of conventional planning and control methods for navigation or simple manipulation in known environments, they lack general physical understanding and task-level reasoning, making it difficult to coordinate robot behaviors for complex, long-horizon tasks. Meanwhile, the emergence of large language models (LLMs) and vision-language models (VLMs) has opened new possibilities for robotic cognition and decision-making, paving the way toward general autonomous and intelligent loco-manipulation in the physical world. This thesis presents a multi-stage research toward achieving embodied intelligence and autonomy in robotic loco-manipulation. First, it focuses on active perception and behavior planning in unstructured environments, introducing a foundation model-based behavior planning framework for humanoid robots. By integrating LLMs with behavior trees, complex task instructions are decomposed into executable action sequences. Through the fusion of visual grounding modules, such as object pose estimation and visual question answering, the robot can detect and recover from task failures during execution. Building on this foundation, we propose a hybrid learning and whole-body optimization method to learn complex manipulation behaviors, integrated with an online LLM-based hierarchical task graph to bridge high-level planning and low-level execution by decomposing long-horizon tasks into subtasks. Our framework leverages the interaction of distilled spatial geometry and 2D observations with a VLM to ground knowledge into a robotic morphology selector, thereby enabling robust loco-manipulation planning for complex wheeled-legged robotic systems. Advancing to the second stage, the dissertation explores the cognitive dimension of robotic intelligence. An intuitive perceptor is developed by combining VLM-driven spatial understanding and relational reasoning, allowing robots to infer object affordances and context-consistent interactions without repetitive task instructions. Building upon this, a memory-driven cognitive reasoning framework is proposed, inspired by human-like semantic and episodic memory systems. Through a Memory Graph structure, the robot can associate current sensory input with past experiences, supporting cross-agent cognitive transfer and adaptive decision-making. Finally, the thesis explores the on-board deployment of foundation models for real-time task reasoning and policy generation. A Visual-Language-Policy (VLP) framework is introduced, enabling robots to dynamically generate and adjust task policies under disturbances and changing human instructions. Together, these contributions advance the field toward embodied autonomous loco-manipulation, where robots achieve integrated perception, reasoning, and planning in dynamic environments.

Towards Embodied Intelligence and Autonomy in Robotic Loco-manipulation

WANG, JIN

2026

Abstract

Recent years have witnessed remarkable breakthroughs in the mobility and whole-body motion control of legged robots, such as humanoid and quadruped systems. These advances have enabled their deployment beyond structured environments, allowing them to traverse complex terrains and perform highly dynamic motions, such as parkour and backflips, that were previously infeasible with traditional control methods. By further equipping such robots with manipulation capabilities, their functionality and value can be significantly enhanced, enabling physical interaction with the environment. However, despite the effectiveness of conventional planning and control methods for navigation or simple manipulation in known environments, they lack general physical understanding and task-level reasoning, making it difficult to coordinate robot behaviors for complex, long-horizon tasks. Meanwhile, the emergence of large language models (LLMs) and vision-language models (VLMs) has opened new possibilities for robotic cognition and decision-making, paving the way toward general autonomous and intelligent loco-manipulation in the physical world. This thesis presents a multi-stage research toward achieving embodied intelligence and autonomy in robotic loco-manipulation. First, it focuses on active perception and behavior planning in unstructured environments, introducing a foundation model-based behavior planning framework for humanoid robots. By integrating LLMs with behavior trees, complex task instructions are decomposed into executable action sequences. Through the fusion of visual grounding modules, such as object pose estimation and visual question answering, the robot can detect and recover from task failures during execution. Building on this foundation, we propose a hybrid learning and whole-body optimization method to learn complex manipulation behaviors, integrated with an online LLM-based hierarchical task graph to bridge high-level planning and low-level execution by decomposing long-horizon tasks into subtasks. Our framework leverages the interaction of distilled spatial geometry and 2D observations with a VLM to ground knowledge into a robotic morphology selector, thereby enabling robust loco-manipulation planning for complex wheeled-legged robotic systems. Advancing to the second stage, the dissertation explores the cognitive dimension of robotic intelligence. An intuitive perceptor is developed by combining VLM-driven spatial understanding and relational reasoning, allowing robots to infer object affordances and context-consistent interactions without repetitive task instructions. Building upon this, a memory-driven cognitive reasoning framework is proposed, inspired by human-like semantic and episodic memory systems. Through a Memory Graph structure, the robot can associate current sensory input with past experiences, supporting cross-agent cognitive transfer and adaptive decision-making. Finally, the thesis explores the on-board deployment of foundation models for real-time task reasoning and policy generation. A Visual-Language-Policy (VLP) framework is introduced, enabling robots to dynamically generate and adjust task policies under disturbances and changing human instructions. Together, these contributions advance the field toward embodied autonomous loco-manipulation, where robots achieve integrated perception, reasoning, and planning in dynamic environments.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				100023 - Dipartimento di Informatica, bioingegneria, robotica e ingegneria dei sistemi
			
	Corso di studio
	
				XXXVIII CICLO - BIOINGEGNERIA E ROBOTICA - BIOENGINEERING AND ROBOTICS - ADVANCED AND HUMANOID ROBOTICS
			
	Data di pubblicazione
	
				26-feb-2026
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				Tsagarakis, Nikos
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				MASSOBRIO, PAOLO
			
	Nome Editore
	
				Università degli studi di Genova
			
	Collezione di appartenenza
	
				Università degli Studi di Genova

File in questo prodotto:

File	Dimensione	Formato
phdunige_5543155.pdf embargo fino al 26/02/2027 Licenza: Tutti i diritti riservati Dimensione 7.65 MB Formato Adobe PDF	7.65 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/359749

Il codice NBN di questa tesi è URN:NBN:IT:UNIGE-359749