Recent years have witnessed remarkable breakthroughs in the mobility and whole-body motion control of legged robots, such as humanoid and quadruped systems. These advances have enabled their deployment beyond structured environments, allowing them to traverse complex terrains and perform highly dynamic motions, such as parkour and backflips, that were previously infeasible with traditional control methods. By further equipping such robots with manipulation capabilities, their functionality and value can be significantly enhanced, enabling physical interaction with the environment. However, despite the effectiveness of conventional planning and control methods for navigation or simple manipulation in known environments, they lack general physical understanding and task-level reasoning, making it difficult to coordinate robot behaviors for complex, long-horizon tasks. Meanwhile, the emergence of large language models (LLMs) and vision-language models (VLMs) has opened new possibilities for robotic cognition and decision-making, paving the way toward general autonomous and intelligent loco-manipulation in the physical world. This thesis presents a multi-stage research toward achieving embodied intelligence and autonomy in robotic loco-manipulation. First, it focuses on active perception and behavior planning in unstructured environments, introducing a foundation model-based behavior planning framework for humanoid robots. By integrating LLMs with behavior trees, complex task instructions are decomposed into executable action sequences. Through the fusion of visual grounding modules, such as object pose estimation and visual question answering, the robot can detect and recover from task failures during execution. Building on this foundation, we propose a hybrid learning and whole-body optimization method to learn complex manipulation behaviors, integrated with an online LLM-based hierarchical task graph to bridge high-level planning and low-level execution by decomposing long-horizon tasks into subtasks. Our framework leverages the interaction of distilled spatial geometry and 2D observations with a VLM to ground knowledge into a robotic morphology selector, thereby enabling robust loco-manipulation planning for complex wheeled-legged robotic systems. Advancing to the second stage, the dissertation explores the cognitive dimension of robotic intelligence. An intuitive perceptor is developed by combining VLM-driven spatial understanding and relational reasoning, allowing robots to infer object affordances and context-consistent interactions without repetitive task instructions. Building upon this, a memory-driven cognitive reasoning framework is proposed, inspired by human-like semantic and episodic memory systems. Through a Memory Graph structure, the robot can associate current sensory input with past experiences, supporting cross-agent cognitive transfer and adaptive decision-making. Finally, the thesis explores the on-board deployment of foundation models for real-time task reasoning and policy generation. A Visual-Language-Policy (VLP) framework is introduced, enabling robots to dynamically generate and adjust task policies under disturbances and changing human instructions. Together, these contributions advance the field toward embodied autonomous loco-manipulation, where robots achieve integrated perception, reasoning, and planning in dynamic environments.

Towards Embodied Intelligence and Autonomy in Robotic Loco-manipulation

WANG, JIN
2026

Abstract

Recent years have witnessed remarkable breakthroughs in the mobility and whole-body motion control of legged robots, such as humanoid and quadruped systems. These advances have enabled their deployment beyond structured environments, allowing them to traverse complex terrains and perform highly dynamic motions, such as parkour and backflips, that were previously infeasible with traditional control methods. By further equipping such robots with manipulation capabilities, their functionality and value can be significantly enhanced, enabling physical interaction with the environment. However, despite the effectiveness of conventional planning and control methods for navigation or simple manipulation in known environments, they lack general physical understanding and task-level reasoning, making it difficult to coordinate robot behaviors for complex, long-horizon tasks. Meanwhile, the emergence of large language models (LLMs) and vision-language models (VLMs) has opened new possibilities for robotic cognition and decision-making, paving the way toward general autonomous and intelligent loco-manipulation in the physical world. This thesis presents a multi-stage research toward achieving embodied intelligence and autonomy in robotic loco-manipulation. First, it focuses on active perception and behavior planning in unstructured environments, introducing a foundation model-based behavior planning framework for humanoid robots. By integrating LLMs with behavior trees, complex task instructions are decomposed into executable action sequences. Through the fusion of visual grounding modules, such as object pose estimation and visual question answering, the robot can detect and recover from task failures during execution. Building on this foundation, we propose a hybrid learning and whole-body optimization method to learn complex manipulation behaviors, integrated with an online LLM-based hierarchical task graph to bridge high-level planning and low-level execution by decomposing long-horizon tasks into subtasks. Our framework leverages the interaction of distilled spatial geometry and 2D observations with a VLM to ground knowledge into a robotic morphology selector, thereby enabling robust loco-manipulation planning for complex wheeled-legged robotic systems. Advancing to the second stage, the dissertation explores the cognitive dimension of robotic intelligence. An intuitive perceptor is developed by combining VLM-driven spatial understanding and relational reasoning, allowing robots to infer object affordances and context-consistent interactions without repetitive task instructions. Building upon this, a memory-driven cognitive reasoning framework is proposed, inspired by human-like semantic and episodic memory systems. Through a Memory Graph structure, the robot can associate current sensory input with past experiences, supporting cross-agent cognitive transfer and adaptive decision-making. Finally, the thesis explores the on-board deployment of foundation models for real-time task reasoning and policy generation. A Visual-Language-Policy (VLP) framework is introduced, enabling robots to dynamically generate and adjust task policies under disturbances and changing human instructions. Together, these contributions advance the field toward embodied autonomous loco-manipulation, where robots achieve integrated perception, reasoning, and planning in dynamic environments.
26-feb-2026
Inglese
Tsagarakis, Nikos
MASSOBRIO, PAOLO
Università degli studi di Genova
File in questo prodotto:
File Dimensione Formato  
phdunige_5543155.pdf

embargo fino al 26/02/2027

Licenza: Tutti i diritti riservati
Dimensione 7.65 MB
Formato Adobe PDF
7.65 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/359749
Il codice NBN di questa tesi è URN:NBN:IT:UNIGE-359749