Integrating multi-agent planning and reinforcement learning through reward and exploration machines

Trapasso, Alessandro

Integrating automated planning with reinforcement learning (RL) is a longstanding goal in artificial intelligence, yet existing solutions struggle when rewards are non-Markovian, when agents must act concurrently, or when the state–action space explodes in multi-agent settings. This dissertation tackles these challenges by unifying symbolic planning techniques with model-based RL and automata-based reward representations. The key idea is to let formal planners supply the high-level temporal and concurrency structure of the task, while data-driven learners refine execution policies online. In doing so, the work bridges the complementary strengths of planning—foresight, structure and explainability—and of RL—adaptation to unknown or stochastic dynamics. Concretely, the thesis contributes: (i) a multi-agent planning formalism with explicit agent representation, implemented in the Unified Planning library to provide clear semantics and seamless compilation to existing MAP solvers; (ii) QR-Max, a PAC-MDP model-based RL algorithm for discrete-action Non-Markovian Reward Decision Processes that exploits reward-machine factorisation; (iii) an extension of QR-Max to cooperative multi-agent domains that shares learned dynamics while decoupling reward models; (iv) MARL-RM, a framework that automatically converts partial-order multi-agent plans into reward machines, thereby injecting concurrency and synchronization constraints directly into decentralized training; and (v) a hierarchy of state abstractions, heuristic shaping and a Global Exploration Machine that densify sparse rewards and orchestrate safe, coordinated exploration. Across single- and multi-agent benchmarks—including robot logistics, concurrent grid-worlds and multi-UAV missions—the proposed methods achieve up to two orders of magnitude fewer environment interactions than strong model-free baselines, while guaranteeing interpretability and formal correctness of the learned behaviours. Taken together, these results demonstrate that tight, principled integration of planning and learning is not only feasible but also essential for scalable, cooperative artificial agents.

Integrating multi-agent planning and reinforcement learning through reward and exploration machines

TRAPASSO, ALESSANDRO

2025

Abstract

Integrating automated planning with reinforcement learning (RL) is a longstanding goal in artificial intelligence, yet existing solutions struggle when rewards are non-Markovian, when agents must act concurrently, or when the state–action space explodes in multi-agent settings. This dissertation tackles these challenges by unifying symbolic planning techniques with model-based RL and automata-based reward representations. The key idea is to let formal planners supply the high-level temporal and concurrency structure of the task, while data-driven learners refine execution policies online. In doing so, the work bridges the complementary strengths of planning—foresight, structure and explainability—and of RL—adaptation to unknown or stochastic dynamics. Concretely, the thesis contributes: (i) a multi-agent planning formalism with explicit agent representation, implemented in the Unified Planning library to provide clear semantics and seamless compilation to existing MAP solvers; (ii) QR-Max, a PAC-MDP model-based RL algorithm for discrete-action Non-Markovian Reward Decision Processes that exploits reward-machine factorisation; (iii) an extension of QR-Max to cooperative multi-agent domains that shares learned dynamics while decoupling reward models; (iv) MARL-RM, a framework that automatically converts partial-order multi-agent plans into reward machines, thereby injecting concurrency and synchronization constraints directly into decentralized training; and (v) a hierarchy of state abstractions, heuristic shaping and a Global Exploration Machine that densify sparse rewards and orchestrate safe, coordinated exploration. Across single- and multi-agent benchmarks—including robot logistics, concurrent grid-worlds and multi-UAV missions—the proposed methods achieve up to two orders of magnitude fewer environment interactions than strong model-free baselines, while guaranteeing interpretability and formal correctness of the learned behaviours. Taken together, these results demonstrate that tight, principled integration of planning and learning is not only feasible but also essential for scalable, cooperative artificial agents.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INGEGNERIA INFORMATICA, AUTOMATICA E GESTIONALE -ANTONIO RUBERTI-
			
	Corso di studio
	
				Ingegneria informatica
			
	Data di pubblicazione
	
				18-ott-2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				PATRIZI, FABIO
IOCCHI, Luca
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				NAVIGLI, Roberto
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Numero di pagine
	
				199
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Trapasso.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 8.37 MB Formato Adobe PDF Visualizza/Apri	8.37 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/306640

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-306640