Integrating automated planning with reinforcement learning (RL) is a longstanding goal in artificial intelligence, yet existing solutions struggle when rewards are non-Markovian, when agents must act concurrently, or when the state–action space explodes in multi-agent settings. This dissertation tackles these challenges by unifying symbolic planning techniques with model-based RL and automata-based reward representations. The key idea is to let formal planners supply the high-level temporal and concurrency structure of the task, while data-driven learners refine execution policies online. In doing so, the work bridges the complementary strengths of planning—foresight, structure and explainability—and of RL—adaptation to unknown or stochastic dynamics. Concretely, the thesis contributes: (i) a multi-agent planning formalism with explicit agent representation, implemented in the Unified Planning library to provide clear semantics and seamless compilation to existing MAP solvers; (ii) QR-Max, a PAC-MDP model-based RL algorithm for discrete-action Non-Markovian Reward Decision Processes that exploits reward-machine factorisation; (iii) an extension of QR-Max to cooperative multi-agent domains that shares learned dynamics while decoupling reward models; (iv) MARL-RM, a framework that automatically converts partial-order multi-agent plans into reward machines, thereby injecting concurrency and synchronization constraints directly into decentralized training; and (v) a hierarchy of state abstractions, heuristic shaping and a Global Exploration Machine that densify sparse rewards and orchestrate safe, coordinated exploration. Across single- and multi-agent benchmarks—including robot logistics, concurrent grid-worlds and multi-UAV missions—the proposed methods achieve up to two orders of magnitude fewer environment interactions than strong model-free baselines, while guaranteeing interpretability and formal correctness of the learned behaviours. Taken together, these results demonstrate that tight, principled integration of planning and learning is not only feasible but also essential for scalable, cooperative artificial agents.

Integrating multi-agent planning and reinforcement learning through reward and exploration machines

TRAPASSO, ALESSANDRO
2025

Abstract

Integrating automated planning with reinforcement learning (RL) is a longstanding goal in artificial intelligence, yet existing solutions struggle when rewards are non-Markovian, when agents must act concurrently, or when the state–action space explodes in multi-agent settings. This dissertation tackles these challenges by unifying symbolic planning techniques with model-based RL and automata-based reward representations. The key idea is to let formal planners supply the high-level temporal and concurrency structure of the task, while data-driven learners refine execution policies online. In doing so, the work bridges the complementary strengths of planning—foresight, structure and explainability—and of RL—adaptation to unknown or stochastic dynamics. Concretely, the thesis contributes: (i) a multi-agent planning formalism with explicit agent representation, implemented in the Unified Planning library to provide clear semantics and seamless compilation to existing MAP solvers; (ii) QR-Max, a PAC-MDP model-based RL algorithm for discrete-action Non-Markovian Reward Decision Processes that exploits reward-machine factorisation; (iii) an extension of QR-Max to cooperative multi-agent domains that shares learned dynamics while decoupling reward models; (iv) MARL-RM, a framework that automatically converts partial-order multi-agent plans into reward machines, thereby injecting concurrency and synchronization constraints directly into decentralized training; and (v) a hierarchy of state abstractions, heuristic shaping and a Global Exploration Machine that densify sparse rewards and orchestrate safe, coordinated exploration. Across single- and multi-agent benchmarks—including robot logistics, concurrent grid-worlds and multi-UAV missions—the proposed methods achieve up to two orders of magnitude fewer environment interactions than strong model-free baselines, while guaranteeing interpretability and formal correctness of the learned behaviours. Taken together, these results demonstrate that tight, principled integration of planning and learning is not only feasible but also essential for scalable, cooperative artificial agents.
18-ott-2025
Inglese
PATRIZI, FABIO
IOCCHI, Luca
NAVIGLI, Roberto
Università degli Studi di Roma "La Sapienza"
199
File in questo prodotto:
File Dimensione Formato  
Tesi_dottorato_Trapasso.pdf

accesso aperto

Licenza: Tutti i diritti riservati
Dimensione 8.37 MB
Formato Adobe PDF
8.37 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/306640
Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-306640