Integrating automated planning with reinforcement learning (RL) is a longstanding goal in artificial intelligence, yet existing solutions struggle when rewards are non-Markovian, when agents must act concurrently, or when the state–action space explodes in multi-agent settings. This dissertation tackles these challenges by unifying symbolic planning techniques with model-based RL and automata-based reward representations. The key idea is to let formal planners supply the high-level temporal and concurrency structure of the task, while data-driven learners refine execution policies online. In doing so, the work bridges the complementary strengths of planning—foresight, structure and explainability—and of RL—adaptation to unknown or stochastic dynamics. Concretely, the thesis contributes: (i) a multi-agent planning formalism with explicit agent representation, implemented in the Unified Planning library to provide clear semantics and seamless compilation to existing MAP solvers; (ii) QR-Max, a PAC-MDP model-based RL algorithm for discrete-action Non-Markovian Reward Decision Processes that exploits reward-machine factorisation; (iii) an extension of QR-Max to cooperative multi-agent domains that shares learned dynamics while decoupling reward models; (iv) MARL-RM, a framework that automatically converts partial-order multi-agent plans into reward machines, thereby injecting concurrency and synchronization constraints directly into decentralized training; and (v) a hierarchy of state abstractions, heuristic shaping and a Global Exploration Machine that densify sparse rewards and orchestrate safe, coordinated exploration. Across single- and multi-agent benchmarks—including robot logistics, concurrent grid-worlds and multi-UAV missions—the proposed methods achieve up to two orders of magnitude fewer environment interactions than strong model-free baselines, while guaranteeing interpretability and formal correctness of the learned behaviours. Taken together, these results demonstrate that tight, principled integration of planning and learning is not only feasible but also essential for scalable, cooperative artificial agents.
Integrating multi-agent planning and reinforcement learning through reward and exploration machines
TRAPASSO, ALESSANDRO
2025
Abstract
Integrating automated planning with reinforcement learning (RL) is a longstanding goal in artificial intelligence, yet existing solutions struggle when rewards are non-Markovian, when agents must act concurrently, or when the state–action space explodes in multi-agent settings. This dissertation tackles these challenges by unifying symbolic planning techniques with model-based RL and automata-based reward representations. The key idea is to let formal planners supply the high-level temporal and concurrency structure of the task, while data-driven learners refine execution policies online. In doing so, the work bridges the complementary strengths of planning—foresight, structure and explainability—and of RL—adaptation to unknown or stochastic dynamics. Concretely, the thesis contributes: (i) a multi-agent planning formalism with explicit agent representation, implemented in the Unified Planning library to provide clear semantics and seamless compilation to existing MAP solvers; (ii) QR-Max, a PAC-MDP model-based RL algorithm for discrete-action Non-Markovian Reward Decision Processes that exploits reward-machine factorisation; (iii) an extension of QR-Max to cooperative multi-agent domains that shares learned dynamics while decoupling reward models; (iv) MARL-RM, a framework that automatically converts partial-order multi-agent plans into reward machines, thereby injecting concurrency and synchronization constraints directly into decentralized training; and (v) a hierarchy of state abstractions, heuristic shaping and a Global Exploration Machine that densify sparse rewards and orchestrate safe, coordinated exploration. Across single- and multi-agent benchmarks—including robot logistics, concurrent grid-worlds and multi-UAV missions—the proposed methods achieve up to two orders of magnitude fewer environment interactions than strong model-free baselines, while guaranteeing interpretability and formal correctness of the learned behaviours. Taken together, these results demonstrate that tight, principled integration of planning and learning is not only feasible but also essential for scalable, cooperative artificial agents.| File | Dimensione | Formato | |
|---|---|---|---|
|
Tesi_dottorato_Trapasso.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
8.37 MB
Formato
Adobe PDF
|
8.37 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/306640
URN:NBN:IT:UNIROMA1-306640