Learning in Monte Carlo Tree Search Planning

Zuccotto, Maddalena

Artificial intelligence is playing an increasingly important role in both industry and society, as evidenced by recent applications in areas such as autonomous driving, personalized shopping, and fraud prevention. Reinforcement Learning (RL) is a prominent machine learning paradigm that focuses on learning policy functions, i.e., functions able to select sequences of actions that allow agents to optimally achieve their goals in the environment in which they act. RL has recently demonstrated strong potential in scenarios where agents must operate in unknown environments adapting to unexpected (or partially specified) situations. This thesis covers three topics related to RL. The first topic addresses the problem of learning state-variable relationships in Partially Observable Markov Decision Processes (POMDPs) to improve planning performance. Specifically, we focus on Partially Observable Monte Carlo Planning (POMCP) and we represent the acquired knowledge with a Markov Random Field (MRF). Three methods are proposed to compute MRF parameters while the agent acts in the environment. Our techniques acquire information from the outcomes of agent actions and from the agent’s belief. We answer a key question: “When can the learned state-variable relationships be trusted?”. Criteria, based on confidence intervals and convergence, are introduced to determine when the MRF is accurate enough and the learning process can be stopped. We test this technique on two domains, rocksample, a standard rover exploration task, and a problem of velocity regulation in industrial mobile robotic platforms. Results show that the proposed approach allows to effectively learn state-variable probabilistic constraints and to outperform standard POMCP with no computational overhead. Finally, a ROS-based architecture is proposed which allows to perform MRF learning, adaptation, and usage in POMCP on real robotic platforms. The second topic tackles the problem of learning the transition model of the environment in Monte Carlo Tree Search (MCTS) using information from state transition traces. The transition model is then used to improve the performance of the policy generated by MCTS. In this case, we focus in particular on fully observable environments represented by Markov Decision Processes (MDPs). We propose a MCTS-based planning approach that assumes a black-box approximated model of the environment developed by an expert using any kind of modeling framework, and it improves the model as new information from the environment is collected. This is crucial in real-world applications since having a complete knowledge of complex environments is impractical. The expert’s model is first translated into a neural network and then it is updated periodically using data (i.e., state-action-next state triplets), collected as the agent acts in the environment. We propose three different methods to integrate this data with prior knowledge provided by the expert, and we evaluate our approach in a domain concerning air quality and thermal comfort control in smart buildings. We compare the performance of MCTS using each of the three proposed model learning techniques with the performance of standard MCTS using the expert’s model (without adaptation), Proximal Policy Optimization (a popular model-free DRL approach), and Stochastic Lower Bounds Optimization (a popular model-based DRL approach). Results show that our approach achieves the best results, outperforming all the competitors. Finally, the third topic of this thesis concerns the recent application of RL to environmental sustainability, an application domain in which uncertainty challenges strategy learning and adaptation. We survey the literature to identify the main applications of RL in this domain and the predominant methods employed to address the main challenges. We analyzed 181 papers and answered seven research questions, e.g., “How many academic studies have been published from 2003 to 2023 about RL for environmental sustainability?” and “What were the application domains and the methodologies used?”. Our analysis reveals an exponential growth in this field over the past two decades, with a rate of 0.42 in the number of publications (from 2 papers in 2007 to 53 in 2022), a strong interest in sustainability issues related to energy fields, and a preference for single-agent RL approaches to deal with sustainability. Finally, the survey provides practitioners with a clear overview of the main challenges and open problems in this topic that should be tackled in future research. In summary, this thesis delves into three aspects of RL and its applications in different scenarios. In all research lines, we observe that explicitly modeling some elements of the environment and the information gathered during the agent-environment interaction, as in model-based RL, can improve the policy learning process in terms of sample efficiency and policy performance. Furthermore, MCTS and POMCP have demonstrated to lend themselves to the implementation of model-based RL algorithms since they natively use the model of the environment in their simulations. Learning this model (or part of it) as the agent evolves is an interesting challenge that we started to tackle in this work and that can have interesting future developments in model-based RL. The code of the proposed methodologies is open-source and available at https://github.com/Zucchy/MCTS_planning.

Learning in Monte Carlo Tree Search Planning

ZUCCOTTO, MADDALENA

2024

Abstract

Artificial intelligence is playing an increasingly important role in both industry and society, as evidenced by recent applications in areas such as autonomous driving, personalized shopping, and fraud prevention. Reinforcement Learning (RL) is a prominent machine learning paradigm that focuses on learning policy functions, i.e., functions able to select sequences of actions that allow agents to optimally achieve their goals in the environment in which they act. RL has recently demonstrated strong potential in scenarios where agents must operate in unknown environments adapting to unexpected (or partially specified) situations. This thesis covers three topics related to RL. The first topic addresses the problem of learning state-variable relationships in Partially Observable Markov Decision Processes (POMDPs) to improve planning performance. Specifically, we focus on Partially Observable Monte Carlo Planning (POMCP) and we represent the acquired knowledge with a Markov Random Field (MRF). Three methods are proposed to compute MRF parameters while the agent acts in the environment. Our techniques acquire information from the outcomes of agent actions and from the agent’s belief. We answer a key question: “When can the learned state-variable relationships be trusted?”. Criteria, based on confidence intervals and convergence, are introduced to determine when the MRF is accurate enough and the learning process can be stopped. We test this technique on two domains, rocksample, a standard rover exploration task, and a problem of velocity regulation in industrial mobile robotic platforms. Results show that the proposed approach allows to effectively learn state-variable probabilistic constraints and to outperform standard POMCP with no computational overhead. Finally, a ROS-based architecture is proposed which allows to perform MRF learning, adaptation, and usage in POMCP on real robotic platforms. The second topic tackles the problem of learning the transition model of the environment in Monte Carlo Tree Search (MCTS) using information from state transition traces. The transition model is then used to improve the performance of the policy generated by MCTS. In this case, we focus in particular on fully observable environments represented by Markov Decision Processes (MDPs). We propose a MCTS-based planning approach that assumes a black-box approximated model of the environment developed by an expert using any kind of modeling framework, and it improves the model as new information from the environment is collected. This is crucial in real-world applications since having a complete knowledge of complex environments is impractical. The expert’s model is first translated into a neural network and then it is updated periodically using data (i.e., state-action-next state triplets), collected as the agent acts in the environment. We propose three different methods to integrate this data with prior knowledge provided by the expert, and we evaluate our approach in a domain concerning air quality and thermal comfort control in smart buildings. We compare the performance of MCTS using each of the three proposed model learning techniques with the performance of standard MCTS using the expert’s model (without adaptation), Proximal Policy Optimization (a popular model-free DRL approach), and Stochastic Lower Bounds Optimization (a popular model-based DRL approach). Results show that our approach achieves the best results, outperforming all the competitors. Finally, the third topic of this thesis concerns the recent application of RL to environmental sustainability, an application domain in which uncertainty challenges strategy learning and adaptation. We survey the literature to identify the main applications of RL in this domain and the predominant methods employed to address the main challenges. We analyzed 181 papers and answered seven research questions, e.g., “How many academic studies have been published from 2003 to 2023 about RL for environmental sustainability?” and “What were the application domains and the methodologies used?”. Our analysis reveals an exponential growth in this field over the past two decades, with a rate of 0.42 in the number of publications (from 2 papers in 2007 to 53 in 2022), a strong interest in sustainability issues related to energy fields, and a preference for single-agent RL approaches to deal with sustainability. Finally, the survey provides practitioners with a clear overview of the main challenges and open problems in this topic that should be tackled in future research. In summary, this thesis delves into three aspects of RL and its applications in different scenarios. In all research lines, we observe that explicitly modeling some elements of the environment and the information gathered during the agent-environment interaction, as in model-based RL, can improve the policy learning process in terms of sample efficiency and policy performance. Furthermore, MCTS and POMCP have demonstrated to lend themselves to the implementation of model-based RL algorithms since they natively use the model of the environment in their simulations. Learning this model (or part of it) as the agent evolves is an interesting challenge that we started to tackle in this work and that can have interesting future developments in model-based RL. The code of the proposed methodologies is open-source and available at https://github.com/Zucchy/MCTS_planning.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				Informatica
			
	Data di pubblicazione
	
				2024
			
	Lingua
	
				Inglese
			
	Numero di pagine
	
				201
			
	Collezione di appartenenza
	
				Università degli Studi di Verona

File in questo prodotto:

File	Dimensione	Formato
Thesis_Zuccotto.pdf accesso aperto Dimensione 41.63 MB Formato Adobe PDF Visualizza/Apri	41.63 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/161061

Il codice NBN di questa tesi è URN:NBN:IT:UNIVR-161061