Reinforcement Learning (RL) commonly relies on the Markov property, which asserts that the current state contains all relevant information for future decision-making. However, many real-world scenarios violate this property: agents often face observations that do not fully reveal the environment's true state or that exhibit dependencies extending across extended histories. Partially Observable Markov Decision Processes (POMDPs) provide a comprehensive theoretical framework for such problems, yet exact solutions are computationally intractable, and existing learning-based methods largely depend on heuristic or approximate techniques. In this thesis, we focus on a more specialised class of partially observable systems known as Regular Decision Processes (RDPs). RDPs capture a wide range of non-Markovian dependencies while remaining more tractable than general POMDPs. Specifically, they assume that any temporal dependencies in the environment can be encoded by a finite-state automaton, thereby transforming the problem into a Markovian one once the automaton's internal states are included in the agent's state representation. This structural assumption, though less general than a full POMDP, is satisfied in many practical cases where the relevant history can be summarised by a finite memory mechanism. A principal contribution of this work is demonstrating how an RL agent can learn such an automaton-based representation autonomously, without explicit domain knowledge. We achieve this by leveraging automata learning, a field dedicated to inferring state machines from observed sequences, in combination with established RL methods. By examining how different histories lead to indistinguishable outcomes, the agent merges states that exhibit statistically equivalent future reward and observation distributions. The result is a compact, state-based model of the environment that effectively restores the Markov property. Building on this learned representation, we integrate it with classical RL algorithms. Empirical evaluations on benchmark tasks show that our approaches outperform baseline methods, particularly in domains featuring partial observability and extended temporal dependencies. Beyond improvements in sample efficiency, the explicit automaton structure also provides a degree of interpretability, potentially allowing for transfer to related tasks or examination by domain experts. Finally, we discuss directions for further development, including extensions to larger or noisier observation spaces, integration with function approximation, and potential applications in fields where partial observability and regular temporal dependencies are dominant.

From histories to states: practical automata-based methods for learning and solving regular decision processes

PALUDO LICKS, GABRIEL
2025

Abstract

Reinforcement Learning (RL) commonly relies on the Markov property, which asserts that the current state contains all relevant information for future decision-making. However, many real-world scenarios violate this property: agents often face observations that do not fully reveal the environment's true state or that exhibit dependencies extending across extended histories. Partially Observable Markov Decision Processes (POMDPs) provide a comprehensive theoretical framework for such problems, yet exact solutions are computationally intractable, and existing learning-based methods largely depend on heuristic or approximate techniques. In this thesis, we focus on a more specialised class of partially observable systems known as Regular Decision Processes (RDPs). RDPs capture a wide range of non-Markovian dependencies while remaining more tractable than general POMDPs. Specifically, they assume that any temporal dependencies in the environment can be encoded by a finite-state automaton, thereby transforming the problem into a Markovian one once the automaton's internal states are included in the agent's state representation. This structural assumption, though less general than a full POMDP, is satisfied in many practical cases where the relevant history can be summarised by a finite memory mechanism. A principal contribution of this work is demonstrating how an RL agent can learn such an automaton-based representation autonomously, without explicit domain knowledge. We achieve this by leveraging automata learning, a field dedicated to inferring state machines from observed sequences, in combination with established RL methods. By examining how different histories lead to indistinguishable outcomes, the agent merges states that exhibit statistically equivalent future reward and observation distributions. The result is a compact, state-based model of the environment that effectively restores the Markov property. Building on this learned representation, we integrate it with classical RL algorithms. Empirical evaluations on benchmark tasks show that our approaches outperform baseline methods, particularly in domains featuring partial observability and extended temporal dependencies. Beyond improvements in sample efficiency, the explicit automaton structure also provides a degree of interpretability, potentially allowing for transfer to related tasks or examination by domain experts. Finally, we discuss directions for further development, including extensions to larger or noisier observation spaces, integration with function approximation, and potential applications in fields where partial observability and regular temporal dependencies are dominant.
19-mag-2025
Inglese
DE GIACOMO, Giuseppe
Università degli Studi di Roma "La Sapienza"
File in questo prodotto:
File Dimensione Formato  
Tesi_dottorato_PaludoLicks.pdf

accesso aperto

Dimensione 1.37 MB
Formato Adobe PDF
1.37 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/212882
Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-212882