Design and implementation of machine learning algorithms on SoC based accelerators

Matta, Marco

Machine Learning (ML) is a field of Artificial Intelligence based on statistical methods to enhance the performance of algorithms in various fields of application (Bishop (2006)), for example in financial trading (Cabrera et al. (2018)), big data management (L’heureux et al. (2017), medicine (Chen et al. (2017)), security (Muhammad et al. (2018); Petrovic et al. (2018)), imaging and image processing (Wang (2016)), mobile apps (Sehgal and Kehtarnavaz (2018)) and more. ML techniques are usually classified in three main categories: Supervised, Unsupervised and Reinforcement Learning. The first two require a training phase to obtain an expert algorithm ready to be deployed in the field (inference phase). Supervised and unsupervised ML approaches rely on massive amounts of data, intensive offline training sessions and large parameter spaces (Kim et al. (2017). Moreover, the inference performance degrades when the statistics of the input data differs from that of the training examples. In these cases, multiple training sessions are required to update the model (Ray (2018). In Reinforcement Learning (RL) the training and inference phases are not separated. The learner interacts with the environment to collect information and receives an immediate reward for each decision and action it takes. The reward is a numerical value that quantifies the quality of the action performed by the learner. Its aim is to maximize the reward while interacting with the environment through an iterative sequence of actions (Mohri et al. (2018). Reinforcement Learning for multi-Agent systems (MARL) is a growing research field as well. It is often bio-inspired (Navarro and Matía (2012) and the learners are organized in ensembles, i.e. swarms, to improve learning capabilities (Brambilla et al. (2013); Matta et al. (2019b). The increasing popularity of Machine Learning, both in Research and Industry, came at a cost of more and more powerful and computationally complex computers. The push given by the growing computational load to the calculation efficiency, to the minimization of latency, to the maximization of the throughput for the processing of very large amounts of data have also given way to the search for ever better calculation methods. In addition to CPU and GPU-based solutions, many researchers have based their studies on developing FPGA-based platforms to implement dedicated ML accelerators. (Chen et al. (2014); Kara et al. (2017); Wang et al. (2016)). In this Doctoral dissertation, the theme of the creation and hardware implementation of Machine Learning algorithms on platforms based on System On Chip, that is of embedded systems that provide the FPGA architecture alongside a processor, is addressed. In particular, in the first chapter the topic of Machine Learning and the various techniques will be briefly presented, divided into Supervised, Unsupervised and Reinforced Training, with a mention of the most important models that characterize recent advances in this research field. In the second part, after an expansion of the discussion on Reinforcement Learning and the related Q-Learning model, the potential and applications of Reinforcement Learning will be explored. In this context, two algorithms were developed during the PhD course with applications respectively in the telecommunications field, in particular an RL-based Timing Recovery Loop technique, and the so-called "Swarm Reinforcement Learning", i.e. machine learning for groups of agents and the Q-Real Time Swarm Intelligence (Q-RTS) algorithm will be presented. Both algorithms feature high degree of parallelism, and the FPGA implementation is the most suitable. The results and characteristics of these research works have both been published in two distinct International journals (Matta et al. (2019a) and Matta et al. (2019b)). The third chapter will deal with issues of implementative type. The most challenging problems that characterize the computational load necessary for the execution of Machine Learning algorithms will be identified. Subsequently, a method of implementation on FPGA of a trained LSTM network will be presented, which exploits the so-called Partial Dynamic Reconfiguration to drastically reduce the size of the memory required by the neural weights. Finally, the implementation of Q-RTS on System on Chip will be presented, with the description of the particular centralized architecture of an accelerator for RL training on FPGAs with a high degree of parallelism. These research works have been the subject of two publications on international conference proceedings Cardarilli et al. (2018) and Gian Carlo Cardarilli (2021)).

Design and implementation of machine learning algorithms on SoC based accelerators

MATTA, MARCO

2020

Abstract

Machine Learning (ML) is a field of Artificial Intelligence based on statistical methods to enhance the performance of algorithms in various fields of application (Bishop (2006)), for example in financial trading (Cabrera et al. (2018)), big data management (L’heureux et al. (2017), medicine (Chen et al. (2017)), security (Muhammad et al. (2018); Petrovic et al. (2018)), imaging and image processing (Wang (2016)), mobile apps (Sehgal and Kehtarnavaz (2018)) and more. ML techniques are usually classified in three main categories: Supervised, Unsupervised and Reinforcement Learning. The first two require a training phase to obtain an expert algorithm ready to be deployed in the field (inference phase). Supervised and unsupervised ML approaches rely on massive amounts of data, intensive offline training sessions and large parameter spaces (Kim et al. (2017). Moreover, the inference performance degrades when the statistics of the input data differs from that of the training examples. In these cases, multiple training sessions are required to update the model (Ray (2018). In Reinforcement Learning (RL) the training and inference phases are not separated. The learner interacts with the environment to collect information and receives an immediate reward for each decision and action it takes. The reward is a numerical value that quantifies the quality of the action performed by the learner. Its aim is to maximize the reward while interacting with the environment through an iterative sequence of actions (Mohri et al. (2018). Reinforcement Learning for multi-Agent systems (MARL) is a growing research field as well. It is often bio-inspired (Navarro and Matía (2012) and the learners are organized in ensembles, i.e. swarms, to improve learning capabilities (Brambilla et al. (2013); Matta et al. (2019b). The increasing popularity of Machine Learning, both in Research and Industry, came at a cost of more and more powerful and computationally complex computers. The push given by the growing computational load to the calculation efficiency, to the minimization of latency, to the maximization of the throughput for the processing of very large amounts of data have also given way to the search for ever better calculation methods. In addition to CPU and GPU-based solutions, many researchers have based their studies on developing FPGA-based platforms to implement dedicated ML accelerators. (Chen et al. (2014); Kara et al. (2017); Wang et al. (2016)). In this Doctoral dissertation, the theme of the creation and hardware implementation of Machine Learning algorithms on platforms based on System On Chip, that is of embedded systems that provide the FPGA architecture alongside a processor, is addressed. In particular, in the first chapter the topic of Machine Learning and the various techniques will be briefly presented, divided into Supervised, Unsupervised and Reinforced Training, with a mention of the most important models that characterize recent advances in this research field. In the second part, after an expansion of the discussion on Reinforcement Learning and the related Q-Learning model, the potential and applications of Reinforcement Learning will be explored. In this context, two algorithms were developed during the PhD course with applications respectively in the telecommunications field, in particular an RL-based Timing Recovery Loop technique, and the so-called "Swarm Reinforcement Learning", i.e. machine learning for groups of agents and the Q-Real Time Swarm Intelligence (Q-RTS) algorithm will be presented. Both algorithms feature high degree of parallelism, and the FPGA implementation is the most suitable. The results and characteristics of these research works have both been published in two distinct International journals (Matta et al. (2019a) and Matta et al. (2019b)). The third chapter will deal with issues of implementative type. The most challenging problems that characterize the computational load necessary for the execution of Machine Learning algorithms will be identified. Subsequently, a method of implementation on FPGA of a trained LSTM network will be presented, which exploits the so-called Partial Dynamic Reconfiguration to drastically reduce the size of the memory required by the neural weights. Finally, the implementation of Q-RTS on System on Chip will be presented, with the description of the particular centralized architecture of an accelerator for RL training on FPGAs with a high degree of parallelism. These research works have been the subject of two publications on international conference proceedings Cardarilli et al. (2018) and Gian Carlo Cardarilli (2021)).

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				Ingegneria elettronica
			
	Data di pubblicazione
	
				2020
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				CARDARILLI, GIAN CARLO
			
	Nome Editore
	
				Università degli Studi di Roma "Tor Vergata"
			
	Collezione di appartenenza
	
				Università degli Studi di Roma Tor Vergata

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_MarcoMatta.pdf accesso solo da BNCF e BNCR Dimensione 5.55 MB Formato Adobe PDF	5.55 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/215251

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA2-215251