In the field of High Performance Computing, communications among processes represent a typical bottleneck for massively parallel scientific applications. Object of this research is the development of a network interface card with specific offloading capabilities that could help large scale simulations in terms of communication latency and scalability with the number of computing elements. Until the early 2000s, general purpose single-core CPU-based systems were the processing systems of choice for HPC applications. They replaced exotic supercomputing architectures because they were inexpensive, and performance scaled with frequency in line with Moore’s Law. After the mid2000s the multi-core architecture era started as the only viable solution to keep up with predicted performance scaling. It is around year 2010 that CPU-based systems augmented with hardware accelerators as co-processors started to emerge as an alternative to CPU-only systems. This has opened up opportunities for accelerators, mainly General Purpose Graphics Processing Units (GPGPUs) to advance HPC to previously unattainable performance levels [1]. Since then, programmable device technology (namely Field Programmable Gate Arrays, or FPGA), while sharing the same silicon complexity of a GPGPU, has struggled to emerge as a real accelerator competitor, mainly due (i) to the lack of well-established high level synthesis tools, (ii) higher costs and slower lead times, (iii) an actually poor result in terms of time-to-solution [2]. By mitigating these negative aspects, the FPGA technology started becoming known and widespread lately by leveraging its own peculiarities, which are the re-configurable computing approach and the high power efficiency [3] [4]. Moreover nowadays FPGAs, thanks to the variety of embedded on-chip resources, allow offloading of increasingly complex tasks: not only for the pure computational part on an algorithm, but also for the communication part in the case of distributed parallel systems [5]. In particular in this thesis a specific computational task has been addressed, the three-dimensional Fast Fourier Transform (3D FFT), which is peculiarly weighty for the interconnection network when parallel systems are involved. The main goal of this study is finding a clever way to move part of the computational weight closer to the network, in order to exploit the communication patterns peculiarities and eventually take advantage of data reuse within the process of transmission.
A multi-FPGA high performance computing system for 3D FFT-based numerical simulations
AMMENDOLA, ROBERTO
2018
Abstract
In the field of High Performance Computing, communications among processes represent a typical bottleneck for massively parallel scientific applications. Object of this research is the development of a network interface card with specific offloading capabilities that could help large scale simulations in terms of communication latency and scalability with the number of computing elements. Until the early 2000s, general purpose single-core CPU-based systems were the processing systems of choice for HPC applications. They replaced exotic supercomputing architectures because they were inexpensive, and performance scaled with frequency in line with Moore’s Law. After the mid2000s the multi-core architecture era started as the only viable solution to keep up with predicted performance scaling. It is around year 2010 that CPU-based systems augmented with hardware accelerators as co-processors started to emerge as an alternative to CPU-only systems. This has opened up opportunities for accelerators, mainly General Purpose Graphics Processing Units (GPGPUs) to advance HPC to previously unattainable performance levels [1]. Since then, programmable device technology (namely Field Programmable Gate Arrays, or FPGA), while sharing the same silicon complexity of a GPGPU, has struggled to emerge as a real accelerator competitor, mainly due (i) to the lack of well-established high level synthesis tools, (ii) higher costs and slower lead times, (iii) an actually poor result in terms of time-to-solution [2]. By mitigating these negative aspects, the FPGA technology started becoming known and widespread lately by leveraging its own peculiarities, which are the re-configurable computing approach and the high power efficiency [3] [4]. Moreover nowadays FPGAs, thanks to the variety of embedded on-chip resources, allow offloading of increasingly complex tasks: not only for the pure computational part on an algorithm, but also for the communication part in the case of distributed parallel systems [5]. In particular in this thesis a specific computational task has been addressed, the three-dimensional Fast Fourier Transform (3D FFT), which is peculiarly weighty for the interconnection network when parallel systems are involved. The main goal of this study is finding a clever way to move part of the computational weight closer to the network, in order to exploit the communication patterns peculiarities and eventually take advantage of data reuse within the process of transmission.File | Dimensione | Formato | |
---|---|---|---|
tesi_ammendola.pdf
accesso solo da BNCF e BNCR
Dimensione
3.62 MB
Formato
Adobe PDF
|
3.62 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/214476
URN:NBN:IT:UNIROMA2-214476