Code analysis is a key topic for improving software quality and efficiency. This analysis becomes even more important for securing code against potential cyber-attacks. However, manual analysis of code, especially for the binary one, is complicated and error-prone. Therefore, the investigation of new automatic techniques for code analysis is research topic of great interest. As suggested by the "naturalness hypothesis", the code exhibits similar statistical properties to natural languages. As a consequence, techniques used for natural language processing can be also applied to analyze source and binary code. For this reason, recent research applies neural language models on code analysis, achieving significant results. In line with this research trend, the two contributions of the thesis are focused on the application of deep learning to analysis of code written in high-level and low-level programming languages. The first contribution of the thesis introduces a benchmark designed to evaluate models for binary code representation. The tool can be used to test and compare the performance of these models on various binary function tasks. The second contribution, on the other hand, focuses on the application of neural networks for analyzing source code. The contribution investigates the application of neural language models for detecting code smells, that represent poor design choices potentially impacting the code quality.

Application of language models on code analysis

CONSOLE, FRANCESCA
2024

Abstract

Code analysis is a key topic for improving software quality and efficiency. This analysis becomes even more important for securing code against potential cyber-attacks. However, manual analysis of code, especially for the binary one, is complicated and error-prone. Therefore, the investigation of new automatic techniques for code analysis is research topic of great interest. As suggested by the "naturalness hypothesis", the code exhibits similar statistical properties to natural languages. As a consequence, techniques used for natural language processing can be also applied to analyze source and binary code. For this reason, recent research applies neural language models on code analysis, achieving significant results. In line with this research trend, the two contributions of the thesis are focused on the application of deep learning to analysis of code written in high-level and low-level programming languages. The first contribution of the thesis introduces a benchmark designed to evaluate models for binary code representation. The tool can be used to test and compare the performance of these models on various binary function tasks. The second contribution, on the other hand, focuses on the application of neural networks for analyzing source code. The contribution investigates the application of neural language models for detecting code smells, that represent poor design choices potentially impacting the code quality.
24-set-2024
Inglese
QUERZONI, Leonardo
DI LUNA, GIUSEPPE ANTONIO
NAVIGLI, Roberto
Università degli Studi di Roma "La Sapienza"
File in questo prodotto:
File Dimensione Formato  
Tesi_dottorato_Console.pdf

accesso aperto

Dimensione 3.08 MB
Formato Adobe PDF
3.08 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/164463
Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-164463