This thesis addresses the automatic analysis of texts written in bureaucratic Italian through the development of resources and the identification of computational linguistics and NLP approaches applicable to data from the Italian Public Administration (PA), with the goal of supporting its digital transformation. The research focuses on two main areas of intervention: streamlining the processing of administrative documents and improving the readability of PA texts. Sector-specific languages, such as bureaucratic Italian, often pose challenges for general-purpose language models, which lack the linguistic knowledge required to accurately perform domain-specific tasks. To address this issue, the thesis describes the stages leading to the development of BureauBERTo, an encoder-based language model and the first to be specialized in the Italian bureaucratic domain. BureauBERTo’s performance was tested and compared to other models using supervised, unsupervised, and prompt-based learning approaches, demonstrating the effectiveness of specialized models in domain-specific tasks, even with limited annotated data. The research also showed that specialized encoders offer an efficient and more sustainable solution for discriminative tasks compared to current large language models, while ensuring internal data governance for public institutions and fostering AI applications that are accessible even to smaller entities within the public sector.

Enhancing Public Administration with Computational Linguistics: a Language Model for Italian Bureacrutic Language

AURIEMMA, SERENA
2025

Abstract

This thesis addresses the automatic analysis of texts written in bureaucratic Italian through the development of resources and the identification of computational linguistics and NLP approaches applicable to data from the Italian Public Administration (PA), with the goal of supporting its digital transformation. The research focuses on two main areas of intervention: streamlining the processing of administrative documents and improving the readability of PA texts. Sector-specific languages, such as bureaucratic Italian, often pose challenges for general-purpose language models, which lack the linguistic knowledge required to accurately perform domain-specific tasks. To address this issue, the thesis describes the stages leading to the development of BureauBERTo, an encoder-based language model and the first to be specialized in the Italian bureaucratic domain. BureauBERTo’s performance was tested and compared to other models using supervised, unsupervised, and prompt-based learning approaches, demonstrating the effectiveness of specialized models in domain-specific tasks, even with limited annotated data. The research also showed that specialized encoders offer an efficient and more sustainable solution for discriminative tasks compared to current large language models, while ensuring internal data governance for public institutions and fostering AI applications that are accessible even to smaller entities within the public sector.
9-lug-2025
Inglese
Italian bureaucratic language
administrative data
public administration
BureauBERTo
encoder
language model
specialized model
further pre-training
fine-tuning
prompting
Lenci, Alessandro
File in questo prodotto:
File Dimensione Formato  
PhD_Tesi_Auriemma_PDFA_2025.pdf

embargo fino al 11/07/2028

Dimensione 3.76 MB
Formato Adobe PDF
3.76 MB Adobe PDF
Report_attivit_svolte_dottorato_etd_pdfa.pdf

non disponibili

Dimensione 175.67 kB
Formato Adobe PDF
175.67 kB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/217899
Il codice NBN di questa tesi è URN:NBN:IT:UNIPI-217899