This thesis addresses the automatic analysis of texts written in bureaucratic Italian through the development of resources and the identification of computational linguistics and NLP approaches applicable to data from the Italian Public Administration (PA), with the goal of supporting its digital transformation. The research focuses on two main areas of intervention: streamlining the processing of administrative documents and improving the readability of PA texts. Sector-specific languages, such as bureaucratic Italian, often pose challenges for general-purpose language models, which lack the linguistic knowledge required to accurately perform domain-specific tasks. To address this issue, the thesis describes the stages leading to the development of BureauBERTo, an encoder-based language model and the first to be specialized in the Italian bureaucratic domain. BureauBERTo’s performance was tested and compared to other models using supervised, unsupervised, and prompt-based learning approaches, demonstrating the effectiveness of specialized models in domain-specific tasks, even with limited annotated data. The research also showed that specialized encoders offer an efficient and more sustainable solution for discriminative tasks compared to current large language models, while ensuring internal data governance for public institutions and fostering AI applications that are accessible even to smaller entities within the public sector.
Enhancing Public Administration with Computational Linguistics: a Language Model for Italian Bureacrutic Language
AURIEMMA, SERENA
2025
Abstract
This thesis addresses the automatic analysis of texts written in bureaucratic Italian through the development of resources and the identification of computational linguistics and NLP approaches applicable to data from the Italian Public Administration (PA), with the goal of supporting its digital transformation. The research focuses on two main areas of intervention: streamlining the processing of administrative documents and improving the readability of PA texts. Sector-specific languages, such as bureaucratic Italian, often pose challenges for general-purpose language models, which lack the linguistic knowledge required to accurately perform domain-specific tasks. To address this issue, the thesis describes the stages leading to the development of BureauBERTo, an encoder-based language model and the first to be specialized in the Italian bureaucratic domain. BureauBERTo’s performance was tested and compared to other models using supervised, unsupervised, and prompt-based learning approaches, demonstrating the effectiveness of specialized models in domain-specific tasks, even with limited annotated data. The research also showed that specialized encoders offer an efficient and more sustainable solution for discriminative tasks compared to current large language models, while ensuring internal data governance for public institutions and fostering AI applications that are accessible even to smaller entities within the public sector.File | Dimensione | Formato | |
---|---|---|---|
PhD_Tesi_Auriemma_PDFA_2025.pdf
embargo fino al 11/07/2028
Dimensione
3.76 MB
Formato
Adobe PDF
|
3.76 MB | Adobe PDF | |
Report_attivit_svolte_dottorato_etd_pdfa.pdf
non disponibili
Dimensione
175.67 kB
Formato
Adobe PDF
|
175.67 kB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/217899
URN:NBN:IT:UNIPI-217899