Foundation Models for Automatic Labeling in Software Engineering

Colavito, Giuseppe

This thesis investigates the application of foundation models for automating labeling tasks in software engineering, focusing on issue classification as a primary case study. Issue tracking systems are essential for collaborative software development, yet manual labeling of issue reports is often inconsistent and time-consuming, with approximately 33.8% of reports being incorrectly labeled. Traditional supervised machine learning approaches require substantial labeled training data, creating barriers for new or resource-constrained projects. The research addresses two key questions: the extent to which foundation models can be leveraged for automated issue labeling, and which models offer optimal trade-offs among performance, computational costs, and scalability. Through comprehensive studies, the work evaluates the impact of data quality on classification performance, examines few-shot learning approaches for limited data scenarios, assesses generative language models in zero-shot and few-shot settings, and conducts extensive benchmarking across various foundation models and hardware configurations. The approaches are validated through collaboration with NASA Goddard Space Flight Center on mission-critical flight software systems. Key findings demonstrate that BERT-based few-shot learning can outperform larger models on high-quality datasets, zero-shot methods achieve performance comparable to supervised approaches, and open-source models can match proprietary systems while offering transparency advantages. The research provides practical guidelines for model selection and supports progressive deployment strategies, enabling organizations to initially adopt zero-shot generative models for rapid automation and transition to fine-tuned models as labeled data becomes available, effectively addressing the cold-start problem in automated classification systems.

Foundation Models for Automatic Labeling in Software Engineering

COLAVITO, GIUSEPPE

2026

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di pubblicazione
	
				28-feb-2026
			
	Lingua
	
				Italiano
			
	Parola chiave
	
				automated labeling
BERT
few-shot learning
foundation models
issue classification
issue tracking systems
large language models
LLMs
natural language processing
NLP
software engineering
zero-shot learning
			
	Relatore, Supervisor, Advisor o Tutor
	
				Novielli, Nicole
Lanubile, Filippo
			
	Collezione di appartenenza
	
				Università degli Studi di Pisa

File in questo prodotto:

File	Dimensione	Formato
PhD_Thesis_Colavito.pdf embargo fino al 02/03/2029 Licenza: Tutti i diritti riservati Dimensione 9.14 MB Formato Adobe PDF	9.14 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/362305

Il codice NBN di questa tesi è URN:NBN:IT:UNIPI-362305