Zero-Shot Hierarchical Short Text Classification

MOIRAGHI MOTTA, Federico

Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (Common Procurement Vocabulary, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach called Hierarchical Cross Encoder (HCE), based on a pre-trained language model that relies only on label description and respects the label taxonomy. To test our proposed model, we used both state of the art datasets and an industrial dataset which comes from contrattipubblici.org (a service by SpazioDati s.r.l. that collects public contracts stipulated in Italy in the last 25 years.

Zero-Shot Hierarchical Short Text Classification

MOIRAGHI MOTTA, FEDERICO

2025

Abstract

Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (Common Procurement Vocabulary, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach called Hierarchical Cross Encoder (HCE), based on a pre-trained language model that relies only on label description and respects the label taxonomy. To test our proposed model, we used both state of the art datasets and an industrial dataset which comes from contrattipubblici.org (a service by SpazioDati s.r.l. that collects public contracts stipulated in Italy in the last 25 years.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di pubblicazione
	
				27-feb-2025
			
	Lingua
	
				Inglese
			
	Abstract in italiano
	
				Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (Common Procurement Vocabulary, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities.
Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach called Hierarchical Cross Encoder (HCE), based on a pre-trained language model that relies only on label description and respects the label taxonomy.
To test our proposed model, we used both state of the art datasets and an industrial dataset which comes from contrattipubblici.org (a service by SpazioDati s.r.l. that collects public contracts stipulated in Italy in the last 25 years.
			
	Parola chiave
	
				nlp; tassonomia; gerarchico; modello linguistico; bert
			
	Relatore, Supervisor, Advisor o Tutor
	
				PALMONARI, MATTEO LUIGI
			
	Collezione di appartenenza
	
				Università degli Studi di Milano - Bicocca

File in questo prodotto:

File	Dimensione	Formato
phd_unimib_799735.pdf accesso aperto Dimensione 2.47 MB Formato Adobe PDF Visualizza/Apri	2.47 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/193889

Il codice NBN di questa tesi è URN:NBN:IT:UNIMIB-193889