Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (Common Procurement Vocabulary, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach called Hierarchical Cross Encoder (HCE), based on a pre-trained language model that relies only on label description and respects the label taxonomy. To test our proposed model, we used both state of the art datasets and an industrial dataset which comes from contrattipubblici.org (a service by SpazioDati s.r.l. that collects public contracts stipulated in Italy in the last 25 years.

Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (Common Procurement Vocabulary, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach called Hierarchical Cross Encoder (HCE), based on a pre-trained language model that relies only on label description and respects the label taxonomy. To test our proposed model, we used both state of the art datasets and an industrial dataset which comes from contrattipubblici.org (a service by SpazioDati s.r.l. that collects public contracts stipulated in Italy in the last 25 years.

Zero-Shot Hierarchical Short Text Classification

MOIRAGHI MOTTA, FEDERICO
2025

Abstract

Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (Common Procurement Vocabulary, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach called Hierarchical Cross Encoder (HCE), based on a pre-trained language model that relies only on label description and respects the label taxonomy. To test our proposed model, we used both state of the art datasets and an industrial dataset which comes from contrattipubblici.org (a service by SpazioDati s.r.l. that collects public contracts stipulated in Italy in the last 25 years.
27-feb-2025
Inglese
Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (Common Procurement Vocabulary, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach called Hierarchical Cross Encoder (HCE), based on a pre-trained language model that relies only on label description and respects the label taxonomy. To test our proposed model, we used both state of the art datasets and an industrial dataset which comes from contrattipubblici.org (a service by SpazioDati s.r.l. that collects public contracts stipulated in Italy in the last 25 years.
nlp; tassonomia; gerarchico; modello linguistico; bert
PALMONARI, MATTEO LUIGI
File in questo prodotto:
File Dimensione Formato  
phd_unimib_799735.pdf

accesso aperto

Dimensione 2.47 MB
Formato Adobe PDF
2.47 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/193889
Il codice NBN di questa tesi è URN:NBN:IT:UNIMIB-193889