Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (Common Procurement Vocabulary, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach called Hierarchical Cross Encoder (HCE), based on a pre-trained language model that relies only on label description and respects the label taxonomy. To test our proposed model, we used both state of the art datasets and an industrial dataset which comes from contrattipubblici.org (a service by SpazioDati s.r.l. that collects public contracts stipulated in Italy in the last 25 years.
Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (Common Procurement Vocabulary, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach called Hierarchical Cross Encoder (HCE), based on a pre-trained language model that relies only on label description and respects the label taxonomy. To test our proposed model, we used both state of the art datasets and an industrial dataset which comes from contrattipubblici.org (a service by SpazioDati s.r.l. that collects public contracts stipulated in Italy in the last 25 years.
Zero-Shot Hierarchical Short Text Classification
MOIRAGHI MOTTA, FEDERICO
2025
Abstract
Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (Common Procurement Vocabulary, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach called Hierarchical Cross Encoder (HCE), based on a pre-trained language model that relies only on label description and respects the label taxonomy. To test our proposed model, we used both state of the art datasets and an industrial dataset which comes from contrattipubblici.org (a service by SpazioDati s.r.l. that collects public contracts stipulated in Italy in the last 25 years.File | Dimensione | Formato | |
---|---|---|---|
phd_unimib_799735.pdf
accesso aperto
Dimensione
2.47 MB
Formato
Adobe PDF
|
2.47 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/193889
URN:NBN:IT:UNIMIB-193889