The exponential growth of textual data shared online has created an urgent need for methods that can effectively extract, structure, and interpret information from vast and varied texts. Information Extraction (IE), a key area within Natural Language Processing (NLP), addresses this need by transforming unstructured text into structured formats enabling automated text analytics and decision-making. However, existing IE systems face substantial challenges in scalability and generalization. These challenges include limited labeled data for low-resource languages, computational demands that restrict accessibility to only well-resourced institutions, and a predominant focus on popular entities. Additionally, most IE tasks are entity-centric tasks (e.g. Named Entity Recognition, Entity Disambiguation, and Relation Extraction), thus overlooking the broader contextual richness present in many texts. This thesis aims at advancing the field of IE by tackling these critical issues through novel resources, methodologies, and theoretical approaches aimed at fostering a multilingual, scalable, and semantically-enriched IE framework. To bridge the multilingual gap, we leverage a combination of neural and knowledge-based approaches and create multilingual datasets for NER and Relation Extraction, ensuring that IE systems can operate effectively across diverse linguistic settings. On the computational front, we propose optimizations designed to reduce the resource requirements of IE models, especially in the context of Entity Disambiguation, enabling broader adoption of NLP technologies by reducing dependence on high-performance hardware and extensive labeled datasets. Additionally, this work challenges traditional IE frameworks by expanding the focus beyond named entities to encompass abstract concepts, idiomatic expressions, and tail entities, which are essential for a more nuanced and comprehensive understanding of texts. Through these contributions, this research aims to establish a robust foundation for multilingual, resource-efficient IE systems that can meet the evolving demands of global text analytics across varied languages, domains, and cultural contexts. Finally, to further encourage the usage and development of multilingual IE systems, we publicly release all the artifacts -- datasets and models -- introduced in this thesis.

Towards comprehensive and efficient information extraction across languages

TEDESCHI, SIMONE
2025

Abstract

The exponential growth of textual data shared online has created an urgent need for methods that can effectively extract, structure, and interpret information from vast and varied texts. Information Extraction (IE), a key area within Natural Language Processing (NLP), addresses this need by transforming unstructured text into structured formats enabling automated text analytics and decision-making. However, existing IE systems face substantial challenges in scalability and generalization. These challenges include limited labeled data for low-resource languages, computational demands that restrict accessibility to only well-resourced institutions, and a predominant focus on popular entities. Additionally, most IE tasks are entity-centric tasks (e.g. Named Entity Recognition, Entity Disambiguation, and Relation Extraction), thus overlooking the broader contextual richness present in many texts. This thesis aims at advancing the field of IE by tackling these critical issues through novel resources, methodologies, and theoretical approaches aimed at fostering a multilingual, scalable, and semantically-enriched IE framework. To bridge the multilingual gap, we leverage a combination of neural and knowledge-based approaches and create multilingual datasets for NER and Relation Extraction, ensuring that IE systems can operate effectively across diverse linguistic settings. On the computational front, we propose optimizations designed to reduce the resource requirements of IE models, especially in the context of Entity Disambiguation, enabling broader adoption of NLP technologies by reducing dependence on high-performance hardware and extensive labeled datasets. Additionally, this work challenges traditional IE frameworks by expanding the focus beyond named entities to encompass abstract concepts, idiomatic expressions, and tail entities, which are essential for a more nuanced and comprehensive understanding of texts. Through these contributions, this research aims to establish a robust foundation for multilingual, resource-efficient IE systems that can meet the evolving demands of global text analytics across varied languages, domains, and cultural contexts. Finally, to further encourage the usage and development of multilingual IE systems, we publicly release all the artifacts -- datasets and models -- introduced in this thesis.
24-gen-2025
Inglese
NAVIGLI, Roberto
LENZERINI, Maurizio
Università degli Studi di Roma "La Sapienza"
158
File in questo prodotto:
File Dimensione Formato  
Tesi_dottorato_Tedeschi.pdf

accesso aperto

Dimensione 4.96 MB
Formato Adobe PDF
4.96 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/189619
Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-189619