From text to knowledge: multilingual information extraction for knowledge graph construction

HUGUET CABOT, PERE-LLUIS

In the era of Large Language Models (LLMs), Information Extraction (IE) may seem like a “Chronicle of a Death Foretold”. Between 2020 and 2023, it ranked among the top three most popular topics at conferences like ACL, yet by 2024, it had dropped to tenth place. The advent of Transformer Language Models (LMs), emerging just before work on this dissertation began, has transformed the field of Natural Language Processing (NLP), enabling unprecedented performance across a broad range of Natural Language Understanding (NLU) tasks. Surprisingly, scaling these models into LLMs has not led to diminishing returns but has instead further expanded their capabilities. However, there remains a need for efficient methods suitable for real-world applications that require low latency or the ability to process large volumes of real-time data—domains where large models are often impractical. Additionally, tasks reliant on LLMs’ parametric memory face limitations due to neural inference, where accuracy and recency of information cannot always be guaranteed. While LLMs show great promise, they increasingly require grounding in external knowledge sources for reliable results. This is where IE becomes indispensable. Rather than being replaced, IE complements and strengthens LLMs, supporting their reasoning with accurate, grounded information. Knowledge Graphs (KGs) serve as structured frameworks that bridge unstructured text and structured knowledge, enabling scalable, interpretable organization of vast amounts of information. Essential for applications like semantic search, recommendation systems, and question-answering, KGs rely heavily on robust IE techniques. In this thesis, we focus on advancing multilingual IE methods to enhance KG construction and address limitations in existing IE systems.

From text to knowledge: multilingual information extraction for knowledge graph construction

HUGUET CABOT, PERE-LLUIS

2025

Abstract

In the era of Large Language Models (LLMs), Information Extraction (IE) may seem like a “Chronicle of a Death Foretold”. Between 2020 and 2023, it ranked among the top three most popular topics at conferences like ACL, yet by 2024, it had dropped to tenth place. The advent of Transformer Language Models (LMs), emerging just before work on this dissertation began, has transformed the field of Natural Language Processing (NLP), enabling unprecedented performance across a broad range of Natural Language Understanding (NLU) tasks. Surprisingly, scaling these models into LLMs has not led to diminishing returns but has instead further expanded their capabilities. However, there remains a need for efficient methods suitable for real-world applications that require low latency or the ability to process large volumes of real-time data—domains where large models are often impractical. Additionally, tasks reliant on LLMs’ parametric memory face limitations due to neural inference, where accuracy and recency of information cannot always be guaranteed. While LLMs show great promise, they increasingly require grounding in external knowledge sources for reliable results. This is where IE becomes indispensable. Rather than being replaced, IE complements and strengthens LLMs, supporting their reasoning with accurate, grounded information. Knowledge Graphs (KGs) serve as structured frameworks that bridge unstructured text and structured knowledge, enabling scalable, interpretable organization of vast amounts of information. Essential for applications like semantic search, recommendation systems, and question-answering, KGs rely heavily on robust IE techniques. In this thesis, we focus on advancing multilingual IE methods to enhance KG construction and address limitations in existing IE systems.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INGEGNERIA INFORMATICA, AUTOMATICA E GESTIONALE -ANTONIO RUBERTI-
			
	Corso di studio
	
				Altro corso di dottorato
			
	Data di pubblicazione
	
				23-gen-2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				NAVIGLI, Roberto
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				LENZERINI, Maurizio
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_HuguetCabot.pdf accesso aperto Dimensione 7.8 MB Formato Adobe PDF Visualizza/Apri	7.8 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/189207

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-189207