Extending coreference resolution to long texts: from paragraphs to full books and beyond

Martinelli, Giuliano

In recent years, Large Language Models have reshaped the landscape of Natural Language Processing, achieving remarkable performance across a wide range of tasks. However, while their performance is impressive on short texts, their capabilities remain limited when dealing with longer contexts: memory and computational costs grow rapidly, coherence degrades, and outputs can lack robustness or factual grounding. Coreference Resolution, a long-standing task that involves determining when expressions refer to the same entity, directly addresses these weaknesses. It remains a fundamental component of discourse understanding, reasoning, and information extraction, especially in applications involving lengthy documents such as articles, reports, or books. Despite steady progress, most neural approaches remain optimized for current benchmarks that mainly contain short inputs, making them ill-suited to the challenges of real-world deployment. This thesis aims to enhance Coreference Resolution techniques in the era of Large Language Models, pushing the boundaries of efficiency, robustness, and scalability of neural-based methods, particularly when dealing with long texts. We begin by introducing a novel encoder-only neural architecture that achieves state-of-the-art performance across a broad range of Coreference Resolution bench- marks. This model challenges the prevailing reliance on large generative architectures with high computational overhead, instead establishing an optimal balance between performance and efficiency. Next, motivated by the absence of resources for evaluat- ing Coreference capabilities on extended contexts, we introduce a new long-document benchmark of fully annotated narrative books. This dataset extends the length limits of existing corpora by several orders of magnitude, revealing the persistent shortcomings of current neural models in handling very long texts. Building on these insights, we propose the first unified architecture capable of addressing long- and cross-document Coreference within a single framework. Joint modeling of these two challenging scenarios enables shared learning and leads to new state-of-the-art results across multiple datasets. Finally, to enhance interpretability in Coreference evaluation, we propose a new approach that integrates semantic information into standard practices. Together, these contributions, along with the produced artifacts, advance Coreference Resolution toward robust, efficient, and interpretable systems suited for deployment in realistic, large-scale NLP applications.

Extending coreference resolution to long texts: from paragraphs to full books and beyond

MARTINELLI, GIULIANO

2025

Abstract

In recent years, Large Language Models have reshaped the landscape of Natural Language Processing, achieving remarkable performance across a wide range of tasks. However, while their performance is impressive on short texts, their capabilities remain limited when dealing with longer contexts: memory and computational costs grow rapidly, coherence degrades, and outputs can lack robustness or factual grounding. Coreference Resolution, a long-standing task that involves determining when expressions refer to the same entity, directly addresses these weaknesses. It remains a fundamental component of discourse understanding, reasoning, and information extraction, especially in applications involving lengthy documents such as articles, reports, or books. Despite steady progress, most neural approaches remain optimized for current benchmarks that mainly contain short inputs, making them ill-suited to the challenges of real-world deployment. This thesis aims to enhance Coreference Resolution techniques in the era of Large Language Models, pushing the boundaries of efficiency, robustness, and scalability of neural-based methods, particularly when dealing with long texts. We begin by introducing a novel encoder-only neural architecture that achieves state-of-the-art performance across a broad range of Coreference Resolution bench- marks. This model challenges the prevailing reliance on large generative architectures with high computational overhead, instead establishing an optimal balance between performance and efficiency. Next, motivated by the absence of resources for evaluat- ing Coreference capabilities on extended contexts, we introduce a new long-document benchmark of fully annotated narrative books. This dataset extends the length limits of existing corpora by several orders of magnitude, revealing the persistent shortcomings of current neural models in handling very long texts. Building on these insights, we propose the first unified architecture capable of addressing long- and cross-document Coreference within a single framework. Joint modeling of these two challenging scenarios enables shared learning and leads to new state-of-the-art results across multiple datasets. Finally, to enhance interpretability in Coreference evaluation, we propose a new approach that integrates semantic information into standard practices. Together, these contributions, along with the produced artifacts, advance Coreference Resolution toward robust, efficient, and interpretable systems suited for deployment in realistic, large-scale NLP applications.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INGEGNERIA INFORMATICA, AUTOMATICA E GESTIONALE -ANTONIO RUBERTI-
			
	Corso di studio
	
				Altro corso di dottorato
			
	Data di pubblicazione
	
				28-gen-2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				NAVIGLI, Roberto
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				GRISETTI, GIORGIO
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Numero di pagine
	
				160
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Martinelli.pdf accesso aperto Licenza: Creative Commons Dimensione 3.7 MB Formato Adobe PDF Visualizza/Apri	3.7 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/359085

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-359085