Enhancing latent alignment methods for NLP: from words to concepts

Bejgu, ANDREI STEFAN

A central challenge in artificial intelligence is developing systems that can truly understand human language. Achieving this requires models that capture nuanced meanings, relationships, and contextual subtleties especially in specialized applications. This thesis explores strategies in Natural Language Processing (NLP) to address this challenge, beginning with foundational word representations, advancing to sentence-level representations, and then integrating both levels to enrich interpretive depth. Finally, we investigate synthetic data generation as a method to refine these representations further using Large Language Models (LLMs). Word representations serve as the starting point, positioning words within a continuous semantic space to encode basic lexical relationships. While these embeddings effectively capture individual word associations, they fall short in representing meaning beyond isolated terms, limiting their utility in complex tasks that demand contextual comprehension. To address this gap, the thesis progresses to sentence-level embeddings, which extend beyond individual words by integrating syntactic and semantic relationships. These context-aware embeddings provide the depth necessary for semantic similarity, sentence alignment, and cross-lingual applications, enabling models to interpret richer and more nuanced meanings within larger contexts. By integrating word and sentence-level embeddings, we achieve a multi-layered representation that captures both fine-grained lexical details and broader contextual dependencies, offering a more comprehensive approach to language understanding. This layered integration sets the foundation for the final phase of the thesis: synthetic data generation, which further refines these representations by addressing data scarcity and tailoring embeddings to task-specific needs in specialized applications. In sum, this thesis shows how each stage—from foundational word embeddings to sentence-level representations, the integration of both, and ultimately synthetic data generation—contributes to building robust and adaptable NLP models. This structured progression provides a comprehensive pathway for addressing the varied and intricate challenges of language understanding, advancing NLP toward effective language comprehension in AI systems.

Enhancing latent alignment methods for NLP: from words to concepts

BEJGU, ANDREI STEFAN

2025

Abstract

A central challenge in artificial intelligence is developing systems that can truly understand human language. Achieving this requires models that capture nuanced meanings, relationships, and contextual subtleties especially in specialized applications. This thesis explores strategies in Natural Language Processing (NLP) to address this challenge, beginning with foundational word representations, advancing to sentence-level representations, and then integrating both levels to enrich interpretive depth. Finally, we investigate synthetic data generation as a method to refine these representations further using Large Language Models (LLMs). Word representations serve as the starting point, positioning words within a continuous semantic space to encode basic lexical relationships. While these embeddings effectively capture individual word associations, they fall short in representing meaning beyond isolated terms, limiting their utility in complex tasks that demand contextual comprehension. To address this gap, the thesis progresses to sentence-level embeddings, which extend beyond individual words by integrating syntactic and semantic relationships. These context-aware embeddings provide the depth necessary for semantic similarity, sentence alignment, and cross-lingual applications, enabling models to interpret richer and more nuanced meanings within larger contexts. By integrating word and sentence-level embeddings, we achieve a multi-layered representation that captures both fine-grained lexical details and broader contextual dependencies, offering a more comprehensive approach to language understanding. This layered integration sets the foundation for the final phase of the thesis: synthetic data generation, which further refines these representations by addressing data scarcity and tailoring embeddings to task-specific needs in specialized applications. In sum, this thesis shows how each stage—from foundational word embeddings to sentence-level representations, the integration of both, and ultimately synthetic data generation—contributes to building robust and adaptable NLP models. This structured progression provides a comprehensive pathway for addressing the varied and intricate challenges of language understanding, advancing NLP toward effective language comprehension in AI systems.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INGEGNERIA INFORMATICA, AUTOMATICA E GESTIONALE -ANTONIO RUBERTI-
			
	Corso di studio
	
				Altro corso di dottorato
			
	Data di pubblicazione
	
				23-gen-2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				NAVIGLI, Roberto
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				LENZERINI, Maurizio
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Numero di pagine
	
				160
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Bejgu.pdf accesso aperto Dimensione 1.3 MB Formato Adobe PDF Visualizza/Apri	1.3 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/192802

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-192802