A central challenge in artificial intelligence is developing systems that can truly understand human language. Achieving this requires models that capture nuanced meanings, relationships, and contextual subtleties especially in specialized applications. This thesis explores strategies in Natural Language Processing (NLP) to address this challenge, beginning with foundational word representations, advancing to sentence-level representations, and then integrating both levels to enrich interpretive depth. Finally, we investigate synthetic data generation as a method to refine these representations further using Large Language Models (LLMs). Word representations serve as the starting point, positioning words within a continuous semantic space to encode basic lexical relationships. While these embeddings effectively capture individual word associations, they fall short in representing meaning beyond isolated terms, limiting their utility in complex tasks that demand contextual comprehension. To address this gap, the thesis progresses to sentence-level embeddings, which extend beyond individual words by integrating syntactic and semantic relationships. These context-aware embeddings provide the depth necessary for semantic similarity, sentence alignment, and cross-lingual applications, enabling models to interpret richer and more nuanced meanings within larger contexts. By integrating word and sentence-level embeddings, we achieve a multi-layered representation that captures both fine-grained lexical details and broader contextual dependencies, offering a more comprehensive approach to language understanding. This layered integration sets the foundation for the final phase of the thesis: synthetic data generation, which further refines these representations by addressing data scarcity and tailoring embeddings to task-specific needs in specialized applications. In sum, this thesis shows how each stage—from foundational word embeddings to sentence-level representations, the integration of both, and ultimately synthetic data generation—contributes to building robust and adaptable NLP models. This structured progression provides a comprehensive pathway for addressing the varied and intricate challenges of language understanding, advancing NLP toward effective language comprehension in AI systems.

Enhancing latent alignment methods for NLP: from words to concepts

BEJGU, ANDREI STEFAN
2025

Abstract

A central challenge in artificial intelligence is developing systems that can truly understand human language. Achieving this requires models that capture nuanced meanings, relationships, and contextual subtleties especially in specialized applications. This thesis explores strategies in Natural Language Processing (NLP) to address this challenge, beginning with foundational word representations, advancing to sentence-level representations, and then integrating both levels to enrich interpretive depth. Finally, we investigate synthetic data generation as a method to refine these representations further using Large Language Models (LLMs). Word representations serve as the starting point, positioning words within a continuous semantic space to encode basic lexical relationships. While these embeddings effectively capture individual word associations, they fall short in representing meaning beyond isolated terms, limiting their utility in complex tasks that demand contextual comprehension. To address this gap, the thesis progresses to sentence-level embeddings, which extend beyond individual words by integrating syntactic and semantic relationships. These context-aware embeddings provide the depth necessary for semantic similarity, sentence alignment, and cross-lingual applications, enabling models to interpret richer and more nuanced meanings within larger contexts. By integrating word and sentence-level embeddings, we achieve a multi-layered representation that captures both fine-grained lexical details and broader contextual dependencies, offering a more comprehensive approach to language understanding. This layered integration sets the foundation for the final phase of the thesis: synthetic data generation, which further refines these representations by addressing data scarcity and tailoring embeddings to task-specific needs in specialized applications. In sum, this thesis shows how each stage—from foundational word embeddings to sentence-level representations, the integration of both, and ultimately synthetic data generation—contributes to building robust and adaptable NLP models. This structured progression provides a comprehensive pathway for addressing the varied and intricate challenges of language understanding, advancing NLP toward effective language comprehension in AI systems.
23-gen-2025
Inglese
NAVIGLI, Roberto
LENZERINI, Maurizio
Università degli Studi di Roma "La Sapienza"
160
File in questo prodotto:
File Dimensione Formato  
Tesi_dottorato_Bejgu.pdf

accesso aperto

Dimensione 1.3 MB
Formato Adobe PDF
1.3 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/192802
Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-192802