Three shades of alignment: from distributional semantics to trustworthy and interpretable large language models

Pallucchini, Filippo

This thesis advances a unified framework for alignment in Natural Language Processing (NLP), explored across three interdependent levels — distributional, behavioural, and epistemic. Together, these levels trace a coherent path from geometric correspondence in embedding spaces to trustworthy and interpretable behaviour in large language models (LLMs). At the distributional level, alignment concerns the mapping of semantic structures between embedding spaces. The thesis introduces SeNSe (Embedding Alignment via Semantic Anchors Selection), an unsupervised method that identifies robust semantic anchors to enhance the stability and interpretability of cross-space mappings. Building on this foundation, MEAL (Multilingual Embeddings Alignment) applies distributional alignment to real-world data, estimating job similarity across multilingual labour market datasets. Complementing these contributions, Lost in Alignment offers the first systematic survey and taxonomy of cross-lingual contextual alignment methods, establishing a unified theoretical and methodological framework for contextual representation alignment. At the behavioural level, alignment shifts from geometric correspondence to model behaviour — that is, how LLMs integrate, retrieve, and generate knowledge in ways consistent with factual and domain-specific constraints. Two novel Retrieval-Augmented Generation (RAG) systems are introduced: RE-FIN (Retrieval-based Enrichment for Financial Data), which enhances financial sentiment analysis through retrieval-based enrichment, and FLEX (Financial Language Enhancement with Guided LLM Execution), which achieves self-alignment via internal paraphrasing and perplexity-based selection. Together, these architectures embody complementary paradigms of external (knowledge-grounded) and internal (self-consistent) behavioural alignment, improving factual reliability and domain adaptation in financial NLP applications. Finally, at the epistemic level, alignment addresses the correspondence between internal model representations and human-interpretable semantics. The thesis introduces SAFE (Sparse Autoencoder-based Framework for robust query Enrichment), a sparse autoencoder-based framework that detects and corrects factual inconsistencies in LLM outputs through interpretable feature representations. Deepening the sparse autoencoders' field, the thesis proposes SFAL (Semantic-Functional Alignment score) a novel metric for quantifying the degree of correspondence between semantic and functional feature spaces, enabling principled evaluation of auto-interpretability in sparse autoencoders. Together, SAFE and SFAL formalise epistemic alignment as the structural coherence between what models know, represent, and explain. Overall, this thesis redefines alignment as a multidimensional principle connecting meaning, behaviour, and reasoning, charting a conceptual trajectory from distributional semantics to trustworthy, interpretable, and epistemically grounded large language models.

Questa tesi propone un quadro unificato per l’allineamento nell’ambito del Natural Language Processing (NLP), articolato su tre livelli interdipendenti: distribuzionale, comportamentale ed epistemico. Insieme, questi livelli delineano un percorso coerente che va dalla corrispondenza geometrica negli spazi di embedding fino a un comportamento affidabile e interpretabile nei Large Language Models (LLMs). A livello distribuzionale, l’allineamento riguarda la mappatura delle strutture semantiche tra spazi di embedding. La tesi introduce SeNSe (Embedding Alignment via Semantic Anchors Selection), un metodo non supervisionato che identifica ancore semantiche robuste per migliorare la stabilità e l’interpretabilità delle mappature tra spazi. Su queste basi, MEAL (Multilingual Embeddings Alignment) applica l’allineamento distribuzionale a dati reali, stimando la similarità tra occupazioni in diversi dataset multilingue dei mercati del lavoro. A complemento di questi contributi, Lost in Alignment offre la prima rassegna sistematica e tassonomia dei metodi di allineamento contestuale cross-lingua, stabilendo un quadro teorico e metodologico unificato per l’allineamento delle rappresentazioni contestuali. A livello comportamentale, l’allineamento si sposta dalla corrispondenza geometrica al comportamento del modello, ossia al modo in cui gli LLM integrano, recuperano e generano conoscenza in modi coerenti con vincoli fattuali e specifici di dominio. La tesi introduce due nuovi sistemi di Retrieval-Augmented Generation (RAG): RE-FIN (Retrieval-based Enrichment for Financial Data), che migliora l’analisi del sentiment finanziario attraverso un arricchimento basato su estrazione da banche dati scelte, e FLEX (Financial Language Enhancement with Guided LLM Execution), che raggiunge l’auto-allineamento mediante parafrasi interne e selezione basata sulla perplessità (perplexity). Insieme, queste architetture incarnano due paradigmi complementari di allineamento comportamentale, esterno (basato sulla conoscenza) e interno (auto-consistente), migliorando l’affidabilità fattuale e l’adattamento di dominio nelle applicazioni NLP in ambito finanziario. Infine, a livello epistemico, l’allineamento riguarda la corrispondenza tra le rappresentazioni interne del modello e le semantiche interpretabili dall’uomo. La tesi introduce SAFE (Sparse Autoencoder-based Framework for robust query Enrichment), un framework basato su autoencoder sparsi che rileva e corregge incoerenze fattuali negli output degli LLM attraverso rappresentazioni di caratteristiche interpretabili. Approfondendo il campo degli autoencoder sparsi, la tesi propone SFAL (Semantic-Functional Alignment score), una nuova metrica per quantificare il grado di corrispondenza tra gli spazi di caratteristiche semantiche e funzionali, consentendo una valutazione rigorosa dell’auto-interpretabilità negli autoencoder sparsi. Insieme, SAFE e SFAL formalizzano l’allineamento epistemico come coerenza strutturale tra ciò che i modelli sanno, rappresentano e spiegano. Nel complesso, questa tesi ridefinisce l’allineamento come un principio multidimensionale che connette significato, comportamento e ragionamento, tracciando una traiettoria concettuale che va dalla semantica distribuzionale a modelli linguistici di grandi dimensioni affidabili, interpretabili e fondati epistemicamente.