A multidimensional study of large language models: recommendation behavior, internal representations, and post-training enhancements

Di Palma, Dario

Although Large Language Models (LLMs) exhibit strong emergent abilities across a range of general tasks, their behaviour in specialized domains requires further investigation. Their training objective, which focuses on next-token prediction, differs from the optimization criteria of many domain-specific applications, and understanding how pretraining artifacts influence adaptability to specific tasks is essential for effective specialization. Moreover, limited insight into how task-relevant information is encoded within their internal representations hinders reliable adaptation and evaluation. Consequently, the use of LLMs beyond their original training context raises open questions concerning generalization, interpretability, and robustness. These issues make the integration of LLMs into specialized domains an open research problem. In this thesis, we study how LLMs behave when repurposed beyond their original training objectives and propose principled methods to adapt, interpret, and extend their capabilities. We examine LLMs from three complementary perspectives: their application in Recommender Systems (RSs), the internal mechanisms that shape their representations, and post-training techniques that enhance their performance in complex reasoning tasks. In the first part of the dissertation, we move beyond standard NLP benchmarks and evaluate LLMs as standalone recommenders. We assess their ability to infer user preferences and re-rank items without task-specific training data, observing performance comparable to classical recommendation systems, strong few-shot re-ranking capabilities, and effectiveness even in cold-start scenarios. We examine whether LLM-generated recommendations align more closely with collaborative filtering or content-based approaches and find a pronounced similarity to collaborative filtering. We also show that LLMs tend to amplify popular but underrepresented items in the dataset, suggesting the presence of broader domain knowledge. In this part we further investigates why LLMs can perform well on recommendation tasks and, for the first time, provides evidence of data leakage from recommender system dataset into LLM pretraining, raising fundamental concerns about the evaluation of LLM-based recommenders. We extend this analysis by leveraging automatic prompt engineering to systematically identify cases in which LLMs have memorized leaked datasets. Finally, we propose a novel method for integrating LLMs into existing recommendation pipelines through symbolic rules that construct user and item profiles. This approach makes LLMs aware of dataset-specific patterns and generates narrative-style descriptions of users and items, showing that LLM-derived profiles can enhance recommendation quality. The second part of the dissertation examines the internal representations learned by LLMs from a linguistic perspective. We apply probing techniques to study how abstract concepts are encoded within the model’s activation space. We investigate how LLMs represent factual truthfulness in their hidden states, extend prior work, and propose two new and more realistic methods for evaluating factuality. Our results show that detecting hallucinations solely from internal model states is substantially more challenging than previously assumed. We also demonstrate that sentiment information is not uniformly distributed across layers, but is most concentrated in the middle layers, and that the final token is not always the most informative representation for sentiment extraction, contrary to common practice. Finally, we introduce a layer-cutting strategy that reduces memory requirements by an average of 57% while enabling sentiment extraction directly from internal states, achieving up to 14% higher accuracy than sentiment classification performed through prompting. The final part of the dissertation shifts from task adaptation and interpretability to post-training techniques aimed at improving performance by scaling test-time computation. We introduce a re-ranking module that extends LLM capabilities in complex reasoning scenarios and present a new model that acts as an effective heuristic, improving accuracy by up to 4.33% on challenging Text-to-SQL benchmarks. Our results show that scaling test-time computation through re-ranking outperforms alternative methods and heuristics, demonstrating that meaningful performance gains can be achieved through post-training strategies rather than by increasing model size. Overall, this thesis provides a multifaceted perspective on LLMs. We adapt them to new tasks, expose the problem of data leakage, interpret their internal behaviour, optimize their memory requirements, and propose a new strategy for enhancing their capabilities through test-time computation.

A multidimensional study of large language models: recommendation behavior, internal representations, and post-training enhancements

Di Palma, Dario

2026

Abstract

Although Large Language Models (LLMs) exhibit strong emergent abilities across a range of general tasks, their behaviour in specialized domains requires further investigation. Their training objective, which focuses on next-token prediction, differs from the optimization criteria of many domain-specific applications, and understanding how pretraining artifacts influence adaptability to specific tasks is essential for effective specialization. Moreover, limited insight into how task-relevant information is encoded within their internal representations hinders reliable adaptation and evaluation. Consequently, the use of LLMs beyond their original training context raises open questions concerning generalization, interpretability, and robustness. These issues make the integration of LLMs into specialized domains an open research problem. In this thesis, we study how LLMs behave when repurposed beyond their original training objectives and propose principled methods to adapt, interpret, and extend their capabilities. We examine LLMs from three complementary perspectives: their application in Recommender Systems (RSs), the internal mechanisms that shape their representations, and post-training techniques that enhance their performance in complex reasoning tasks. In the first part of the dissertation, we move beyond standard NLP benchmarks and evaluate LLMs as standalone recommenders. We assess their ability to infer user preferences and re-rank items without task-specific training data, observing performance comparable to classical recommendation systems, strong few-shot re-ranking capabilities, and effectiveness even in cold-start scenarios. We examine whether LLM-generated recommendations align more closely with collaborative filtering or content-based approaches and find a pronounced similarity to collaborative filtering. We also show that LLMs tend to amplify popular but underrepresented items in the dataset, suggesting the presence of broader domain knowledge. In this part we further investigates why LLMs can perform well on recommendation tasks and, for the first time, provides evidence of data leakage from recommender system dataset into LLM pretraining, raising fundamental concerns about the evaluation of LLM-based recommenders. We extend this analysis by leveraging automatic prompt engineering to systematically identify cases in which LLMs have memorized leaked datasets. Finally, we propose a novel method for integrating LLMs into existing recommendation pipelines through symbolic rules that construct user and item profiles. This approach makes LLMs aware of dataset-specific patterns and generates narrative-style descriptions of users and items, showing that LLM-derived profiles can enhance recommendation quality. The second part of the dissertation examines the internal representations learned by LLMs from a linguistic perspective. We apply probing techniques to study how abstract concepts are encoded within the model’s activation space. We investigate how LLMs represent factual truthfulness in their hidden states, extend prior work, and propose two new and more realistic methods for evaluating factuality. Our results show that detecting hallucinations solely from internal model states is substantially more challenging than previously assumed. We also demonstrate that sentiment information is not uniformly distributed across layers, but is most concentrated in the middle layers, and that the final token is not always the most informative representation for sentiment extraction, contrary to common practice. Finally, we introduce a layer-cutting strategy that reduces memory requirements by an average of 57% while enabling sentiment extraction directly from internal states, achieving up to 14% higher accuracy than sentiment classification performed through prompting. The final part of the dissertation shifts from task adaptation and interpretability to post-training techniques aimed at improving performance by scaling test-time computation. We introduce a re-ranking module that extends LLM capabilities in complex reasoning scenarios and present a new model that acts as an effective heuristic, improving accuracy by up to 4.33% on challenging Text-to-SQL benchmarks. Our results show that scaling test-time computation through re-ranking outperforms alternative methods and heuristics, demonstrating that meaningful performance gains can be achieved through post-training strategies rather than by increasing model size. Overall, this thesis provides a multifaceted perspective on LLMs. We adapt them to new tasks, expose the problem of data leakage, interpret their internal behaviour, optimize their memory requirements, and propose a new strategy for enhancing their capabilities through test-time computation.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria Elettrica e dell'Informazione
			
	Corso di studio
	
				Ingegneria Elettrica e dell’Informazione
			
	Data di pubblicazione
	
				2026
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				Di Noia, Tommaso
Narducci, Fedelucio
Anelli, Vito Walter
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				Carpentieri, Mario
			
	Nome Editore
	
				Politecnico di Bari
			
	Collezione di appartenenza
	
				Politecnico di Bari

File in questo prodotto:

File	Dimensione	Formato
38 ciclo-DI PALMA Dario.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 10.24 MB Formato Adobe PDF Visualizza/Apri	10.24 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/364964

Il codice NBN di questa tesi è URN:NBN:IT:POLIBA-364964