Although Large Language Models (LLMs) exhibit strong emergent abilities across a range of general tasks, their behaviour in specialized domains requires further investigation. Their training objective, which focuses on next-token prediction, differs from the optimization criteria of many domain-specific applications, and understanding how pretraining artifacts influence adaptability to specific tasks is essential for effective specialization. Moreover, limited insight into how task-relevant information is encoded within their internal representations hinders reliable adaptation and evaluation. Consequently, the use of LLMs beyond their original training context raises open questions concerning generalization, interpretability, and robustness. These issues make the integration of LLMs into specialized domains an open research problem. In this thesis, we study how LLMs behave when repurposed beyond their original training objectives and propose principled methods to adapt, interpret, and extend their capabilities. We examine LLMs from three complementary perspectives: their application in Recommender Systems (RSs), the internal mechanisms that shape their representations, and post-training techniques that enhance their performance in complex reasoning tasks. In the first part of the dissertation, we move beyond standard NLP benchmarks and evaluate LLMs as standalone recommenders. We assess their ability to infer user preferences and re-rank items without task-specific training data, observing performance comparable to classical recommendation systems, strong few-shot re-ranking capabilities, and effectiveness even in cold-start scenarios. We examine whether LLM-generated recommendations align more closely with collaborative filtering or content-based approaches and find a pronounced similarity to collaborative filtering. We also show that LLMs tend to amplify popular but underrepresented items in the dataset, suggesting the presence of broader domain knowledge. In this part we further investigates why LLMs can perform well on recommendation tasks and, for the first time, provides evidence of data leakage from recommender system dataset into LLM pretraining, raising fundamental concerns about the evaluation of LLM-based recommenders. We extend this analysis by leveraging automatic prompt engineering to systematically identify cases in which LLMs have memorized leaked datasets. Finally, we propose a novel method for integrating LLMs into existing recommendation pipelines through symbolic rules that construct user and item profiles. This approach makes LLMs aware of dataset-specific patterns and generates narrative-style descriptions of users and items, showing that LLM-derived profiles can enhance recommendation quality. The second part of the dissertation examines the internal representations learned by LLMs from a linguistic perspective. We apply probing techniques to study how abstract concepts are encoded within the model’s activation space. We investigate how LLMs represent factual truthfulness in their hidden states, extend prior work, and propose two new and more realistic methods for evaluating factuality. Our results show that detecting hallucinations solely from internal model states is substantially more challenging than previously assumed. We also demonstrate that sentiment information is not uniformly distributed across layers, but is most concentrated in the middle layers, and that the final token is not always the most informative representation for sentiment extraction, contrary to common practice. Finally, we introduce a layer-cutting strategy that reduces memory requirements by an average of 57% while enabling sentiment extraction directly from internal states, achieving up to 14% higher accuracy than sentiment classification performed through prompting. The final part of the dissertation shifts from task adaptation and interpretability to post-training techniques aimed at improving performance by scaling test-time computation. We introduce a re-ranking module that extends LLM capabilities in complex reasoning scenarios and present a new model that acts as an effective heuristic, improving accuracy by up to 4.33% on challenging Text-to-SQL benchmarks. Our results show that scaling test-time computation through re-ranking outperforms alternative methods and heuristics, demonstrating that meaningful performance gains can be achieved through post-training strategies rather than by increasing model size. Overall, this thesis provides a multifaceted perspective on LLMs. We adapt them to new tasks, expose the problem of data leakage, interpret their internal behaviour, optimize their memory requirements, and propose a new strategy for enhancing their capabilities through test-time computation.
A multidimensional study of large language models: recommendation behavior, internal representations, and post-training enhancements
Di Palma, Dario
2026
Abstract
Although Large Language Models (LLMs) exhibit strong emergent abilities across a range of general tasks, their behaviour in specialized domains requires further investigation. Their training objective, which focuses on next-token prediction, differs from the optimization criteria of many domain-specific applications, and understanding how pretraining artifacts influence adaptability to specific tasks is essential for effective specialization. Moreover, limited insight into how task-relevant information is encoded within their internal representations hinders reliable adaptation and evaluation. Consequently, the use of LLMs beyond their original training context raises open questions concerning generalization, interpretability, and robustness. These issues make the integration of LLMs into specialized domains an open research problem. In this thesis, we study how LLMs behave when repurposed beyond their original training objectives and propose principled methods to adapt, interpret, and extend their capabilities. We examine LLMs from three complementary perspectives: their application in Recommender Systems (RSs), the internal mechanisms that shape their representations, and post-training techniques that enhance their performance in complex reasoning tasks. In the first part of the dissertation, we move beyond standard NLP benchmarks and evaluate LLMs as standalone recommenders. We assess their ability to infer user preferences and re-rank items without task-specific training data, observing performance comparable to classical recommendation systems, strong few-shot re-ranking capabilities, and effectiveness even in cold-start scenarios. We examine whether LLM-generated recommendations align more closely with collaborative filtering or content-based approaches and find a pronounced similarity to collaborative filtering. We also show that LLMs tend to amplify popular but underrepresented items in the dataset, suggesting the presence of broader domain knowledge. In this part we further investigates why LLMs can perform well on recommendation tasks and, for the first time, provides evidence of data leakage from recommender system dataset into LLM pretraining, raising fundamental concerns about the evaluation of LLM-based recommenders. We extend this analysis by leveraging automatic prompt engineering to systematically identify cases in which LLMs have memorized leaked datasets. Finally, we propose a novel method for integrating LLMs into existing recommendation pipelines through symbolic rules that construct user and item profiles. This approach makes LLMs aware of dataset-specific patterns and generates narrative-style descriptions of users and items, showing that LLM-derived profiles can enhance recommendation quality. The second part of the dissertation examines the internal representations learned by LLMs from a linguistic perspective. We apply probing techniques to study how abstract concepts are encoded within the model’s activation space. We investigate how LLMs represent factual truthfulness in their hidden states, extend prior work, and propose two new and more realistic methods for evaluating factuality. Our results show that detecting hallucinations solely from internal model states is substantially more challenging than previously assumed. We also demonstrate that sentiment information is not uniformly distributed across layers, but is most concentrated in the middle layers, and that the final token is not always the most informative representation for sentiment extraction, contrary to common practice. Finally, we introduce a layer-cutting strategy that reduces memory requirements by an average of 57% while enabling sentiment extraction directly from internal states, achieving up to 14% higher accuracy than sentiment classification performed through prompting. The final part of the dissertation shifts from task adaptation and interpretability to post-training techniques aimed at improving performance by scaling test-time computation. We introduce a re-ranking module that extends LLM capabilities in complex reasoning scenarios and present a new model that acts as an effective heuristic, improving accuracy by up to 4.33% on challenging Text-to-SQL benchmarks. Our results show that scaling test-time computation through re-ranking outperforms alternative methods and heuristics, demonstrating that meaningful performance gains can be achieved through post-training strategies rather than by increasing model size. Overall, this thesis provides a multifaceted perspective on LLMs. We adapt them to new tasks, expose the problem of data leakage, interpret their internal behaviour, optimize their memory requirements, and propose a new strategy for enhancing their capabilities through test-time computation.| File | Dimensione | Formato | |
|---|---|---|---|
|
38 ciclo-DI PALMA Dario.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
10.24 MB
Formato
Adobe PDF
|
10.24 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/364964
URN:NBN:IT:POLIBA-364964