Enabling robust and reliable commonsense reasoning in large language models

Molfese, Francesco Maria

In recent years, Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP), demonstrating remarkable performance across a wide range of tasks. Despite these advances, fundamental questions remain about their reasoning capabilities, particularly in commonsense reasoning, which refers to the ability to draw inferences from implicit, everyday knowledge that humans typically take for granted. Current approaches often fail to provide models with knowledge in forms they can effectively utilize, while evaluation methodologies focus exclusively on final answers, ignoring the reasoning processes that lead to them. Moreover, the dominant multiple-choice evaluation paradigm can systematically misrepresent model capabilities, especially as models generate increasingly complex reasoning traces. These limitations hinder progress toward language models with robust, human-like commonsense reasoning and understanding. We aim to advance both the performance and reliable evaluation of commonsense reasoning in language models, addressing fundamental challenges in knowledge augmentation and assessment methodologies. To improve performance, instead of augmenting models with retrieved isolated facts, we introduce a retrieval augmentation framework that provides models with complete reasoning examples, demonstrating consistent improvements across multiple commonsense reasoning benchmarks without requiring model retraining. We then analyze the challenges related to the evaluation of language models in commonsense reasoning by exploring two complementary directions. Specifically, we expose systematic inconsistencies in language model evaluation via multiple-choice question answering through comprehensive human annotation studies, revealing substantial disagreement between automated extraction strategies and human judgment, and introducing a dataset that exposes critical vulnerabilities in state-of-the-art LLM-based answer extractors. Subsequently, we introduce the first comprehensive benchmark for evaluating reasoning traces in commonsense domains, demonstrating that a significant proportion of correct answers contain flawed reasoning and that reasoning-aware evaluation reveals substantial performance drops compared to answer-only assessment. Together, these contributions establish a foundation for more robust and reliable commonsense reasoning in language models, bridging the gap between knowledge integration and commonsense reasoning evaluation.

Enabling robust and reliable commonsense reasoning in large language models

MOLFESE, FRANCESCO MARIA

2026

Abstract

In recent years, Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP), demonstrating remarkable performance across a wide range of tasks. Despite these advances, fundamental questions remain about their reasoning capabilities, particularly in commonsense reasoning, which refers to the ability to draw inferences from implicit, everyday knowledge that humans typically take for granted. Current approaches often fail to provide models with knowledge in forms they can effectively utilize, while evaluation methodologies focus exclusively on final answers, ignoring the reasoning processes that lead to them. Moreover, the dominant multiple-choice evaluation paradigm can systematically misrepresent model capabilities, especially as models generate increasingly complex reasoning traces. These limitations hinder progress toward language models with robust, human-like commonsense reasoning and understanding. We aim to advance both the performance and reliable evaluation of commonsense reasoning in language models, addressing fundamental challenges in knowledge augmentation and assessment methodologies. To improve performance, instead of augmenting models with retrieved isolated facts, we introduce a retrieval augmentation framework that provides models with complete reasoning examples, demonstrating consistent improvements across multiple commonsense reasoning benchmarks without requiring model retraining. We then analyze the challenges related to the evaluation of language models in commonsense reasoning by exploring two complementary directions. Specifically, we expose systematic inconsistencies in language model evaluation via multiple-choice question answering through comprehensive human annotation studies, revealing substantial disagreement between automated extraction strategies and human judgment, and introducing a dataset that exposes critical vulnerabilities in state-of-the-art LLM-based answer extractors. Subsequently, we introduce the first comprehensive benchmark for evaluating reasoning traces in commonsense domains, demonstrating that a significant proportion of correct answers contain flawed reasoning and that reasoning-aware evaluation reveals substantial performance drops compared to answer-only assessment. Together, these contributions establish a foundation for more robust and reliable commonsense reasoning in language models, bridging the gap between knowledge integration and commonsense reasoning evaluation.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INGEGNERIA INFORMATICA, AUTOMATICA E GESTIONALE -ANTONIO RUBERTI-
			
	Corso di studio
	
				Altro corso di dottorato
			
	Data di pubblicazione
	
				28-gen-2026
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				NAVIGLI, Roberto
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				GRISETTI, GIORGIO
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Numero di pagine
	
				135
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Molfese.pdf accesso aperto Licenza: Creative Commons Dimensione 1.39 MB Formato Adobe PDF Visualizza/Apri	1.39 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/358427

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-358427