In recent years, Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP), demonstrating remarkable performance across a wide range of tasks. Despite these advances, fundamental questions remain about their reasoning capabilities, particularly in commonsense reasoning, which refers to the ability to draw inferences from implicit, everyday knowledge that humans typically take for granted. Current approaches often fail to provide models with knowledge in forms they can effectively utilize, while evaluation methodologies focus exclusively on final answers, ignoring the reasoning processes that lead to them. Moreover, the dominant multiple-choice evaluation paradigm can systematically misrepresent model capabilities, especially as models generate increasingly complex reasoning traces. These limitations hinder progress toward language models with robust, human-like commonsense reasoning and understanding. We aim to advance both the performance and reliable evaluation of commonsense reasoning in language models, addressing fundamental challenges in knowledge augmentation and assessment methodologies. To improve performance, instead of augmenting models with retrieved isolated facts, we introduce a retrieval augmentation framework that provides models with complete reasoning examples, demonstrating consistent improvements across multiple commonsense reasoning benchmarks without requiring model retraining. We then analyze the challenges related to the evaluation of language models in commonsense reasoning by exploring two complementary directions. Specifically, we expose systematic inconsistencies in language model evaluation via multiple-choice question answering through comprehensive human annotation studies, revealing substantial disagreement between automated extraction strategies and human judgment, and introducing a dataset that exposes critical vulnerabilities in state-of-the-art LLM-based answer extractors. Subsequently, we introduce the first comprehensive benchmark for evaluating reasoning traces in commonsense domains, demonstrating that a significant proportion of correct answers contain flawed reasoning and that reasoning-aware evaluation reveals substantial performance drops compared to answer-only assessment. Together, these contributions establish a foundation for more robust and reliable commonsense reasoning in language models, bridging the gap between knowledge integration and commonsense reasoning evaluation.

Enabling robust and reliable commonsense reasoning in large language models

MOLFESE, FRANCESCO MARIA
2026

Abstract

In recent years, Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP), demonstrating remarkable performance across a wide range of tasks. Despite these advances, fundamental questions remain about their reasoning capabilities, particularly in commonsense reasoning, which refers to the ability to draw inferences from implicit, everyday knowledge that humans typically take for granted. Current approaches often fail to provide models with knowledge in forms they can effectively utilize, while evaluation methodologies focus exclusively on final answers, ignoring the reasoning processes that lead to them. Moreover, the dominant multiple-choice evaluation paradigm can systematically misrepresent model capabilities, especially as models generate increasingly complex reasoning traces. These limitations hinder progress toward language models with robust, human-like commonsense reasoning and understanding. We aim to advance both the performance and reliable evaluation of commonsense reasoning in language models, addressing fundamental challenges in knowledge augmentation and assessment methodologies. To improve performance, instead of augmenting models with retrieved isolated facts, we introduce a retrieval augmentation framework that provides models with complete reasoning examples, demonstrating consistent improvements across multiple commonsense reasoning benchmarks without requiring model retraining. We then analyze the challenges related to the evaluation of language models in commonsense reasoning by exploring two complementary directions. Specifically, we expose systematic inconsistencies in language model evaluation via multiple-choice question answering through comprehensive human annotation studies, revealing substantial disagreement between automated extraction strategies and human judgment, and introducing a dataset that exposes critical vulnerabilities in state-of-the-art LLM-based answer extractors. Subsequently, we introduce the first comprehensive benchmark for evaluating reasoning traces in commonsense domains, demonstrating that a significant proportion of correct answers contain flawed reasoning and that reasoning-aware evaluation reveals substantial performance drops compared to answer-only assessment. Together, these contributions establish a foundation for more robust and reliable commonsense reasoning in language models, bridging the gap between knowledge integration and commonsense reasoning evaluation.
28-gen-2026
Inglese
NAVIGLI, Roberto
GRISETTI, GIORGIO
Università degli Studi di Roma "La Sapienza"
135
File in questo prodotto:
File Dimensione Formato  
Tesi_dottorato_Molfese.pdf

accesso aperto

Licenza: Creative Commons
Dimensione 1.39 MB
Formato Adobe PDF
1.39 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/358427
Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-358427