This thesis addresses two major challenges in the field of Natural Language Generation (NLG): interpretability and factuality. Despite the significant advancements made by Large Language Models (LLMs) in NLG tasks such as text summarization and machine translation, studies have consistently demonstrated that their outputs often remain opaque and contain factual inaccuracies. These limitations raise concerns regarding the reliability and trustworthiness of current NLG systems, both in academia and industry. Our journey begins by addressing long-form text summarization with an emphasis on interpretability. In this setting, given the length and complexity of texts like books, users must place a considerable trust in the summarization system’s ability to distill key information accurately. To address this need, we propose an extractive-then-abstractive summarization approach that highlights the relevant portions of the original text used to generate the final summary. This method helps build trust in the final user, as they can check the sentences deemed relevant by the system. However, our investigation into book summarization systems revealed two critical findings: first, we observed many outputs featuring factual errors, and second, we discovered that standard automatic metrics are not enough to detect them. To tackle these issues, we introduce a summarization factuality metric that leverages Natural Language Inference (NLI) and claim extraction. By aligning claims extracted from the summary with corresponding sections of the source document, this metric provides insights into which specific parts of the summary are accurate or hallucinated, addressing both factuality and interpretability simultaneously. Having addressed summarization consistency with our factuality metric, we recognized that verifying the factual accuracy of summaries against a source document is only one part of a bigger quest. In real-world applications, texts often need to be validated against multiple external sources, where the relevant information is not known beforehand. This leads to the broader task of end-to-end factuality evaluation, where verification extends beyond a predefined document to any potential evidence retrieved from various knowledge bases. To tackle this, we introduce LLM-Oasis, the first large-scale resource designed for training and evaluating models on this more complex verification task. Our resource is created by extracting and falsifying claims from Wikipedia pages, and subsequently generating factual and unfactual versions of the original text. We then train and evaluate language models on their capability of discerning factual texts from their falsified counterparts. Our experiments reveal the challenging nature of this benchmark for current LLMs, even in the Retrieval Augmented Generation (RAG) setting, with smaller, specialized models fine-tuned on our resource achieving competitive performance. In the last chapter of the thesis, we show that the lack of transparency and interpretability also extends to other areas of NLG, such as machine translation. In this context, the leading trend of MT evaluation methods shares similar limitations, offering only a general quality score without revealing the precise nature or location of translation errors. As an initial step toward a more interpretable evaluation, we propose MaTESe, a novel metric that frames MT evaluation as a sequence tagging task, identifying mistranslated spans and categorizing errors by type and severity. This thesis contributes to the ongoing effort to make NLG systems both interpretable and factually reliable, demonstrating the feasibility and importance of these qualities in practical applications. Our hope is that the methodologies, resources, and insights outlined in this research will inspire future works and lay a solid foundation for more transparent and trustworthy NLG systems, ultimately building greater confidence in AI-driven text generation.
Towards interpretable and factual natural language generation
SCIRE', ALESSANDRO
2025
Abstract
This thesis addresses two major challenges in the field of Natural Language Generation (NLG): interpretability and factuality. Despite the significant advancements made by Large Language Models (LLMs) in NLG tasks such as text summarization and machine translation, studies have consistently demonstrated that their outputs often remain opaque and contain factual inaccuracies. These limitations raise concerns regarding the reliability and trustworthiness of current NLG systems, both in academia and industry. Our journey begins by addressing long-form text summarization with an emphasis on interpretability. In this setting, given the length and complexity of texts like books, users must place a considerable trust in the summarization system’s ability to distill key information accurately. To address this need, we propose an extractive-then-abstractive summarization approach that highlights the relevant portions of the original text used to generate the final summary. This method helps build trust in the final user, as they can check the sentences deemed relevant by the system. However, our investigation into book summarization systems revealed two critical findings: first, we observed many outputs featuring factual errors, and second, we discovered that standard automatic metrics are not enough to detect them. To tackle these issues, we introduce a summarization factuality metric that leverages Natural Language Inference (NLI) and claim extraction. By aligning claims extracted from the summary with corresponding sections of the source document, this metric provides insights into which specific parts of the summary are accurate or hallucinated, addressing both factuality and interpretability simultaneously. Having addressed summarization consistency with our factuality metric, we recognized that verifying the factual accuracy of summaries against a source document is only one part of a bigger quest. In real-world applications, texts often need to be validated against multiple external sources, where the relevant information is not known beforehand. This leads to the broader task of end-to-end factuality evaluation, where verification extends beyond a predefined document to any potential evidence retrieved from various knowledge bases. To tackle this, we introduce LLM-Oasis, the first large-scale resource designed for training and evaluating models on this more complex verification task. Our resource is created by extracting and falsifying claims from Wikipedia pages, and subsequently generating factual and unfactual versions of the original text. We then train and evaluate language models on their capability of discerning factual texts from their falsified counterparts. Our experiments reveal the challenging nature of this benchmark for current LLMs, even in the Retrieval Augmented Generation (RAG) setting, with smaller, specialized models fine-tuned on our resource achieving competitive performance. In the last chapter of the thesis, we show that the lack of transparency and interpretability also extends to other areas of NLG, such as machine translation. In this context, the leading trend of MT evaluation methods shares similar limitations, offering only a general quality score without revealing the precise nature or location of translation errors. As an initial step toward a more interpretable evaluation, we propose MaTESe, a novel metric that frames MT evaluation as a sequence tagging task, identifying mistranslated spans and categorizing errors by type and severity. This thesis contributes to the ongoing effort to make NLG systems both interpretable and factually reliable, demonstrating the feasibility and importance of these qualities in practical applications. Our hope is that the methodologies, resources, and insights outlined in this research will inspire future works and lay a solid foundation for more transparent and trustworthy NLG systems, ultimately building greater confidence in AI-driven text generation.File | Dimensione | Formato | |
---|---|---|---|
Tesi_dottorato_Sciré.pdf
accesso aperto
Dimensione
2.07 MB
Formato
Adobe PDF
|
2.07 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/192807
URN:NBN:IT:UNIROMA1-192807