In recent years, we have witnessed an impressive rise in the ubiquity of large language models (LLMs). Although their fundamental objective, predicting the most probable next word in a sequence, has remained unchanged, the models themselves have expanded dramatically in scale and capability, becoming the dominant paradigm in Natural Language Processing (NLP). Progress has been marked by the development of increasingly sophisticated evaluation benchmarks on one hand and by a growing demand for vast amounts of training data on the other. In this thesis, we examine how unstructured, primarily web-based data is utilized in LLM pre-training and fine-tuning. We investigate two principal roles that large textual corpora play within these models: first, as a source of world knowledge through retrieval augmentation, and second, as pre-training data. We begin by demonstrating how retrieval from large, unstructured web corpora can enhance performance on open-domain tasks, paving the way towards assistants capable of supporting humans in solving complex, knowledge-intensive problems. Next, we address the challenge of improving the robustness of pre-training data through the development of tools that enable qualitative analysis of massive text collections. Finally, we explore potential avenues for model scaling under data-constrained conditions, anticipating a future in which the entirety of publicly available web text may no longer suffice to meet the demands of ever-larger language models.

Unstructured data for large language models

PIKTUS, ALEKSANDRA
2026

Abstract

In recent years, we have witnessed an impressive rise in the ubiquity of large language models (LLMs). Although their fundamental objective, predicting the most probable next word in a sequence, has remained unchanged, the models themselves have expanded dramatically in scale and capability, becoming the dominant paradigm in Natural Language Processing (NLP). Progress has been marked by the development of increasingly sophisticated evaluation benchmarks on one hand and by a growing demand for vast amounts of training data on the other. In this thesis, we examine how unstructured, primarily web-based data is utilized in LLM pre-training and fine-tuning. We investigate two principal roles that large textual corpora play within these models: first, as a source of world knowledge through retrieval augmentation, and second, as pre-training data. We begin by demonstrating how retrieval from large, unstructured web corpora can enhance performance on open-domain tasks, paving the way towards assistants capable of supporting humans in solving complex, knowledge-intensive problems. Next, we address the challenge of improving the robustness of pre-training data through the development of tools that enable qualitative analysis of massive text collections. Finally, we explore potential avenues for model scaling under data-constrained conditions, anticipating a future in which the entirety of publicly available web text may no longer suffice to meet the demands of ever-larger language models.
28-gen-2026
Inglese
SILVESTRI, FABRIZIO
SILVESTRI, FABRIZIO
Università degli Studi di Roma "La Sapienza"
125
File in questo prodotto:
File Dimensione Formato  
Tesi_dottorato_Piktus.pdf

accesso aperto

Licenza: Creative Commons
Dimensione 3.53 MB
Formato Adobe PDF
3.53 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/360699
Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-360699