In recent years, we have witnessed an impressive rise in the ubiquity of large language models (LLMs). Although their fundamental objective, predicting the most probable next word in a sequence, has remained unchanged, the models themselves have expanded dramatically in scale and capability, becoming the dominant paradigm in Natural Language Processing (NLP). Progress has been marked by the development of increasingly sophisticated evaluation benchmarks on one hand and by a growing demand for vast amounts of training data on the other. In this thesis, we examine how unstructured, primarily web-based data is utilized in LLM pre-training and fine-tuning. We investigate two principal roles that large textual corpora play within these models: first, as a source of world knowledge through retrieval augmentation, and second, as pre-training data. We begin by demonstrating how retrieval from large, unstructured web corpora can enhance performance on open-domain tasks, paving the way towards assistants capable of supporting humans in solving complex, knowledge-intensive problems. Next, we address the challenge of improving the robustness of pre-training data through the development of tools that enable qualitative analysis of massive text collections. Finally, we explore potential avenues for model scaling under data-constrained conditions, anticipating a future in which the entirety of publicly available web text may no longer suffice to meet the demands of ever-larger language models.
Unstructured data for large language models
PIKTUS, ALEKSANDRA
2026
Abstract
In recent years, we have witnessed an impressive rise in the ubiquity of large language models (LLMs). Although their fundamental objective, predicting the most probable next word in a sequence, has remained unchanged, the models themselves have expanded dramatically in scale and capability, becoming the dominant paradigm in Natural Language Processing (NLP). Progress has been marked by the development of increasingly sophisticated evaluation benchmarks on one hand and by a growing demand for vast amounts of training data on the other. In this thesis, we examine how unstructured, primarily web-based data is utilized in LLM pre-training and fine-tuning. We investigate two principal roles that large textual corpora play within these models: first, as a source of world knowledge through retrieval augmentation, and second, as pre-training data. We begin by demonstrating how retrieval from large, unstructured web corpora can enhance performance on open-domain tasks, paving the way towards assistants capable of supporting humans in solving complex, knowledge-intensive problems. Next, we address the challenge of improving the robustness of pre-training data through the development of tools that enable qualitative analysis of massive text collections. Finally, we explore potential avenues for model scaling under data-constrained conditions, anticipating a future in which the entirety of publicly available web text may no longer suffice to meet the demands of ever-larger language models.| File | Dimensione | Formato | |
|---|---|---|---|
|
Tesi_dottorato_Piktus.pdf
accesso aperto
Licenza:
Creative Commons
Dimensione
3.53 MB
Formato
Adobe PDF
|
3.53 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/360699
URN:NBN:IT:UNIROMA1-360699