Unstructured data for large language models

Piktus, Aleksandra

In recent years, we have witnessed an impressive rise in the ubiquity of large language models (LLMs). Although their fundamental objective, predicting the most probable next word in a sequence, has remained unchanged, the models themselves have expanded dramatically in scale and capability, becoming the dominant paradigm in Natural Language Processing (NLP). Progress has been marked by the development of increasingly sophisticated evaluation benchmarks on one hand and by a growing demand for vast amounts of training data on the other. In this thesis, we examine how unstructured, primarily web-based data is utilized in LLM pre-training and fine-tuning. We investigate two principal roles that large textual corpora play within these models: first, as a source of world knowledge through retrieval augmentation, and second, as pre-training data. We begin by demonstrating how retrieval from large, unstructured web corpora can enhance performance on open-domain tasks, paving the way towards assistants capable of supporting humans in solving complex, knowledge-intensive problems. Next, we address the challenge of improving the robustness of pre-training data through the development of tools that enable qualitative analysis of massive text collections. Finally, we explore potential avenues for model scaling under data-constrained conditions, anticipating a future in which the entirety of publicly available web text may no longer suffice to meet the demands of ever-larger language models.

Unstructured data for large language models

PIKTUS, ALEKSANDRA

2026

Abstract

In recent years, we have witnessed an impressive rise in the ubiquity of large language models (LLMs). Although their fundamental objective, predicting the most probable next word in a sequence, has remained unchanged, the models themselves have expanded dramatically in scale and capability, becoming the dominant paradigm in Natural Language Processing (NLP). Progress has been marked by the development of increasingly sophisticated evaluation benchmarks on one hand and by a growing demand for vast amounts of training data on the other. In this thesis, we examine how unstructured, primarily web-based data is utilized in LLM pre-training and fine-tuning. We investigate two principal roles that large textual corpora play within these models: first, as a source of world knowledge through retrieval augmentation, and second, as pre-training data. We begin by demonstrating how retrieval from large, unstructured web corpora can enhance performance on open-domain tasks, paving the way towards assistants capable of supporting humans in solving complex, knowledge-intensive problems. Next, we address the challenge of improving the robustness of pre-training data through the development of tools that enable qualitative analysis of massive text collections. Finally, we explore potential avenues for model scaling under data-constrained conditions, anticipating a future in which the entirety of publicly available web text may no longer suffice to meet the demands of ever-larger language models.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INGEGNERIA INFORMATICA, AUTOMATICA E GESTIONALE -ANTONIO RUBERTI-
			
	Corso di studio
	
				Data Science
			
	Data di pubblicazione
	
				28-gen-2026
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				SILVESTRI, FABRIZIO
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				SILVESTRI, FABRIZIO
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Numero di pagine
	
				125
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Piktus.pdf accesso aperto Licenza: Creative Commons Dimensione 3.53 MB Formato Adobe PDF Visualizza/Apri	3.53 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/360699

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-360699