Beyond traditional search: bridging retrieval, reasoning, and language barriers in intelligent search systems

Bacciu, Andrea

This thesis investigates innovative approaches to address fundamental limitations in modern search systems, with a particular focus on enhancing the synergy between retrieval and reasoning components and improving the ease of information access. A primary research thread explores the optimization of interactions between retrievers and reasoning components. We specifically address the challenge of "hard false relevant" documents—those that appear superficially relevant but lack true semantic alignment with queries. To tackle this issue, we introduce Reinforced Retrieval Augmented Machine Learning (RRAML), a visionary framework that enables retrieval systems to be fine-tuned within a Retrieval-Augmented Generation (RAG) architecture. This novel approach allows the retriever to adapt through continued training when combined with a reasoner in a RAG scenario, to reduce the retrieval of false relevant documents. Within the same research direction, we explored Neural Semantic Parsing (NSP), which uses Large Language Models (LLMs) to translate natural language queries into a machine-readable format that can be used to retrieve information from knowledge graphs. This approach represents an alternative to traditional RAG systems, where, instead of retrieving from unstructured text, the LLM facilitates access to structured and verified information stored in knowledge graphs, providing greater control and transparency in the information retrieval process. To enhance this system's reliability, we developed the Hallucination Simulation Framework, which deliberately induces hallucinations in semantic parsers during training. Complementing this, we created the Hallucination Detection Model (HDM), which identifies and mitigates hallucinations stemming from knowledge gaps, improving answer reliability by 20\%. This framework enables semantic parsers to recognize their knowledge boundaries and uncertainty levels, resulting in more transparent and trustworthy responses. To further support the democratization of information, this thesis introduces approaches to bridge linguistic and cultural barriers, enabling users worldwide to access accurate, relevant information regardless of language or technical proficiency. This goal led to the development of a new architecture, x-NDB, for cross-lingual neural databases and the creation of X-WikiNLDB, a dataset containing unstructured text in multiple languages simulating data retrieved online. X-WikiNLDB facilitates robust cross-lingual information retrieval. Our cross-lingual performance are on-par with previous work in the English-only scenario. Furthermore, we demonstrated significant zero-shot performance improvements of 2-5$\times$ compared to the multilingual counterpart across several low-resource languages in Catalan, Tagalog, Yoruba, Japanese, and Korean. This success suggests that cross-lingual training encourages models to capture deeper semantic understanding rather than surface-level patterns, enabling better generalization across linguistic boundaries. To advance language accessibility further, we developed Fauno, DanteLLM, and OpenDanteLLM, pioneering a series of open-source Italian language models. Starting with Fauno, we created the first open-source conversational Italian LLM along with a novel conversational dataset. Building upon this foundation, DanteLLM achieved remarkable performance improvements, demonstrating a 10\% increase over Fauno and 7\% over the best competitor across comprehensive Italian benchmarks. OpenDanteLLM, while trained exclusively on open-source data, showed a 5\% improvement over existing methods while ensuring unrestricted access through its commercial-friendly Apache 2.0 license. This work, which gained particular relevance during ChatGPT's temporary ban in Italy, has established a new direction in language-specific AI development, inspiring other researchers to create additional Italian language models. Our approach demonstrates the feasibility of building high-performance, privacy-preserving language models that operate entirely offline, providing Italian-speaking users with reliable alternatives to centralized systems. The thesis also addresses the challenge of ambiguous queries and the dependence on behavioral data in search systems. We introduce Generative Query Recommendation (GQR), a zero-shot approach that reimagines query expansion without relying on user query logs. By leveraging LLMs as the sole component, GQR eliminates the complex pipelines and query log dependencies associated with traditional methods. Our system significantly outperformed industry standards, demonstrating a 10-point improvement in NDCG@10. The generated query reformulations showed reduced ambiguity with respect to the document collection, as measured by a 7-point increase in the Simplified Clarity Score. A blind user study with 12 annotators further validated GQR's effectiveness, with users preferring our recommendations approximately 60\% of the time compared to leading industry alternatives. This approach establishes a new paradigm for query recommendations that achieves superior retrieval accuracy and user satisfaction without requiring behavioral data. Finally, we present the Multi-Relevant Future Items Evaluation (MRFI) protocol for Sequential Recommender Systems (SRS), which improves evaluation by considering multiple relevant items. MRFI, alongside a novel loss function that integrates relevance feedback, enhances recommendation accuracy and reliability, achieving improvements of 2.82 points of NDCG@10 and 0.64\% Hit Rate across several benchmark datasets. This methodological innovation provides a robust basis for evaluating and training sequential recommendation systems across diverse applications. Together, these contributions address the core challenges of relevance, reliability, and accessibility in information retrieval and language technology, supporting the vision of universally accessible, high-quality information systems.

Beyond traditional search: bridging retrieval, reasoning, and language barriers in intelligent search systems

BACCIU, ANDREA

2025

Abstract

This thesis investigates innovative approaches to address fundamental limitations in modern search systems, with a particular focus on enhancing the synergy between retrieval and reasoning components and improving the ease of information access. A primary research thread explores the optimization of interactions between retrievers and reasoning components. We specifically address the challenge of "hard false relevant" documents—those that appear superficially relevant but lack true semantic alignment with queries. To tackle this issue, we introduce Reinforced Retrieval Augmented Machine Learning (RRAML), a visionary framework that enables retrieval systems to be fine-tuned within a Retrieval-Augmented Generation (RAG) architecture. This novel approach allows the retriever to adapt through continued training when combined with a reasoner in a RAG scenario, to reduce the retrieval of false relevant documents. Within the same research direction, we explored Neural Semantic Parsing (NSP), which uses Large Language Models (LLMs) to translate natural language queries into a machine-readable format that can be used to retrieve information from knowledge graphs. This approach represents an alternative to traditional RAG systems, where, instead of retrieving from unstructured text, the LLM facilitates access to structured and verified information stored in knowledge graphs, providing greater control and transparency in the information retrieval process. To enhance this system's reliability, we developed the Hallucination Simulation Framework, which deliberately induces hallucinations in semantic parsers during training. Complementing this, we created the Hallucination Detection Model (HDM), which identifies and mitigates hallucinations stemming from knowledge gaps, improving answer reliability by 20\%. This framework enables semantic parsers to recognize their knowledge boundaries and uncertainty levels, resulting in more transparent and trustworthy responses. To further support the democratization of information, this thesis introduces approaches to bridge linguistic and cultural barriers, enabling users worldwide to access accurate, relevant information regardless of language or technical proficiency. This goal led to the development of a new architecture, x-NDB, for cross-lingual neural databases and the creation of X-WikiNLDB, a dataset containing unstructured text in multiple languages simulating data retrieved online. X-WikiNLDB facilitates robust cross-lingual information retrieval. Our cross-lingual performance are on-par with previous work in the English-only scenario. Furthermore, we demonstrated significant zero-shot performance improvements of 2-5$\times$ compared to the multilingual counterpart across several low-resource languages in Catalan, Tagalog, Yoruba, Japanese, and Korean. This success suggests that cross-lingual training encourages models to capture deeper semantic understanding rather than surface-level patterns, enabling better generalization across linguistic boundaries. To advance language accessibility further, we developed Fauno, DanteLLM, and OpenDanteLLM, pioneering a series of open-source Italian language models. Starting with Fauno, we created the first open-source conversational Italian LLM along with a novel conversational dataset. Building upon this foundation, DanteLLM achieved remarkable performance improvements, demonstrating a 10\% increase over Fauno and 7\% over the best competitor across comprehensive Italian benchmarks. OpenDanteLLM, while trained exclusively on open-source data, showed a 5\% improvement over existing methods while ensuring unrestricted access through its commercial-friendly Apache 2.0 license. This work, which gained particular relevance during ChatGPT's temporary ban in Italy, has established a new direction in language-specific AI development, inspiring other researchers to create additional Italian language models. Our approach demonstrates the feasibility of building high-performance, privacy-preserving language models that operate entirely offline, providing Italian-speaking users with reliable alternatives to centralized systems. The thesis also addresses the challenge of ambiguous queries and the dependence on behavioral data in search systems. We introduce Generative Query Recommendation (GQR), a zero-shot approach that reimagines query expansion without relying on user query logs. By leveraging LLMs as the sole component, GQR eliminates the complex pipelines and query log dependencies associated with traditional methods. Our system significantly outperformed industry standards, demonstrating a 10-point improvement in NDCG@10. The generated query reformulations showed reduced ambiguity with respect to the document collection, as measured by a 7-point increase in the Simplified Clarity Score. A blind user study with 12 annotators further validated GQR's effectiveness, with users preferring our recommendations approximately 60\% of the time compared to leading industry alternatives. This approach establishes a new paradigm for query recommendations that achieves superior retrieval accuracy and user satisfaction without requiring behavioral data. Finally, we present the Multi-Relevant Future Items Evaluation (MRFI) protocol for Sequential Recommender Systems (SRS), which improves evaluation by considering multiple relevant items. MRFI, alongside a novel loss function that integrates relevance feedback, enhances recommendation accuracy and reliability, achieving improvements of 2.82 points of NDCG@10 and 0.64\% Hit Rate across several benchmark datasets. This methodological innovation provides a robust basis for evaluating and training sequential recommendation systems across diverse applications. Together, these contributions address the core challenges of relevance, reliability, and accessibility in information retrieval and language technology, supporting the vision of universally accessible, high-quality information systems.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INGEGNERIA INFORMATICA, AUTOMATICA E GESTIONALE -ANTONIO RUBERTI-
			
	Corso di studio
	
				Altro corso di dottorato
			
	Data di pubblicazione
	
				23-gen-2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				SILVESTRI, FABRIZIO
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				LENZERINI, Maurizio
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Numero di pagine
	
				125
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Bacciu.pdf accesso aperto Dimensione 2.26 MB Formato Adobe PDF Visualizza/Apri	2.26 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/189630

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-189630