Advancing Information Extraction with Large Language Models: The Role of Structured Understanding in Knowledge Management and AI Safety

Piano, Leonardo

Nowadays, most of the world’s information is produced in unstructured textual form. This vast amount of text represents an invaluable source of knowledge, yet it remains challenging for machines to interpret and transform into actionable insights. Information Extraction (IE) offers a promising solution by converting raw text into machine-readable representations. However, traditional IE systems are limited by their dependence on predefined schemas and by the scarcity of linguistic resources for non-English languages. This thesis investigates how Large Language Models (LLMs) can overcome these limitations and, conversely, how IE can enhance their reliability and safety. It examines the symbiotic relationship between structured knowledge extraction and language modeling, where language models enhance the expressiveness and adaptability of IE, while IE provides structure, grounding, and interpretability to language models. To advance knowledge discovery, the thesis introduces LLIMONIIE, a novel LLM-based approach to Open Information Extraction that generates flexible, normalized relational structures, addressing long-standing issues of schema rigidity and linguistic coverage. LLIMONIIE achieves up to 10% higher F1 scores than state-of-the-art baselines and enables the first large-scale Italian OIE resource, broadening the applicability of structured extraction to non-English languages. Building on the mutual relationship between LLMs and IE, the research then explores how IE can be applied to AI safety, transforming it into a structured defense mechanism against prompt manipulation and adversarial attacks. Following a systematic assessment of vulnerabilities in small language models, the work introduces Jailbreak Segmentation, a token-level extraction method that reduces attack success rates by up to 97%, outperforming state-of-the-art models that are over an order of magnitude larger. Complementing this, a rationale-aware moderation dataset named REMEDy is proposed, providing fine-grained explanations of model decisions and improving transparency, robustness, and trust. Fine-tuning on REMEDy improves zero-shot F1 scores from 50% to 75%, demonstrating that reasoning over explicit rationales enhances both explainability and resilience to adversarial inputs. Overall, this thesis demonstrates that Information Extraction is not only a tool for uncovering factual knowledge but also a foundational framework for enhancing the reliability, interpretability, and safety of modern AI systems.

Advancing Information Extraction with Large Language Models: The Role of Structured Understanding in Knowledge Management and AI Safety

PIANO, LEONARDO

2026

Abstract

Nowadays, most of the world’s information is produced in unstructured textual form. This vast amount of text represents an invaluable source of knowledge, yet it remains challenging for machines to interpret and transform into actionable insights. Information Extraction (IE) offers a promising solution by converting raw text into machine-readable representations. However, traditional IE systems are limited by their dependence on predefined schemas and by the scarcity of linguistic resources for non-English languages. This thesis investigates how Large Language Models (LLMs) can overcome these limitations and, conversely, how IE can enhance their reliability and safety. It examines the symbiotic relationship between structured knowledge extraction and language modeling, where language models enhance the expressiveness and adaptability of IE, while IE provides structure, grounding, and interpretability to language models. To advance knowledge discovery, the thesis introduces LLIMONIIE, a novel LLM-based approach to Open Information Extraction that generates flexible, normalized relational structures, addressing long-standing issues of schema rigidity and linguistic coverage. LLIMONIIE achieves up to 10% higher F1 scores than state-of-the-art baselines and enables the first large-scale Italian OIE resource, broadening the applicability of structured extraction to non-English languages. Building on the mutual relationship between LLMs and IE, the research then explores how IE can be applied to AI safety, transforming it into a structured defense mechanism against prompt manipulation and adversarial attacks. Following a systematic assessment of vulnerabilities in small language models, the work introduces Jailbreak Segmentation, a token-level extraction method that reduces attack success rates by up to 97%, outperforming state-of-the-art models that are over an order of magnitude larger. Complementing this, a rationale-aware moderation dataset named REMEDy is proposed, providing fine-grained explanations of model decisions and improving transparency, robustness, and trust. Fine-tuning on REMEDy improves zero-shot F1 scores from 50% to 75%, demonstrating that reasoning over explicit rationales enhances both explainability and resilience to adversarial inputs. Overall, this thesis demonstrates that Information Extraction is not only a tool for uncovering factual knowledge but also a foundational framework for enhancing the reliability, interpretability, and safety of modern AI systems.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				MATEMATICA E INFORMATICA
			
	Data di pubblicazione
	
				24-feb-2026
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				CARTA, SALVATORE MARIO
POMPIANU, LIVIO
			
	Nome Editore
	
				Università degli Studi di Cagliari
			
	Collezione di appartenenza
	
				Università degli Studi di Cagliari

File in questo prodotto:

File	Dimensione	Formato
tesi di dottorato_Leonardo Piano.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 1.81 MB Formato Adobe PDF Visualizza/Apri	1.81 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/362924

Il codice NBN di questa tesi è URN:NBN:IT:UNICA-362924