Nowadays, most of the world’s information is produced in unstructured textual form. This vast amount of text represents an invaluable source of knowledge, yet it remains challenging for machines to interpret and transform into actionable insights. Information Extraction (IE) offers a promising solution by converting raw text into machine-readable representations. However, traditional IE systems are limited by their dependence on predefined schemas and by the scarcity of linguistic resources for non-English languages. This thesis investigates how Large Language Models (LLMs) can overcome these limitations and, conversely, how IE can enhance their reliability and safety. It examines the symbiotic relationship between structured knowledge extraction and language modeling, where language models enhance the expressiveness and adaptability of IE, while IE provides structure, grounding, and interpretability to language models. To advance knowledge discovery, the thesis introduces LLIMONIIE, a novel LLM-based approach to Open Information Extraction that generates flexible, normalized relational structures, addressing long-standing issues of schema rigidity and linguistic coverage. LLIMONIIE achieves up to 10% higher F1 scores than state-of-the-art baselines and enables the first large-scale Italian OIE resource, broadening the applicability of structured extraction to non-English languages. Building on the mutual relationship between LLMs and IE, the research then explores how IE can be applied to AI safety, transforming it into a structured defense mechanism against prompt manipulation and adversarial attacks. Following a systematic assessment of vulnerabilities in small language models, the work introduces Jailbreak Segmentation, a token-level extraction method that reduces attack success rates by up to 97%, outperforming state-of-the-art models that are over an order of magnitude larger. Complementing this, a rationale-aware moderation dataset named REMEDy is proposed, providing fine-grained explanations of model decisions and improving transparency, robustness, and trust. Fine-tuning on REMEDy improves zero-shot F1 scores from 50% to 75%, demonstrating that reasoning over explicit rationales enhances both explainability and resilience to adversarial inputs. Overall, this thesis demonstrates that Information Extraction is not only a tool for uncovering factual knowledge but also a foundational framework for enhancing the reliability, interpretability, and safety of modern AI systems.

Advancing Information Extraction with Large Language Models: The Role of Structured Understanding in Knowledge Management and AI Safety

PIANO, LEONARDO
2026

Abstract

Nowadays, most of the world’s information is produced in unstructured textual form. This vast amount of text represents an invaluable source of knowledge, yet it remains challenging for machines to interpret and transform into actionable insights. Information Extraction (IE) offers a promising solution by converting raw text into machine-readable representations. However, traditional IE systems are limited by their dependence on predefined schemas and by the scarcity of linguistic resources for non-English languages. This thesis investigates how Large Language Models (LLMs) can overcome these limitations and, conversely, how IE can enhance their reliability and safety. It examines the symbiotic relationship between structured knowledge extraction and language modeling, where language models enhance the expressiveness and adaptability of IE, while IE provides structure, grounding, and interpretability to language models. To advance knowledge discovery, the thesis introduces LLIMONIIE, a novel LLM-based approach to Open Information Extraction that generates flexible, normalized relational structures, addressing long-standing issues of schema rigidity and linguistic coverage. LLIMONIIE achieves up to 10% higher F1 scores than state-of-the-art baselines and enables the first large-scale Italian OIE resource, broadening the applicability of structured extraction to non-English languages. Building on the mutual relationship between LLMs and IE, the research then explores how IE can be applied to AI safety, transforming it into a structured defense mechanism against prompt manipulation and adversarial attacks. Following a systematic assessment of vulnerabilities in small language models, the work introduces Jailbreak Segmentation, a token-level extraction method that reduces attack success rates by up to 97%, outperforming state-of-the-art models that are over an order of magnitude larger. Complementing this, a rationale-aware moderation dataset named REMEDy is proposed, providing fine-grained explanations of model decisions and improving transparency, robustness, and trust. Fine-tuning on REMEDy improves zero-shot F1 scores from 50% to 75%, demonstrating that reasoning over explicit rationales enhances both explainability and resilience to adversarial inputs. Overall, this thesis demonstrates that Information Extraction is not only a tool for uncovering factual knowledge but also a foundational framework for enhancing the reliability, interpretability, and safety of modern AI systems.
24-feb-2026
Inglese
CARTA, SALVATORE MARIO
POMPIANU, LIVIO
Università degli Studi di Cagliari
File in questo prodotto:
File Dimensione Formato  
tesi di dottorato_Leonardo Piano.pdf

accesso aperto

Licenza: Tutti i diritti riservati
Dimensione 1.81 MB
Formato Adobe PDF
1.81 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/362924
Il codice NBN di questa tesi è URN:NBN:IT:UNICA-362924