Nowadays, most of the world’s information is produced in unstructured textual form. This vast amount of text represents an invaluable source of knowledge, yet it remains challenging for machines to interpret and transform into actionable insights. Information Extraction (IE) offers a promising solution by converting raw text into machine-readable representations. However, traditional IE systems are limited by their dependence on predefined schemas and by the scarcity of linguistic resources for non-English languages. This thesis investigates how Large Language Models (LLMs) can overcome these limitations and, conversely, how IE can enhance their reliability and safety. It examines the symbiotic relationship between structured knowledge extraction and language modeling, where language models enhance the expressiveness and adaptability of IE, while IE provides structure, grounding, and interpretability to language models. To advance knowledge discovery, the thesis introduces LLIMONIIE, a novel LLM-based approach to Open Information Extraction that generates flexible, normalized relational structures, addressing long-standing issues of schema rigidity and linguistic coverage. LLIMONIIE achieves up to 10% higher F1 scores than state-of-the-art baselines and enables the first large-scale Italian OIE resource, broadening the applicability of structured extraction to non-English languages. Building on the mutual relationship between LLMs and IE, the research then explores how IE can be applied to AI safety, transforming it into a structured defense mechanism against prompt manipulation and adversarial attacks. Following a systematic assessment of vulnerabilities in small language models, the work introduces Jailbreak Segmentation, a token-level extraction method that reduces attack success rates by up to 97%, outperforming state-of-the-art models that are over an order of magnitude larger. Complementing this, a rationale-aware moderation dataset named REMEDy is proposed, providing fine-grained explanations of model decisions and improving transparency, robustness, and trust. Fine-tuning on REMEDy improves zero-shot F1 scores from 50% to 75%, demonstrating that reasoning over explicit rationales enhances both explainability and resilience to adversarial inputs. Overall, this thesis demonstrates that Information Extraction is not only a tool for uncovering factual knowledge but also a foundational framework for enhancing the reliability, interpretability, and safety of modern AI systems.
Advancing Information Extraction with Large Language Models: The Role of Structured Understanding in Knowledge Management and AI Safety
PIANO, LEONARDO
2026
Abstract
Nowadays, most of the world’s information is produced in unstructured textual form. This vast amount of text represents an invaluable source of knowledge, yet it remains challenging for machines to interpret and transform into actionable insights. Information Extraction (IE) offers a promising solution by converting raw text into machine-readable representations. However, traditional IE systems are limited by their dependence on predefined schemas and by the scarcity of linguistic resources for non-English languages. This thesis investigates how Large Language Models (LLMs) can overcome these limitations and, conversely, how IE can enhance their reliability and safety. It examines the symbiotic relationship between structured knowledge extraction and language modeling, where language models enhance the expressiveness and adaptability of IE, while IE provides structure, grounding, and interpretability to language models. To advance knowledge discovery, the thesis introduces LLIMONIIE, a novel LLM-based approach to Open Information Extraction that generates flexible, normalized relational structures, addressing long-standing issues of schema rigidity and linguistic coverage. LLIMONIIE achieves up to 10% higher F1 scores than state-of-the-art baselines and enables the first large-scale Italian OIE resource, broadening the applicability of structured extraction to non-English languages. Building on the mutual relationship between LLMs and IE, the research then explores how IE can be applied to AI safety, transforming it into a structured defense mechanism against prompt manipulation and adversarial attacks. Following a systematic assessment of vulnerabilities in small language models, the work introduces Jailbreak Segmentation, a token-level extraction method that reduces attack success rates by up to 97%, outperforming state-of-the-art models that are over an order of magnitude larger. Complementing this, a rationale-aware moderation dataset named REMEDy is proposed, providing fine-grained explanations of model decisions and improving transparency, robustness, and trust. Fine-tuning on REMEDy improves zero-shot F1 scores from 50% to 75%, demonstrating that reasoning over explicit rationales enhances both explainability and resilience to adversarial inputs. Overall, this thesis demonstrates that Information Extraction is not only a tool for uncovering factual knowledge but also a foundational framework for enhancing the reliability, interpretability, and safety of modern AI systems.| File | Dimensione | Formato | |
|---|---|---|---|
|
tesi di dottorato_Leonardo Piano.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
1.81 MB
Formato
Adobe PDF
|
1.81 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/362924
URN:NBN:IT:UNICA-362924