Entity-Oriented Strategies for Information Extraction and Access in Knowledge-Intensive Domains

Pozzi, Riccardo

Knowledge-intensive domains such as law require accessing, integrating, and reasoning over large collections of heterogeneous documents, while meeting strict requirements on privacy, traceability, and regulatory compliance. Despite recent advances in large language models, their direct use in these settings is limited by hallucination, lack of grounding, and the legal constraints that complicate the transfer of sensitive data to external APIs. This thesis investigates how to satisfy information access use cases in the legal domain, including precedent retrieval, investigative search on seized data, document navigation, question answering, and statistical monitoring, under these constraints. The work pursues three objectives. First, it quantifies to what extent general-domain entity extraction pipelines, including entity recognition, entity linking, and NIL prediction, can be applied to Italian legal judgments and investigative chat logs. The results show that incremental entity extraction, where novel (or NIL) entities are identified and added to the knowledge base, suffers from error propagation, and that detecting novel entities (NIL prediction) is a major performance bottleneck, supporting the need for architectures that tolerate imperfect extraction. Second, it designs an entity-centric data integration architecture that integrates heterogeneous legal sources (judgments, investigative chats, attachments) around entities, supports traceability and human oversight via error correction functionalities, remains useful despite extraction errors, and enables the considered use cases. Third, it develops ReFactX, a constrained-generation approach to question answering that injects facts from large knowledge bases into a large language model without retrievers or external calls, producing answers that are traceable and verifiable against grounded evidence while adding only negligible latency, thus remaining efficient and suitable for local deployment. Together, the contributions represent an integrated approach for information access in knowledge-intensive legal settings. The designed entity-centric data integration architecture integrates the knowledge received from extraction services and can be paired with user interfaces or ReFactX to support downstream use cases, while preserving traceability, verifiability, and error-correction capability in line with the GDPR and AI Act requirements on data control and human oversight.

Entity-Oriented Strategies for Information Extraction and Access in Knowledge-Intensive Domains

POZZI, RICCARDO

2026

Abstract

Knowledge-intensive domains such as law require accessing, integrating, and reasoning over large collections of heterogeneous documents, while meeting strict requirements on privacy, traceability, and regulatory compliance. Despite recent advances in large language models, their direct use in these settings is limited by hallucination, lack of grounding, and the legal constraints that complicate the transfer of sensitive data to external APIs. This thesis investigates how to satisfy information access use cases in the legal domain, including precedent retrieval, investigative search on seized data, document navigation, question answering, and statistical monitoring, under these constraints. The work pursues three objectives. First, it quantifies to what extent general-domain entity extraction pipelines, including entity recognition, entity linking, and NIL prediction, can be applied to Italian legal judgments and investigative chat logs. The results show that incremental entity extraction, where novel (or NIL) entities are identified and added to the knowledge base, suffers from error propagation, and that detecting novel entities (NIL prediction) is a major performance bottleneck, supporting the need for architectures that tolerate imperfect extraction. Second, it designs an entity-centric data integration architecture that integrates heterogeneous legal sources (judgments, investigative chats, attachments) around entities, supports traceability and human oversight via error correction functionalities, remains useful despite extraction errors, and enables the considered use cases. Third, it develops ReFactX, a constrained-generation approach to question answering that injects facts from large knowledge bases into a large language model without retrievers or external calls, producing answers that are traceable and verifiable against grounded evidence while adding only negligible latency, thus remaining efficient and suitable for local deployment. Together, the contributions represent an integrated approach for information access in knowledge-intensive legal settings. The designed entity-centric data integration architecture integrates the knowledge received from extraction services and can be paired with user interfaces or ReFactX to support downstream use cases, while preserving traceability, verifiability, and error-correction capability in line with the GDPR and AI Act requirements on data control and human oversight.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di pubblicazione
	
				19-feb-2026
			
	Lingua
	
				Inglese
			
	Abstract in italiano
	
				Knowledge-intensive domains such as law require accessing, integrating, and reasoning over
large collections of heterogeneous documents, while meeting strict requirements on privacy,
traceability, and regulatory compliance. Despite recent advances in large language models,
their direct use in these settings is limited by hallucination, lack of grounding, and the legal
constraints that complicate the transfer of sensitive data to external APIs. This thesis investigates
how to satisfy information access use cases in the legal domain, including precedent
retrieval, investigative search on seized data, document navigation, question answering, and
statistical monitoring, under these constraints.
The work pursues three objectives. First, it quantifies to what extent general-domain entity
extraction pipelines, including entity recognition, entity linking, and NIL prediction, can be
applied to Italian legal judgments and investigative chat logs. The results show that incremental
entity extraction, where novel (or NIL) entities are identified and added to the knowledge
base, suffers from error propagation, and that detecting novel entities (NIL prediction) is a
major performance bottleneck, supporting the need for architectures that tolerate imperfect
extraction. Second, it designs an entity-centric data integration architecture that integrates
heterogeneous legal sources (judgments, investigative chats, attachments) around entities, supports
traceability and human oversight via error correction functionalities, remains useful despite
extraction errors, and enables the considered use cases. Third, it develops ReFactX, a
constrained-generation approach to question answering that injects facts from large knowledge
bases into a large language model without retrievers or external calls, producing answers that
are traceable and verifiable against grounded evidence while adding only negligible latency,
thus remaining efficient and suitable for local deployment.
Together, the contributions represent an integrated approach for information access in knowledge-intensive
legal settings. The designed entity-centric data integration architecture integrates
the knowledge received from extraction services and can be paired with user interfaces or
ReFactX to support downstream use cases, while preserving traceability, verifiability, and
error-correction capability in line with the GDPR and AI Act requirements on data control
and human oversight.
			
	Parola chiave
	
				nlp; entity linking; knowledge base; data integration; legal
			
	Relatore, Supervisor, Advisor o Tutor
	
				ZANDRON, CLAUDIO
PALMONARI, MATTEO LUIGI
			
	Collezione di appartenenza
	
				Università degli Studi di Milano - Bicocca

File in questo prodotto:

File	Dimensione	Formato
phd_unimib_807857.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 4.67 MB Formato Adobe PDF Visualizza/Apri	4.67 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/368725

Il codice NBN di questa tesi è URN:NBN:IT:UNIMIB-368725