Computational Authorship Analysis: Applications and Issues in the Cultural Heritage Field

Corbara, Silvia

The discipline of Authorship Analysis studies the linguistic style of written documents to determine information about their authorship. Unlike traditional methodologies, it leverages statistical methods and focuses on quantifiable linguistic events rather than the literary content of the text. In recent years, this field has experienced significant growth due to advances in information technology, enabling the employment of Machine Learning and Natural Language Processing computational tools, and it has been applied in various domains, spanning from cybersecurity to forensics. This Ph.D. Thesis investigates the application of Computational Authorship Analysis methodologies in the cultural heritage domain. Building on the experience gathered gathered through the research of a case-study (the debated Dantean authorship of the historic document textit{Epistle to Cangrande}), we address what we believe are the four main issues in this domain application: i) the identification of features that allow for accurate classification while being topic-agnostic; ii) the limited size of the datasets usually available in these studies; iii) the challenges that can be encountered when facing the possibility that the document under scrutiny is a forgery; and iv) the necessity of providing scholars in cultural heritage with proper explanations regarding the computational system's findings. Each of these issues is covered by a dedicated chapter in this dissertation, in which we offer a deep examination of the problem background, describe our proposed solutions, and present the related results of our research on the matter. In particular, we: i) introduce the use of rhythmic features; ii) evaluate the employment of an alternative vectorial representation of the documents, based on the concept of document pairs; iii) propose the augmentation of the classifier training data with automatically generated samples that mimic the work of a forger; and iv) assess the suitability of some modern explainability methods for the cultural heritage public. With this work, we aim to offer a comprehensive overview of the Authorship Analysis field, and provide guidance on the best practices for its application in cultural heritage.

Computational Authorship Analysis: Applications and Issues in the Cultural Heritage Field

CORBARA, Silvia

2024

Abstract

The discipline of Authorship Analysis studies the linguistic style of written documents to determine information about their authorship. Unlike traditional methodologies, it leverages statistical methods and focuses on quantifiable linguistic events rather than the literary content of the text. In recent years, this field has experienced significant growth due to advances in information technology, enabling the employment of Machine Learning and Natural Language Processing computational tools, and it has been applied in various domains, spanning from cybersecurity to forensics. This Ph.D. Thesis investigates the application of Computational Authorship Analysis methodologies in the cultural heritage domain. Building on the experience gathered gathered through the research of a case-study (the debated Dantean authorship of the historic document textit{Epistle to Cangrande}), we address what we believe are the four main issues in this domain application: i) the identification of features that allow for accurate classification while being topic-agnostic; ii) the limited size of the datasets usually available in these studies; iii) the challenges that can be encountered when facing the possibility that the document under scrutiny is a forgery; and iv) the necessity of providing scholars in cultural heritage with proper explanations regarding the computational system's findings. Each of these issues is covered by a dedicated chapter in this dissertation, in which we offer a deep examination of the problem background, describe our proposed solutions, and present the related results of our research on the matter. In particular, we: i) introduce the use of rhythmic features; ii) evaluate the employment of an alternative vectorial representation of the documents, based on the concept of document pairs; iii) propose the augmentation of the classifier training data with automatically generated samples that mimic the work of a forger; and iv) assess the suitability of some modern explainability methods for the cultural heritage public. With this work, we aim to offer a comprehensive overview of the Authorship Analysis field, and provide guidance on the best practices for its application in cultural heritage.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				Data Science
			
	Data di pubblicazione
	
				25-nov-2024
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				Monreale, Anna
Sebastiani, Fabrizio
Moreo, Alejandro
			
	Nome Editore
	
				Scuola Normale Superiore
			
	Referee
	
				Esperti anonimi
			
	Collezione di appartenenza
	
				Scuola Normale Superiore

File in questo prodotto:

File	Dimensione	Formato
Tesi.pdf accesso aperto Licenza: Creative Commons Dimensione 4.33 MB Formato Adobe PDF Visualizza/Apri	4.33 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/305907

Il codice NBN di questa tesi è URN:NBN:IT:SNS-305907