The discipline of Authorship Analysis studies the linguistic style of written documents to determine information about their authorship. Unlike traditional methodologies, it leverages statistical methods and focuses on quantifiable linguistic events rather than the literary content of the text. In recent years, this field has experienced significant growth due to advances in information technology, enabling the employment of Machine Learning and Natural Language Processing computational tools, and it has been applied in various domains, spanning from cybersecurity to forensics. This Ph.D. Thesis investigates the application of Computational Authorship Analysis methodologies in the cultural heritage domain. Building on the experience gathered gathered through the research of a case-study (the debated Dantean authorship of the historic document textit{Epistle to Cangrande}), we address what we believe are the four main issues in this domain application: i) the identification of features that allow for accurate classification while being topic-agnostic; ii) the limited size of the datasets usually available in these studies; iii) the challenges that can be encountered when facing the possibility that the document under scrutiny is a forgery; and iv) the necessity of providing scholars in cultural heritage with proper explanations regarding the computational system's findings. Each of these issues is covered by a dedicated chapter in this dissertation, in which we offer a deep examination of the problem background, describe our proposed solutions, and present the related results of our research on the matter. In particular, we: i) introduce the use of rhythmic features; ii) evaluate the employment of an alternative vectorial representation of the documents, based on the concept of document pairs; iii) propose the augmentation of the classifier training data with automatically generated samples that mimic the work of a forger; and iv) assess the suitability of some modern explainability methods for the cultural heritage public. With this work, we aim to offer a comprehensive overview of the Authorship Analysis field, and provide guidance on the best practices for its application in cultural heritage.

Computational Authorship Analysis: Applications and Issues in the Cultural Heritage Field

CORBARA, Silvia
2024

Abstract

The discipline of Authorship Analysis studies the linguistic style of written documents to determine information about their authorship. Unlike traditional methodologies, it leverages statistical methods and focuses on quantifiable linguistic events rather than the literary content of the text. In recent years, this field has experienced significant growth due to advances in information technology, enabling the employment of Machine Learning and Natural Language Processing computational tools, and it has been applied in various domains, spanning from cybersecurity to forensics. This Ph.D. Thesis investigates the application of Computational Authorship Analysis methodologies in the cultural heritage domain. Building on the experience gathered gathered through the research of a case-study (the debated Dantean authorship of the historic document textit{Epistle to Cangrande}), we address what we believe are the four main issues in this domain application: i) the identification of features that allow for accurate classification while being topic-agnostic; ii) the limited size of the datasets usually available in these studies; iii) the challenges that can be encountered when facing the possibility that the document under scrutiny is a forgery; and iv) the necessity of providing scholars in cultural heritage with proper explanations regarding the computational system's findings. Each of these issues is covered by a dedicated chapter in this dissertation, in which we offer a deep examination of the problem background, describe our proposed solutions, and present the related results of our research on the matter. In particular, we: i) introduce the use of rhythmic features; ii) evaluate the employment of an alternative vectorial representation of the documents, based on the concept of document pairs; iii) propose the augmentation of the classifier training data with automatically generated samples that mimic the work of a forger; and iv) assess the suitability of some modern explainability methods for the cultural heritage public. With this work, we aim to offer a comprehensive overview of the Authorship Analysis field, and provide guidance on the best practices for its application in cultural heritage.
25-nov-2024
Inglese
Scuola Normale Superiore
Esperti anonimi
File in questo prodotto:
File Dimensione Formato  
Tesi.pdf

accesso aperto

Licenza: Tutti i diritti riservati
Dimensione 4.33 MB
Formato Adobe PDF
4.33 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/305907
Il codice NBN di questa tesi è URN:NBN:IT:SNS-305907