Artificial Intelligence Algorithms for Distant Reading of Archives

Cacciatore, Alessandro

This dissertation investigates the use of artificial intelligence (AI) to enhance the automatic ex- traction of information from archival records, with a dual focus on improving trustworthiness and reducing computational costs. This research moves from the increasingly common use of AI inside archival contexts, both from archivists and archival users. Conventional archival access relies on metadata—structured information about records that result from the process of archival description—which, despite their crucial importance in data retrieval, limit the range of possible research questions through which an archive can be investigated. From the point of view of archival users, AI can be leveraged to conduct a distant read of records by automatically extracting information about them. In doing so, the research aims to provide novel access keys to enrich archival metadata, thereby expanding the scope of archival discovery and research. After introducing archival science and the call for the integration of AI in archival contexts, the thesis provides a state-of-the-art review of visual-language models (VLMs) and large language models (LLMs). The models are classified by architectures and the functions used to train them, including their recent applications in archival contexts. The thesis then provides two empirical case studies drawn from healthcare data. The first study involves a machine learning algorithm (XGBoost) in a federated learning context to extract diagnostic information from electronic health records, while the second study relies on convolutional neural networks (CNNs) to carry out human pose estimation on videos of preterm to study their motility and assess potential impairments. In these studies, data records are treated as independent, stand-alone documents, i.e., the same way in which automatic tools would process archival records. The dissertation also highlights the economic costs of these tools, which may hinder their deployment since CNNs, VLMs, and LLMs are compute-demanding models. This is important for small and underfunded institutions, like those in healthcare context, where resource constraints are a barrier to the adoption of advanced AI tools. Moreover, to comply with archival theory and the trustworthiness of archives, particular attention is given to the reliability of these models, providing insight about their possible errors and bias (in the case of LLMs), as well as empirical rules on when it might be best to use them (for XGBoost). By bridging the gap between the potential of AI-driven information extraction from data that may be records and the practical constraints of cost and sustainability, this dissertation provides a critical perspective on the future of digital archival research and AI. It proposes not only technological enhancements but also guidelines for developing more economically viable and reliable AI systems that meet the rigorous demands of archival trustworthiness.

Questa tesi indaga l’utilizzo dell’intelligenza artificiale (IA) per potenziare l’estrazione automatica di informazioni dai documenti d’archivio, con un duplice obiettivo: migliorare l’affidabilità e contenere i costi computazionali. Il lavoro prende spunto dall’uso sempre più diffuso dell’IA in ambito archivistico, sia da parte degli archivisti sia degli utenti, e mette in luce come l’accesso tradizionale si basi su metadati — informazioni strutturate derivanti dalla descrizione archivistica — che, pur essendo essenziali per il reperimento dei dati, limitano le possibili domande di ricerca con cui esplorare i fondi archivistici. Dal punto di vista degli utenti, l’IA può essere sfruttata per una “lettura a distanza” dei documenti, che consenta l’estrazione automatica di informazioni con cui arricchire i metadati già esistenti e ampliare così le possibilità di scoperta e analisi. Dopo un’introduzione alla scienza archivistica e alla necessità di integrare l’IA nei contesti archivistici, la tesi offre una rassegna dello stato dell’arte sui modelli visivo-linguistici (VLM) e sui grandi modelli linguistici (LLM), classificandoli in base alle architetture e alle funzioni di addestramento e illustrandone le applicazioni più recenti in contesti archivistici. Successivamente, vengono presentati due casi di studio empirici basati su dati sanitari: il primo utilizza XGBoost in un contesto di addestramento federato per estrarre informazioni diagnostiche da cartelle cliniche elettroniche; il secondo applica reti neurali convoluzionali (CNN) all’analisi di video in profondità di neonati pretermine per stimarne la posa, studiarne la motilità e valutare eventuali deficit motori. In entrambi i casi i dati sono documenti autonomi e non legati da alcun vincolo archivistico, in linea con il modo in cui gli strumenti automatici processano i record d’archivio. La tesi evidenzia poi i costi economici associati a questi strumenti — poiché CNN, VLM e LLM richiedono elevate risorse computazionali — e sottolinea come tale aspetto rappresenti un ostacolo per le istituzioni piccole o con limitate disponibilità, tipico soprattutto del settore sanitario. Per garantire coerenza con la teoria archivistica e mantenere l’affidabilità delle raccolte, il lavoro dedica particolare attenzione all’analisi di possibili errori e bias (specialmente per gli LLM) e fornisce regole empiriche su quando sia preferibile impiegare determinati modelli (in particolare, XGBoost). Colmando il divario tra il potenziale dell’estrazione di informazioni basata su IA e i vincoli pratici di costi e sostenibilità, questa ricerca propone non solo soluzioni tecnologiche avanzate, ma anche linee guida per lo sviluppo di sistemi di IA più economici, sostenibili e affidabili, in grado di rispondere alle rigorose esigenze di fiducia proprie del lavoro archivistico digitale.