Migliorare le Biblioteche Digitali per le Lingue Non Latine attraverso Metodi Basati sull'Intelligenza Artificiale

Aftar, Sania

This dissertation addresses the challenges and opportunities in data management, analytics and AI-driven intelligent knowledge extraction for multilingual and multi-alphabetic cultural heritages, which specifically focus on Arabic script in digital libraries. This research is conducted within the \textit{Digital Maktaba} project, part of ITSERR (the Italian Research Infrastructure for Digital and Cultural Heritage) project funded by NextGenerationEU through the National Recovery and Resilience Plan (PNRR) and aligned with the RESILIENCE European research infrastructure for Religious Studies, with the aims to develop innovative methodologies for cataloging, processing and semantically enriching non-Latin scripts. It identifies the ongoing asymmetries in current digital library and AI based systems, where the non-Latin languages such as Arabic remain underrepresented in both classification standards and computational resources. These disparities create the gaps between cultural heritage archives and modern data-driven infrastructures.This dissertation addresses the above mentioned challenges by presenting a methodology, which is structured around two key components: standardizing the metadata and semantic analysis of contents. The first component is based on a detailed review of extraction techniques for bibliographic metadata, such as ISBNs or names of Books/authors in Arabic records. This is later followed by the development and testing of an automated pipeline, which mainly contains regex parsing, character identification, correction, checksum validation fall back matching with the La Pira Library Catalog.The second component, on the other hand, focuses on contextual categorization and semantic analysis of short texts. To investigate this aspect, the study chooses Hadith as a linguistically and culturally significant corpus. Its linguistic and theological depth, combined with the diversity of Classical Arabic, makes it an excellent choice for experimenting and evaluating advanced topic modeling techniques for the Arabic language. The complex morphology and linguistic variability of Arabic present additional challenges for semantic modeling. Therefore, a strong preprocessing stage is needed to make sure that data is clean and consistent. The study presents a customized preprocessing pipeline that entirely separates the Matn (main text) from narrators (names). Subsequently, it employs techniques such as diacritic removal, lemmatization and normalization to prepare the data for semantic modeling. Once the preprocessing pipeline was established, a neural based architecture \textit{ RoBERT2VecTM} is proposed, which is based on hybrid embedding. In contrast with existing approaches the model significantly improves topic diversity and coherence, by providing more precise representation of topics in Classical Arabic. The study is subsequently expanded via framework AZIM, an Arabic-centric zero-shot topic modeling that facilitates transitions between different languages. This model is trained on Arabic and transfers learned topics to unseen documents across both Latin and non-Latin languages. The findings show that summarized text can be more effective for topic modeling. Additionally, there is an improvement in topic coherence and it is noted that Arabic-trained embeddings can effectively benefit languages like Persian and Urdu. The model was later improved by employing \textit{CoRefine}, which introduces automated theme categorization to overcome the drawbacks of manual labeling.This refinement aligns the automatically derived Hadith topics with an Islamic taxonomy which enables hierarchical organization and concept-based access to religious knowledge.

Questa tesi affronta le sfide e le opportunità nella gestione dei dati, nell’analisi e nell’estrazione intelligente della conoscenza basata sull’intelligenza artificiale (AI) per i patrimoni culturali multilingui e multi-alfabetici, con particolare attenzione alla scrittura araba nelle biblioteche digitali. La ricerca è condotta nell’ambito del progetto Digital Maktaba, parte dell’infrastruttura italiana ITSERR (Infrastruttura di Ricerca per il Patrimonio Digitale e Culturale), finanziata da NextGenerationEU attraverso il Piano Nazionale di Ripresa e Resilienza (PNRR) e allineata con l’infrastruttura europea RESILIENCE per gli studi religiosi. L’obiettivo è sviluppare metodologie innovative per la catalogazione, l’elaborazione e l’arricchimento semantico delle scritture non latine. La tesi identifica le asimmetrie esistenti nelle biblioteche digitali e nei sistemi basati sull’AI, dove le lingue non latine, come l’arabo, restano sottorappresentate sia negli standard di classificazione sia nelle risorse computazionali, generando un divario tra gli archivi del patrimonio culturale e le moderne infrastrutture basate sui dati. La ricerca propone una metodologia articolata in due componenti principali: la standardizzazione dei metadati e l’analisi semantica dei contenuti. La prima componente si basa su una revisione delle tecniche di estrazione dei metadati bibliografici (come numeri ISBN o nomi di libri e autori nei record arabi) e sullo sviluppo di una pipeline automatizzata che integra regex parsing, identificazione e correzione dei caratteri, validazione tramite checksum e confronto (fallback matching) con il catalogo della Biblioteca La Pira. La seconda componente riguarda la categorizzazione contestuale e l’analisi semantica di testi brevi, utilizzando i Ḥadīth come corpus linguistico e culturale di riferimento. La complessità morfologica e la variabilità linguistica dell’arabo classico richiedono una solida fase di pre-elaborazione, che in questa tesi separa il Matn (testo principale) dai narratori (Isnad), applicando rimozione dei diacritici, lemmatizzazione e normalizzazione per ottenere dati coerenti. Da questa base nasce l’architettura neurale RoBERT2VecTM, fondata su embedding ibridi, che migliora la diversità e la coerenza dei temi rispetto ai modelli precedenti, offrendo rappresentazioni più precise dei contenuti in arabo classico. Successivamente, la ricerca introduce il framework AZIM, un modello di zero-shot topic modeling centrato sull’arabo, capace di trasferire i temi appresi a lingue latine e non latine. I risultati mostrano che i testi sintetizzati aumentano la coerenza tematica e che gli embedding addestrati sull’arabo possono essere utili anche per lingue affini come il persiano e l’urdu. Infine, il modello viene ulteriormente perfezionato attraverso CoRefine, che introduce la categorizzazione automatica dei temi, allineando quelli derivati dai Ḥadīth con una tassonomia islamica. Questo approccio abilita un’organizzazione gerarchica e un accesso concettuale alla conoscenza religiosa, contribuendo all’avanzamento della rappresentazione digitale dei patrimoni culturali multilingui e multi-alfabetici.