Neural Lemmatization for Early Slavic Texts

Nawaz, Usman

In this thesis, the task of lemmatization for Early Slavic languages, such as Old Church Slavonic (OCS) and Old East Slavic (OES), is addressed as a problem of end-to-end robustness, conditioned on limited supervision, rich morphology, and significant variability in orthography and editorial practices. Unlike the standard languages used today, Early Slavic texts are typically limited, heterogeneous, and affected by representational errors in digitization, such as inconsistent use of diacritics and font-encoded text, including Unicode Private Use Area (PUA) code points. These factors undermine token consistency, inflate the size of the lexicon, and increase Out-of-Vocabulary (OOV) rates, which negatively affect both wordlist-based and neural character-based approaches and can be difficult to diagnose. The primary objective is to develop a reproducible text-processing pipeline that ensures text consistency and improves lemmatization accuracy, especially under conditions typical of Early Slavic texts, with particular emphasis on the ability to generalize to novel and orthographically diverse texts. The thesis introduces a newly annotated OCS dataset, which serves as a challenging benchmark for adaptation and robustness testing beyond existing standard resources. The results on this dataset also provide empirical support for the corpus-construction, normalization, and annotation strategies developed in the thesis, showing that newly curated and carefully standardized material is essential for robust lemmatization of orthographically diverse Early Slavic texts. The work also introduces an encoding-aware standardization layer that detects non-standard Unicode usage and, in a deterministic way, normalizes PUA-dependent characters to stable Unicode equivalents. Standardization is treated as part of the modeling backbone, as it reduces out-of-vocabulary word counts, improves the robustness of frequency statistics used in lexical baselines, and allows cross-comparisons across mixed data. On top of this representation, the study introduces a clear, dictionary-based baseline that automatically infers a type-to-lemma mapping from the training data and treats out-of-vocabulary words in a conservative way. This baseline serves as a reference and diagnostic tool to identify failures caused by limited coverage and ambiguity, while the main comparative evaluation also includes established Universal Dependencies (UD) systems such as Stanza and UDPipe 2.The key contribution of this study is the introduction of OLDSLAVICLEMMA, a neural lemmatizer for low-resource historical languages that does not rely on a dictionary. Instead, the model treats lemmatization as a character-level sequence transduction problem, conditioned on a fixed local context window surrounding the word to be lemmatized. The model consists of a stacked Bidirectional Long Short-Term Memory (BiLSTM) encoder and a Long Short-Term Memory (LSTM) decoder with attention. For lemmatization, the model leverages Cross-Layer Multi-Head Attention (CLMA) to integrate low-level orthography and higher-level abstraction in context. Furthermore, it leverages Encoder-Decoder Cross-Attention (EDCA) to generate lemmas that are context-aware and to resolve ambiguity. Evaluation is conducted with gold tokenization and the official splits of the corresponding UD treebanks, in order to isolate lemmatization from tokenization and sentence segmentation. In addition, a supplementary raw-text-to-CoNLL-U evaluation is conducted on the Early Slavic UD v2.12 and v2.15 treebanks, where the Stanza tokenizer is used, and OLDSLAVICLEMMA is applied to its tokenized output. Four experimental settings are considered: (i) in-family evaluation on Early Slavic UD treebanks across different UD versions, (ii) evaluation across language families on the SIGTYP 2024 dataset in the constrained regime, (iii) multilingual evaluation on a curated subset of UD treebanks to assess transfer beyond Slavic languages, and (iv) evaluation on the newly annotated OCS benchmark to assess robustness, adaptation, and generalization beyond existing standard resources. In the in-family evaluation of Early Slavic languages, OLDSLAVICLEMMA outperforms strong baseline systems, including UDPipe 2 and the Stanza lemmatizer in its hybrid configuration, with the largest improvements observed in out-of-vocabulary-rich and orthographically complex settings. In addition, the results show that the improvements arise mainly from better handling of out-of-vocabulary words rather than from simple memorization of frequent word–lemma mappings. Ablation experiments also show that encoder-decoder attention and optimization are crucial for robust performance in low-resource settings. In conclusion, the experiments support the claim that reliable historical lemmatization requires simultaneous control of representation stability and character-level contextualization, with OOV handling and ambiguity as the primary historical failure modes that should be factored into the evaluation.

Neural Lemmatization for Early Slavic Texts

NAWAZ, Usman

2026

Abstract

In this thesis, the task of lemmatization for Early Slavic languages, such as Old Church Slavonic (OCS) and Old East Slavic (OES), is addressed as a problem of end-to-end robustness, conditioned on limited supervision, rich morphology, and significant variability in orthography and editorial practices. Unlike the standard languages used today, Early Slavic texts are typically limited, heterogeneous, and affected by representational errors in digitization, such as inconsistent use of diacritics and font-encoded text, including Unicode Private Use Area (PUA) code points. These factors undermine token consistency, inflate the size of the lexicon, and increase Out-of-Vocabulary (OOV) rates, which negatively affect both wordlist-based and neural character-based approaches and can be difficult to diagnose. The primary objective is to develop a reproducible text-processing pipeline that ensures text consistency and improves lemmatization accuracy, especially under conditions typical of Early Slavic texts, with particular emphasis on the ability to generalize to novel and orthographically diverse texts. The thesis introduces a newly annotated OCS dataset, which serves as a challenging benchmark for adaptation and robustness testing beyond existing standard resources. The results on this dataset also provide empirical support for the corpus-construction, normalization, and annotation strategies developed in the thesis, showing that newly curated and carefully standardized material is essential for robust lemmatization of orthographically diverse Early Slavic texts. The work also introduces an encoding-aware standardization layer that detects non-standard Unicode usage and, in a deterministic way, normalizes PUA-dependent characters to stable Unicode equivalents. Standardization is treated as part of the modeling backbone, as it reduces out-of-vocabulary word counts, improves the robustness of frequency statistics used in lexical baselines, and allows cross-comparisons across mixed data. On top of this representation, the study introduces a clear, dictionary-based baseline that automatically infers a type-to-lemma mapping from the training data and treats out-of-vocabulary words in a conservative way. This baseline serves as a reference and diagnostic tool to identify failures caused by limited coverage and ambiguity, while the main comparative evaluation also includes established Universal Dependencies (UD) systems such as Stanza and UDPipe 2.The key contribution of this study is the introduction of OLDSLAVICLEMMA, a neural lemmatizer for low-resource historical languages that does not rely on a dictionary. Instead, the model treats lemmatization as a character-level sequence transduction problem, conditioned on a fixed local context window surrounding the word to be lemmatized. The model consists of a stacked Bidirectional Long Short-Term Memory (BiLSTM) encoder and a Long Short-Term Memory (LSTM) decoder with attention. For lemmatization, the model leverages Cross-Layer Multi-Head Attention (CLMA) to integrate low-level orthography and higher-level abstraction in context. Furthermore, it leverages Encoder-Decoder Cross-Attention (EDCA) to generate lemmas that are context-aware and to resolve ambiguity. Evaluation is conducted with gold tokenization and the official splits of the corresponding UD treebanks, in order to isolate lemmatization from tokenization and sentence segmentation. In addition, a supplementary raw-text-to-CoNLL-U evaluation is conducted on the Early Slavic UD v2.12 and v2.15 treebanks, where the Stanza tokenizer is used, and OLDSLAVICLEMMA is applied to its tokenized output. Four experimental settings are considered: (i) in-family evaluation on Early Slavic UD treebanks across different UD versions, (ii) evaluation across language families on the SIGTYP 2024 dataset in the constrained regime, (iii) multilingual evaluation on a curated subset of UD treebanks to assess transfer beyond Slavic languages, and (iv) evaluation on the newly annotated OCS benchmark to assess robustness, adaptation, and generalization beyond existing standard resources. In the in-family evaluation of Early Slavic languages, OLDSLAVICLEMMA outperforms strong baseline systems, including UDPipe 2 and the Stanza lemmatizer in its hybrid configuration, with the largest improvements observed in out-of-vocabulary-rich and orthographically complex settings. In addition, the results show that the improvements arise mainly from better handling of out-of-vocabulary words rather than from simple memorization of frequent word–lemma mappings. Ablation experiments also show that encoder-decoder attention and optimization are crucial for robust performance in low-resource settings. In conclusion, the experiments support the claim that reliable historical lemmatization requires simultaneous control of representation stability and character-level contextualization, with OOV handling and ambiguity as the primary historical failure modes that should be factored into the evaluation.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di pubblicazione
	
				lug-2026
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				LO PRESTI, Liliana
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				LA CASCIA, Marco
			
	Nome Editore
	
				Università degli Studi di Palermo
			
	Città Editore
	
				Palermo
			
	Collezione di appartenenza
	
				Università degli Studi di Palermo

File in questo prodotto:

File	Dimensione	Formato
Usman Nawaz PhD Thesis.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 1.1 MB Formato Adobe PDF Visualizza/Apri	1.1 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/374242

Il codice NBN di questa tesi è URN:NBN:IT:UNIPA-374242