Recent advancements in semantic parsing and text generation through Discourse Representation Structures (DRS) underscore the critical need for innovative methodologies to enhance neural model performance, particularly in multilingual and resourceconstrained environments. This research presents a comprehensive framework addressing these challenges through multiple complementary approaches: data transformation techniques, alternative evaluation methodologies, and task reversibility analysis. The foundation of this work lies in novel data transformation strategies, encompassing both data augmentation and delexicalization. These techniques employ multilingual and multifaceted approaches, such as manipulating named entities, leveraging WordNet-based lexical substitutions, applying supersenses, and implementing grammatical transformations. The effectiveness of these methods has been demonstrated across typologically diverse languages: English, Italian, and Urdu. For English, the augmentation framework expanded the Parallel Meaning Bank (PMB) dataset ninefold, yielding substantial improvements in model performance. In Italian, the application of crosslingual resources led to significant enhancements in semantic parsing and generation capabilities. For Urdu, a low-resource language, a novel rule-based alignment method was developed to transform English DRS, complemented by various augmentation strategies. A key contribution of this research is the introduction of innovative bidirectional evaluation methodologies. The Parse-Generate (Pars-Gen) and Generate-Parse (Gen- Pars) approaches provide a holistic assessment framework that addresses the limitations of traditional metrics. While SMATCH effectively captures structural overlaps, it may miss nuances in linguistic expression. Conversely, generation metrics like BLEU, COMET, and BERTScore often overlook core semantic equivalences. This dual evaluation approach offers a more comprehensive assessment of system performance across languages. Furthermore, the research explores task reversibility in semantic processing through Parse-Generate-Parse (PGP) and Generate-Parse-Generate (GPG) pipelines. This investigation reveals complex dynamics between error propagation and mitigation across languages, with English demonstrating the highest stability, Italian showing moderate variations, and Urdu exhibiting the most volatility. The analysis spans multiple dimensions, including sentence length, structural complexity, sentence type, polarity, and voice, providing valuable insights into error behavior patterns. Extensive experiments utilizing state-of-the-art neural language models, including byT5, mT5, T5, and mBART, as well as LSTM-based sequence-to-sequence architec-tures, have demonstrated significant performance improvements across multiple evaluation metrics. These advancements represent a substantial contribution to computational semantics, introducing novel approaches for improving semantic parsing and text generation across diverse linguistic contexts. Moreover, they establish a foundation for developing more robust, generalizable, and linguistically inclusive natural language processing systems, particularly beneficial for low-resource languages and limited-data scenarios.
Advancing Multilingual DRS-based Semantic Parsing and Generation: A Framework for Data Transformation, Robust Evaluation, and Task Reversibility
AMIN, MUHAMMAD SAAD
2025
Abstract
Recent advancements in semantic parsing and text generation through Discourse Representation Structures (DRS) underscore the critical need for innovative methodologies to enhance neural model performance, particularly in multilingual and resourceconstrained environments. This research presents a comprehensive framework addressing these challenges through multiple complementary approaches: data transformation techniques, alternative evaluation methodologies, and task reversibility analysis. The foundation of this work lies in novel data transformation strategies, encompassing both data augmentation and delexicalization. These techniques employ multilingual and multifaceted approaches, such as manipulating named entities, leveraging WordNet-based lexical substitutions, applying supersenses, and implementing grammatical transformations. The effectiveness of these methods has been demonstrated across typologically diverse languages: English, Italian, and Urdu. For English, the augmentation framework expanded the Parallel Meaning Bank (PMB) dataset ninefold, yielding substantial improvements in model performance. In Italian, the application of crosslingual resources led to significant enhancements in semantic parsing and generation capabilities. For Urdu, a low-resource language, a novel rule-based alignment method was developed to transform English DRS, complemented by various augmentation strategies. A key contribution of this research is the introduction of innovative bidirectional evaluation methodologies. The Parse-Generate (Pars-Gen) and Generate-Parse (Gen- Pars) approaches provide a holistic assessment framework that addresses the limitations of traditional metrics. While SMATCH effectively captures structural overlaps, it may miss nuances in linguistic expression. Conversely, generation metrics like BLEU, COMET, and BERTScore often overlook core semantic equivalences. This dual evaluation approach offers a more comprehensive assessment of system performance across languages. Furthermore, the research explores task reversibility in semantic processing through Parse-Generate-Parse (PGP) and Generate-Parse-Generate (GPG) pipelines. This investigation reveals complex dynamics between error propagation and mitigation across languages, with English demonstrating the highest stability, Italian showing moderate variations, and Urdu exhibiting the most volatility. The analysis spans multiple dimensions, including sentence length, structural complexity, sentence type, polarity, and voice, providing valuable insights into error behavior patterns. Extensive experiments utilizing state-of-the-art neural language models, including byT5, mT5, T5, and mBART, as well as LSTM-based sequence-to-sequence architec-tures, have demonstrated significant performance improvements across multiple evaluation metrics. These advancements represent a substantial contribution to computational semantics, introducing novel approaches for improving semantic parsing and text generation across diverse linguistic contexts. Moreover, they establish a foundation for developing more robust, generalizable, and linguistically inclusive natural language processing systems, particularly beneficial for low-resource languages and limited-data scenarios.File | Dimensione | Formato | |
---|---|---|---|
phd_thesis__final_submission.pdf
accesso aperto
Dimensione
2.66 MB
Formato
Adobe PDF
|
2.66 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/201082
URN:NBN:IT:UNITO-201082