Narrative text is one of the most demanding domains for automatic language understanding. At book scale, characters can be represented through varying names, pronouns, and implicit references, and answering questions often requires evidence distributed across entire chapters. Current NLP systems, developed primarily for short documents, face four interconnected limitations: unreliable entity tracking, evaluation metrics that correlate poorly with human judgments, retrieval pipelines that miss implicitly referenced passages, and insufficient adaptation to longer narrative inputs. In this thesis, we address all four through new resources, evaluation frameworks, and methods specifically designed for book-scale Narrative Understanding. The first challenge, reliable entity tracking, naturally leads to Coreference Resolution at book scale, a problem studied almost exclusively on short documents. We introduce BookCoref, the first book-scale coreference resource, produced through a semi-automatic pipeline combining entity linking, LLM-based cluster validation, and local coreference expansion. Its silver partition spans over 10.8 million tokens across 50 books, with chains averaging more than 73,000 tokens in pairwise distance; a manually annotated test set provides expert-curated annotations for three complete novels. State-of-the-art systems drop below 47 CoNLL-F1 on full novels (versus approximately 80 on much shorter benchmarks), and training on the silver data raises the best system to 67.0, well short of the 82.2 achieved on independent windows. Equipped with these models, we investigate the challenge of coreference-aware retrieval-augmented generation (RAG) for Narrative QA, where dense retrievers miss passages containing implicit entity references. Two integration strategies reveal a clear asymmetry: index-time augmentation degrades performance by propagating coreference errors globally, whereas query-time filtering yields consistent, if modest, improvements by confining errors to passages already retrieved for a single query. The second pair of contributions addresses evaluation methodology and model capacity. NarrativeQA, the most widely used Narrative QA benchmark, proves unreliable: 22% of its test documents are mismatched or non-narrative, and noisy annotations degrade n-gram-based metrics. We introduce LiteraryQA, a curated benchmark of 138 documents and 3,785 question-answer pairs derived via manual document filtering and LLM-based QA correction, which also serves as the evaluation framework for the RAG experiments above. A meta-evaluation on 7,000 model predictions shows near-zero system-level correlation between n-gram metrics and human preferences, while an LLM-as-a-judge approach augmented with book summaries reaches 0.69 Kendall's Tau. We then apply this evaluation framework when exploring context length adaptation and applications on less-resourced languages: through continual pretraining, we extend a bilingual English-Italian LLM's context window from 4K to 16K tokens using a literary-enriched data mixture, and introduce INDAQA, the first Italian Narrative QA benchmark (365 books, 13,757 question-answer pairs). Continually training our model achieves substantial gains over simple post-training adaptations on both languages. We also find that our method provides a clear cross-lingual transfer effect and robust extrapolation to twice the training length, while metric divergences on Italian further reinforces the LiteraryQA findings. This thesis addresses key challenges of narrative text, from long-distance coreference to evidence scattered across chapters, calling for resources and protocols designed specifically for book-scale inputs. BookCoref lays the groundwork for entity tracking and enables the coreference-aware retrieval experiments; LiteraryQA establishes the evaluation standard shared by both the retrieval and adaptation studies; and INDAQA extends it to Italian, enabling the first cross-lingual study of narrative comprehension. These resources and findings form a concrete foundation for future research in automatic Narrative Understanding.
Addressing Key Challenges in Narrative Understanding
BONOMO, TOMMASO
2026
Abstract
Narrative text is one of the most demanding domains for automatic language understanding. At book scale, characters can be represented through varying names, pronouns, and implicit references, and answering questions often requires evidence distributed across entire chapters. Current NLP systems, developed primarily for short documents, face four interconnected limitations: unreliable entity tracking, evaluation metrics that correlate poorly with human judgments, retrieval pipelines that miss implicitly referenced passages, and insufficient adaptation to longer narrative inputs. In this thesis, we address all four through new resources, evaluation frameworks, and methods specifically designed for book-scale Narrative Understanding. The first challenge, reliable entity tracking, naturally leads to Coreference Resolution at book scale, a problem studied almost exclusively on short documents. We introduce BookCoref, the first book-scale coreference resource, produced through a semi-automatic pipeline combining entity linking, LLM-based cluster validation, and local coreference expansion. Its silver partition spans over 10.8 million tokens across 50 books, with chains averaging more than 73,000 tokens in pairwise distance; a manually annotated test set provides expert-curated annotations for three complete novels. State-of-the-art systems drop below 47 CoNLL-F1 on full novels (versus approximately 80 on much shorter benchmarks), and training on the silver data raises the best system to 67.0, well short of the 82.2 achieved on independent windows. Equipped with these models, we investigate the challenge of coreference-aware retrieval-augmented generation (RAG) for Narrative QA, where dense retrievers miss passages containing implicit entity references. Two integration strategies reveal a clear asymmetry: index-time augmentation degrades performance by propagating coreference errors globally, whereas query-time filtering yields consistent, if modest, improvements by confining errors to passages already retrieved for a single query. The second pair of contributions addresses evaluation methodology and model capacity. NarrativeQA, the most widely used Narrative QA benchmark, proves unreliable: 22% of its test documents are mismatched or non-narrative, and noisy annotations degrade n-gram-based metrics. We introduce LiteraryQA, a curated benchmark of 138 documents and 3,785 question-answer pairs derived via manual document filtering and LLM-based QA correction, which also serves as the evaluation framework for the RAG experiments above. A meta-evaluation on 7,000 model predictions shows near-zero system-level correlation between n-gram metrics and human preferences, while an LLM-as-a-judge approach augmented with book summaries reaches 0.69 Kendall's Tau. We then apply this evaluation framework when exploring context length adaptation and applications on less-resourced languages: through continual pretraining, we extend a bilingual English-Italian LLM's context window from 4K to 16K tokens using a literary-enriched data mixture, and introduce INDAQA, the first Italian Narrative QA benchmark (365 books, 13,757 question-answer pairs). Continually training our model achieves substantial gains over simple post-training adaptations on both languages. We also find that our method provides a clear cross-lingual transfer effect and robust extrapolation to twice the training length, while metric divergences on Italian further reinforces the LiteraryQA findings. This thesis addresses key challenges of narrative text, from long-distance coreference to evidence scattered across chapters, calling for resources and protocols designed specifically for book-scale inputs. BookCoref lays the groundwork for entity tracking and enables the coreference-aware retrieval experiments; LiteraryQA establishes the evaluation standard shared by both the retrieval and adaptation studies; and INDAQA extends it to Italian, enabling the first cross-lingual study of narrative comprehension. These resources and findings form a concrete foundation for future research in automatic Narrative Understanding.| File | Dimensione | Formato | |
|---|---|---|---|
|
Tesi_dottorato_Bonomo.pdf
accesso aperto
Licenza:
Creative Commons
Dimensione
1.86 MB
Formato
Adobe PDF
|
1.86 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/368870
URN:NBN:IT:UNIROMA1-368870