The Long Document Representation and Processing Problem in the Era of the Transformer-Based Large Language Models

ALVA PRINCIPE, RENZO ARTURO

Transformer-based models, such as encoder-only models like BERT, often called Pre-Trained Language Models (PLMs), and decoder-only models like GPT, known as Generative Large Language Models (GLLMs), have become widely popular for their ability to be pre-trained on vast corpora, driving substantial advances across numerous tasks in Natural Language Processing (NLP). However, these models face significant challenges with long document processing due to their quadratic computational complexity, which limits the number of tokens they can process effectively. This constraint poses difficulties for tasks such as document classification and information extraction, where entire documents can exceed typical input length limits. The influence of Transformer models has extended to solving real-world problems, leading to recent applications across various industries, including Legal, Finance, and Real Estate. These advancements have opened up new opportunities while highlighting challenges such as processing long documents and handling specialized domain terms and entities. This thesis examines the industry-driven challenges of processing and extracting valuable information from long documents, with a focus on the Italian real estate and financial sectors. Responding to the high demand for automated processing of extensive textual data, this research addresses two core tasks: Automatic Long Document Classification (ALDC) and Long Document Information Extraction. To address these challenges, the first key contribution of this work is a survey of approaches for ALDC, introducing a taxonomy that categorizes methods into Efficient Transformers, Decomposition-Recomposition, and Summarization-Based approaches. Additionally, this survey identifies critical limitations in current evaluation practices for ALDC, including issues with baseline selection, dataset choices, and the insufficiency of document length as a sole evaluation criterion. Our second contribution to addressing the ALDC challenge, is the introduction of the Latent Concept Frequency-Inverse Document Frequency (LCF-IDF) model. This novel approach combines the full-document coverage of TF-IDF with the semantic awareness of PLMs, presenting an innovative solution for representing long documents, which is evaluated in the classification task. Finally, our third contribution focuses on the Long Document Information Extraction task. We perform a comprehensive comparative analysis of rule-based systems and GLLMs within the real estate sector. This evaluation examines performance, cost-effectiveness, development effort, and maintainability, offering valuable insights for industrial applications that manage long, domain-specific documents.

I modelli basati su Transformer, come i modelli encoder-only come BERT, spesso chiamati Pre-Trained Language Models (PLMs), e i modelli decoder-only come GPT, noti come Generative Large Language Models (GLLMs), sono diventati molto popolari per la loro capacità di essere pre-addestrati su vasti corpora, portando a sostanziali avanzamenti in numerosi compiti nel campo del Natural Language Processing (NLP). Tuttavia, questi modelli affrontano sfide significative con l'elaborazione di documenti lunghi a causa della loro complessità computazionale quadratica, che limita il numero di token che possono elaborare in modo efficace. Questa restrizione pone difficoltà in compiti come la document classification e l'information extraction, dove documenti interi possono superare i limiti tipici di lunghezza dell'input. L'influenza dei modelli Transformer si è estesa alla risoluzione di problemi del mondo reale, portando a recenti applicazioni in vari settori, tra cui quello legale, finanziario e immobiliare. Questi avanzamenti hanno aperto nuove opportunità, evidenziando al contempo sfide come l'elaborazione di documenti lunghi e la gestione di termini e entità specializzati di dominio. Questa tesi esamina le sfide guidate dall'industria nell'elaborazione e nell'estrazione di informazioni preziose da documenti lunghi, con un focus sui settori immobiliare e finanziario italiani. In risposta all'alta domanda di elaborazione automatizzata di ampi dati testuali, questa ricerca affronta due compiti fondamentali: Automatic Long Document Classification (ALDC) ed Long Document Information Extraction. Per affrontare queste sfide, il primo contributo chiave di questo lavoro è un'indagine sugli approcci per l'ALDC, introducendo una tassonomia che categorizza i metodi in Efficient Transformers, Decomposition-Recomposition e Summarization-Based approaches. Inoltre, questa indagine identifica limitazioni critiche nelle attuali pratiche di valutazione per l'ALDC, inclusi problemi con la selezione dei baseline, le scelte dei dataset e l'insufficienza della lunghezza del documento come unico criterio di valutazione. Il nostro secondo contributo per affrontare la sfida dell'ALDC è l'introduzione del modello Latent Concept Frequency-Inverse Document Frequency (LCF-IDF). Questo approccio innovativo combina la copertura dell'intero documento del TF-IDF con la consapevolezza semantica dei PLMs, presentando una soluzione innovativa per rappresentare documenti lunghi, che viene valutata nel compito di classificazione. Infine, il nostro terzo contributo si concentra sul compito di Long Document Information Extraction. Eseguiamo un'analisi comparativa completa di sistemi basati su regole e GLLMs nel settore immobiliare. Questa valutazione esamina le prestazioni, il costo-efficacia, costo di sviluppo e la manutenibilità, offrendo preziose intuizioni per applicazioni industriali che gestiscono documenti lunghi e specifici per il dominio.