The aim of the project, in collaboration with the company Expert.ai, is to implement a new system called Sapient which stands for ‘Semantic and Automatic Processing of Information about Environment’. The project grew out of the need to process a multitude of complex documents, i.e. those containing information in multiple objects such as graphs, tables, etc. For this purpose, it is necessary to segment complex texts into homogeneous areas i.e. to identify the different parts that make up a document with its relative location. This system is part of a broader one able to recognise the characters of a document and is known as Optical Character Recognition (OCR). The analysis of the structure of a document by classifying it into its components such as title, figures, tables, main text etc. is of great importance and is the main objective of the project. In the Literature this topic is known as Document Layout Analysis (DLA). This project operates in the area of computer vision and specifically of pattern recognition in as much as documents are generally in PDF format and thus more related to an image than a text document. For the purposes of the system, the objective is not only to classify and locate the components of a text, but also to segment each component so that it can be extracted in an orderly manner. Therefore, Semantic Segmentation appears to be the best model for this purpose. In fact, it is not just an object detection problem, which is the mere identification and localisation of the document components within the same image, but also the capacity to classify the image pixel by pixel. The classification pipeline is initially divided into two consequential steps: layout analysis and text-only analysis. For the solution of the first phase, an end-to-end Convolutional Neural Network (CNN) implementing dilated convolution is used, while for the second phase, an end-to-end multiscale CNN is used; a heuristic within the framework of mathematical morphology is also defined for the same purpose. Finally, the segmentation of all classes simultaneously was achieved by means of another end-to-end CNN model. The final classification allows for the segmentation of both the text and the non-text parts, thus having a final breakdown of the document into: all text parts, tables and images for non-text components and title, authors, abstract, paragraphs and its title, header, footer, notes, caption and finally lists for the segmentation of the text alone. The same classes are found in the simultaneous segmentation of text and non-text components. The comparison with the vast Literature available, explains how this system describes an alternative overall model for DLA.

SAPIENT: Semantic and Automatic Processing of Information about Environment

AMMATURO, ELEONORA
2025

Abstract

The aim of the project, in collaboration with the company Expert.ai, is to implement a new system called Sapient which stands for ‘Semantic and Automatic Processing of Information about Environment’. The project grew out of the need to process a multitude of complex documents, i.e. those containing information in multiple objects such as graphs, tables, etc. For this purpose, it is necessary to segment complex texts into homogeneous areas i.e. to identify the different parts that make up a document with its relative location. This system is part of a broader one able to recognise the characters of a document and is known as Optical Character Recognition (OCR). The analysis of the structure of a document by classifying it into its components such as title, figures, tables, main text etc. is of great importance and is the main objective of the project. In the Literature this topic is known as Document Layout Analysis (DLA). This project operates in the area of computer vision and specifically of pattern recognition in as much as documents are generally in PDF format and thus more related to an image than a text document. For the purposes of the system, the objective is not only to classify and locate the components of a text, but also to segment each component so that it can be extracted in an orderly manner. Therefore, Semantic Segmentation appears to be the best model for this purpose. In fact, it is not just an object detection problem, which is the mere identification and localisation of the document components within the same image, but also the capacity to classify the image pixel by pixel. The classification pipeline is initially divided into two consequential steps: layout analysis and text-only analysis. For the solution of the first phase, an end-to-end Convolutional Neural Network (CNN) implementing dilated convolution is used, while for the second phase, an end-to-end multiscale CNN is used; a heuristic within the framework of mathematical morphology is also defined for the same purpose. Finally, the segmentation of all classes simultaneously was achieved by means of another end-to-end CNN model. The final classification allows for the segmentation of both the text and the non-text parts, thus having a final breakdown of the document into: all text parts, tables and images for non-text components and title, authors, abstract, paragraphs and its title, header, footer, notes, caption and finally lists for the segmentation of the text alone. The same classes are found in the simultaneous segmentation of text and non-text components. The comparison with the vast Literature available, explains how this system describes an alternative overall model for DLA.
25-set-2025
Inglese
Vitulano, Domenico
GIACOMELLI, Lorenzo
Università degli Studi di Roma "La Sapienza"
97
File in questo prodotto:
File Dimensione Formato  
Tesi_dottorato_Ammaturo.pdf

accesso aperto

Licenza: Tutti i diritti riservati
Dimensione 10.45 MB
Formato Adobe PDF
10.45 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/312569
Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-312569