Automatic document classification process extracts information with an automatic analysis of the content of documents. Is is an active research field of growing importance due to the large amount of electronic documents produced almost daily and worldwide available thanks to diffused technologies. Several application areas benefits of automatic document classification, like document archiving, invoice processing in business environment, press releases, research engines, etc... Current tools classify or "tag" either text or images so they can be processed; by linking image and text-based content, a technology can improve fundamental document management tasks like retrieving information from a database or automatically routing documents to achieve more complete searches and streamlined business processes. In this work, we firstly make an investigation of a possible model for conceptual space of the joint information from the text and the images forming complex documents.We present a formal definition of pertinence and relevance concepts that apply to those documents types we name ``multimodal" and we develop a computable algorithm.Then we present the test dataset which will be used to validate and improve the model.Finally we explain the experiments performed and related results.

Automatic Document Classification: combining image and text information to enhance quality and performances

TOMAZZOLI, Claudio
2014

Abstract

Automatic document classification process extracts information with an automatic analysis of the content of documents. Is is an active research field of growing importance due to the large amount of electronic documents produced almost daily and worldwide available thanks to diffused technologies. Several application areas benefits of automatic document classification, like document archiving, invoice processing in business environment, press releases, research engines, etc... Current tools classify or "tag" either text or images so they can be processed; by linking image and text-based content, a technology can improve fundamental document management tasks like retrieving information from a database or automatically routing documents to achieve more complete searches and streamlined business processes. In this work, we firstly make an investigation of a possible model for conceptual space of the joint information from the text and the images forming complex documents.We present a formal definition of pertinence and relevance concepts that apply to those documents types we name ``multimodal" and we develop a computable algorithm.Then we present the test dataset which will be used to validate and improve the model.Finally we explain the experiments performed and related results.
2014
Inglese
CLASSIFICATION; information retrieval; Indicizzazione semantica
120
File in questo prodotto:
File Dimensione Formato  
Tesi_PDH_Tomazzoli.pdf

accesso solo da BNCF e BNCR

Dimensione 10.18 MB
Formato Adobe PDF
10.18 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/115711
Il codice NBN di questa tesi è URN:NBN:IT:UNIVR-115711