Automatic Document Classification:  combining image and text information to enhance quality and performances

Tomazzoli, Claudio

Automatic document classification process extracts information with an automatic analysis of the content of documents. Is is an active research field of growing importance due to the large amount of electronic documents produced almost daily and worldwide available thanks to diffused technologies. Several application areas benefits of automatic document classification, like document archiving, invoice processing in business environment, press releases, research engines, etc... Current tools classify or "tag" either text or images so they can be processed; by linking image and text-based content, a technology can improve fundamental document management tasks like retrieving information from a database or automatically routing documents to achieve more complete searches and streamlined business processes. In this work, we firstly make an investigation of a possible model for conceptual space of the joint information from the text and the images forming complex documents.We present a formal definition of pertinence and relevance concepts that apply to those documents types we name ``multimodal" and we develop a computable algorithm.Then we present the test dataset which will be used to validate and improve the model.Finally we explain the experiments performed and related results.

Automatic Document Classification: combining image and text information to enhance quality and performances

TOMAZZOLI, Claudio

2014

Abstract

Automatic document classification process extracts information with an automatic analysis of the content of documents. Is is an active research field of growing importance due to the large amount of electronic documents produced almost daily and worldwide available thanks to diffused technologies. Several application areas benefits of automatic document classification, like document archiving, invoice processing in business environment, press releases, research engines, etc... Current tools classify or "tag" either text or images so they can be processed; by linking image and text-based content, a technology can improve fundamental document management tasks like retrieving information from a database or automatically routing documents to achieve more complete searches and streamlined business processes. In this work, we firstly make an investigation of a possible model for conceptual space of the joint information from the text and the images forming complex documents.We present a formal definition of pertinence and relevance concepts that apply to those documents types we name ``multimodal" and we develop a computable algorithm.Then we present the test dataset which will be used to validate and improve the model.Finally we explain the experiments performed and related results.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				Informatica
			
	Data di pubblicazione
	
				2014
			
	Lingua
	
				Inglese
			
	Parola chiave
	
				CLASSIFICATION; information retrieval; Indicizzazione semantica
			
	Relatore, Supervisor, Advisor o Tutor
	
				Cristani Matteo
			
	Numero di pagine
	
				120
			
	Collezione di appartenenza
	
				Università degli Studi di Verona

File in questo prodotto:

File	Dimensione	Formato
Tesi_PDH_Tomazzoli.pdf accesso solo da BNCF e BNCR Licenza: Tutti i diritti riservati Dimensione 10.18 MB Formato Adobe PDF	10.18 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/115711

Il codice NBN di questa tesi è URN:NBN:IT:UNIVR-115711