PROBLEMI DI CLUSTERING CON VINCOLI: ALGORITMI E COMPLESSITÀ

Sacca', Francesco

This thesis introduces and studies the problem of 1-dimensional bounded clustering: for any fixed p ≥ 1, given reals x1, x2…, xn, and integers k1, k2.., km, determine the partition (A1, A2… Am) of {1, 2, ..., n} with |A1| = k1, |A2| = k2 , … , |Am| = km which minimizes Σk Σi Ak |xi - μk |p where μk is the p-centroid of Ak First, we prove that the optimum partition is contiguous (String Property), that is if i,j  Ak, and xi < xs < xj, then s  Ak . As a consequence, we determine an efficient algorithm for bi-clustering (if p is an integer); however, we show that the general problem is NP-complete, while a relaxed version of it admits a polynomial-time algorithm. When p is not an integer, we prove that the problem of deciding if the centroid μ is less than a given integer is in the Counting Hierarchy CH. As an application, the relaxed clustering algorithm used as a step for solving a problem in the field of Bioinformatics: the Localization of promoter regions in genomic sequences. The results are compared with those obtained through another methodology (MADAP).

PROBLEMI DI CLUSTERING CON VINCOLI: ALGORITMI E COMPLESSITÀ

SACCA', FRANCESCO

2010

Abstract

This thesis introduces and studies the problem of 1-dimensional bounded clustering: for any fixed p ≥ 1, given reals x1, x2…, xn, and integers k1, k2.., km, determine the partition (A1, A2… Am) of {1, 2, ..., n} with |A1| = k1, |A2| = k2 , … , |Am| = km which minimizes Σk Σi Ak |xi - μk |p where μk is the p-centroid of Ak First, we prove that the optimum partition is contiguous (String Property), that is if i,j  Ak, and xi < xs < xj, then s  Ak . As a consequence, we determine an efficient algorithm for bi-clustering (if p is an integer); however, we show that the general problem is NP-complete, while a relaxed version of it admits a polynomial-time algorithm. When p is not an integer, we prove that the problem of deciding if the centroid μ is less than a given integer is in the Counting Hierarchy CH. As an application, the relaxed clustering algorithm used as a step for solving a problem in the field of Bioinformatics: the Localization of promoter regions in genomic sequences. The results are compared with those obtained through another methodology (MADAP).

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				MATEMATICA E STATISTICA PER LE SCIENZE COMPUTAZIONALI
			
	Data di pubblicazione
	
				17-dic-2010
			
	Lingua
	
				Italiano
			
	Parola chiave
	
				clustering ; NP-complete
			
	Relatore, Supervisor, Advisor o Tutor
	
				BERTONI, ALBERTO
			
	Nome Editore
	
				Università degli Studi di Milano
			
	Collezione di appartenenza
	
				Università degli Studi di Milano

File in questo prodotto:

File	Dimensione	Formato
phd_unimi_r07040.pdf accesso aperto Dimensione 631.54 kB Formato Adobe PDF Visualizza/Apri	631.54 kB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/81202

Il codice NBN di questa tesi è URN:NBN:IT:UNIMI-81202