This thesis introduces and studies the problem of 1-dimensional bounded clustering: for any fixed p ≥ 1, given reals x1, x2…, xn, and integers k1, k2.., km, determine the partition (A1, A2… Am) of {1, 2, ..., n} with |A1| = k1, |A2| = k2 , … , |Am| = km which minimizes Σk Σi Ak |xi - μk |p where μk is the p-centroid of Ak First, we prove that the optimum partition is contiguous (String Property), that is if i,j  Ak, and xi < xs < xj, then s  Ak . As a consequence, we determine an efficient algorithm for bi-clustering (if p is an integer); however, we show that the general problem is NP-complete, while a relaxed version of it admits a polynomial-time algorithm. When p is not an integer, we prove that the problem of deciding if the centroid μ is less than a given integer is in the Counting Hierarchy CH. As an application, the relaxed clustering algorithm used as a step for solving a problem in the field of Bioinformatics: the Localization of promoter regions in genomic sequences. The results are compared with those obtained through another methodology (MADAP).

PROBLEMI DI CLUSTERING CON VINCOLI: ALGORITMI E COMPLESSITÀ

SACCA', FRANCESCO
2010

Abstract

This thesis introduces and studies the problem of 1-dimensional bounded clustering: for any fixed p ≥ 1, given reals x1, x2…, xn, and integers k1, k2.., km, determine the partition (A1, A2… Am) of {1, 2, ..., n} with |A1| = k1, |A2| = k2 , … , |Am| = km which minimizes Σk Σi Ak |xi - μk |p where μk is the p-centroid of Ak First, we prove that the optimum partition is contiguous (String Property), that is if i,j  Ak, and xi < xs < xj, then s  Ak . As a consequence, we determine an efficient algorithm for bi-clustering (if p is an integer); however, we show that the general problem is NP-complete, while a relaxed version of it admits a polynomial-time algorithm. When p is not an integer, we prove that the problem of deciding if the centroid μ is less than a given integer is in the Counting Hierarchy CH. As an application, the relaxed clustering algorithm used as a step for solving a problem in the field of Bioinformatics: the Localization of promoter regions in genomic sequences. The results are compared with those obtained through another methodology (MADAP).
17-dic-2010
Italiano
clustering ; NP-complete
BERTONI, ALBERTO
Università degli Studi di Milano
File in questo prodotto:
File Dimensione Formato  
phd_unimi_r07040.pdf

accesso aperto

Dimensione 631.54 kB
Formato Adobe PDF
631.54 kB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/81202
Il codice NBN di questa tesi è URN:NBN:IT:UNIMI-81202