In recent years, several Pattern Recognition problems have been successfully faced by approaches based on the "bag of words" representation. This representation is particularly appropriate when the pattern is characterized (or assumed to be characterized) by the repetition of basic, "constituting" elements called words. By assuming that all possible words are stored in a dictionary, the bag of words vector for one particular object is obtained by counting the number of times each element of the dictionary occurs in the object. Even if largely applied to several scientific fields (with increasingly sophisticated approaches), techniques based on this representation have not been completely exploited in Bioinformatics, due to the methodological and applicative challenges derived from the peculiar scenario. However, in this context the bag of words paradigm seems to be particularly suited: on one hand, many biological mechanisms inherently subsume a counting process; on the other hand, in many Bioinformatics scenarios the objects of the problem are either unstructured or with unknown structure, so that one of the main drawbacks of the bag of words representation (it destroys the object's structure) does not hold anymore. This permits to exploit and to derive highly effective and interpretable solutions, a stringent need in nowadays Bioinformatics research. This thesis is inserted in the above described scenario, and promotes the use of the bag of words paradigm to face problems in Bioinformatics. We investigated the different problematics and aspects related to the creation of bag of words models and representations for some specific Bioinformatics problems, as well as proposing original solutions and approaches based on this representation. In particular, in this thesis three scenarios have been analyzed: the gene expression analysis, the modeling of HIV infection, and the protein remote homology detection. For each scenario, motivations, advantages, and challenges of the bag of words representations are addressed, proposing possible solutions. The merits of bag of words representations and models have been demonstrated in extensive experimental evaluations, exploiting widely used benchmarks as well as datasets derived from direct interactions with biological and clinical laboratories and research groups. With this thesis, we provided evidence that the bag of words representation can have a significant impact on the Bioinformatics and Computational Biology communities.

Bag of Words approaches for Bioinformatics

LOVATO, PIETRO
2015

Abstract

In recent years, several Pattern Recognition problems have been successfully faced by approaches based on the "bag of words" representation. This representation is particularly appropriate when the pattern is characterized (or assumed to be characterized) by the repetition of basic, "constituting" elements called words. By assuming that all possible words are stored in a dictionary, the bag of words vector for one particular object is obtained by counting the number of times each element of the dictionary occurs in the object. Even if largely applied to several scientific fields (with increasingly sophisticated approaches), techniques based on this representation have not been completely exploited in Bioinformatics, due to the methodological and applicative challenges derived from the peculiar scenario. However, in this context the bag of words paradigm seems to be particularly suited: on one hand, many biological mechanisms inherently subsume a counting process; on the other hand, in many Bioinformatics scenarios the objects of the problem are either unstructured or with unknown structure, so that one of the main drawbacks of the bag of words representation (it destroys the object's structure) does not hold anymore. This permits to exploit and to derive highly effective and interpretable solutions, a stringent need in nowadays Bioinformatics research. This thesis is inserted in the above described scenario, and promotes the use of the bag of words paradigm to face problems in Bioinformatics. We investigated the different problematics and aspects related to the creation of bag of words models and representations for some specific Bioinformatics problems, as well as proposing original solutions and approaches based on this representation. In particular, in this thesis three scenarios have been analyzed: the gene expression analysis, the modeling of HIV infection, and the protein remote homology detection. For each scenario, motivations, advantages, and challenges of the bag of words representations are addressed, proposing possible solutions. The merits of bag of words representations and models have been demonstrated in extensive experimental evaluations, exploiting widely used benchmarks as well as datasets derived from direct interactions with biological and clinical laboratories and research groups. With this thesis, we provided evidence that the bag of words representation can have a significant impact on the Bioinformatics and Computational Biology communities.
2015
Inglese
bag of words; bioinformatics; topic model
168
File in questo prodotto:
File Dimensione Formato  
tesiP.pdf

accesso solo da BNCF e BNCR

Dimensione 8.77 MB
Formato Adobe PDF
8.77 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/112269
Il codice NBN di questa tesi è URN:NBN:IT:UNIVR-112269