Caraterrizzazione computazionale di melo. Funzione proteica, disordine e variabilità.

Necci, Marco

Domesticated apple is the most important temperate fruit crop and has been cultivated in Asia and Europe from antiquity. As a consequence of its self-incompatibility a wide variability within a same population is observed for phenotype characters of domesticated apple. This retained variability is thought to be the base of the great diversity of domesticated apple yield among different cultivars. However, while the genome of domesticated apple has been available since 2012, annotation for domesticated apple is lacking in standard resources gathering genome and protein data despite its economic importance. Ensembl, the repository for nucleic acid sequences does not include the genome of domesticated apple or any other species from the genus Malus and does not even feature a way to encode information relative to variability among cultivars/accessions. UniProt only annotates around 10% of domesticated apple genes. I argue that at the origin of lacking annotation stands a recent phenomenon known as Big Data, which in this case is produced by NGS technologies. Due to an ever-increasing amount of sequenced genomes, fast and accurate methods are required to keep the pace with the sequencing of new genomes. Produced annotations must then be stored in biological database (DB)s, where final users can easily access and retrieve information of interest. However, plant-specific data is less integrated than human data and this brought to plant-related information to be fragmented in many different plant-specific DBs. This is why domesticated apple is absent or mostly absent from Ensembl and UniProt. To fill the gap left by existing resources, we developed PhytoTypeDB (http://phytotypedb.bio.unipd.it), a database containing the inter-cultivar variability of functionally annotated plant proteins.PhytoTypeDB is a user-friendly resource developed to help plant scientist to retrieve updated information about gene function and variability. To generate PhytoTypeDB annotation, the concept of gene family and domain were vastly exploited. This Concepts revolve around the premise of protein sequence conservation being the drive for function conservation. While this idea is undoubtedly right for globular proteins, it is not enough to cover the whole protein function space. Furthermore, limiting the annotation to conserved domains automatically bias annotation coverage towards highly conserved regions. Many proteins however lack such stable three-dimensional structure and are rather intrinsically disordered under native conditions. These proteins play critical roles in the cell and host the vast majority of variability since they are quickly evolving. To investigate this class of proteins we transferred from the U.S.A. to Italy the central resource of high quality manually curated Intrinsic Disorder (ID) annotation, Database Of Protein Disorder (DisProt). After re-annotating all legacy entries and adding a supplementary two hundred annotations as a community effort of the European Cost Action NGP-Net, we compared DisProt annotation to other manually curated resources of ID. Furthermore, we used manually curated annotations to evaluate existing automatic detection methods, posingthe foundations for a future periodic assessment of ID prediction similar to CASP - called Critical Assessment of Intrinsic protein Disorder (CAID) - that we are currently running. In the field of prediction of ID we published a novel method - MobiDB-lite - that was included as the first ID predictor in the famous domain annotation resource InterPro.MobiDB-lite was used to predict ID on the largest scale possible, all protein sequence of the UniProt sequence space. These annotations were collected in the newest version of MobiDB, along with annotation from many different sources. Finally, I delved in the analysis of the very nuanced array of phenomena that fall under the name of ID trying to further classify and extrapolate patterns on a large scale dataset.

Melo domestico è la coltivazione da frutta più diffusa nelle zone temperate ed è stata coltivata in Asia e in Europa dall'antichità. Si osserva un'ampia variabilità fenotipica all'interno di una stessa. Si ritiene che questa variabilità sia alla base della grande diversità della resa fruttifera di melo che si osserva tra le cultivar. Tuttavia, mentre il genoma di melo è disponibile dal 2012, la sua annotazione è carente nelle risorse standard che raccolgono dati sul genoma e sulle proteine, nonostante la sua importanza economica. Ensembl non include il genoma di melo o di qualsiasi altra specie del genere Malus e non presenta nemmeno un modo per codificare le informazioni relative alla variabilità tra cultivar/ecotipi. UniProt annota solo circa il 10% dei geni di melo. Io sostengo che all'origine della mancanza di annotazioni si trova un fenomeno noto come Big Data, che in questo caso è prodotto dalle tecnologie NGS. A causa di una quantità sempre crescente di genomi sequenziati, sono necessari metodi rapidi e precisi per mantenere il passo con il sequenziamento di nuovi genomi, che producano annotazioni da archiviare in database biologici (DB), dove possono essere facilmente recuperate. Tuttavia, i dati relativi alle piante sono poco integrati e questo ha portato a frammentare le informazioni relative alle piante in diversi DB specifici. Questo è il motivo per cui la mela addomesticata è assente o per lo più assente da Ensembl e UniProt. Per colmare il vuoto lasciato dalle risorse esistenti, abbiamo sviluppato PhytoTypeDB, un database che contiene la variabilità inter-cultivar di proteine vegetali annotate funzionalmente. PytoTypeDB è una risorsa facile da usare sviluppata per i ricercatori a recuperare informazioni aggiornate sulla funzione e variabilità dei geni. Per generare l'annotazione PhytoTypeDB, il concetto di famiglia genica e dominio è stato ampiamente sfruttato. Questi concetti ruotano attorno alla premessa che la conservazione delle sequenze proteiche guidi la conservazione delle funzioni. Sebbene questa idea sia senza dubbio giusta per le proteine globulari, non è sufficiente a coprire l'intero spazio delle funzioni proteiche. Inoltre, limitando l'annotazione ai domini conservati automaticamente la copertura delle annotazioni sono polarizzate verso le regioni altamente conservate. Molte proteine tuttavia non hanno una struttura tridimensionale stabile e sono piuttosto intrinsecamente disordinate (ID) in condizioni native. Queste proteine svolgono ruoli critici nella cellula e ospitano la maggior parte della variabilità dal momento che sono in rapida evoluzione. Per indagare su questa classe di proteine abbiamo trasferito dagli Stati Uniti in Italia la risorsa centrale di annotazioni del disordine manualmente curate, DisProt. Dopo aver ri-annotato le voci già presenti e aggiunto duecento nuove annotazioni come sforzo comunitario europeo, abbiamo confrontato DisProt con altre risorse di ID gestite manualmente. Inoltre, abbiamo utilizzato annotazioni manuali per valutare i predittori esistenti, ponendo le basi per una futura valutazione periodica della predizione di ID simile al CASP - chiamata Critical Assessment of Intrinsic protein Disorder (CAID) - che stiamo attualmente eseguendo. Nel campo della predizione dell'ID abbiamo pubblicato un nuovo metodo - MobiDB-lite - che è stato incluso come il primo predittore di ID nella famosa risorsa di annotazione del dominio InterPro. MobiDB-lite è stato usato per predire l'ID sulla più grande scala possibile, tutta lo spazio delle sequenze proteiche contenute in UniProt. Queste annotazioni sono state raccolte nella versione più recente di MobiDB, insieme alle annotazioni da molte altre fonti. Infine, ho approfondito l'analisi della gamma di fenomeni che ricade sotto il nome di ID cercando di classificare ulteriormente e estrapolare modelli su un set di dati su larga scala.