The advent of the Second Generation of sequencing technologies deeply changed the process of generating data from DNA molecules, which has become cheaper and faster. The multiplicity of technologies and assembly tools available, each with different strengths and weaknesses, turns the choice of a proper experimental set-up when approaching the genome of a new species into a difficult task. In this work, multiple strategies have been adopted for reconstructing the genomes of different species. This has allowed profiling of the practices that best optimize costs and results according to the genetic characteristics of the subject of study. When dealing with bacterial organisms, the short genome length and a low complexity of the underlying sequence allows to obtain a high quality draft even when using only one standard Illumina library - regardless of the assembly procedure adopted. Fungal genomes show an increased length and a higher complexity when compared with prokaryotic organisms. Standard Illumina libraries are not sufficient to overcome the fragmentation issue of the draft sequence, and improving the computational assembly pipeline shows only a limited power in ameliorating the results. Additional Mate Pair sequencing data or PacBio long reads sequencing can be adequate alternatives, as they both lead to high quality assembly results at similar expenses. Long plant genomes show the highest complexity degree, with an elevated repetitive content and high heterozygosity rate. Standard Illumina libraries are not sufficient to overcome the fragmentation problem due to the limited insert size. Mate pair sequences greatly improve the results, with longer libraries spanning longer repeats and shorter ones improving the gap reconstruction. PacBio showed to be an effective solution to this problem, but given its high sequencing costs it is prohibitive to adopt this technology alone for reconstruction. Hybrid assembly is a possible alternative, combining an high coverage of Illumina short but cheap and reliable reads with a low coverage of longer but more erroneous PacBio reads. This solution has lower sequencing costs, but the quality of the results is limited by the coverage of long reads; moreover, the computational resources necessary to perform error correction and assembly are massively increased. -4- When approaching the reconstruction of a genome, therefore, multiple solutions are available – but it is the available knowledge of its characteristics that indicates the best combination of assembly tools and sequencing technologies to optimise both expenses and quality of the results.
Genome Assembly With 2nd Generation Sequencing Technologies: Definition of Best Experimental Design In Relation To Genomic Features
Minio, Andrea
2015
Abstract
The advent of the Second Generation of sequencing technologies deeply changed the process of generating data from DNA molecules, which has become cheaper and faster. The multiplicity of technologies and assembly tools available, each with different strengths and weaknesses, turns the choice of a proper experimental set-up when approaching the genome of a new species into a difficult task. In this work, multiple strategies have been adopted for reconstructing the genomes of different species. This has allowed profiling of the practices that best optimize costs and results according to the genetic characteristics of the subject of study. When dealing with bacterial organisms, the short genome length and a low complexity of the underlying sequence allows to obtain a high quality draft even when using only one standard Illumina library - regardless of the assembly procedure adopted. Fungal genomes show an increased length and a higher complexity when compared with prokaryotic organisms. Standard Illumina libraries are not sufficient to overcome the fragmentation issue of the draft sequence, and improving the computational assembly pipeline shows only a limited power in ameliorating the results. Additional Mate Pair sequencing data or PacBio long reads sequencing can be adequate alternatives, as they both lead to high quality assembly results at similar expenses. Long plant genomes show the highest complexity degree, with an elevated repetitive content and high heterozygosity rate. Standard Illumina libraries are not sufficient to overcome the fragmentation problem due to the limited insert size. Mate pair sequences greatly improve the results, with longer libraries spanning longer repeats and shorter ones improving the gap reconstruction. PacBio showed to be an effective solution to this problem, but given its high sequencing costs it is prohibitive to adopt this technology alone for reconstruction. Hybrid assembly is a possible alternative, combining an high coverage of Illumina short but cheap and reliable reads with a low coverage of longer but more erroneous PacBio reads. This solution has lower sequencing costs, but the quality of the results is limited by the coverage of long reads; moreover, the computational resources necessary to perform error correction and assembly are massively increased. -4- When approaching the reconstruction of a genome, therefore, multiple solutions are available – but it is the available knowledge of its characteristics that indicates the best combination of assembly tools and sequencing technologies to optimise both expenses and quality of the results.File | Dimensione | Formato | |
---|---|---|---|
Tesi_CdR.pdf
accesso solo da BNCF e BNCR
Dimensione
7.46 MB
Formato
Adobe PDF
|
7.46 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/181195
URN:NBN:IT:UNIVR-181195