Tandem repeats are repeated sequences that occur adjacent to each other in the human genome. Due to their prevalence and their association with a number of genetic diseases, there is a rising interest in developing tools for tandem repeat profiling. Genome-wide discovery approaches are needed to fully understand their roles in health and disease but resolving tandem repeat variation accurately remains a very challenging task. Indeed, while traditional mapping-based and assembly-based approaches using short-read data have severe limitations in the size and type of tandem repeats they can resolve, recent third-generation sequencing technologies provide the long reads required to broaden the scope of detectable tandem repeats but exhibit substantially higher sequencing error rates that complicates repeat resolution. In order to overcome limitations of prior methods, we developed TRiCoLOR, a freely-available tool for tandem repeat profiling using error-prone long reads from third-generation sequencing technologies. The method can identify repetitive regions in long-read sequencing data de novo and resolve their motif and multiplicity in a haplotype-specific manner. The tool further includes methods to interactively visualize the identified repeats and to trace their Mendelian consistency in pedigrees. Tested on synthetic data harboring tandem repeat contractions and expansions, TRiCoLOR demonstrates excellent performances and improved precision and recall compared to alternative tools. For real human whole-genome sequencing data, TRiCoLOR achieves high validation rates, suggesting its suitability to identify tandem repeat variation in personal genomes. Compared to assembly-based approaches for structural variant detection, TRiCoLOR demonstrates capable to resolve tandem repeats in difficult to assemble regions that are prone to mis-assemblies or incorrect repeat assignments. TRiCoLOR is open-source and implemented in python 3, with supporting C++ code and bash scripts. The tool is released through GitHub https://github.com/davidebolo1993/TRiCoLOR and as a docker image https://hub.docker.com/r/davidebolo1993/tricolor, with accompanying documentation.
Unraveling tandem repeat variation in personal genomes with long reads
BOLOGNINI, DAVIDE
2021
Abstract
Tandem repeats are repeated sequences that occur adjacent to each other in the human genome. Due to their prevalence and their association with a number of genetic diseases, there is a rising interest in developing tools for tandem repeat profiling. Genome-wide discovery approaches are needed to fully understand their roles in health and disease but resolving tandem repeat variation accurately remains a very challenging task. Indeed, while traditional mapping-based and assembly-based approaches using short-read data have severe limitations in the size and type of tandem repeats they can resolve, recent third-generation sequencing technologies provide the long reads required to broaden the scope of detectable tandem repeats but exhibit substantially higher sequencing error rates that complicates repeat resolution. In order to overcome limitations of prior methods, we developed TRiCoLOR, a freely-available tool for tandem repeat profiling using error-prone long reads from third-generation sequencing technologies. The method can identify repetitive regions in long-read sequencing data de novo and resolve their motif and multiplicity in a haplotype-specific manner. The tool further includes methods to interactively visualize the identified repeats and to trace their Mendelian consistency in pedigrees. Tested on synthetic data harboring tandem repeat contractions and expansions, TRiCoLOR demonstrates excellent performances and improved precision and recall compared to alternative tools. For real human whole-genome sequencing data, TRiCoLOR achieves high validation rates, suggesting its suitability to identify tandem repeat variation in personal genomes. Compared to assembly-based approaches for structural variant detection, TRiCoLOR demonstrates capable to resolve tandem repeats in difficult to assemble regions that are prone to mis-assemblies or incorrect repeat assignments. TRiCoLOR is open-source and implemented in python 3, with supporting C++ code and bash scripts. The tool is released through GitHub https://github.com/davidebolo1993/TRiCoLOR and as a docker image https://hub.docker.com/r/davidebolo1993/tricolor, with accompanying documentation.File | Dimensione | Formato | |
---|---|---|---|
phd_unisi_076260.pdf
accesso aperto
Dimensione
7.35 MB
Formato
Adobe PDF
|
7.35 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/165192
URN:NBN:IT:UNISI-165192