DNA is a complex molecule that stores the genetic information needed for the development and functioning of each living organism. Each organism has its own unique sequence of DNA, also known as the genome, which differs from those of all other existing individuals due to the presence of genetic variants. These differences in the genome sequence can be inherited from parental DNA or arise during the life of each individual. Some variants are responsible for differences in appearance among individuals of the same species, while some others are deeply associated to their health status. Since the importance of identifying variants to monitor the disease onset or predisposition in a certain individual has been recognised, the techniques for studying DNA have significantly improved. Increasingly sophisticated sequencing techniques are employed to determine the DNA sequence extracted from biological tissues, while bioinformatics pipelines are used both for variant calling, to identify the exact position of variants along the genome, and variant interpretation, to finally assess their impact on an individual's health. With the rapidly decreasing cost of sequencing technologies, gold-standard samples, for which the positions of variants in the genome are known, have become available. Subsequently, these samples have been used to optimise variant calling pipelines, in some cases achieving identification performance good enough to allow those pipelines to be used in clinical practice. As regards variant interpretation pipelines, many tools have been deployed that automatically collect relevant clinical information about variants. Based on this information, it is possible to infer the variant's impact on health and provide the assessment as a suggestion for clinicians. Despite these recent advancements, critical challenges still limit the effective use of bioinformatics pipelines for identifying variants and determine their clinical significance. On the one hand, variant calling remains difficult due to the lack of a comprehensive gold-standard sample dataset that accounts for the extensive variability in both biological and technical characteristics of samples. As a result, optimising variant calling for each specific sample scenario is challenging, leading to less than optimal discovery performance. On the other hand, variant interpretation requires to collect clinically relevant information about variants from databases and the literature. However, the process of extracting information from the literature is carried out manually. Consequently, it is highly time-consuming, non-scalable, and error-prone. This limitation affects the ability to correctly infer the effects of the variants on health, as the variant information from the literature is often incomplete or biased. Currently, to address the problem of gold-standard samples shortage, in-silico simulation tools are employed. However, these tools lack the ability to control all the characteristics of real samples. Moreover, if they are not configured properly, they fail to realistically represent samples. Instead, to address the challenges of manual literature curation, artificial intelligence tools have been developed to analyse textual data. While these tools can identify relevant terms and filter scientific papers, they cannot fully automate the complex literature screening process needed to extract all clinically relevant information about variants.
Enhancing variant calling and interpretation pipelines using data-driven in-silico simulation and artificial-intelligence
LONGHIN, FRANCESCA
2025
Abstract
DNA is a complex molecule that stores the genetic information needed for the development and functioning of each living organism. Each organism has its own unique sequence of DNA, also known as the genome, which differs from those of all other existing individuals due to the presence of genetic variants. These differences in the genome sequence can be inherited from parental DNA or arise during the life of each individual. Some variants are responsible for differences in appearance among individuals of the same species, while some others are deeply associated to their health status. Since the importance of identifying variants to monitor the disease onset or predisposition in a certain individual has been recognised, the techniques for studying DNA have significantly improved. Increasingly sophisticated sequencing techniques are employed to determine the DNA sequence extracted from biological tissues, while bioinformatics pipelines are used both for variant calling, to identify the exact position of variants along the genome, and variant interpretation, to finally assess their impact on an individual's health. With the rapidly decreasing cost of sequencing technologies, gold-standard samples, for which the positions of variants in the genome are known, have become available. Subsequently, these samples have been used to optimise variant calling pipelines, in some cases achieving identification performance good enough to allow those pipelines to be used in clinical practice. As regards variant interpretation pipelines, many tools have been deployed that automatically collect relevant clinical information about variants. Based on this information, it is possible to infer the variant's impact on health and provide the assessment as a suggestion for clinicians. Despite these recent advancements, critical challenges still limit the effective use of bioinformatics pipelines for identifying variants and determine their clinical significance. On the one hand, variant calling remains difficult due to the lack of a comprehensive gold-standard sample dataset that accounts for the extensive variability in both biological and technical characteristics of samples. As a result, optimising variant calling for each specific sample scenario is challenging, leading to less than optimal discovery performance. On the other hand, variant interpretation requires to collect clinically relevant information about variants from databases and the literature. However, the process of extracting information from the literature is carried out manually. Consequently, it is highly time-consuming, non-scalable, and error-prone. This limitation affects the ability to correctly infer the effects of the variants on health, as the variant information from the literature is often incomplete or biased. Currently, to address the problem of gold-standard samples shortage, in-silico simulation tools are employed. However, these tools lack the ability to control all the characteristics of real samples. Moreover, if they are not configured properly, they fail to realistically represent samples. Instead, to address the challenges of manual literature curation, artificial intelligence tools have been developed to analyse textual data. While these tools can identify relevant terms and filter scientific papers, they cannot fully automate the complex literature screening process needed to extract all clinically relevant information about variants.File | Dimensione | Formato | |
---|---|---|---|
tesi_definitiva_Francesca_Longhin.pdf
embargo fino al 19/03/2028
Dimensione
34.07 MB
Formato
Adobe PDF
|
34.07 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/202609
URN:NBN:IT:UNIPD-202609