Forensic DNA phenotyping (FDP) enables the prediction of specific phenotypic traits of an individual, such as physical appearance, biogeographical ancestry, and age, from minimal amounts of DNA obtained from biological samples. FDP is particularly useful in identifying unknown perpetrators of crimes, especially when standard forensic profiling does not provide sufficient information due to the absence of known suspects or profiles in national DNA databases. A validated forensic phenotyping tool is the HIrisPlex-S system, designed for the simultaneous prediction of eye, hair, and skin colour. Based on the analysis of 41 genetic polymorphisms associated with pigmentation, this system allows for the estimation of individual probabilities for three categories of eye colour, four categories of hair colour, and five categories of skin colour, using exclusively genotypic data. Originally developed to support specific criminal investigations, this tool has also been extensively used in recent years to predict the phenotype of ancient human skeletal remains. To date, phenotypic prediction based on ancient, degraded, and low-coverage data has been carried out by assuming knowledge of the allelic and genotypic states of the samples at the 41 loci of interest. However, it is well known that directly calling genotypic variants from such data presents numerous challenges due to genetic material fragmentation, contamination risks, and degradation. Over time, degradation causes the DNA to break into very small fragments, making it difficult to obtain complete and accurate sequences. This results in low sequencing depth, which drastically affects the ability to reliably identify genotypes, thereby contributing to errors in phenotypic inference. In the first part of this thesis, the robustness of the HIrisPlex-S system in phenotypic inference applied to ancient and low-coverage data was evaluated by testing it at various coverage levels. The required coverage level for ensuring robust inference was determined, and guidelines were established for when to apply the standard HIrisPlex-S protocol versus when to use alternative methods that account for uncertainty in genotypic calls. The evaluation was conducted by analysing the results of phenotypic predictions for eye, hair, and skin colour obtained through three prediction models: 1) the standard HIrisPlex-S protocol, based on direct variant calling; 2) a model specifically developed in this work, which integrates the classic system with genotype likelihoods to account for Robust inference of phenotypic traits from low-coverage ancient genomes B the uncertainty associated with low-coverage data in genotype calling; 3) imputation, used to handle potentially missing data in low-coverage genomes. To this scope, three available high-coverage samples were analysed: one from Palaeolithic Siberia, one from Mesolithic Sweden, and one from Bronze Age Germany. Using a downsampling procedure, phenotypic predictions for eye, hair, and skin colour were made by applying the three different prediction models. The results of phenotypic predictions for each approach and coverage level were compared with the "true phenotype" inferred from the original high-coverage genome. Subsequently, the frequency of correct predictions by each method at each coverage level was evaluated to determine the limits and the most suitable approach for phenotypic prediction from ancient data, thereby minimizing errors in the estimations. The conclusions of this first part indicate that, regardless of the historical period of the analysed sample, it is advisable to apply the standard HIrisPlex-S protocol if minimum coverage levels of at least 8x at each of the 41 loci of interest are achieved. For coverages below 8x, the method developed and tested in this study, which uses genotype likelihoods for genotype estimation, is recommended instead. The second part of this thesis project involved applying the protocol to a dataset of 348 Eurasian individuals ranging from the Upper Palaeolithic to the Iron Age, with the goal of examining the variation in eye, hair, and skin colour over the past 45,000 years. The resulting picture reveals a complex evolution of these traits: while dark phenotypes predominated for much of the studied period, light phenotypes began to appear from the Mesolithic onward in regions requiring environmental adaptation to high latitudes. Subsequently, these light phenotypes further spread in later periods due to a complex interaction between environmental adaptation and demographic events.
La fenotipizzazione forense del DNA (FDP) consente di prevedere determinati tratti fenotipici di un individuo, quali l’aspetto fisico, l’ascendenza biogeografica e l’età, a partire da minime quantità di DNA provenienti da campioni biologici. Tale tecnica risulta particolarmente utile nell’identificazione di autori sconosciuti di reati, nei casi in cui la profilazione forense standard non fornisca informazioni sufficienti a causa della mancanza di sospetti noti o profili presenti in database nazionali di DNA. Uno strumento di fenotipizzazione forense già sviluppato e validato è il sistema HIrisPlex-S, progettato per la previsione simultanea del colore degli occhi, dei capelli e della pelle. Basato sull’analisi di 41 polimorfismi genetici associati alla pigmentazione, questo sistema permette di stimare le probabilità individuali per tre categorie di colore degli occhi, quattro di colore dei capelli e cinque di colore della pelle, utilizzando esclusivamente i dati genotipici. Questo strumento, originariamente sviluppato per supportare indagini criminali specifiche, è stato ampiamente utilizzato negli ultimi anni anche per prevedere il fenotipo di resti scheletrici umani antichi. Fino ad oggi, la predizione fenotipica basata su dati antichi, degradati e a bassa copertura, è stata effettuata assumendo la conoscenza dello stato allelico e genotipico dei campioni nelle 41 posizioni di interesse. Tuttavia, è noto che la chiamata diretta delle varianti genotipiche da tali dati presenta numerose sfide, a causa della frammentazione del materiale genetico, del rischio di contaminazione e della degradazione. Quest’ultima provoca, nel tempo, la rottura del DNA in frammenti molto piccoli, rendendo difficile l’ottenimento di sequenze complete e accurate. Ciò comporta una bassa profondità di sequenziamento, che influisce drasticamente sulla capacità di identificare in modo affidabile i genotipi, contribuendo così ad errori nell’inferenza fenotipica. Nella prima parte di questo lavoro di tesi è stata valutata la robustezza del sistema HIrisPlex-S nella procedura inferenziale fenotipica applicata a dati antichi e a bassa copertura, testandolo a diversi livelli di copertura. È stato determinato il livello di copertura necessario per garantire un’inferenza robusta e valutato quando applicare il protocollo classico di HIrisPlex-S e quando invece ricorrere a metodi alternativi che considerino l’incertezza nella chiamata genotipica. La valutazione è stata con-dotta analizzando i risultati delle predizioni fenotipiche per il colore degli occhi, dei capelli e della pelle, ottenuti tramite tre modelli di predizione: 1) il protocollo standard di HIrisPlex-S, basato sulla chiamata diretta delle varianti; 2) un modello specificamente sviluppato in questo lavoro, che integra il sistema classico con le genotype likelihoods, per considerare l’incertezza associata ai dati a bassa copertura nella chiamata dei genotipi; 3) l’imputazione, utilizzata per gestire i dati potenzialmente mancanti in genomi a bassa copertura. A tale scopo, sono stati analizzati tre campioni presenti in letteratura ad alta copertura: uno prove-niente dalla Siberia del Paleolitico, uno dalla Svezia del Mesolitico e uno dalla Germania dell’Età del Bronzo. Utilizzando una procedura di abbassamento artificiale della copertura, sono state effettuate previsioni fenotipiche sui colori degli occhi, dei capelli e della pelle, testando i tre diversi modelli di predizione. I risultati delle previsioni fenotipiche per ciascun approccio e livello di copertura sono stati confrontati con il "fenotipo reale", inferito dal genoma originale ad alta copertura. È stato quindi valutato quante volte ciascun metodo, a ogni livello di copertura, ha prodotto previsioni corrette, al fine di determinare i limiti e l’approccio più idoneo per la predizione fenotipica da dati antichi, riducendo al minimo gli errori nelle stime. Le conclusioni di questa prima parte indicano che, indipendentemente dal periodo storico del campione analizzato, è consigliabile applicare il protocollo standard di HIrisPlex-S se si raggiungono livelli minimi di copertura di almeno 8x su ciascuna delle 41 posizioni di interesse. Per coperture inferiori a 8x, si raccomanda invece l’uso del metodo ideato e testato in questo studio, che utilizza le genotype likelihoods per la stima dei genotipi. La seconda parte di questo progetto di tesi ha previsto l’applicazione del protocollo a un dataset di 348 individui eurasiatici, che spaziano dal Paleolitico superiore all’Età del Ferro, con l’obiettivo di osservare la variazione del colore degli occhi, dei capelli e della pelle negli ultimi 45.000 anni. La fotografia che emerge rivela un’evoluzione complessa di questi tratti: sebbene i fenotipi scuri siano stati predominanti per gran parte del periodo esaminato, i fenotipi chiari sono comparsi a partire dal Mesolitico, nelle regioni che richiedevano un adattamento ambientale alle alte latitudini. In seguito, tali fenotipi chiari si sono diffusi ulteriormente nei periodi successivi grazie a una complessa interazione tra adattamento ambientale ed eventi demografici.
Robust inference of phenotypic traits from low-coverage ancient genomes
Silvia, Perretti
2024
Abstract
Forensic DNA phenotyping (FDP) enables the prediction of specific phenotypic traits of an individual, such as physical appearance, biogeographical ancestry, and age, from minimal amounts of DNA obtained from biological samples. FDP is particularly useful in identifying unknown perpetrators of crimes, especially when standard forensic profiling does not provide sufficient information due to the absence of known suspects or profiles in national DNA databases. A validated forensic phenotyping tool is the HIrisPlex-S system, designed for the simultaneous prediction of eye, hair, and skin colour. Based on the analysis of 41 genetic polymorphisms associated with pigmentation, this system allows for the estimation of individual probabilities for three categories of eye colour, four categories of hair colour, and five categories of skin colour, using exclusively genotypic data. Originally developed to support specific criminal investigations, this tool has also been extensively used in recent years to predict the phenotype of ancient human skeletal remains. To date, phenotypic prediction based on ancient, degraded, and low-coverage data has been carried out by assuming knowledge of the allelic and genotypic states of the samples at the 41 loci of interest. However, it is well known that directly calling genotypic variants from such data presents numerous challenges due to genetic material fragmentation, contamination risks, and degradation. Over time, degradation causes the DNA to break into very small fragments, making it difficult to obtain complete and accurate sequences. This results in low sequencing depth, which drastically affects the ability to reliably identify genotypes, thereby contributing to errors in phenotypic inference. In the first part of this thesis, the robustness of the HIrisPlex-S system in phenotypic inference applied to ancient and low-coverage data was evaluated by testing it at various coverage levels. The required coverage level for ensuring robust inference was determined, and guidelines were established for when to apply the standard HIrisPlex-S protocol versus when to use alternative methods that account for uncertainty in genotypic calls. The evaluation was conducted by analysing the results of phenotypic predictions for eye, hair, and skin colour obtained through three prediction models: 1) the standard HIrisPlex-S protocol, based on direct variant calling; 2) a model specifically developed in this work, which integrates the classic system with genotype likelihoods to account for Robust inference of phenotypic traits from low-coverage ancient genomes B the uncertainty associated with low-coverage data in genotype calling; 3) imputation, used to handle potentially missing data in low-coverage genomes. To this scope, three available high-coverage samples were analysed: one from Palaeolithic Siberia, one from Mesolithic Sweden, and one from Bronze Age Germany. Using a downsampling procedure, phenotypic predictions for eye, hair, and skin colour were made by applying the three different prediction models. The results of phenotypic predictions for each approach and coverage level were compared with the "true phenotype" inferred from the original high-coverage genome. Subsequently, the frequency of correct predictions by each method at each coverage level was evaluated to determine the limits and the most suitable approach for phenotypic prediction from ancient data, thereby minimizing errors in the estimations. The conclusions of this first part indicate that, regardless of the historical period of the analysed sample, it is advisable to apply the standard HIrisPlex-S protocol if minimum coverage levels of at least 8x at each of the 41 loci of interest are achieved. For coverages below 8x, the method developed and tested in this study, which uses genotype likelihoods for genotype estimation, is recommended instead. The second part of this thesis project involved applying the protocol to a dataset of 348 Eurasian individuals ranging from the Upper Palaeolithic to the Iron Age, with the goal of examining the variation in eye, hair, and skin colour over the past 45,000 years. The resulting picture reveals a complex evolution of these traits: while dark phenotypes predominated for much of the studied period, light phenotypes began to appear from the Mesolithic onward in regions requiring environmental adaptation to high latitudes. Subsequently, these light phenotypes further spread in later periods due to a complex interaction between environmental adaptation and demographic events.File | Dimensione | Formato | |
---|---|---|---|
SPerretti_Thesis_10-09-24.pdf
embargo fino al 01/10/2025
Dimensione
10.79 MB
Formato
Adobe PDF
|
10.79 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/192930
URN:NBN:IT:UNIPR-192930