High-performance algorithms and frameworks for graph-based pangenomics: methods, applications, and technology transfer

Coggi, Mirko

The field of computational genomics is undergoing a profound transformation driven by the emergence of pangenome graphs as a more expressive alternative to linear references. These structures hold the potential to mitigate reference bias and improve the representation of complex genomic variation. However, their adoption remains hindered by significant computational challenges, lack of standardized evaluation frameworks, and limited interoperability with downstream analytics pipelines. This dissertation contributes a comprehensive and modular framework for sequence-to-graph alignment, designed to adapt dynamically to the topological properties of genome graphs. The framework is articulated across three stages: a preprocessing phase that partitions and refines graph topology, introducing the first method for compact and accurate representation of copy number variations; a processing layer that integrates specialized aligners and includes a novel implementation of the Graph Wavefront Alignment algorithm with traceback support, as well as patented techniques for GPU-accelerated alignment on cyclic structures; and a postprocessing module that addresses the limitations of current variant representation formats through a GPU-accelerated library for transforming VCF data into structured, analysis-ready formats. In parallel, this work introduces the first systematic benchmarking methodology for sequence-to-graph aligners, combining qualitative KPIs and quantitative assessments across real and synthetic datasets. This initiative lays the foundation for reproducible and extensible evaluation practices within the field. Recognizing the translational potential of these contributions, the final part of the thesis proposes and validates a methodology for supporting deep tech technology transfer in academia. Developed in collaboration with NECSTLab (Politecnico di Milano) and tested through the GenoGra case study, this framework provides structured guidance for transforming scientific results into scalable, research-driven startups. Through the integration of algorithmic innovation, high-performance computing strategies, and entrepreneurial methodology, this thesis advances both the technical foundations and the translational pathways for next-generation genomic analysis.

Il campo della genomica computazionale sta vivendo una profonda trasformazione grazie all'introduzione dei grafi pangenomici, che offrono una rappresentazione più espressiva rispetto ai tradizionali sistemi di riferimento lineari. Tali strutture permettono di ridurre il bias introdotto dal riferimento e di descrivere in modo accurato anche le variazioni genomiche più complesse. Tuttavia, la loro adozione su larga scala è limitata da sfide computazionali significative, dall'assenza di standard condivisi per la valutazione delle soluzioni e da una scarsa interoperabilità con le classiche pipeline bioinformatiche. Questa tesi propone un framework modulare per l'allineamento tra sequenze e grafi genomici, progettato per adattarsi dinamicamente alle proprietà topologiche del grafo. Il framework è articolato in tre fasi principali: una fase di preprocessing che suddivide il grafo in sottosezioni omogenee e ne ottimizza la struttura topologica, introducendo anche un nuovo metodo per rappresentare in modo compatto e interpretabile le variazioni di numero di copie (CNV); una fase di processing che integra allineatori specializzati, tra cui una nuova implementazione dell'algoritmo Graph Wavefront Alignment con supporto al traceback, e tecniche brevettate per l'allineamento accelerato su GPU anche in presenza di strutture cicliche; infine, una fase di postprocessing che affronta le limitazioni del formato VCF tramite una libreria ad alte prestazioni per la creazione formati strutturati, ottimizzati per l'analisi. Parallelamente, viene introdotta la prima metodologia sistematica per la valutazione di allineatori sequence-to-graph, basata su indicatori qualitativi e benchmark quantitativi su dati reali e sintetici. Questo approccio pone le basi per pratiche di valutazione riproducibili e comparabili nell'ambito della genomica pangenomica. Infine, la tesi presenta e valida una metodologia di supporto al trasferimento tecnologico nel contesto accademico, sviluppata in collaborazione con il NECSTLab (Politecnico di Milano) e sperimentata tramite il caso studio GenoGra. Tale metodologia fornisce un percorso strutturato per trasformare risultati scientifici in startup deep tech scalabili e sostenibili. Attraverso l'integrazione di innovazione algoritmica, strategie di calcolo ad alte prestazioni e un modello operativo per la valorizzazione dei risultati della ricerca, questa tesi contribuisce ad ampliare sia le fondamenta tecniche sia le traiettorie di trasferimento verso la genomica applicata di nuova generazione.