In the world of complex systems, which are omnipresent in various domains including economics, biology, and human-engineered systems, understanding their behavior poses significant challenges. The crux of comprehending these systems lies in the effective analysis of the data they produce, whose methodologies are provided by data science. However, a notable challenge in this realm is the confrontation with partial information which, if not addressed judiciously, can lead to biased interpretations or misconceptions.This thesis is structured into five main chapters: the first provides a broad introduction to the main topics of this work. The second chapter studies opinion dynamics across various social media platforms by defining an opinion dynamics model on a multiplex network, highlighting the interplay of multiple platforms in shaping opinions. It underscores the importance of considering the different network layers, corresponding to social media platforms, when analyzing how users interact and shape their opinions. I find that empirical studies focusing on a single platform, neglecting interactions on other layers, can result in misleading conclusions. Moreover, by considering the richer picture given by this multi-platform opinion dynamics model, segregation of extreme from moderate users emerges. The subsequent chapter concerns the Generalized Maximum Entropy Principle (GMEP), a general principled technique for treating partial information. I will introduce the uninformativeness axiom, which when applied to the Uffink-Jizba-Korbel or the Hanel-Thurner families of entropies selects only Rényi entropy as viable, bridging the consistency between the GMEP and the Maximum Likelihood (ML) principles. I will also showcase the potential of ML in estimating the entropic parameter characterizing Rényi entropy, providing numerical examples supporting my theoretical findings. The fourth chapter regards nonlinear data compression, where I will introduce a generalized Arithmetic Coding scheme to encode sequences in order to minimize the exponential average codeword length. Moreover, I will provide a simple yet general justification for the employment of the exponential average, instead of the linear one. Namely, if the main interest is to reduce the probability of exceeding a given codewords' length threshold, I find that the exponential average is the target quantity to minimize. All my theoretical findings will be supported and confirmed by applications on both simulated i.i.d. and real correlated data. In the last chapter, I will briefly summarize my results. In essence, this thesis addresses the challenges posed by complex systems to data science, offering insights and methodologies to treat complex-systems-generated data, which are often fragmentary.

Challenges in Data Science for Complex Systems

SOMAZZI, Andrea
2024

Abstract

In the world of complex systems, which are omnipresent in various domains including economics, biology, and human-engineered systems, understanding their behavior poses significant challenges. The crux of comprehending these systems lies in the effective analysis of the data they produce, whose methodologies are provided by data science. However, a notable challenge in this realm is the confrontation with partial information which, if not addressed judiciously, can lead to biased interpretations or misconceptions.This thesis is structured into five main chapters: the first provides a broad introduction to the main topics of this work. The second chapter studies opinion dynamics across various social media platforms by defining an opinion dynamics model on a multiplex network, highlighting the interplay of multiple platforms in shaping opinions. It underscores the importance of considering the different network layers, corresponding to social media platforms, when analyzing how users interact and shape their opinions. I find that empirical studies focusing on a single platform, neglecting interactions on other layers, can result in misleading conclusions. Moreover, by considering the richer picture given by this multi-platform opinion dynamics model, segregation of extreme from moderate users emerges. The subsequent chapter concerns the Generalized Maximum Entropy Principle (GMEP), a general principled technique for treating partial information. I will introduce the uninformativeness axiom, which when applied to the Uffink-Jizba-Korbel or the Hanel-Thurner families of entropies selects only Rényi entropy as viable, bridging the consistency between the GMEP and the Maximum Likelihood (ML) principles. I will also showcase the potential of ML in estimating the entropic parameter characterizing Rényi entropy, providing numerical examples supporting my theoretical findings. The fourth chapter regards nonlinear data compression, where I will introduce a generalized Arithmetic Coding scheme to encode sequences in order to minimize the exponential average codeword length. Moreover, I will provide a simple yet general justification for the employment of the exponential average, instead of the linear one. Namely, if the main interest is to reduce the probability of exceeding a given codewords' length threshold, I find that the exponential average is the target quantity to minimize. All my theoretical findings will be supported and confirmed by applications on both simulated i.i.d. and real correlated data. In the last chapter, I will briefly summarize my results. In essence, this thesis addresses the challenges posed by complex systems to data science, offering insights and methodologies to treat complex-systems-generated data, which are often fragmentary.
26-gen-2024
Inglese
Scuola Normale Superiore
Esperti anonimi
File in questo prodotto:
File Dimensione Formato  
Tesi.pdf

accesso aperto

Dimensione 4.8 MB
Formato Adobe PDF
4.8 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/167607
Il codice NBN di questa tesi è URN:NBN:IT:SNS-167607