In this thesis we focus on understanding the impressive ability of neural networks to generalize on unseen data. When and why does this generalization occur, and how can it be predicted? We offer partial answers to these questions by analyzing the loss landscape of neural networks, drawing insights from the study of disordered systems. The first chapter contains a study that addresses multiple phenomena in machine learning using a complexity measure of the function realized by a neural network. Specifically, we study how the generalization of a neural network evolves as we vary its size. We demonstrate that the location of the test error double descent of a network can be accurately predicted by studying the phase transitions of a new sensitivity metric for the neural networks that we call Boolean Mean Dimension (BMD). Focusing on a teacher-student setting for the random feature model, we derive a theoretical analysis based on the replica method that yields an interpretable expression for the BMD, in the high dimensional regime where the number of data points, the number of features, and the input size grow to infinity. The second chapter delves into the relationship between the generalization capabilities of neural networks and the concept of flatness, which is evaluated through local entropy measures for different loss functions. Our goal is to enhance the understanding of how properties of the loss landscape connect with generalization. We show that the generalization properties of a loss minimizer can be inferred by examining the empirical risk landscape around it. Specifically, we show that generalization correlates with the existence of clusters of near-minimal solutions nearby, which we refer to as flat minima. Conversely, isolated minimizers tend to overfit the model. Following this idea, we employ variants of gradient descent methods to train neural networks that specifically target these wide flat minima. We demonstrate that these algorithms improve generalization across different architectures, datasets, and loss functions. In the third chapter we conduct a more precise study of the loss landscape in a simple model of shallow neural networks known as the perceptron problem. We analyze the Stochastic Localization (SL) algorithm, which employs a denoising diffusion process to sample from a known but intractable distribution. In our setting, the score function is provided by an oracle using Approximate Message Passing. We consider different variants of non-convex perceptron problems: the negative spherical perceptron, the binary perceptron, and the binary perceptron with modified loss functions. We show that stochastic localization successfully samples solutions for any satisfiable parameter regime of the negative spherical perceptron and fails throughout an entire regime of parameters for the binary perceptron, but also that the latter issue can be overcome by adapting the loss function to target flat minima. Additionally, we provide numerical evidence for the uniformity of sampled solutions in case of the negative spherical perceptron. Finally, the fourth chapter discusses the performance of the stochastic localization algorithm on sparse constraint satisfaction problems (CSPs). In particular, we focus on sampling solutions to linear equations over finite fields (k-XORSAT), Boolean formulas (k-SAT), and proper graph colorings (q-coloring). Previous works analyzing this sampling algorithm on CSPs rely on a certain contiguity assumption that does not hold in the entire replica-symmetric regime for the k-SAT problem. Instead, we propose a new population dynamics framework tailored to the localization process in order to analyze the limitations of stochastic localization beyond the contiguity regime. Lastly, we provide numerical evidence showing that stochastic localization achieves uniform sampling for the k-XORSAT problem.

Methods of statistical physics for explaining neural networks performance

DEMYANENKO, ELIZAVETA
2025

Abstract

In this thesis we focus on understanding the impressive ability of neural networks to generalize on unseen data. When and why does this generalization occur, and how can it be predicted? We offer partial answers to these questions by analyzing the loss landscape of neural networks, drawing insights from the study of disordered systems. The first chapter contains a study that addresses multiple phenomena in machine learning using a complexity measure of the function realized by a neural network. Specifically, we study how the generalization of a neural network evolves as we vary its size. We demonstrate that the location of the test error double descent of a network can be accurately predicted by studying the phase transitions of a new sensitivity metric for the neural networks that we call Boolean Mean Dimension (BMD). Focusing on a teacher-student setting for the random feature model, we derive a theoretical analysis based on the replica method that yields an interpretable expression for the BMD, in the high dimensional regime where the number of data points, the number of features, and the input size grow to infinity. The second chapter delves into the relationship between the generalization capabilities of neural networks and the concept of flatness, which is evaluated through local entropy measures for different loss functions. Our goal is to enhance the understanding of how properties of the loss landscape connect with generalization. We show that the generalization properties of a loss minimizer can be inferred by examining the empirical risk landscape around it. Specifically, we show that generalization correlates with the existence of clusters of near-minimal solutions nearby, which we refer to as flat minima. Conversely, isolated minimizers tend to overfit the model. Following this idea, we employ variants of gradient descent methods to train neural networks that specifically target these wide flat minima. We demonstrate that these algorithms improve generalization across different architectures, datasets, and loss functions. In the third chapter we conduct a more precise study of the loss landscape in a simple model of shallow neural networks known as the perceptron problem. We analyze the Stochastic Localization (SL) algorithm, which employs a denoising diffusion process to sample from a known but intractable distribution. In our setting, the score function is provided by an oracle using Approximate Message Passing. We consider different variants of non-convex perceptron problems: the negative spherical perceptron, the binary perceptron, and the binary perceptron with modified loss functions. We show that stochastic localization successfully samples solutions for any satisfiable parameter regime of the negative spherical perceptron and fails throughout an entire regime of parameters for the binary perceptron, but also that the latter issue can be overcome by adapting the loss function to target flat minima. Additionally, we provide numerical evidence for the uniformity of sampled solutions in case of the negative spherical perceptron. Finally, the fourth chapter discusses the performance of the stochastic localization algorithm on sparse constraint satisfaction problems (CSPs). In particular, we focus on sampling solutions to linear equations over finite fields (k-XORSAT), Boolean formulas (k-SAT), and proper graph colorings (q-coloring). Previous works analyzing this sampling algorithm on CSPs rely on a certain contiguity assumption that does not hold in the entire replica-symmetric regime for the k-SAT problem. Instead, we propose a new population dynamics framework tailored to the localization process in order to analyze the limitations of stochastic localization beyond the contiguity regime. Lastly, we provide numerical evidence showing that stochastic localization achieves uniform sampling for the k-XORSAT problem.
23-giu-2025
Inglese
MALATESTA, ENRICO MARIA
LUCIBELLO, CARLO
Università Bocconi
File in questo prodotto:
File Dimensione Formato  
DEMYANENKO_Revised Thesis_vf.pdf

accesso aperto

Dimensione 3.66 MB
Formato Adobe PDF
3.66 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/213827
Il codice NBN di questa tesi è URN:NBN:IT:UNIBOCCONI-213827