Neural networks excel at discovering statistical patterns in high-dimensional data sets. In practice, higher-order cumulants, which quantify the non-Gaussian correlations between three or more variables, are particularly important for the performance of neural networks. In this thesis we study the efficiency of neural networks at extracting features from higher-order cumulants. We first discuss the fundamental statistical and computational limits of recovering a single non-Gaussian direction, by computing the number of samples~n required to reliably distinguish between inputs from the spiked cumulant model and isotropic Gaussian inputs. Our findings confirm the existence of a wide statistical-to-computational gap: statistical distinguishability requires n > d samples, while distinguishing the two distributions in polynomial time requires n > d^2 samples for the class of algorithms covered by the low degree conjecture. Numerical experiments show that neural networks do indeed learn to distinguish the two distributions with quadratic sample complexity. Then, to further bridge the gap between theory and practice, we show that correlations between latent variables along the directions encoded in different input cumulants speed up learning from higher-order cumulants, reaching linear sample complexity. We show this effect analytically by deriving nearly sharp thresholds for the number of samples required by a single neuron to weakly-recover these directions using online SGD from a random start in high dimensions. Our analytical results are confirmed in simulations of two-layer neural networks and unveil a new mechanism for hierarchical learning in neural networks. Finally we connect to feature learning of deep networks via the proxy of independent component analysis (ICA), a simple unsupervised method that seeks the most non-Gaussian projections of its inputs. The motivation is that ICA finds features that are similar to the first-layer filters learned by deep convolutional networks on classification tasks. This similarity suggests that ICA provides a simple, yet principled model for studying feature learning. Here, we prove that FastICA, the most popular ICA algorithm, requires at least n> d^4 samples to recover a single non-Gaussian direction from d-dimensional inputs on a simple synthetic data model. We show that vanilla online SGD outperforms FastICA, and prove that the optimal sample complexity n > d^2 can be reached by smoothing the loss, albeit in a data-dependent way. We conclude by discussing how this picture extends to different distributions and architectures, such as generative diffusion models.

Learning Beyond the Gaussian Approximation: Algorithmic Hardness, Neural Network Dynamics and Independent Components

BARDONE, LORENZO
2025

Abstract

Neural networks excel at discovering statistical patterns in high-dimensional data sets. In practice, higher-order cumulants, which quantify the non-Gaussian correlations between three or more variables, are particularly important for the performance of neural networks. In this thesis we study the efficiency of neural networks at extracting features from higher-order cumulants. We first discuss the fundamental statistical and computational limits of recovering a single non-Gaussian direction, by computing the number of samples~n required to reliably distinguish between inputs from the spiked cumulant model and isotropic Gaussian inputs. Our findings confirm the existence of a wide statistical-to-computational gap: statistical distinguishability requires n > d samples, while distinguishing the two distributions in polynomial time requires n > d^2 samples for the class of algorithms covered by the low degree conjecture. Numerical experiments show that neural networks do indeed learn to distinguish the two distributions with quadratic sample complexity. Then, to further bridge the gap between theory and practice, we show that correlations between latent variables along the directions encoded in different input cumulants speed up learning from higher-order cumulants, reaching linear sample complexity. We show this effect analytically by deriving nearly sharp thresholds for the number of samples required by a single neuron to weakly-recover these directions using online SGD from a random start in high dimensions. Our analytical results are confirmed in simulations of two-layer neural networks and unveil a new mechanism for hierarchical learning in neural networks. Finally we connect to feature learning of deep networks via the proxy of independent component analysis (ICA), a simple unsupervised method that seeks the most non-Gaussian projections of its inputs. The motivation is that ICA finds features that are similar to the first-layer filters learned by deep convolutional networks on classification tasks. This similarity suggests that ICA provides a simple, yet principled model for studying feature learning. Here, we prove that FastICA, the most popular ICA algorithm, requires at least n> d^4 samples to recover a single non-Gaussian direction from d-dimensional inputs on a simple synthetic data model. We show that vanilla online SGD outperforms FastICA, and prove that the optimal sample complexity n > d^2 can be reached by smoothing the loss, albeit in a data-dependent way. We conclude by discussing how this picture extends to different distributions and architectures, such as generative diffusion models.
25-set-2025
Inglese
Goldt, Sebastian Dominik
SISSA
Trieste
File in questo prodotto:
File Dimensione Formato  
PhD_thesis-Bardone.pdf

accesso aperto

Dimensione 8.53 MB
Formato Adobe PDF
8.53 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/295811
Il codice NBN di questa tesi è URN:NBN:IT:SISSA-295811