Towards Flexible and Expressive Generative Models for Tabular and Relational Data

Scassola, Davide

Generative modeling has advanced significantly over the past decade, driven by methodological innovation and increased computational resources. While domains such as images, text, and audio have seen widespread adoption of advanced techniques, tabular and relational data present distinct challenges: complex marginal distributions, intricate dependencies, heterogeneous data types, missing values, and hard constraints. These challenges intensify in relational databases, where multiple interconnected tables must be modeled jointly while preserving structural dependencies. Despite recent progress, crucial limitations remain regarding flexibility. State-of-the-art diffusion models generate high-fidelity synthetic data but lack the ability to incorporate user-specified constraints without retraining, perform general probabilistic queries, or handle complex relational structures without restrictive independence assumptions. This thesis addresses these limitations through three main contributions. First, we develop a training-free conditional sampling method for score-based models that enables users to impose logical constraints by combining neuro-symbolic constraint encoding with conditional score approximation. Second, we propose an expressive flow-matching framework for generating multi-table relational databases with arbitrary graph structures, where independence between any related records is not assumed, achieving state-of-the-art fidelity. Third, we analyze overparameterized probabilistic circuits as tractable generative models for tabular data, achieving competitive performance while enabling exact likelihood computation, principled handling of missing values, exact conditional sampling on partial evidence, and faster training and sampling compared to diffusion models. We also critically evaluate existing metrics and benchmarks, identifying their limitations and proposing more reliable evaluation protocols. Collectively, this work advances the state of the art in flexible and expressive generative modeling for tabular data.

Towards Flexible and Expressive Generative Models for Tabular and Relational Data

SCASSOLA, DAVIDE

2026

Abstract

Generative modeling has advanced significantly over the past decade, driven by methodological innovation and increased computational resources. While domains such as images, text, and audio have seen widespread adoption of advanced techniques, tabular and relational data present distinct challenges: complex marginal distributions, intricate dependencies, heterogeneous data types, missing values, and hard constraints. These challenges intensify in relational databases, where multiple interconnected tables must be modeled jointly while preserving structural dependencies. Despite recent progress, crucial limitations remain regarding flexibility. State-of-the-art diffusion models generate high-fidelity synthetic data but lack the ability to incorporate user-specified constraints without retraining, perform general probabilistic queries, or handle complex relational structures without restrictive independence assumptions. This thesis addresses these limitations through three main contributions. First, we develop a training-free conditional sampling method for score-based models that enables users to impose logical constraints by combining neuro-symbolic constraint encoding with conditional score approximation. Second, we propose an expressive flow-matching framework for generating multi-table relational databases with arbitrary graph structures, where independence between any related records is not assumed, achieving state-of-the-art fidelity. Third, we analyze overparameterized probabilistic circuits as tractable generative models for tabular data, achieving competitive performance while enabling exact likelihood computation, principled handling of missing values, exact conditional sampling on partial evidence, and faster training and sampling compared to diffusion models. We also critically evaluate existing metrics and benchmarks, identifying their limitations and proposing more reliable evaluation protocols. Collectively, this work advances the state of the art in flexible and expressive generative modeling for tabular data.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				APPLIED DATA SCIENCE AND ARTIFICIAL INTELLIGENCE
			
	Data di pubblicazione
	
				25-feb-2026
			
	Lingua
	
				Inglese
			
	Abstract in italiano
	
				Generative modeling has advanced significantly over the past decade, driven by methodological innovation and increased computational resources. While domains such as images, text, and audio have seen widespread adoption of advanced techniques, tabular and relational data present distinct challenges: complex marginal distributions, intricate dependencies, heterogeneous data types, missing values, and hard constraints. These challenges intensify in relational databases, where multiple interconnected tables must be modeled jointly while preserving structural dependencies.

Despite recent progress, crucial limitations remain regarding flexibility. State-of-the-art diffusion models generate high-fidelity synthetic data but lack the ability to incorporate user-specified constraints without retraining, perform general probabilistic queries, or handle complex relational structures without restrictive independence assumptions. This thesis addresses these limitations through three main contributions.

First, we develop a training-free conditional sampling method for score-based models that enables users to impose logical constraints by combining neuro-symbolic constraint encoding with conditional score approximation. Second, we propose an expressive flow-matching framework for generating multi-table relational databases with arbitrary graph structures, where independence between any related records is not assumed, achieving state-of-the-art fidelity. Third, we analyze overparameterized probabilistic circuits as tractable generative models for tabular data, achieving competitive performance while enabling exact likelihood computation, principled handling of missing values, exact conditional sampling on partial evidence, and faster training and sampling compared to diffusion models. We also critically evaluate existing metrics and benchmarks, identifying their limitations and proposing more reliable evaluation protocols. Collectively, this work advances the state of the art in flexible and expressive generative modeling for tabular data.
			
	Parola chiave
	
				Generative Models; Tabular Data; Probability; Deep Learning; Neuro-symbolic
			
	Relatore, Supervisor, Advisor o Tutor
	
				SACCANI SEBASTIANO
BORTOLUSSI, LUCA
			
	Nome Editore
	
				Università degli Studi di Trieste
			
	Collezione di appartenenza
	
				Università degli Studi di Trieste

File in questo prodotto:

File	Dimensione	Formato
Thesis.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 4.31 MB Formato Adobe PDF Visualizza/Apri	4.31 MB	Adobe PDF	Visualizza/Apri
Thesis_1.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 4.31 MB Formato Adobe PDF Visualizza/Apri	4.31 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/362755

Il codice NBN di questa tesi è URN:NBN:IT:UNITS-362755