Generative modeling has advanced significantly over the past decade, driven by methodological innovation and increased computational resources. While domains such as images, text, and audio have seen widespread adoption of advanced techniques, tabular and relational data present distinct challenges: complex marginal distributions, intricate dependencies, heterogeneous data types, missing values, and hard constraints. These challenges intensify in relational databases, where multiple interconnected tables must be modeled jointly while preserving structural dependencies. Despite recent progress, crucial limitations remain regarding flexibility. State-of-the-art diffusion models generate high-fidelity synthetic data but lack the ability to incorporate user-specified constraints without retraining, perform general probabilistic queries, or handle complex relational structures without restrictive independence assumptions. This thesis addresses these limitations through three main contributions. First, we develop a training-free conditional sampling method for score-based models that enables users to impose logical constraints by combining neuro-symbolic constraint encoding with conditional score approximation. Second, we propose an expressive flow-matching framework for generating multi-table relational databases with arbitrary graph structures, where independence between any related records is not assumed, achieving state-of-the-art fidelity. Third, we analyze overparameterized probabilistic circuits as tractable generative models for tabular data, achieving competitive performance while enabling exact likelihood computation, principled handling of missing values, exact conditional sampling on partial evidence, and faster training and sampling compared to diffusion models. We also critically evaluate existing metrics and benchmarks, identifying their limitations and proposing more reliable evaluation protocols. Collectively, this work advances the state of the art in flexible and expressive generative modeling for tabular data.
Generative modeling has advanced significantly over the past decade, driven by methodological innovation and increased computational resources. While domains such as images, text, and audio have seen widespread adoption of advanced techniques, tabular and relational data present distinct challenges: complex marginal distributions, intricate dependencies, heterogeneous data types, missing values, and hard constraints. These challenges intensify in relational databases, where multiple interconnected tables must be modeled jointly while preserving structural dependencies. Despite recent progress, crucial limitations remain regarding flexibility. State-of-the-art diffusion models generate high-fidelity synthetic data but lack the ability to incorporate user-specified constraints without retraining, perform general probabilistic queries, or handle complex relational structures without restrictive independence assumptions. This thesis addresses these limitations through three main contributions. First, we develop a training-free conditional sampling method for score-based models that enables users to impose logical constraints by combining neuro-symbolic constraint encoding with conditional score approximation. Second, we propose an expressive flow-matching framework for generating multi-table relational databases with arbitrary graph structures, where independence between any related records is not assumed, achieving state-of-the-art fidelity. Third, we analyze overparameterized probabilistic circuits as tractable generative models for tabular data, achieving competitive performance while enabling exact likelihood computation, principled handling of missing values, exact conditional sampling on partial evidence, and faster training and sampling compared to diffusion models. We also critically evaluate existing metrics and benchmarks, identifying their limitations and proposing more reliable evaluation protocols. Collectively, this work advances the state of the art in flexible and expressive generative modeling for tabular data.
Towards Flexible and Expressive Generative Models for Tabular and Relational Data
SCASSOLA, DAVIDE
2026
Abstract
Generative modeling has advanced significantly over the past decade, driven by methodological innovation and increased computational resources. While domains such as images, text, and audio have seen widespread adoption of advanced techniques, tabular and relational data present distinct challenges: complex marginal distributions, intricate dependencies, heterogeneous data types, missing values, and hard constraints. These challenges intensify in relational databases, where multiple interconnected tables must be modeled jointly while preserving structural dependencies. Despite recent progress, crucial limitations remain regarding flexibility. State-of-the-art diffusion models generate high-fidelity synthetic data but lack the ability to incorporate user-specified constraints without retraining, perform general probabilistic queries, or handle complex relational structures without restrictive independence assumptions. This thesis addresses these limitations through three main contributions. First, we develop a training-free conditional sampling method for score-based models that enables users to impose logical constraints by combining neuro-symbolic constraint encoding with conditional score approximation. Second, we propose an expressive flow-matching framework for generating multi-table relational databases with arbitrary graph structures, where independence between any related records is not assumed, achieving state-of-the-art fidelity. Third, we analyze overparameterized probabilistic circuits as tractable generative models for tabular data, achieving competitive performance while enabling exact likelihood computation, principled handling of missing values, exact conditional sampling on partial evidence, and faster training and sampling compared to diffusion models. We also critically evaluate existing metrics and benchmarks, identifying their limitations and proposing more reliable evaluation protocols. Collectively, this work advances the state of the art in flexible and expressive generative modeling for tabular data.| File | Dimensione | Formato | |
|---|---|---|---|
|
Thesis.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
4.31 MB
Formato
Adobe PDF
|
4.31 MB | Adobe PDF | Visualizza/Apri |
|
Thesis_1.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
4.31 MB
Formato
Adobe PDF
|
4.31 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/362755
URN:NBN:IT:UNITS-362755