Deep generative modeling is revolutionizing visual data synthesis, from creative industry applications to tools for scientific discovery in fields like medical imaging and remote sensing. In image synthesis, deep learning models like Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Denoising Diffusion Probabilistic Models (DDPMs) help create novel, realistic, and controllable visual content. In specialized applications, these models can augment scarce data for medical analysis or enhance imagery for remote sensing tasks. A crucial challenge in applying deep learning to image synthesis is overcoming the persistent trade-offs between resolution, fidelity, and computational efficiency. While generative models have advanced significantly, they often struggle with generalization across diverse domains and data modalities. Conventional methods tend to scale poorly to ultra-high resolutions (UHR), leading to artifacts like repeated structures or blurred textures. Furthermore, these models often apply uniform refinement processes across all spatial regions, disregarding local frequency variations and failing to optimally allocate supervision to areas of different visual complexity. Wavelet transforms, particularly Discrete Wavelet Transforms (DWT) and their hypercomplex extension, Quaternion Wavelet Transforms (QWT), have shown promising results in multi-scale signal analysis. These methods operate by decomposing images into a hierarchy of frequency sub-bands, capturing both global structure and fine-grained details. This formulation endows models with the ability to leverage sparse representations and dimensionality reduction. However, their potential to reshape feature representation, inform conditioning, and adapt training objectives in a model-agnostic fashion had not been comprehensively exploited. In this thesis, we explore this concept and exploit the wavelet-driven learning paradigm to overcome the aforementioned shortcomings of traditional generative models. We leverage multi-scale analysis to make models inherently aware of frequency and spatial information. We first design a structure-aware GAN, StawGAN, tailored for cross-domain infrared-to-RGB image translation. Building on this foundation, we develop specialized diffusion models for domain-specific tasks, including high-fidelity maritime image super-resolution and efficient EEG-to-image synthesis. Moving beyond these conventional generative approaches, we introduce a series of wavelet-driven architectures that explicitly incorporate multi-scale signal representations. Among them is QUAVE, a novel framework leveraging quaternion wavelet transforms (QWT) to enhance feature extraction and improve generalization in medical imaging. Expanding on these insights, we also pioneer several wavelet-based super-resolution models: a QWT-conditioned diffusion model, a metadata- and wavelet-aware architecture for satellite imagery, and a highly efficient hybrid framework, Wavelet Diffusion GAN, that combines the strengths of GANs and diffusion processes. Finally, we address a direct extension of these works, focusing on high-fidelity synthesis. We culminate our investigation with a Latent Wavelet Diffusion (LWD) framework, a general and lightweight solution that enables existing latent diffusion and flow matching models to achieve UHR (up to 4K) synthesis without architectural modifications or additional inference costs. Through extensive experiments across various generative tasks involving different domains and data modalities, we have thoroughly explored the wavelet-driven paradigm while addressing scenario-specific challenges in fidelity, efficiency, and generalization, advancing research in this field.
High-resolution synthesis across domains: a wavelet-driven approach to generative modeling
SIGILLO, LUIGI
2025
Abstract
Deep generative modeling is revolutionizing visual data synthesis, from creative industry applications to tools for scientific discovery in fields like medical imaging and remote sensing. In image synthesis, deep learning models like Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Denoising Diffusion Probabilistic Models (DDPMs) help create novel, realistic, and controllable visual content. In specialized applications, these models can augment scarce data for medical analysis or enhance imagery for remote sensing tasks. A crucial challenge in applying deep learning to image synthesis is overcoming the persistent trade-offs between resolution, fidelity, and computational efficiency. While generative models have advanced significantly, they often struggle with generalization across diverse domains and data modalities. Conventional methods tend to scale poorly to ultra-high resolutions (UHR), leading to artifacts like repeated structures or blurred textures. Furthermore, these models often apply uniform refinement processes across all spatial regions, disregarding local frequency variations and failing to optimally allocate supervision to areas of different visual complexity. Wavelet transforms, particularly Discrete Wavelet Transforms (DWT) and their hypercomplex extension, Quaternion Wavelet Transforms (QWT), have shown promising results in multi-scale signal analysis. These methods operate by decomposing images into a hierarchy of frequency sub-bands, capturing both global structure and fine-grained details. This formulation endows models with the ability to leverage sparse representations and dimensionality reduction. However, their potential to reshape feature representation, inform conditioning, and adapt training objectives in a model-agnostic fashion had not been comprehensively exploited. In this thesis, we explore this concept and exploit the wavelet-driven learning paradigm to overcome the aforementioned shortcomings of traditional generative models. We leverage multi-scale analysis to make models inherently aware of frequency and spatial information. We first design a structure-aware GAN, StawGAN, tailored for cross-domain infrared-to-RGB image translation. Building on this foundation, we develop specialized diffusion models for domain-specific tasks, including high-fidelity maritime image super-resolution and efficient EEG-to-image synthesis. Moving beyond these conventional generative approaches, we introduce a series of wavelet-driven architectures that explicitly incorporate multi-scale signal representations. Among them is QUAVE, a novel framework leveraging quaternion wavelet transforms (QWT) to enhance feature extraction and improve generalization in medical imaging. Expanding on these insights, we also pioneer several wavelet-based super-resolution models: a QWT-conditioned diffusion model, a metadata- and wavelet-aware architecture for satellite imagery, and a highly efficient hybrid framework, Wavelet Diffusion GAN, that combines the strengths of GANs and diffusion processes. Finally, we address a direct extension of these works, focusing on high-fidelity synthesis. We culminate our investigation with a Latent Wavelet Diffusion (LWD) framework, a general and lightweight solution that enables existing latent diffusion and flow matching models to achieve UHR (up to 4K) synthesis without architectural modifications or additional inference costs. Through extensive experiments across various generative tasks involving different domains and data modalities, we have thoroughly explored the wavelet-driven paradigm while addressing scenario-specific challenges in fidelity, efficiency, and generalization, advancing research in this field.| File | Dimensione | Formato | |
|---|---|---|---|
|
Tesi_dottorato_Sigillo.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
167 MB
Formato
Adobe PDF
|
167 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/306649
URN:NBN:IT:UNIROMA1-306649