With the rapid advancement of multimedia technologies, the proliferation of digital content manipulation has emerged as a significant and growing concern. The ease with which digital media can be created, modified, and distributed has opened new avenues for innovation and creativity but has also given rise to malicious practices that undermine the authenticity of information. Among these practices, audio-visual forgery, which involves the intentional tampering or fabrication of audio and visual data, has become a particularly pressing issue. This type of manipulation poses a serious threat to the integrity of information across a wide range of sectors, including media, legal proceedings, and the music industry. In the media sector, forged audio and video can lead to the dissemination of false information, eroding public trust. In the legal realm, manipulated evidence can compromise the fairness of judicial outcomes. Similarly, in the music industry, fraudulent alterations can infringe on intellectual property rights and damage the reputation of artists. Given the sophistication of modern forgery techniques, traditional detection methods and general-purpose detectors have become increasingly inadequate. As such, there is a critical need for advanced, task-specific solutions capable of addressing these challenges. This thesis focuses on the development and application of deep learning techniques to detect forgeries in both audio and visual data, with an emphasis on tasks and perspectives that have been overlooked in the extensive literature of multimedia forensics. In the visual domain, we concentrate on the detection of image forgeries through the analysis of light consistency, a forensic cue that remains challenging to maintain even with advanced manipulation techniques. We propose a physics-guided neural network capable of estimating the global 3D light direction within images, which can then be applied to identify inconsistencies indicative of tampering. In the audio domain, this thesis explores the emerging, yet underexplored area of singing voice manipulations. We begin by investigating the detection of Auto-Tune, a widely used singing voice enhancement technique that can also be exploited for malicious purposes. Next, we examine deepfake-generated singing voices, evaluating the effectiveness of various audio representations and feature sets for their detection and comparing their impact on the related task of speech deepfake detection. Building on these insights, we present ongoing research focused on detecting and analyzing vocal transformations in music production. By leveraging Large Language Models, we aim to describe the forensic characteristics of manipulated singing voices through the generation of descriptive text, paving the way for more interpretable and robust forensic audio analysis. In conclusion, this thesis demonstrates the critical importance of developing advanced, domain-specific tools for forgery detection in the rapidly evolving landscape of multimedia technologies. By addressing underexplored areas in both audio and visual domains, we contribute novel methodologies and insights to the field of multimedia forensics. We regard this thesis as a foundational exploration of multimedia forensics. Although the results demonstrate significant potential, the ever-changing landscape of digital forgery calls for continuous innovation. We hope these contributions will inform and inspire further advancements in the field.
Con il rapido progresso delle tecnologie multimediali, la proliferazione della manipolazione dei contenuti digitali suscita una crescente preoccupazione. La facilità con cui è possibile creare, modificare e distribuire contenuti digitali ha aperto nuove opportunità per l’innovazione e la creatività, ma ha anche dato origine a pratiche dannose che compromettono l’autenticità delle informazioni. Tra queste, la falsificazione audio-visiva, che implica l’alterazione intenzionale o la fabbricazione di dati audio e visuali, rappresenta una problematica particolarmente rilevante. Questo tipo di manipolazione costituisce una seria minaccia all’integrità delle informazioni in molti settori, tra cui i media, i procedimenti legali e l’industria musicale. Nel settore dei media, audio e video falsificati possono portare alla diffusione di informazioni errate, minando la fiducia del pubblico. In ambito legale, prove manipolate possono compromettere l’equità dei processi giudiziari. Allo stesso modo, nell’industria musicale, alterazioni fraudolente possono violare i diritti di proprietà intellettuale e danneggiare la reputazione degli artisti. Data la sofisticazione delle tecniche moderne di falsificazione, i metodi tradizionali di rilevamento si dimostrano sempre meno efficaci. Pertanto, si avverte una necessità di soluzioni più avanzate e specifiche, in grado di affrontare queste sfide. Questa tesi si concentra sullo sviluppo e sull’applicazione di tecniche di deep learning per rilevare falsificazioni in dati audio e visivi, ponendo particolare enfasi su obiettivi e prospettive che sono stati trascurati nella vasta letteratura della computer forensics multimediale. Per quanto riguarda video ed immagini, ci focalizziamo sul rilevamento di manipolazioni attraverso l’analisi della coerenza delle fonti di luce, un indizio forense che rimane difficile da preservare anche con tecniche di manipolazione avanzate. Proponiamo una rete neurale guidata da principi fisici, in grado di stimare la direzione globale della luce 3D all'interno delle immagini, utilizzabile per identificare incongruenze indicative di manipolazioni. Nel dominio audio, questa tesi esplora l’area emergente, ma ancora poco studiata, delle manipolazioni della voce nel canto. Iniziamo con l’analisi del rilevamento di Auto-Tune, una tecnica di miglioramento vocale ampiamente utilizzata, che può però essere sfruttata a scopi malevoli. Successivamente, esaminiamo il canto generato tramite tecniche di deepfake, valutando l’efficacia di diverse rappresentazioni audio e insiemi di caratteristiche per il loro rilevamento e confrontando il loro impatto sul compito correlato del rilevamento di deepfake vocali. Basandoci su questi risultati, presentiamo una metodologia, focalizzata sul rilevamento e l’analisi delle trasformazioni vocali, nella produzione musicale. Con l’ausilio di modelli linguistici di grandi dimensioni, puntiamo a descrivere le caratteristiche forensi delle voci manipolate tramite la generazione di testi descrittivi, aprendo la strada a un’analisi forense audio più interpretabile e robusta. In conclusione, questa tesi dimostra l'importanza cruciale di sviluppare strumenti avanzati e specifici per il rilevamento delle falsificazioni nei documenti audio e visivi. Affrontando aree poco esplorate, sono state individuate nuove metodologie e prospettive nel campo della forensics multimediale. Consideriamo questa tesi come una base di esplorazione del settore, con risultati che mostrano un potenziale significativo. Tuttavia, il panorama in continuo cambiamento delle falsificazioni digitali richiede innovazioni costanti. Speriamo che questi contributi possano guidare e ispirare ulteriori progressi nel campo.
Data-driven Tools for Audio-Visual Forgery Detection
Gohari Moghaddam, Mahyar
2025
Abstract
With the rapid advancement of multimedia technologies, the proliferation of digital content manipulation has emerged as a significant and growing concern. The ease with which digital media can be created, modified, and distributed has opened new avenues for innovation and creativity but has also given rise to malicious practices that undermine the authenticity of information. Among these practices, audio-visual forgery, which involves the intentional tampering or fabrication of audio and visual data, has become a particularly pressing issue. This type of manipulation poses a serious threat to the integrity of information across a wide range of sectors, including media, legal proceedings, and the music industry. In the media sector, forged audio and video can lead to the dissemination of false information, eroding public trust. In the legal realm, manipulated evidence can compromise the fairness of judicial outcomes. Similarly, in the music industry, fraudulent alterations can infringe on intellectual property rights and damage the reputation of artists. Given the sophistication of modern forgery techniques, traditional detection methods and general-purpose detectors have become increasingly inadequate. As such, there is a critical need for advanced, task-specific solutions capable of addressing these challenges. This thesis focuses on the development and application of deep learning techniques to detect forgeries in both audio and visual data, with an emphasis on tasks and perspectives that have been overlooked in the extensive literature of multimedia forensics. In the visual domain, we concentrate on the detection of image forgeries through the analysis of light consistency, a forensic cue that remains challenging to maintain even with advanced manipulation techniques. We propose a physics-guided neural network capable of estimating the global 3D light direction within images, which can then be applied to identify inconsistencies indicative of tampering. In the audio domain, this thesis explores the emerging, yet underexplored area of singing voice manipulations. We begin by investigating the detection of Auto-Tune, a widely used singing voice enhancement technique that can also be exploited for malicious purposes. Next, we examine deepfake-generated singing voices, evaluating the effectiveness of various audio representations and feature sets for their detection and comparing their impact on the related task of speech deepfake detection. Building on these insights, we present ongoing research focused on detecting and analyzing vocal transformations in music production. By leveraging Large Language Models, we aim to describe the forensic characteristics of manipulated singing voices through the generation of descriptive text, paving the way for more interpretable and robust forensic audio analysis. In conclusion, this thesis demonstrates the critical importance of developing advanced, domain-specific tools for forgery detection in the rapidly evolving landscape of multimedia technologies. By addressing underexplored areas in both audio and visual domains, we contribute novel methodologies and insights to the field of multimedia forensics. We regard this thesis as a foundational exploration of multimedia forensics. Although the results demonstrate significant potential, the ever-changing landscape of digital forgery calls for continuous innovation. We hope these contributions will inform and inspire further advancements in the field.File | Dimensione | Formato | |
---|---|---|---|
Tesi Gohari.pdf
embargo fino al 20/06/2026
Dimensione
27.09 MB
Formato
Adobe PDF
|
27.09 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/213032
URN:NBN:IT:UNIBS-213032