Neural speech synthesis has revolutionized voice cloning, enabling hyper-realistic audio creation with minimal training data. While offering transformative possibilities for accessibility and creativity, these technologies become devastating weapons when misused. Audio deepfakes now fuel sophisticated fraud, social engineering attacks, and misinformation campaigns that threaten democratic discourse. The ear, once our most trusted ally, is now our weakest sensor in distinguishing authentic from synthetic speech. This thesis confronts the critical challenge of audio deepfake detection, systematically addressing three fundamental barriers that cripple current methods' real-world effectiveness. First, existing approaches catastrophically fail across different datasets, generators, and audio processing conditions encountered in practice. Second, traditional architectures fundamentally misunderstand the intricate spectro-temporal fingerprints essential for forensic discrimination, while suffering from catastrophic forgetting when adapting to emerging synthesis techniques. Third, current methods lack the transparency and explainability demanded by forensic applications, where detection decisions must withstand rigorous legal scrutiny. To address these limitations, we propose a comprehensive framework tackling each challenge through specialized methodologies. For the generalization problem, we develop hierarchical learning approaches that model complex acoustic structures through novel architectures, moving beyond conventional networks that destroy crucial spatial-temporal information. We explore fusion strategies that combine complementary evidence streams, including spectral, temporal, and self-supervised representations, to create more robust detection systems. Additionally, we investigate ensemble methods that leverage diverse acoustic cues to improve cross-domain performance. Recognizing that the synthetic speech landscape continuously evolves, we address the adaptation challenge through knowledge transfer methodologies. We propose continual learning strategies that enable detectors to incorporate new synthesis methods without catastrophic forgetting. Our codec-aware framework specifically handles compression artifacts and channel variations encountered in real-world scenarios, ensuring robust performance across different audio processing pipelines. Finally, we tackle the requirement for explainable detection systems suitable for forensic workflows. We develop an interpretable framework that provides transparent explanations for detection decisions, moving beyond simple confidence scores to offer a detailed analysis of acoustic features that indicate synthetic content. The proposed methods offer valuable insights for both audio deepfake detection and the broader multimedia forensics field. The hierarchical modeling approaches, fusion strategies, and interpretability frameworks can be adapted to address various challenges across different forensic domains. Our evaluation protocols establish new standards for assessing detection system reliability in practical deployment scenarios. We consider this thesis a significant contribution to synthetic speech forensics. While achieving promising results, the rapidly evolving nature of generative technologies necessitates continuous development of new detection methodologies. Our approach provides valuable foundations for future research, contributing to trustworthy systems capable of maintaining information integrity in an era of increasingly sophisticated synthetic media.

Defending against audio deepfakes: robust detection in the synthetic speech era

XX, TAIBA MAJID
2026

Abstract

Neural speech synthesis has revolutionized voice cloning, enabling hyper-realistic audio creation with minimal training data. While offering transformative possibilities for accessibility and creativity, these technologies become devastating weapons when misused. Audio deepfakes now fuel sophisticated fraud, social engineering attacks, and misinformation campaigns that threaten democratic discourse. The ear, once our most trusted ally, is now our weakest sensor in distinguishing authentic from synthetic speech. This thesis confronts the critical challenge of audio deepfake detection, systematically addressing three fundamental barriers that cripple current methods' real-world effectiveness. First, existing approaches catastrophically fail across different datasets, generators, and audio processing conditions encountered in practice. Second, traditional architectures fundamentally misunderstand the intricate spectro-temporal fingerprints essential for forensic discrimination, while suffering from catastrophic forgetting when adapting to emerging synthesis techniques. Third, current methods lack the transparency and explainability demanded by forensic applications, where detection decisions must withstand rigorous legal scrutiny. To address these limitations, we propose a comprehensive framework tackling each challenge through specialized methodologies. For the generalization problem, we develop hierarchical learning approaches that model complex acoustic structures through novel architectures, moving beyond conventional networks that destroy crucial spatial-temporal information. We explore fusion strategies that combine complementary evidence streams, including spectral, temporal, and self-supervised representations, to create more robust detection systems. Additionally, we investigate ensemble methods that leverage diverse acoustic cues to improve cross-domain performance. Recognizing that the synthetic speech landscape continuously evolves, we address the adaptation challenge through knowledge transfer methodologies. We propose continual learning strategies that enable detectors to incorporate new synthesis methods without catastrophic forgetting. Our codec-aware framework specifically handles compression artifacts and channel variations encountered in real-world scenarios, ensuring robust performance across different audio processing pipelines. Finally, we tackle the requirement for explainable detection systems suitable for forensic workflows. We develop an interpretable framework that provides transparent explanations for detection decisions, moving beyond simple confidence scores to offer a detailed analysis of acoustic features that indicate synthetic content. The proposed methods offer valuable insights for both audio deepfake detection and the broader multimedia forensics field. The hierarchical modeling approaches, fusion strategies, and interpretability frameworks can be adapted to address various challenges across different forensic domains. Our evaluation protocols establish new standards for assessing detection system reliability in practical deployment scenarios. We consider this thesis a significant contribution to synthetic speech forensics. While achieving promising results, the rapidly evolving nature of generative technologies necessitates continuous development of new detection methodologies. Our approach provides valuable foundations for future research, contributing to trustworthy systems capable of maintaining information integrity in an era of increasingly sophisticated synthetic media.
29-gen-2026
Inglese
AMERINI, IRENE
SCHAERF, Marco
Università degli Studi di Roma "La Sapienza"
File in questo prodotto:
File Dimensione Formato  
Tesi_dottorato_Majid.pdf

accesso aperto

Licenza: Creative Commons
Dimensione 14.22 MB
Formato Adobe PDF
14.22 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/359101
Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-359101