DEEPFAKES AND FAKE NEWS REPRESENT A CRITICAL CHALLENGE IN THE FIELD OF DIGITAL MEDIA, WHERE VARIOUS FORMS OF THREATS CONTINUE TO APPEAR. THESE THREATS CAN BE ESTABLISHED THROUGH VIDEO, IMAGES, OR AUDIO CONTENT, EACH CAPABLE OF DECEIVING USERS AND DESTROYING TRUST IN DIGITAL INTERACTION. IN THIS RESEARCH, WE PURPOSELY ADDRESS AUDIO-BASED THREATS AND SUGGEST DIFFERENT PROCEDURAL METHODS TO DETECT AND MODERATE SUCH ATTACKS SUCCESSFULLY. THIS RESEARCH INTRODUCES A METHOD FOR DETECTING AUDIO DEEPFAKES USING CONVOLUTIONAL NEURAL NETWORKS (CNNS) APPLIED TO MEL SPECTROGRAMS DERIVED FROM AUDIO SIGNALS. SPECTROGRAMS ARE GENERATED USING LIBROSA WITH PARAMETERS SET TO N_FFT=2048, HOP_LENGTH=512, AND N_MELS=175. HERE, THESE PARAMETERS PROVIDE THE BEST BALANCE BETWEEN THE SEQUENTIAL AND SPECTRAL SOLUTIONS, CONFIRMING THAT THE AUDIO FEATURES ESSENTIAL FOR DEEPFAKE DETECTION IN TERMS OF PITCH, PATTERN, HARMONICS, AND VOICE TEXTURE ARE VERY WELL CAPTURED. TO BOOST THE MODEL STRENGTH, DATA AUGMENTATION TECHNIQUES SUCH AS GAUSSIAN AND WHITE NOISE ARE USED. THEN, THIS AUGMENTED DATASET IS USED TO TRAIN AND EVALUATE SEVERAL CNN ARCHITECTURES. THE EXPERIMENTAL RESULTS ESTABLISH THAT DEEP LEARNING MODELS CAN EFFECTIVELY DISTINGUISH BETWEEN REAL AND FAKE AUDIOS, ACHIEVING ACCURACIES OF UP TO 99%. TO MAKE THE MODEL’S PREDICTIONS EASIER TO UNDERSTAND, APPLY GRADIENT-WEIGHTED CLASS ACTIVATION MAPPING (GRAD-CAM) TO SHOW WHICH PARTS OF THE MEL SPECTROGRAMS MOST INDUCE THE MODEL’S PREDICTIONS. IN OTHER RESEARCH ON AUDIO DEEPFAKES, WE PRESENT A NEAR REAL-TIME PIPELINE FOR DETECTING AUDIO DEEPFAKES. UNLIKE TRADITIONAL METHODS, OUR METHOD ANALYZES AUDIO SIGNALS IN BOTH THE TIME AND FREQUENCY DOMAINS, NOT JUST SIMPLE FEATURES. THIS HELPS TO PRECISELY TELL REAL AUDIO FROM FAKE. WE ALSO USE DATA AUGMENTATION TO IMPROVE SIGNAL FEATURES AND DETECTION ACCURACY. THE RESULTS SHOW THAT OUR PIPELINE IS A RELIABLE AND EFFECTIVE WAY TO DISTINGUISH BETWEEN REAL AND FAKE AUDIO. THE DETECTION OF AUDIO DEEPFAKES IS MORE DIFFICULT TO PERFORM THAN OTHER TYPES OF DEEPFAKES, SUCH AS VISUAL DEEPFAKES, BECAUSE OF LIMITED CONTEXTUAL DATA. TO ADDRESS THIS LIMITATION, WE PROPOSE A LAYERED ARCHITECTURE IN THIS PAPER, AND ON TOP OF IT, WE BUILD A COMPREHENSIVE DATABASE THAT INCLUDES THE AUDIO STREAMS. IN ADDITION, WE USE ADVANCED MACHINE LEARNING METHODS, MODELS, AND DATA ANALYTICS TECHNIQUES. BY CONVERTING AUDIO INTO MEL SPECTROGRAMS AND CLASSIFYING THEM USING CNN MODELS SUCH AS RESNET50, INCEPTIONV3, DENSENET121, MOBILENETV2, AND DENSENET201, WE ACHIEVE 99% ACCURACY. IN ADDITION, WE USED GRAD-CAM WITH DENSENET201 TO HIGHLIGHT KEY SPECTROGRAM AREAS THAT DISTINGUISH HUMAN VOICES FROM CHATBOT VOICES. WE THEN ANALYZE THE AUDIO USING HUGGING FACE MODELS WITH PYTORCH AND CUDA GPUS, WHICH PROCESS SHORT AUDIO SEGMENTS OF TWO SECONDS FOR SPEAKER IDENTIFICATION, GENDER, LANGUAGE, ACCENT, EMOTION, AND TRANSCRIPTION. USING MOZILLA COMMON VOICE AND CAMBRIDGE IELTS RECORDINGS, WE BUILT A STANDARD DATASET FOR ACCURACY. THIS ARCHITECTURE IS SCALABLE, RELIABLE, AND EFFECTIVE FOR AUDIO DEEPFAKE DETECTION, EVEN IN MULTI-SPEAKER ENVIRONMENTS. ANOTHER IMPORTANT RESEARCH ISSUE IS THE INTEGRATION OF SPEAKER DIARIZATION WITH VOICE ACTIVITY DETECTION (VAD) SYSTEMS TO CREATE JOINT FRAMEWORKS THAT PERFORM BETTER THAN MODULAR SYSTEMS. IN THE THESIS, WE PROPOSE A SYSTEM TO ENHANCE SPEAKER DIARIZATION ACCURACY BY INTEGRATING MACHINE LEARNING MODELS FOR INITIAL SPEAKER SEGMENTATION, PRECISE SPEECH-TO-TEXT (STT) TRANSCRIPTION, AND ADVANCED LARGE LANGUAGE MODELS (LLMS). ANOTHER IMPORTANT AREA OF RESEARCH IS ON A SIMPLE EDGE CLOUD ARCHITECTURE FOR REAL-TIME ENVIRONMENTAL AUDIO CLASSIFICATION TO IMPROVE INDOOR SECURITY AND AVAILABILITY. AUDIO SIGNALS ARE CAPTURED AT THE EDGE LAYER USING A RASPBERRY PI, THEN CONVERTED INTO MEL SPECTROGRAMS USING THE LIBROSA PYTHON LIBRARY, AND SUBSEQUENTLY TRANSMITTED TO A CLOUD-HOSTED CONVOLUTIONAL NEURAL NETWORK (CNN) TRAINED ON THE FSD50K DATASET.
DEEPFAKES AND FAKE NEWS REPRESENT A CRITICAL CHALLENGE IN THE FIELD OF DIGITAL MEDIA, WHERE VARIOUS FORMS OF THREATS CONTINUE TO APPEAR. THESE THREATS CAN BE ESTABLISHED THROUGH VIDEO, IMAGES, OR AUDIO CONTENT, EACH CAPABLE OF DECEIVING USERS AND DESTROYING TRUST IN DIGITAL INTERACTION. IN THIS RESEARCH, WE PURPOSELY ADDRESS AUDIO-BASED THREATS AND SUGGEST DIFFERENT PROCEDURAL METHODS TO DETECT AND MODERATE SUCH ATTACKS SUCCESSFULLY. THIS RESEARCH INTRODUCES A METHOD FOR DETECTING AUDIO DEEPFAKES USING CONVOLUTIONAL NEURAL NETWORKS (CNNS) APPLIED TO MEL SPECTROGRAMS DERIVED FROM AUDIO SIGNALS. SPECTROGRAMS ARE GENERATED USING LIBROSA WITH PARAMETERS SET TO N_FFT=2048, HOP_LENGTH=512, AND N_MELS=175. HERE, THESE PARAMETERS PROVIDE THE BEST BALANCE BETWEEN THE SEQUENTIAL AND SPECTRAL SOLUTIONS, CONFIRMING THAT THE AUDIO FEATURES ESSENTIAL FOR DEEPFAKE DETECTION IN TERMS OF PITCH, PATTERN, HARMONICS, AND VOICE TEXTURE ARE VERY WELL CAPTURED. TO BOOST THE MODEL STRENGTH, DATA AUGMENTATION TECHNIQUES SUCH AS GAUSSIAN AND WHITE NOISE ARE USED. THEN, THIS AUGMENTED DATASET IS USED TO TRAIN AND EVALUATE SEVERAL CNN ARCHITECTURES. THE EXPERIMENTAL RESULTS ESTABLISH THAT DEEP LEARNING MODELS CAN EFFECTIVELY DISTINGUISH BETWEEN REAL AND FAKE AUDIOS, ACHIEVING ACCURACIES OF UP TO 99%. TO MAKE THE MODEL’S PREDICTIONS EASIER TO UNDERSTAND, APPLY GRADIENT-WEIGHTED CLASS ACTIVATION MAPPING (GRAD-CAM) TO SHOW WHICH PARTS OF THE MEL SPECTROGRAMS MOST INDUCE THE MODEL’S PREDICTIONS. IN OTHER RESEARCH ON AUDIO DEEPFAKES, WE PRESENT A NEAR REAL-TIME PIPELINE FOR DETECTING AUDIO DEEPFAKES. UNLIKE TRADITIONAL METHODS, OUR METHOD ANALYZES AUDIO SIGNALS IN BOTH THE TIME AND FREQUENCY DOMAINS, NOT JUST SIMPLE FEATURES. THIS HELPS TO PRECISELY TELL REAL AUDIO FROM FAKE. WE ALSO USE DATA AUGMENTATION TO IMPROVE SIGNAL FEATURES AND DETECTION ACCURACY. THE RESULTS SHOW THAT OUR PIPELINE IS A RELIABLE AND EFFECTIVE WAY TO DISTINGUISH BETWEEN REAL AND FAKE AUDIO. THE DETECTION OF AUDIO DEEPFAKES IS MORE DIFFICULT TO PERFORM THAN OTHER TYPES OF DEEPFAKES, SUCH AS VISUAL DEEPFAKES, BECAUSE OF LIMITED CONTEXTUAL DATA. TO ADDRESS THIS LIMITATION, WE PROPOSE A LAYERED ARCHITECTURE IN THIS PAPER, AND ON TOP OF IT, WE BUILD A COMPREHENSIVE DATABASE THAT INCLUDES THE AUDIO STREAMS. IN ADDITION, WE USE ADVANCED MACHINE LEARNING METHODS, MODELS, AND DATA ANALYTICS TECHNIQUES. BY CONVERTING AUDIO INTO MEL SPECTROGRAMS AND CLASSIFYING THEM USING CNN MODELS SUCH AS RESNET50, INCEPTIONV3, DENSENET121, MOBILENETV2, AND DENSENET201, WE ACHIEVE 99% ACCURACY. IN ADDITION, WE USED GRAD-CAM WITH DENSENET201 TO HIGHLIGHT KEY SPECTROGRAM AREAS THAT DISTINGUISH HUMAN VOICES FROM CHATBOT VOICES. WE THEN ANALYZE THE AUDIO USING HUGGING FACE MODELS WITH PYTORCH AND CUDA GPUS, WHICH PROCESS SHORT AUDIO SEGMENTS OF TWO SECONDS FOR SPEAKER IDENTIFICATION, GENDER, LANGUAGE, ACCENT, EMOTION, AND TRANSCRIPTION. USING MOZILLA COMMON VOICE AND CAMBRIDGE IELTS RECORDINGS, WE BUILT A STANDARD DATASET FOR ACCURACY. THIS ARCHITECTURE IS SCALABLE, RELIABLE, AND EFFECTIVE FOR AUDIO DEEPFAKE DETECTION, EVEN IN MULTI-SPEAKER ENVIRONMENTS. ANOTHER IMPORTANT RESEARCH ISSUE IS THE INTEGRATION OF SPEAKER DIARIZATION WITH VOICE ACTIVITY DETECTION (VAD) SYSTEMS TO CREATE JOINT FRAMEWORKS THAT PERFORM BETTER THAN MODULAR SYSTEMS. IN THE THESIS, WE PROPOSE A SYSTEM TO ENHANCE SPEAKER DIARIZATION ACCURACY BY INTEGRATING MACHINE LEARNING MODELS FOR INITIAL SPEAKER SEGMENTATION, PRECISE SPEECH-TO-TEXT (STT) TRANSCRIPTION, AND ADVANCED LARGE LANGUAGE MODELS (LLMS). ANOTHER IMPORTANT AREA OF RESEARCH IS ON A SIMPLE EDGE CLOUD ARCHITECTURE FOR REAL-TIME ENVIRONMENTAL AUDIO CLASSIFICATION TO IMPROVE INDOOR SECURITY AND AVAILABILITY. AUDIO SIGNALS ARE CAPTURED AT THE EDGE LAYER USING A RASPBERRY PI, THEN CONVERTED INTO MEL SPECTROGRAMS USING THE LIBROSA PYTHON LIBRARY, AND SUBSEQUENTLY TRANSMITTED TO A CLOUD-HOSTED CONVOLUTIONAL NEURAL NETWORK (CNN) TRAINED ON THE FSD50K DATASET.
ADVANCED METHODS FOR DEEPFAKE DETECTION IN AUDIO STREAMS
BAJWA, MUHAMMAD KHURRAM ZAHUR
2026
Abstract
DEEPFAKES AND FAKE NEWS REPRESENT A CRITICAL CHALLENGE IN THE FIELD OF DIGITAL MEDIA, WHERE VARIOUS FORMS OF THREATS CONTINUE TO APPEAR. THESE THREATS CAN BE ESTABLISHED THROUGH VIDEO, IMAGES, OR AUDIO CONTENT, EACH CAPABLE OF DECEIVING USERS AND DESTROYING TRUST IN DIGITAL INTERACTION. IN THIS RESEARCH, WE PURPOSELY ADDRESS AUDIO-BASED THREATS AND SUGGEST DIFFERENT PROCEDURAL METHODS TO DETECT AND MODERATE SUCH ATTACKS SUCCESSFULLY. THIS RESEARCH INTRODUCES A METHOD FOR DETECTING AUDIO DEEPFAKES USING CONVOLUTIONAL NEURAL NETWORKS (CNNS) APPLIED TO MEL SPECTROGRAMS DERIVED FROM AUDIO SIGNALS. SPECTROGRAMS ARE GENERATED USING LIBROSA WITH PARAMETERS SET TO N_FFT=2048, HOP_LENGTH=512, AND N_MELS=175. HERE, THESE PARAMETERS PROVIDE THE BEST BALANCE BETWEEN THE SEQUENTIAL AND SPECTRAL SOLUTIONS, CONFIRMING THAT THE AUDIO FEATURES ESSENTIAL FOR DEEPFAKE DETECTION IN TERMS OF PITCH, PATTERN, HARMONICS, AND VOICE TEXTURE ARE VERY WELL CAPTURED. TO BOOST THE MODEL STRENGTH, DATA AUGMENTATION TECHNIQUES SUCH AS GAUSSIAN AND WHITE NOISE ARE USED. THEN, THIS AUGMENTED DATASET IS USED TO TRAIN AND EVALUATE SEVERAL CNN ARCHITECTURES. THE EXPERIMENTAL RESULTS ESTABLISH THAT DEEP LEARNING MODELS CAN EFFECTIVELY DISTINGUISH BETWEEN REAL AND FAKE AUDIOS, ACHIEVING ACCURACIES OF UP TO 99%. TO MAKE THE MODEL’S PREDICTIONS EASIER TO UNDERSTAND, APPLY GRADIENT-WEIGHTED CLASS ACTIVATION MAPPING (GRAD-CAM) TO SHOW WHICH PARTS OF THE MEL SPECTROGRAMS MOST INDUCE THE MODEL’S PREDICTIONS. IN OTHER RESEARCH ON AUDIO DEEPFAKES, WE PRESENT A NEAR REAL-TIME PIPELINE FOR DETECTING AUDIO DEEPFAKES. UNLIKE TRADITIONAL METHODS, OUR METHOD ANALYZES AUDIO SIGNALS IN BOTH THE TIME AND FREQUENCY DOMAINS, NOT JUST SIMPLE FEATURES. THIS HELPS TO PRECISELY TELL REAL AUDIO FROM FAKE. WE ALSO USE DATA AUGMENTATION TO IMPROVE SIGNAL FEATURES AND DETECTION ACCURACY. THE RESULTS SHOW THAT OUR PIPELINE IS A RELIABLE AND EFFECTIVE WAY TO DISTINGUISH BETWEEN REAL AND FAKE AUDIO. THE DETECTION OF AUDIO DEEPFAKES IS MORE DIFFICULT TO PERFORM THAN OTHER TYPES OF DEEPFAKES, SUCH AS VISUAL DEEPFAKES, BECAUSE OF LIMITED CONTEXTUAL DATA. TO ADDRESS THIS LIMITATION, WE PROPOSE A LAYERED ARCHITECTURE IN THIS PAPER, AND ON TOP OF IT, WE BUILD A COMPREHENSIVE DATABASE THAT INCLUDES THE AUDIO STREAMS. IN ADDITION, WE USE ADVANCED MACHINE LEARNING METHODS, MODELS, AND DATA ANALYTICS TECHNIQUES. BY CONVERTING AUDIO INTO MEL SPECTROGRAMS AND CLASSIFYING THEM USING CNN MODELS SUCH AS RESNET50, INCEPTIONV3, DENSENET121, MOBILENETV2, AND DENSENET201, WE ACHIEVE 99% ACCURACY. IN ADDITION, WE USED GRAD-CAM WITH DENSENET201 TO HIGHLIGHT KEY SPECTROGRAM AREAS THAT DISTINGUISH HUMAN VOICES FROM CHATBOT VOICES. WE THEN ANALYZE THE AUDIO USING HUGGING FACE MODELS WITH PYTORCH AND CUDA GPUS, WHICH PROCESS SHORT AUDIO SEGMENTS OF TWO SECONDS FOR SPEAKER IDENTIFICATION, GENDER, LANGUAGE, ACCENT, EMOTION, AND TRANSCRIPTION. USING MOZILLA COMMON VOICE AND CAMBRIDGE IELTS RECORDINGS, WE BUILT A STANDARD DATASET FOR ACCURACY. THIS ARCHITECTURE IS SCALABLE, RELIABLE, AND EFFECTIVE FOR AUDIO DEEPFAKE DETECTION, EVEN IN MULTI-SPEAKER ENVIRONMENTS. ANOTHER IMPORTANT RESEARCH ISSUE IS THE INTEGRATION OF SPEAKER DIARIZATION WITH VOICE ACTIVITY DETECTION (VAD) SYSTEMS TO CREATE JOINT FRAMEWORKS THAT PERFORM BETTER THAN MODULAR SYSTEMS. IN THE THESIS, WE PROPOSE A SYSTEM TO ENHANCE SPEAKER DIARIZATION ACCURACY BY INTEGRATING MACHINE LEARNING MODELS FOR INITIAL SPEAKER SEGMENTATION, PRECISE SPEECH-TO-TEXT (STT) TRANSCRIPTION, AND ADVANCED LARGE LANGUAGE MODELS (LLMS). ANOTHER IMPORTANT AREA OF RESEARCH IS ON A SIMPLE EDGE CLOUD ARCHITECTURE FOR REAL-TIME ENVIRONMENTAL AUDIO CLASSIFICATION TO IMPROVE INDOOR SECURITY AND AVAILABILITY. AUDIO SIGNALS ARE CAPTURED AT THE EDGE LAYER USING A RASPBERRY PI, THEN CONVERTED INTO MEL SPECTROGRAMS USING THE LIBROSA PYTHON LIBRARY, AND SUBSEQUENTLY TRANSMITTED TO A CLOUD-HOSTED CONVOLUTIONAL NEURAL NETWORK (CNN) TRAINED ON THE FSD50K DATASET.| File | Dimensione | Formato | |
|---|---|---|---|
|
PhDThesisBajwaMuhammadKhurramZahur.pdf
embargo fino al 23/03/2028
Licenza:
Tutti i diritti riservati
Dimensione
3.28 MB
Formato
Adobe PDF
|
3.28 MB | Adobe PDF | |
|
PhDThesis_Abstract_BajwaMuhammadKhurramZahur.pdf
embargo fino al 23/03/2028
Licenza:
Tutti i diritti riservati
Dimensione
188.18 kB
Formato
Adobe PDF
|
188.18 kB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/362516
URN:NBN:IT:UNISA-362516