DEEPFAKE DETECTION USING MULTIMODAL DEEP LEARNING

ABSTRACT

Deepfakes allow bogus(fake) content creation through artificial intelligence technology, and it is impacting society a lot. Human beings are not able to identify which one is fake and which one is real. Deepfake videos which look like realistic fake videos create cyber frauds in the society. Deepfake videos facilitate the spread of false information, manipulating public opinion in society. This raises concerns about privacy violations and the potential for malicious use of technology. Addressing the rising threat of deepfakes, this paper leverages advanced multimodal deep learning techniques to analyze various forms of media, including images and videos. By utilizing convolutional neural networks for visual data and recurrent neural networks for audio analysis, we provide a multimodal deep learning method that integrates audio and visual analysis to overcome this constraint, utilising complimentary information from both input streams to improve detection performance. The primary objective of this paper is to create sophisticated models capable of effectively distinguishing between genuine recordings and manipulated deepfakes. Through training these models on diverse datasets containing both authentic and altered media, the goal is to ensure truthful media and prevent the spread of false information, building a trustworthy digital world that can withstand fake content.

INTRODUCTION

The advancement in the technology enables the creation of deepfakes, which are fake content pretend to be real using Deep learning technology. Deepfakes are generated using generative models such as Generative Adversarial Networks (GANs). However this innovation presents good oppurtunities for accessibility and enjoyment, its misuse presents serious problems in areas like privacy and cybersecurity. The accuracy of current deepfake detection techniques is restricted since they frequently examine visual or aural characteristics separately. M. S. Rana et al., Highlighted challenges such as dataset diversity and generalization in deepfake detection . However, anomalies between modalities, including irregular speech patterns or abnormalities in lip synchronisation, are frequently present in deepfake content. We provide a multimodal deep learning method that integrates audio and visual analysis to overcome this constraint, utilising complimentary information from both input streams to improve detection performance.
This study proposes a novel multimodal deep learning approach for deepfake identification that makes use of the complementary qualities of both audio and visual data. In particular, the proposed model employs different preprocessing techniques for video (ResNet-50 model) and audio (using spectrograms, MFCCs, and normalisation), which are then integrated to increase detection resilience. Building on their capacity to process sequential data from both modalities, the study further investigates the application of recurrent neural networks (RNNs) for temporal sequence modelling. To detect deepfakes more efficiently than single-modal methods, the suggested method examines the correlations between audio and video features. Various authors are worked on deepfake detection using different deeplearning techniques. J. K. Lewis et al., Proposed a multimodal framework combining spatial, spectral, and temporal feature inconsistencies to enhance detection accuracy . Davide Salvi et al, and Kashish Gandhi et al., developed a robust multimodal detection approach by integrating video, audio, and other metadata to tackle challenging deepfake cases.Santosh Kolagati et al., combined MLP and CNN architectures for identifying spatial anomalies in deepfake videos. Later different authors proposed deep CNNs for visual features and Mel-Frequency Cepstral Coefficients (MFCCs) for audio features, showcasing the complementary strengths of both modalities for deepfake detection .S. Nailwal et al., introduced a multi-algorithmic framework that combines multiple deep learning models and modalities to enhance the reliability and robustness of deepfake detection. S. Sebyakin et al., explored the use of deep neural networks to detect spatio-temporal inconsistencies in deepfake videos, demonstrating the importance of analyzing both spatial and temporal features for more accurate detection. A. Malik et al., emphasized the importance of multimodal approaches in deepfake detection, as they provide a more comprehensive solution to identify various inconsistencies across different data types. Most of the authors concluded that the deep learning (mainly CNN) models hold a significant percentage of all the models. The most widely used performance metric is detection accuracy . Most of the authors proposed the model to detect deepfake using either audio or video features but not on the combined features. The proposed model uses the FakeAVCEleb dataset, which we created using four distinct deepfake generation and voice synthesis techniques. It is a thorough set of multimodal synthetic media created especially for studies on deepfake detection. It has a wide variety of videos with audio and video pairings that include both deepfake and authentic content. The video clips in the FakeAVCeleb dataset show celebrities performing or speaking in a variety of settings. The associated audio, which has been modified through sophisticated deepfake techniques, is coupled with the videos to create synthetic material. By modifying lip motions, facial expressions, and audio, these alterations replicate real-world situations where deepfakes could be used. Our study intends to create a sophisticated multimodal deep learning system that integrates audio and video data for more precise deepfake identification to address this problem. This method improves detection precision and lessens the negative social impacts of deepfakes.

DATASET DESCRIPTION

The proposed model used the FakeAVCeleb dataset which includes audio-visual material in the following categories, • RealVideo-RealAudio (Category A): Authentic videos and audio. • RealVideo-FakeAudio (Category B): Authentic videos paired with deepfake audio. • FakeVideo-RealAudio (Category C): Deepfake videos paired with authentic audio. • FakeVideo-FakeAudio (Category D): Fully synthetic content combining both deepfake video and audio. The metadata file had comprehensive labels that listed the sources, categories, and techniques utilized to create deepfake content. To ensure equal representation, the data was preprocessed to include 500 samples from Category A and 170 samples each from Category’s B, C, and D.

Dataset Preprocessing

Audio Extraction: The moviepy library was used to extract audio tracks from video files and convert them to WAV format. Feature engineering: 224 × 224 pixels were added to the video frames. Mel Frequency Cepstral Coefficients (MFCCs) with 30 coefficients per frame were used to extract audio characteristics, and sequences were either padded or shortened to 30 frames. To ensure class balance, the dataset was divided into training (80%) and testing (20%) sets using stratified sampling.

METHOD

The proposed method leverages a multimodal deep learning architecture combining visual and audio features to detect deepfakes.

Model Architecture:

Video Processing

ResNet-50: A ResNet-50 model pretrained on ImageNet is used to extract spatial features from individual video frames at the start of the video processing pipeline. Elimination of Last Layers: To focus on feature extraction, the classification layers of ResNet-50 (completely connected and pooling layers) were removed. This ensures that the model's output is high-dimensional feature maps rather than class probabilities. Adaptive Average pooling: The feature maps are sent via an Adaptive Average Pooling layer, which reduces them to fixed-length 2048-dimensional embeddings regardless of the input size. This method prepares the spatial dimensions for further processing by standardising them. Dimensionality Reduction: To reduce the feature size to 512 dimensions, the 2048-dimensional embeddings are flattened and run through a fully linked (Linear) layer. This reduction in dimensionality highlights the most important spatial elements while also lowering computing cost. Temporal Aggregation: Temporal average pooling is used to aggregate the harvested features from each video frame in order to maintain temporal consistency across frames. By averaging the features over every frame in a video, this procedure creates a single, cohesive depiction. Both temporal coherence and spatial details are captured in the final video features. The model architecture of video processing to capture the temporal coherence and spatial details are shown in figure 1

Audio Processing:

Features of Input Audio: At the beginning features such as Mel-Frequency Cepstral Coefficients (MFCCs) are extracted from audio data. These characteristics serve as the input for sequence modelling since they capture the crucial temporal and spectral characteristics of the audio stream. GRU Layer: Following the passage of the audio features via a bidirectional GRU layer, the sequence is processed both forward and backward. As a result, dependencies from previous and upcoming steps can be captured by the model. The audio representation is enhanced by the GRU's transformation of the input size from 30 to128 LSTM Layer in both directions: A bidirectional LSTM layer, which is excellent at identifying long-term dependencies, receives the GRU output after that. The representation is further refined by the LSTM, which lowers the feature size from 256 (combined from both directions) to 128. Last Time Step Selection: The output of the most recent time step is chosen by the model from the LSTM layer. The temporal data from the complete series is compressed into a single 256-dimensional vector in this stage. The audio processing for capturing features is shown below in figure 2

Fusion and Classification

Combination of Features: A single multimodal feature vector is created by concatenating the 256-dimensional audio features from the bidirectional GRU-LSTM [20] and the 512-dimensional video features that were retrieved using the ResNet-50 model.By combining visual and aural data into a single representation, the model is able to identify connections between the two modalities. **Completely Interconnected Layers:**Two thick layers with ReLU activations are applied to the combined feature vector, assisting the model in learning intricate relationships between the features from the two modalities. At this point, regularisation strategies like dropout can be used to lessen overfitting. Final Classification: The output layer creates probabilities for each class (actual or fake) using a Softmax activation function. The prediction is made for the class with the highest probability. The workflow of proposed system to capure the audio and video features using multi model is shown below in figure 3

Training Parameters

The proposed multimodal deep learning model for deepfake detection combines audio and video modalities, using different preprocessing approaches and architectures for each stream. Video data is resized and normalised before being extracted as spatial features by a CNN [21]-[22] and temporal relationships by an LSTM network[22]. Similarly, audio data is preprocessed to extract MFCC features, then analysed using an audio-based CNN and sequentially modelled with LSTM. The output vectors from both modalities are merged in a joint representation layer, then transmitted through fully connected layers before being classified with a Softmax layer for genuine or deepfake predictions. Optimizer: Mixed Precision Training is employed using PyTorch’s torch.cuda.amp, significantly improving computational speed while maintaining accuracy. The model was trained for 16 epochs with a batch size of 8, balancing effective learning and memory efficiency. The use of the Adam optimizer ensures adaptive learning rate adjustments, facilitating smooth convergence. Loss Function: The model uses categorical cross-entropy, a commonly used metric for multi-class classification tasks, as its loss function. This loss function quantifies the uncertainty of the model by comparing the actual class labels with the predicted probability. The model maximizes its capacity to differentiate between different categories of bone disease by decreasing categorical cross-entropy. In order to penalize inaccurate predictions, the loss function computes the logarithmic loss between predicted probability and true labels. By using this method, the model can increase its classification accuracy by learning from its mistakes and modifying weights. The loss value drops as the model converges, suggesting better performance.

RESULTS

Accuracy: The training accuracy improved steadily throughout epochs, reaching 79.58% on the training set and 78.22% on the test set as shown in below figure 4

Loss: The training loss gradually decreased, suggesting good learning. The validation loss exhibited modest oscillations but general convergence.Here The training loss consistently decreases, showing the model is learning patterns from the data effectively. Validation loss is low in early epochs, indicating a decent generalization capability initially. The training and validation loss is shown in figure 5

The proposed model gives 93% to 95% accuracy for real videos and 65% to 67% for fake videos. The below figure 8 shows how well different categories of a custom dataset generated for this study can distinguish between actual and fraudulent data. Across all categories, the detection accuracy of actual data is consistently greater (93.0% total) than that of phoney data (67.4%).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
img		img
models		models
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
app.py		app.py
model.ipynb		model.ipynb
requirements.txt		requirements.txt
vaild.ipynb		vaild.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DEEPFAKE DETECTION USING MULTIMODAL DEEP LEARNING

ABSTRACT

INTRODUCTION

DATASET DESCRIPTION

Dataset Preprocessing

METHOD

Model Architecture:

Video Processing

Audio Processing:

Fusion and Classification

Training Parameters

RESULTS

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DEEPFAKE DETECTION USING MULTIMODAL DEEP LEARNING

ABSTRACT

INTRODUCTION

DATASET DESCRIPTION

Dataset Preprocessing

METHOD

Model Architecture:

Video Processing

Audio Processing:

Fusion and Classification

Training Parameters

RESULTS

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages