State-of-the-art Neural Machine Translation systems have become increasingly competent in automatically translating natural languages. These systems have not only become formidable in plain text-to-text translation tasks but have also made a considerable leap in speech-to-speech translation tasks. With the development of such systems, we are getting closer and closer to overcoming language barriers. However, there is still a medium that these systems need to tackle- videos. As far as videos are concerned we are still stuck with transcripts, subtitles, and manual dubs. And the translation systems that do exist can only translate the audiovisual content at the speech-to-speech level. This creates two flaws- the translated voice sounds significantly different from the original speaker, the generated audio and the lip movements are unsynchronized.
In their paper “Towards Automatic Face-to-Face Translation”, Prajwal K R et al tackle both these issues. They propose a new model LipGAN that generates realistic talking face videos across languages. And to work around the issue of personalizing the speaker’s voice, they make use of the CycleGAN architecture.
Sign up for your weekly dose of what's up in emerging technology.
Pipeline for Face-to-Face Translation
In the very first phase of the pipeline, DeepSearch 2 Automatic Speech Recognition(ASR) model is used to transcribe the audio. To translate the text from language A to language B the Transformer-Base available in fairseq-py is re-implemented by training a multiway model to maximize learning. The trained model has parameters that are shared across seven languages – Hindi, English, Telugu, Malayalam, Tamil, Telugu, and Urdu.
DeepVoice 3 is employed for the text-to-speech(TTS) conversion, this model only generates the audio in one voice. A CycleGAN architecture model trained with 10 minutes of target’s audio clip is used to personalize the audio to match the voice of the target speaker.
The LipGAN generator network contains three branches-
- Face Encoder
The encoder consists of residual blocks with intermediate down-sampling layers. Instead of passing a face image of a random pose and its corresponding audio segment to the generator, the LipGAN model inputs the target face with the bottom-half masked to act as a pose prior. This allows the generated face crops to be seamlessly pasted back into the original video without further post-processing.
- Audio Encoder
LipGAN uses a standard CNN that takes a Mel-frequency cepstral coefficient (MFCC) heatmap for the audio encoder
- Face Decoder
This branch takes the concatenated audio and face embeddings and creates a lip-synchronized face by inpainting the masked region of the input image with an appropriate mouth shape. It contains a series of residual blocks with a few intermediate deconvolutional layers that upsample the feature maps. The output layer of the decoder is a sigmoid activated 1×1 convolutional layer with 3 filters.
The generator is trained to minimize L1 reconstruction loss between the generated frames and ground-truth frames
The discriminator network contains the same audio and face encoder as the generator network. It learns to detect synchronization by minimizing the following
Since the “Towards Automatic Face-to-Face Translation” paper, the authors have come up with a better lip sync model Wav2Lip. The significant difference between the two is the discriminator. Wav2Lip uses a pre-trained lip-sync expert combined with a visual quality discriminator.
The expert lip-sync discriminator is a modified, deeper SyncNet with residual connections trained on color images. It computes the dot product between the ReLU-activated video and speech embeddings. This yields the probability of the input audio-video pair being in sync:
Along with the L1 reconstruction loss, in Wav2Lip the generator is trained to also minimize the expert sync-loss
The visual quality discriminator consists of a stack of convolutional blocks. Each block consists of a convolutional layer followed by a leaky ReLU activation. It is trained to minimize the following objective function:
Combining everything, the generator minimizes the weighted sum of the reconstruction(L1) loss, the synchronization loss (expert sync-loss), and the adversarial loss Lgen.
Speech to Lip Generation using Wav2Lip
- Install ffmpeg
sudo apt-get install ffmpeg
- Create a new environment using either conda or venv
conda create --name myenvor
python3 -m venv
- Clone the Wave2Lip repository
git clone https://github.com/Rudrabha/Wav2Lip.git
- Move inside the Wave2Lip directory and install the necessary modules from the requirement.txt file
pip install -r requirements.txt
- Download the pre-trained GAN model from here and move it into the “Wav2Lip/checkpoints/” folder
- Download the face detection model and put it in “face_detection/detection/sfd/” folder and rename it to “s3fd.pth”
wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "Wav2Lip/face_detection/detection/sfd/s3fd.pth"
- For the speech to lip generation to work it needs a video/image of the target face and a video/audio file containing the raw audio.
python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face "input.jpg" --audio "input.mp4"
By default, the output video file named “result_voice.mp4” will be stored in the results folder, you can change this using the –outfile argument.