Active Hackathon

Guide To Real-Time Face-To-Face Translation Using LipSync GANs

Face-to-Face translation is plagued by the novel problem of out-of-sync lips, LipGAN and Wave2Lip aim to solve this lip sync issue.
Face to Face translation using LipSync GANs

State-of-the-art Neural Machine Translation systems have become increasingly competent in automatically translating natural languages. These systems have not only become formidable in plain text-to-text translation tasks but have also made a considerable leap in speech-to-speech translation tasks.  With the development of such systems, we are getting closer and closer to overcoming language barriers. However, there is still a medium that these systems need to tackle- videos.  As far as videos are concerned we are still stuck with transcripts, subtitles, and manual dubs. And the translation systems that do exist can only translate the audiovisual content at the speech-to-speech level. This creates two flaws- the translated voice sounds significantly different from the original speaker, the generated audio and the lip movements are unsynchronized. 

In their paper “Towards Automatic Face-to-Face Translation”, Prajwal K R et al tackle both these issues. They propose a new model LipGAN that generates realistic talking face videos across languages. And to work around the issue of personalizing the speaker’s voice, they make use of the CycleGAN architecture. 

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Pipeline for Face-to-Face Translation

Pipeline for Face-to-Face Translation

In the very first phase of the pipeline, DeepSearch 2 Automatic Speech Recognition(ASR) model is used to transcribe the audio. To translate the text from language A to language B the Transformer-Base available in fairseq-py is re-implemented by training a multiway model to maximize learning. The trained model has parameters that are shared across seven languages – Hindi, English, Telugu, Malayalam, Tamil, Telugu, and Urdu.

DeepVoice 3 is employed for the text-to-speech(TTS) conversion, this model only generates the audio in one voice. A CycleGAN architecture model trained with 10 minutes of target’s audio clip is used to personalize the audio to match the voice of the target speaker.  

This personalized audio is passed to the lip-sync GAN, LipGAN, along with the frames from the original video.

LipGAN 

LipGAN architecture used in Face-to-Face Translation

The LipGAN generator network contains three branches-

  1. Face Encoder
    The encoder consists of residual blocks with intermediate down-sampling layers. Instead of passing a face image of a random pose and its corresponding audio segment to the generator, the LipGAN model inputs the target face with the bottom-half masked to act as a pose prior. This allows the generated face crops to be seamlessly pasted back into the original video without further post-processing.
  2. Audio Encoder
    LipGAN uses a standard CNN that takes a Mel-frequency cepstral coefficient (MFCC) heatmap for the audio encoder
  3. Face Decoder
    This branch takes the concatenated audio and face embeddings and creates a lip-synchronized face by inpainting the masked region of the input image with an appropriate mouth shape. It contains a series of residual blocks with a few intermediate deconvolutional layers that upsample the feature maps. The output layer of the decoder is a sigmoid activated 1×1 convolutional layer with 3 filters. 

The generator is trained to minimize L1 reconstruction loss between the generated frames and ground-truth frames

The discriminator network contains the same audio and face encoder as the generator network. It learns to detect synchronization by minimizing the following

contrastive loss:

Wav2Lip 

Since the “Towards Automatic Face-to-Face Translation” paper, the authors have come up with a better lip sync model Wav2Lip. The significant difference between the two is the discriminator. Wav2Lip uses a pre-trained lip-sync expert combined with a visual quality discriminator. 

The expert lip-sync discriminator is a modified, deeper SyncNet with residual connections trained on color images. It computes the dot product between the ReLU-activated video and speech embeddings. This yields the probability of the input audio-video pair being in sync:

Along with the L1 reconstruction loss, in Wav2Lip the generator is trained to also minimize the expert sync-loss

The visual quality discriminator consists of a stack of convolutional blocks. Each block consists of a convolutional layer followed by a leaky ReLU activation. It is trained to minimize the following objective function: 

Combining everything, the generator minimizes the weighted sum of the reconstruction(L1) loss, the synchronization loss (expert sync-loss), and the adversarial loss Lgen.

Speech to Lip Generation using Wav2Lip

  1. Install ffmpeg
    sudo apt-get install ffmpeg
  2. Create a new environment using either conda or venv
    conda create --name myenv  or python3 -m venv
  3. Clone the Wave2Lip repository
    git clone https://github.com/Rudrabha/Wav2Lip.git

  1. Move inside the Wave2Lip directory and install the necessary modules from the requirement.txt file
    cd Wav2Lip
    pip install -r requirements.txt

  2. Download the pre-trained GAN model from here and move it into the “Wav2Lip/checkpoints/”  folder
  3. Download the face detection model and put it in “face_detection/detection/sfd/” folder and rename it to “s3fd.pth

    wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "Wav2Lip/face_detection/detection/sfd/s3fd.pth"


  4. For the speech to lip generation to work it needs a video/image of the target face and a video/audio file containing the raw audio.

    python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face "input.jpg" --audio "input.mp4"



    By default, the output video file named “result_voice.mp4” will be stored in the results folder, you can change this using the –outfile argument.
input image and audio/video for face-to-face translation
Output video of the face-to-face translation process

More Great AIM Stories

Aditya Singh
A machine learning enthusiast with a knack for finding patterns. In my free time, I like to delve into the world of non-fiction books and video essays.

Our Upcoming Events

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

Ouch, Cognizant

The company has reduced its full-year 2022 revenue growth guidance to 8.5% – 9.5% in constant currency from the 9-11% in the previous quarter

The curious case of Google Cloud revenue

Porat had earlier said that Google Cloud was putting in money to make more money, but even with the bucket-loads of money that it was making, profitability was still elusive.

[class^="wpforms-"]
[class^="wpforms-"]