Guide To Real-Time Face-To-Face Translation Using LipSync GANs

Face-to-Face translation is plagued by the novel problem of out-of-sync lips, LipGAN and Wave2Lip aim to solve this lip sync issue.
Face to Face translation using LipSync GANs

State-of-the-art Neural Machine Translation systems have become increasingly competent in automatically translating natural languages. These systems have not only become formidable in plain text-to-text translation tasks but have also made a considerable leap in speech-to-speech translation tasks.  With the development of such systems, we are getting closer and closer to overcoming language barriers. However, there is still a medium that these systems need to tackle- videos.  As far as videos are concerned we are still stuck with transcripts, subtitles, and manual dubs. And the translation systems that do exist can only translate the audiovisual content at the speech-to-speech level. This creates two flaws- the translated voice sounds significantly different from the original speaker, the generated audio and the lip movements are unsynchronized. 

In their paper “Towards Automatic Face-to-Face Translation”, Prajwal K R et al tackle both these issues. They propose a new model LipGAN that generates realistic talking face videos across languages. And to work around the issue of personalizing the speaker’s voice, they make use of the CycleGAN architecture. 

Pipeline for Face-to-Face Translation

Pipeline for Face-to-Face Translation

In the very first phase of the pipeline, DeepSearch 2 Automatic Speech Recognition(ASR) model is used to transcribe the audio. To translate the text from language A to language B the Transformer-Base available in fairseq-py is re-implemented by training a multiway model to maximize learning. The trained model has parameters that are shared across seven languages – Hindi, English, Telugu, Malayalam, Tamil, Telugu, and Urdu.

DeepVoice 3 is employed for the text-to-speech(TTS) conversion, this model only generates the audio in one voice. A CycleGAN architecture model trained with 10 minutes of target’s audio clip is used to personalize the audio to match the voice of the target speaker.  

This personalized audio is passed to the lip-sync GAN, LipGAN, along with the frames from the original video.


LipGAN architecture used in Face-to-Face Translation

The LipGAN generator network contains three branches-

  1. Face Encoder
    The encoder consists of residual blocks with intermediate down-sampling layers. Instead of passing a face image of a random pose and its corresponding audio segment to the generator, the LipGAN model inputs the target face with the bottom-half masked to act as a pose prior. This allows the generated face crops to be seamlessly pasted back into the original video without further post-processing.
  2. Audio Encoder
    LipGAN uses a standard CNN that takes a Mel-frequency cepstral coefficient (MFCC) heatmap for the audio encoder
  3. Face Decoder
    This branch takes the concatenated audio and face embeddings and creates a lip-synchronized face by inpainting the masked region of the input image with an appropriate mouth shape. It contains a series of residual blocks with a few intermediate deconvolutional layers that upsample the feature maps. The output layer of the decoder is a sigmoid activated 1×1 convolutional layer with 3 filters. 

The generator is trained to minimize L1 reconstruction loss between the generated frames and ground-truth frames

The discriminator network contains the same audio and face encoder as the generator network. It learns to detect synchronization by minimizing the following

contrastive loss:


Since the “Towards Automatic Face-to-Face Translation” paper, the authors have come up with a better lip sync model Wav2Lip. The significant difference between the two is the discriminator. Wav2Lip uses a pre-trained lip-sync expert combined with a visual quality discriminator. 

The expert lip-sync discriminator is a modified, deeper SyncNet with residual connections trained on color images. It computes the dot product between the ReLU-activated video and speech embeddings. This yields the probability of the input audio-video pair being in sync:

Along with the L1 reconstruction loss, in Wav2Lip the generator is trained to also minimize the expert sync-loss

The visual quality discriminator consists of a stack of convolutional blocks. Each block consists of a convolutional layer followed by a leaky ReLU activation. It is trained to minimize the following objective function: 

Combining everything, the generator minimizes the weighted sum of the reconstruction(L1) loss, the synchronization loss (expert sync-loss), and the adversarial loss Lgen.

Speech to Lip Generation using Wav2Lip

  1. Install ffmpeg
    sudo apt-get install ffmpeg
  2. Create a new environment using either conda or venv
    conda create --name myenv  or python3 -m venv
  3. Clone the Wave2Lip repository
    git clone

  1. Move inside the Wave2Lip directory and install the necessary modules from the requirement.txt file
    cd Wav2Lip
    pip install -r requirements.txt

  2. Download the pre-trained GAN model from here and move it into the “Wav2Lip/checkpoints/”  folder
  3. Download the face detection model and put it in “face_detection/detection/sfd/” folder and rename it to “s3fd.pth

    wget "" -O "Wav2Lip/face_detection/detection/sfd/s3fd.pth"

  4. For the speech to lip generation to work it needs a video/image of the target face and a video/audio file containing the raw audio.

    python --checkpoint_path checkpoints/wav2lip_gan.pth --face "input.jpg" --audio "input.mp4"

    By default, the output video file named “result_voice.mp4” will be stored in the results folder, you can change this using the –outfile argument.
input image and audio/video for face-to-face translation
Output video of the face-to-face translation process

More Great AIM Stories

Aditya Singh
A machine learning enthusiast with a knack for finding patterns. In my free time, I like to delve into the world of non-fiction books and video essays.

More Stories


8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

Victor Dey
My Small Project on Animating Images & Videos Using GANsNRoses

GANsNRoses is an open-source library that creates new Images by building a mapping that takes face images and produces them to anime drawings of faces. With the contents of the image being preserved, the same face could be represented in many different ways in the anime. GANsNRoses consists of a function that takes a content code recovered from the face image and a style code, a latent variable that produces an anime face. GNR uses the same image with different augmentations to form a batch, constraining the spatial invariance in style codes, all style codes being the same across the batch. The Diversity Discriminator present looks at batch-wise statistics by explicitly computing the minibatch standard deviation across the batch. This ensures diversity within each batch of images, in turn producing unique animated images. 

Vijaysinh Lendave
Hands-On Guide To Image Extrapolation With Boundless-GAN

Image extrapolation is such a task in computer vision that aims to fill the surrounding region of a sub-image, e.g. completing the object appearing in the image or predicting the unseen view from the scene picture. This task is extremely challenging since the extrapolated image must be realistic with reasonable and meaningful context. Moreover, the extrapolated region should be consistent in structure and texture with the original sub-image.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM