Now Reading
Google Used Fréchet Distance To Assess AI-Generated Audio & Video Quality

Google Used Fréchet Distance To Assess AI-Generated Audio & Video Quality

Ambika Choudhury

From generating deep fake faces to fake videos, Artificial Neural Network has touched new heights as well as opened up various directions in the domain of emerging technologies. While modelling a deep ANN, it is important to use the right metrics with the right dataset to build up a robust model. 

Generative adversarial networks (GANs) are one of the most popular methods for generating images and Frechet Inception Distance (FID) is the most popular metric used to validate it. This metric serves as a remedy for the pitfalls of GANs and is specifically designed for images.

The Fréchet Inception Distance (FID) works by taking a large number of images from both the target distribution and the generative model and uses the Inception object-recognition network to embed each image into a lower-dimensional space which captures the important features. Computing the Fréchet distance between these samples helps in providing a quantitative measure of how similar the two distributions actually are.

According to the researchers at the tech giant, the access to robust metrics for evaluation of generative models is crucial for measuring (and making) progress in the fields of audio and video understanding, but currently, no such metrics exist until now. Recently, the researchers at Google proposed two new metrics for audio and video which are built on the principles of FID. They are Fréchet Video Distance (FVD) and Fréchet Audio Distance (FAD) which are useful in measuring the quality of synthesised video and audio respectively. 


Figure: The key component for both metrics is a pre-trained model that converts the video or audio clip into an N-dimensional embedding. (Source)

Fréchet Video Distance (FVD)

Fréchet Video Distance (FVD) is a new metric for generative models of video which correlates well with the qualitative human judgment of generated videos. FVD can be used in situations such as unconditional video generation via Generative Adversarial Networks. There are several features which makes this metric better than the existing ones. They are mentioned below

  • FVD is sensitive to both temporal, and frame-level perturbations.
  • It coincides well with the qualitative human judgment of generated videos.
  • This metric is accurate in evaluating videos that were modified to include static noise, and temporal noise.

This metric avoids the drawbacks of frame-level metrics which is common among the existing video evaluating metrics.

Image: Examples of videos of a robot arm, judged by the new FVD metric. (Source)

See Also
BigBiGAN Sets New Benchmark For Unsupervised Representation Learning

Fréchet Audio Distance (FAD)

Fréchet Audio Distance (FAD) is a reference-free evaluation metric for music enhancement algorithms which is designed to measure how a given audio clip compares to clean, studio-recorded music. It compares statistics computed on a set of reconstructed music clips to background statistics computed on a large set of studio-recorded music. 

This metric is different from the other existing metrics as the existing metrics for generating audio quality either require a time-aligned ground truth signal, for instance, source-to-distortion ratio (SDR) or only target a specific domain like speech quality. But FAD, on the other hand, is a reference-free metric and can be used on any type of audio. 

For the evaluation of these two metrics, the researchers at Google performed a large-scale human study in order to determine how well these two new metrics align with the qualitative human judgment of generated audio and video.


For a few years now, Google has been doing several breakthroughs among intelligent platforms. The tech giant has also released the source codes for both Fréchet Video Distance and Fréchet Audio Distance on GitHub. According to the researchers at the tech giant, the two metrics, FAD and FVD will assist in keeping this progress measurable as well as improve the models for audio and video generation.

Provide your comments below


Copyright Analytics India Magazine Pvt Ltd

Scroll To Top