MITB Banner

New AI Algorithm Can Now Guess What You Look Like Based Just On Your Voice

Share

In a bizarre-sounding experiment which walks a tightrope between Orwellian voyeurism and ingenious innovation, researchers at MIT have come up with an algorithm, which can listen to a voice and guess the face of the speaker with decent accuracy.

Picking information like gender, race or culture from social cues like speech or song is what humans have subconsciously devoted themselves throughout their evolutionary past. Now we can easily recognise the voice of a person be it through wireless communication or behind the wall. We can imagine the face of the person in case it is familiar or at least pick up whether it belongs to male or female based on pitch.

Now, imagine machines doing the same. It is eerie and exciting at the same time.

The authors of this paper, trained a neural network using millions of videos on the internet.

“During training, our model learns voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This is done in a self-supervised manner, by utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly,” wrote the authors in their paper titled Speech2Face: Learning the Face Behind a Voice.

The picture below contains the speaker’s image in the first column followed by the results of the model.

Results illustrating the accuracy of the experiment via Speech2Face paper

How Does Speech2Face Model Work

Regressing from input speech to image pixels is not as impossible as it sounds because a model has to learn to factor out of many irrelevant variations in the data and to implicitly extract a meaningful internal representation of faces.

To sidestep these challenges, the researchers train their model to regress to a low-dimensional intermediate representation of the face by utilising the VGG-Face model.

Speech2Face pipeline via paper

Speech2Face pipeline consists of two main components:

  1. a voice encoder, which takes a complex spectrogram of speech as input, and predicts a low-dimensional face feature that would correspond to the associated face; and
  2. a face decoder, which takes as input the face features and produces an image of the face in a canonical form (frontal-facing and with neutral expression)

During training, the face decoder is fixed, and only the voice encoder is trained that predicts the face feature. Whereas, the face decoder model is developed using face normalization model.

The voice encoder module is a convolutional neural network (CNN) that turns the spectrogram of a short input speech into a pseudo-face feature, which is subsequently fed into the face decoder to reconstruct the face image.

This voice encoder is trained in a self-supervised manner, using the natural co-occurrence of a  speaker’s speech and facial images in videos.

Up to 6 seconds of audio is taken from the beginning of each video clip in AVSpeech. If the video clip is shorter than 6 seconds, the audio is repeated such that it becomes at least 6-seconds long.

The resulting training and test sets include 1.7 and 0.15 million spectra–face feature pairs, respectively. The whole network is implemented in TensorFlow and optimized by ADAM with learning rate set at 0.001.

Future Direction

The results show that for age and gender the classification results are highly correlated. For gender, there is an agreement of 94 % in male/female labels between the true images and the reconstructions from speech. For ethnicity, there is a good correlation on the “white” and “Asian”, but less agreement on “India” and “black”.

The authors clearly have stated in their paper that this research is purely an academic investigation, the implications of which can be of a wide range- from eavesdropping and identifying speaker in remote locations to giving a voice to those with speech impediments by reverse engineering their facial features. However, it may be some time before it becomes reality.

Read more about the work here

PS: The story was written using a keyboard.
Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India