New AI Algorithm Can Now Guess What You Look Like Based Just On Your Voice

In a bizarre-sounding experiment which walks a tightrope between Orwellian voyeurism and ingenious innovation, researchers at MIT have come up with an algorithm, which can listen to a voice and guess the face of the speaker with decent accuracy.

Picking information like gender, race or culture from social cues like speech or song is what humans have subconsciously devoted themselves throughout their evolutionary past. Now we can easily recognise the voice of a person be it through wireless communication or behind the wall. We can imagine the face of the person in case it is familiar or at least pick up whether it belongs to male or female based on pitch.

Now, imagine machines doing the same. It is eerie and exciting at the same time.


Sign up for your weekly dose of what's up in emerging technology.

The authors of this paper, trained a neural network using millions of videos on the internet.

“During training, our model learns voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This is done in a self-supervised manner, by utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly,” wrote the authors in their paper titled Speech2Face: Learning the Face Behind a Voice.

The picture below contains the speaker’s image in the first column followed by the results of the model.

Results illustrating the accuracy of the experiment via Speech2Face paper

How Does Speech2Face Model Work

Regressing from input speech to image pixels is not as impossible as it sounds because a model has to learn to factor out of many irrelevant variations in the data and to implicitly extract a meaningful internal representation of faces.

To sidestep these challenges, the researchers train their model to regress to a low-dimensional intermediate representation of the face by utilising the VGG-Face model.

Speech2Face pipeline via paper

Speech2Face pipeline consists of two main components:

  1. a voice encoder, which takes a complex spectrogram of speech as input, and predicts a low-dimensional face feature that would correspond to the associated face; and
  2. a face decoder, which takes as input the face features and produces an image of the face in a canonical form (frontal-facing and with neutral expression)

During training, the face decoder is fixed, and only the voice encoder is trained that predicts the face feature. Whereas, the face decoder model is developed using face normalization model.

The voice encoder module is a convolutional neural network (CNN) that turns the spectrogram of a short input speech into a pseudo-face feature, which is subsequently fed into the face decoder to reconstruct the face image.

This voice encoder is trained in a self-supervised manner, using the natural co-occurrence of a  speaker’s speech and facial images in videos.

Up to 6 seconds of audio is taken from the beginning of each video clip in AVSpeech. If the video clip is shorter than 6 seconds, the audio is repeated such that it becomes at least 6-seconds long.

The resulting training and test sets include 1.7 and 0.15 million spectra–face feature pairs, respectively. The whole network is implemented in TensorFlow and optimized by ADAM with learning rate set at 0.001.

Future Direction

The results show that for age and gender the classification results are highly correlated. For gender, there is an agreement of 94 % in male/female labels between the true images and the reconstructions from speech. For ethnicity, there is a good correlation on the “white” and “Asian”, but less agreement on “India” and “black”.

The authors clearly have stated in their paper that this research is purely an academic investigation, the implications of which can be of a wide range- from eavesdropping and identifying speaker in remote locations to giving a voice to those with speech impediments by reverse engineering their facial features. However, it may be some time before it becomes reality.

Read more about the work here

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM