Ever got stuck with a tune but couldn’t name the song? We all have been there. It’s called an ‘earworm’. It doesn’t go away until we listen to the song again. The frustration of the faint memory forces people to resort to all kinds of tricks. One such effort is to hum the tune to people close to us so that they can help us with the song name. The researchers at Google have been working on this very aspect of the search for a while now.
Last month they rolled out this feature on their search engine where people can hum and find the relevant song. Humming is not perfect; even friends who are familiar with your voice and your music taste will take a while to identify the song. So, how does Google do it? The answer is machine learning.
How It Works
- Hum a tune into Google Search.
- ML models transform this audio into a number-based sequence that represents the song’s melody.
- Models are trained to identify these music tracks based on sources such as humans singing, whistling or humming, as well as studio recordings.
- These number-based sequences are compared to thousands of songs from around the world and identify potential matches in real-time.
The idea of humming to find a song is not new, but Google claims to have figured out quite well. So what is the secret sauce behind their algorithm?
ML Behind Hum To Search
In a blog detailing the machine learning behind their new feature, the Google AI team wrote that they had trained a neural network with pairs of hummed audio with recorded audio to produce embeddings for each input. These inputs are later used for matching to a hummed melody.
The idea here is to generate embeddings for every pair of the humming audio. So, within the embedding space, the embeddings that are close to or far away further help the algorithm to identify the matching pair. For instance, pairs of audio containing different melodies should be far apart. The network is already trained on such pairs.
The trained model should be capable of generating embeddings for a tune that is similar to the embedding of the song’s reference audio. Now finding the right song is just about finding similar embeddings from a database of reference recordings computed from the audio of popular music.
So far, the process looks good but what really makes a difference is incorporating the ‘triplet loss’ into the ML models.
The job of triplet loss is to ignore a few parts of the training data. Given a pair of audio corresponding to the same melody, triplet loss ignores those parts of the training data that were derived from a different melody.
The algorithm gets rid of other accompanying audio like those of instruments and others= recordings. The model is left with the song’s number-based sequence or to say its unique identity. Adding triplet loss, wrote Google, has led to improvements in the model’s precision and recall.
In addition to this, the Google team, to improve the model’s performance, generated additional training data of “hummed” melodies simulated from the existing audio dataset using SPICE, a pitch extraction model developed by Google.
This model extracts the pitch values from given audio, which are then used to generate a melody consisting of discrete audio tones. This step is followed by replacing the simple tone generator with a neural network that generates audio resembling an actual hummed tune.
Finally, the training data is compared by mixing and matching the audio samples. And, if there is a similar clip from two different singers, then the researchers ensured that those two clips are aligned with the preliminary models in such a way so that audio clips that represent the same melody are shown.
Try the new feature by
- Opening the Google app and tap the mic icon and say “what’s this song?” or
- Click the “Search a song” button and hum the tune.
- Google will find your song.
Know more about Google’s new feature here.