In an episode called ‘The Devil’s Hands Are Idle Playthings’ from the TV series Futurama, the show’s protagonist Fry plays a musical instrument called Holophonor. This musical instrument is a clarinet that looks like an instrument that has a holographic lens which can display images based on the mood of the music. The show is set in the 30th century, and it is totally understandable if one is not able to comprehend the technology behind this.
However, Greek researchers Nikolaos Passalis and Stavros Doropoulos have tried to replicate something similar using machine learning algorithms. In their work titled Deepsing, they have demonstrated the idea behind this quirky application.
Deepsing Overview
Taking inspiration from Futurama Holophoner, Deepsing is designed to translate audio to images. It works by performing attributed-based music-to-image translation and synthesizes visual stories according to the sentiment expressed by songs.
The sentiment-aware generated images aim to induce the same feelings to the viewers, as the original song does, reinforcing the primary aim of music, i.e., communicating feelings.
This is how Deepsing comes up with visuals:
- Firstly, it classifies music segments based on valence and arousal
- Then the audio and the associated sentiments are mapped to image categories
- The sentiment in the images is then enhanced using neural style transfer.
- Then the GANs come up with out of the box visual stories.
In the above picture, for example, note the generated “Feather Boa” during the most arousing riff of the song and the transition to a “prison” as the valence of the song decreases. Sample frames were generated using the song “Chop Suey!” by “System Of A Down”.
Keyframes selected along with annotations regarding the corresponding affective content of the song. The generated images were then aimed at inducing the same feelings to the viewers, as the original song does, reinforcing the primary aim of music, i.e., communicating feelings.
The process of music-to-image translation poses unique challenges, mainly due to the unstable mapping between the different modalities involved in this process.
In this paper, the authors have employed a trainable cross-modal translation method to overcome this limitation, leading to the first, and one-of-its-kind deep learning method for generating sentiment-aware visual stories.
How DeepSing Works
When the model is asked to paint a picture of a song, supposedly — “Take Me To Church”, based on the highs and lows in the audio, an estimate of the emotional variance between positive and negative is assessed and some pictures are displayed. In the below case, a palace has been shown which exudes positivity.
There is an option on the website that allows the user to play with different moods for the same lyrics. So, here’s how a negativity-inducing palace looks like:
The authors claimed to believe that the whole output can be improved by selecting the class to use for the content generation according to a semantic similarity measure with the rest of the selected classes, instead of the cardinality-based sampling.
They have assured that using audio and video estimators to align the feature spaces and aligning the attribute spaces can enrich the generated content, and produce diverse visual stories.
Future Direction
Music video production has imbibed a strange custom of using patterns and other hallucinating animations to compliment the audio. Videos, nowadays are abstract and have become art itself. Music videos are one of the only few human-made creations that are weirder than those done by GANs. An application like Deepsing is ingenious, but it shouldn’t shock the world as much as other AI products like GPT and GANs DeepFake.