Listen to this story
Humans can conjure up the smells, sounds, and how the space would feel based on an image and the other way around. Given a picture of a beach you would know exactly how the waves would sound, the smell of salty air and the heat around you, or if you hear snoring, you can picture a person lying down in deep sleep. Meta AI’s ImageBind paper that was published in late May addresses the question – Can you bind many different and unrelated modalities together, like a human?
To ‘bind’ multiple modalities, beyond just text and images, the researchers of the paper kept images as the primary data and tested audio, heat map (a thermal camera), text, and IMU (inertial measurement which is in the family accelerometers, gyroscopes), and depth.
To link two unrelated modalities like depth and text, the researchers used contrastive learning. Keeping the image data as the primary requirement, the diagram below from the paper shows the solid bold lines representing the actual links to the image that are available in any given data.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Next, the researchers show how the emergent linking happens where now you can take the audio and text data points and get the right image or video. This capability didn’t exist before; it’s emergent. Using the pairs of aligned observations such as the sound of barking and the text ‘dog’, it gives the output correctly as an image of a dog. Another example given in the paper is the image of a stork and the sound of waves combines the modalities and shows images of a stork in water.
What the paper seems to be building up on is that you don’t actually need a data-pair with the image to link it together. For example, with just the depth or heat map information paired with a text (that has the actual link with the image) the user can create an image that binds all three. The paper calls this ‘emergent alignment’.
Why CLIP was the perfect choice
Meta’s Facebook has one of the largest datasets of paired images and texts. Curiously, instead of using their own dataset, the researchers used OpenAI’s CLIP where it should have made sense to use their own datasets collected over the last ten years to train this model. On the other hand, there is no sign of GPT-4 multimodal architecture.
CLIP is a model that has created an embedding space shared for both images and language making it ridiculously powerful and useful. The addition of ImageBind on the CLIP dataset makes the model not only for text but pretty much all the other modalities mentioned as in the paper. If the user has audio, IMU, heat map, depth, and text data you can create an image that is closest to that data.
Ponte further breaks down the paper and the authors’ reason for selecting CLIP – “I see that it’s a genius move that they didn’t change the CLIP embedding space which now means that you can actually go back to every single paper that uses CLIP that people have released for the last three years, and you can just plug this (ImageBind) in instead.”
Using ImageBind, we can project anything into CLIP. “They extended CLIP, they didn’t replace it, but made it even better because CLIP works on contrastive learning as well where you need paired examples of the image and the text on what the image shows,” added Ponte.
Further the ImageBind authors have employed a Vision Transformer (ViT), a common architecture nowadays to create similar embeddings for related concepts across different modalities, like associating “dog” with an image of a dog.
What’s next for ImageBind
Meta released the code as open source but funnily enough it is not available for commercial purposes which limits the use cases so far. Yet, developers have built in a clever search engine demo using ImageBind. The search engine retrieves AI-generated images using text, audio or even visual inputs.
Yann LeCun, Meta AI chief, said that the model wasn’t released publicly probably for legal reasons, or it could be because it is just the initial paper with such a wide number of modalities. This has slowed down the adoption of the paper with only a few demos developed on it. The extensive modalities however look like it’s a step towards Yann Lecun approach to AGI. The model so far, can learn from different ‘senses’ to produce the right image mimicking how humans perceive the world.