Last updated August 23, 2023
In AI Breakthroughs

What Happened to Meta’s One Image to Bind it All?

With one embedding space, Meta’s ImageBind pairs five different modalities with their link to images. Though the paper is brilliant, there has not been much work done with it.

Share

Published on August 23, 2023

by K L Krithika

Listen to this story

Humans can conjure up the smells, sounds, and how the space would feel based on an image and the other way around. Given a picture of a beach you would know exactly how the waves would sound, the smell of salty air and the heat around you, or if you hear snoring, you can picture a person lying down in deep sleep. Meta AI’s ImageBind paper that was published in late May addresses the question – Can you bind many different and unrelated modalities together, like a human?

To ‘bind’ multiple modalities, beyond just text and images, the researchers of the paper kept images as the primary data and tested audio, heat map (a thermal camera), text, and IMU (inertial measurement which is in the family accelerometers, gyroscopes), and depth.

To link two unrelated modalities like depth and text, the researchers used contrastive learning. Keeping the image data as the primary requirement, the diagram below from the paper shows the solid bold lines representing the actual links to the image that are available in any given data.

Next, the researchers show how the emergent linking happens where now you can take the audio and text data points and get the right image or video. This capability didn’t exist before; it’s emergent. Using the pairs of aligned observations such as the sound of barking and the text ‘dog’, it gives the output correctly as an image of a dog. Another example given in the paper is the image of a stork and the sound of waves combines the modalities and shows images of a stork in water.

What the paper seems to be building up on is that you don’t actually need a data-pair with the image to link it together. For example, with just the depth or heat map information paired with a text (that has the actual link with the image) the user can create an image that binds all three. The paper calls this ‘emergent alignment’.

Why CLIP was the perfect choice

Meta’s Facebook has one of the largest datasets of paired images and texts. Curiously, instead of using their own dataset, the researchers used OpenAI’s CLIP where it should have made sense to use their own datasets collected over the last ten years to train this model. On the other hand, there is no sign of GPT-4 multimodal architecture.

Hugo Ponte, a robotics researcher, however explains in some detail why it was a genius move by Meta to use CLIP instead.

CLIP is a model that has created an embedding space shared for both images and language making it ridiculously powerful and useful. The addition of ImageBind on the CLIP dataset makes the model not only for text but pretty much all the other modalities mentioned as in the paper. If the user has audio, IMU, heat map, depth, and text data you can create an image that is closest to that data.

Ponte further breaks down the paper and the authors’ reason for selecting CLIP – “I see that it’s a genius move that they didn’t change the CLIP embedding space which now means that you can actually go back to every single paper that uses CLIP that people have released for the last three years, and you can just plug this (ImageBind) in instead.”

Using ImageBind, we can project anything into CLIP. “They extended CLIP, they didn’t replace it, but made it even better because CLIP works on contrastive learning as well where you need paired examples of the image and the text on what the image shows,” added Ponte.

Further the ImageBind authors have employed a Vision Transformer (ViT), a common architecture nowadays to create similar embeddings for related concepts across different modalities, like associating “dog” with an image of a dog.

What’s next for ImageBind

Meta released the code as open source but funnily enough it is not available for commercial purposes which limits the use cases so far. Yet, developers have built in a clever search engine demo using ImageBind. The search engine retrieves AI-generated images using text, audio or even visual inputs.

Yann LeCun, Meta AI chief, said that the model wasn’t released publicly probably for legal reasons, or it could be because it is just the initial paper with such a wide number of modalities. This has slowed down the adoption of the paper with only a few demos developed on it. The extensive modalities however look like it’s a step towards Yann Lecun approach to AGI. The model so far, can learn from different ‘senses’ to produce the right image mimicking how humans perceive the world.

Access all our open Survey & Awards Nomination forms in one place

K L Krithika

K L Krithika is a tech journalist at AIM. Apart from writing tech news, she enjoys reading sci-fi and pondering the impossible technologies, trying not to confuse it with reality.