What Happened to Meta’s One Image to Bind it All?

With one embedding space, Meta’s ImageBind pairs five different modalities with their link to images. Though the paper is brilliant, there has not been much work done with it. 
Listen to this story

Humans can conjure up the smells, sounds, and how the space would feel based on an image and the other way around. Given a picture of a beach you would know exactly how the waves would sound, the smell of salty air and the heat around you, or if you hear snoring, you can picture a person lying down in deep sleep. Meta AI’s ImageBind paper that was published in late May addresses the question – Can you bind many different and unrelated modalities together, like a human?

To ‘bind’ multiple modalities, beyond just text and images, the researchers of the paper kept images as the primary data and tested audio, heat map (a thermal camera), text, and IMU (inertial measurement which is in the family accelerometers, gyroscopes), and depth.

To link two unrelated modalities like depth and text, the researchers used contrastive learning. Keeping the image data as the primary requirement, the diagram below from the paper shows the solid bold lines representing the actual links to the image that are available in any given data.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Next, the researchers show how the emergent linking happens where now you can take the audio and text data points and get the right image or video. This capability didn’t exist before; it’s emergent. Using the pairs of aligned observations such as the sound of barking and the text ‘dog’, it gives the output correctly as an image of a dog. Another example given in the paper is the image of a stork and the sound of waves combines the modalities and shows images of a stork in water. 

What the paper seems to be building up on is that you don’t actually need a data-pair with the image to link it together. For example, with just the depth or heat map information paired with a text (that has the actual link with the image) the user can create an image that binds all three. The paper calls this ‘emergent alignment’.  

Why CLIP was the perfect choice

Meta’s Facebook has one of the largest datasets of paired images and texts. Curiously, instead of using their own dataset, the researchers used OpenAI’s CLIP where it should have made sense to use their own datasets collected over the last ten years to train this model. On the other hand, there is no sign of GPT-4 multimodal architecture.

Hugo Ponte, a robotics researcher, however explains in some detail why it was a genius move by Meta to use CLIP instead.

CLIP is a model that has created an embedding space shared for both images and language making it ridiculously powerful and useful. The addition of ImageBind on the CLIP dataset makes the model not only for text but pretty much all the other modalities mentioned as in the paper. If the user has audio, IMU, heat map, depth, and text data you can create an image that is closest to that data. 

Ponte further breaks down the paper and the authors’ reason for selecting CLIP – “I see that it’s a genius move that they didn’t change the CLIP embedding space which now means that you can actually go back to every single paper that uses CLIP that people have released for the last three years, and you can just plug this (ImageBind) in instead.”

Using ImageBind, we can project anything into CLIP. “They extended CLIP, they didn’t replace it, but made it even better because CLIP works on contrastive learning as well where you need paired examples of the image and the text on what the image shows,” added Ponte.

Further the ImageBind authors have employed a Vision Transformer (ViT), a common architecture nowadays to create similar embeddings for related concepts across different modalities, like associating “dog” with an image of a dog. 

What’s next for ImageBind

Meta released the code as open source but funnily enough it is not available for commercial purposes which limits the use cases so far. Yet, developers have built in a clever search engine demo using ImageBind. The search engine retrieves AI-generated images using text, audio or even visual inputs. 

Yann LeCun, Meta AI chief, said that the model wasn’t released publicly probably for legal reasons, or it could be because it is just the initial paper with such a wide number of modalities. This has slowed down the adoption of the paper with only a few demos developed on it. The extensive modalities however look like it’s a step towards Yann Lecun approach to AGI. The model so far, can learn from different ‘senses’ to produce the right image mimicking how humans perceive the world. 

K L Krithika
K L Krithika is a tech journalist at AIM. Apart from writing tech news, she enjoys reading sci-fi and pondering the impossible technologies while trying not to confuse it with the strides technology achieves in real life.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox