An Illustrative Guide to Multimodal Recommendation System


The general recommendation systems learn the pattern of user choices or interactions with items and recommend the items to the users based on the learned patterns. As the next step to this, the multimodal recommendation system captures users’ styles and aesthetic preferences and recommends the products based on themes or context the user is interested in. In this article, we will discuss these multimodal recommendation systems in detail along with their working principle and applications. The major points to be covered in this article are listed below.

Table of contents

  1. What are Multimodal Recommender Systems?
  2. Why do we need Multimodal Recommender Systems?
  3. How Does it Work?
  4. Applications of Multimodal Recommender Systems

What are Multimodal Recommender Systems?

Multimodal generally means having more than one mode. Multimodal recommender systems are the systems that capture users’ styles and aesthetic preferences. That means it will recommend items based on input, history, and even match the colour and pattern from the searched item. Multimodal recommender systems have been developed by using multimodal information of users and items.


Sign up for your weekly dose of what's up in emerging technology.

This type of recommendation system saves a lot of time for the user because it recommends the next item with a similar overall theme, style, or ambience, eventually increasing the revenue for the company.

Why do we need Multimodal Recommender Systems?

In a recommendation system, generally, there are two types of approaches followed, collaborative and content-based recommendations. In collaborative recommendation systems, it predicts your preference based on other users’ similar interests and based on rating. While in the content recommendation system, it gives recommendations based on search history only based on users profiles.

The problem arises if you’re looking for matching colour shoes to the shirt or looking for matching furniture for your home, the above systems fail to deliver. So multimodal recommendation system will help to find user matches based on colour, theme, ambience, etc.

In the above image, we can see the input image as seed and recommended images as a generated assortment.

How Does it Work?

In this multimodal recommender system, we use transfer learning and topic modelling to maximize the visual-based style compatibility and polylingual topic modelling for incorporating text data to infer style over both modalities. Before explanation of multimodal representer systems, let’s have a basic and brief introduction to transfer learning, topic modelling, and LDA.

Transfer Learning

Transfer learning is deep neural networks that are trained on the ImageNet dataset, it basically means they are pre-trained models. Some examples are Resnet-50, VGG-16, VGG-19, etc.

Topic Modelling

Topic modelling is a statistical process in which you can identify, extract, and analyze topics from a given collection of documents. Topic modelling techniques figure out which topics are present in the documents inside the corpus and check what the strength of each of them is.

Latent Dirichlet Allocation(LDA)

It is generative statistical modelling that allows observations to be explained by unobserved groups in which explains why some parts of the data are similar.

Primarily we use content data to see the user preference and seed the product around which we create the bundles. Now topic modelling LDA is also applied to create the topic-based recommendations from both user input text and content data. These systems will score individual products against one another and users will see which are most similar. PolyLDA enables learning two coupled and distinct latent style representations. In a given set of documents and a number of target topics, the model assumes the following generative process was used to create the documents.

To see the aesthetics of the images which are seeded we use the deep learning method. Some of the powerful deep neural networks like ResNet-50, InceptionV3, etc. show powerful models and capture style-based preferences. Here in Resnet-50 is used which is pre-trained on the ImageNet dataset. Here it uses a convolutional neural network to learn features of our data and simply index their responses to images to create visual documents. 

As we use LDA for topic modelling and transfer learning for images, we need an extension to interpret both of them. That extension is a multimodal topic model which assumes words and visual features occur in pair and should be captured as tuples.

The above figure shows the layers used to create visual documents.

The above graph shows visually the assortment and performance of a multimodal recommender system.

Applications of Multimodal Recommender Systems

Multimodal recommender systems are used by e-commerce platforms where they can recommend additional products which are aesthetically similar to the products the user has searched for. This kind of recommendation system can boost the sales of the other product as well, generating more revenue.

The above image shows how searched results are similar to each other. Multimodal recommender systems are used for recommending fashion-related products also. Considering that you just searched for a red t-shirt with a leaf pattern on it, the system automatically recommends t-shirts in red colour and with leaf pattern on it along with that, it will also recommend red shoes.

Multimodal recommender systems are used in the food and beverages industry. Considering that you searched for organic grape juices it will show several other products that are organic.

Final Words

In this article, we understand what multimodal recommender systems are, how it works, and where it is used. In multimodal recommendations, there are other models that are also used like transfer learning models, sequential recommendation systems, etc. We also went through some of the interesting applications of multimodal recommendation systems. 


More Great AIM Stories

Basawanyya Hiremath
Basawanyya sees patterns around him. That's what makes him love machine learning, after all it's all about patterns around us.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM