MITB Banner

How Amazon’s Image-Recipe Hierarchical Transformer excels in Cross-modal Recipe Retrieval

Amazon introduced a transformer-based cross-modal recipe retrieval method that is straightforward, simple and versatile to train and deploy
Recipe retrieval

Design by Processed with VSCO with kk2 preset

Though food is essential to everyone, it is not just identified as a basic need every time. Rather, humans consistently show interest in exploring new foods and improving their conventional foods’ taste. The present digital world gives a great passage to digitalise food recipes by listing the ingredients, nutritional information, cooking instructions, supporting images and videos, and reviews and ratings. Recipe retrieval with AI/ML attempts over a decade to fulfil the humans’ need to develop their cooking skills and consume something new and delicious!

Cross-modal recipe retrieval is one of the digital recipe approaches in which a machine learning model outputs a text recipe when it is provided with an image of food. This is a challenging task because of the two entirely different modes: natural language processing and image processing. A lot of training data are available on various sites that make modeling possible. Nevertheless, data are spread across various sites and they do not have a proper structure or assurance for completeness. Various researches resulted in amazing machine learning models, but they fail to fulfil humans’ expectations on performance. 

Recent cross-modal recipe retrieval approaches use the LSTM cells to encode the text recipes and the corresponding image embeddings. These models mostly require heavily pre-trained text representations, complex multi-stage training strategies and adversarial losses. Amaia Salvador, Erhan Gundogdu, Loris Bazzani and Michael Donoser of Amazon have introduced a transformer-based cross-modal recipe retrieval method that is straightforward, simple and versatile to train and deploy. 

Attention-based transformer networks have recently replaced traditional convolution neural networks and recurrent neural networks in various domains, including text, audio, image, video and structured data. Transformer networks show computational efficiency and performance improvements over those traditional approaches. To this end, the Amazon scientists have implemented a hierarchical transformers-based self-learning algorithm in the interesting inter-domain task, cross-modal recipe retrieval to great success. This hierarchical recipe transformer is an end-to-end machine learning model with attention-based encoders for both text and images. 

recipe retrieval
Hierarchical Recipe Transformer overview during training

This hierarchical model has two parallel encoder architecture, one for image encoding and another for recipe text encoding. Recipe text encoding is performed hierarchically from the recipe’s title, the ingredients of the recipe to the recipe’s instructions. Individual encoders are used for each of these text tasks. The final embedded text representations are supplied to the recipe encoder that is paired with the image encoder. The hierarchical transformer encoder (HTR) reads the text sentence by sentence. Thus, it effectively retrieves the correct information of ingredients and instructions without any data loss or mismatch. 

recipe retrieval
Difference between the traditional Transformer Encoder (on left) and the Hierarchical Transformer Encoder (on right)

The pair of encoders for image and text are trained simultaneously by incorporating a pair loss. However, some large recipe datasets do not have accompanying images making supervised training impossible. For training on recipe text that has no accompanying image, a self-supervised learning approach is introduced with a special loss function, known as the recipe loss. Therefore, the model is robust and powerful to train with either recipe-image pairs or recipe-only data. 

Self-supervised training strategy for recipe-only data

Python Implementation

Amazon’s Image-Recipe Transformer requires Git’s LFS (Large File Storage) module. The following commands install git lfs in the local machine.

 %%bash
 curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
 sudo apt-get install git-lfs
 git lfs install 

Output:

Install timm module using the command,

!pip install timm

Download the source code from the official repository to the local machine.

!git clone https://github.com/amzn/image-to-recipe-transformers.git

Output:

Change the directory to refer to the source contents and check for proper download of source files using the following command.

 %cd /content/image-to-recipe-transformers/
 !ls -p 

Output:

If the machine has no Anaconda installed, the following command may help install the Anaconda-3 package.

 !wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
 !bash Anaconda3-2020.02-Linux-x86_64.sh 

Activate the conda environment and create the development environment using the following command inside the base mode. It takes some time to install the dependencies and activate the required environment.

!bash

and provide the following inside the base mode,

conda env create -f environment.yml

Output:

The following command inside the base mode runs the transformer in the conda environment.

conda activate im2recipetransformers

Output:

Download the recipe data from the Recipe1M dataset by creating an account. This dataset contains more than one million recipes and around 13 million supporting images. Once the dataset is downloaded, extract it and move to a directory named /root/DATASET_PATH. The following command preprocesses the data.

!python preprocessing.py --root DATASET_PATH

Enable training using the following command. Training may take more time based on the device configuration and memory availability.

 %cd /content/image-to-recipe-transformers/src/
 !python train.py --model_name model --root DATASET_PATH --save_dir /path/to/saved/model/checkpoints 

Launch tensorboard logging using the following command.

!tensorboard --logdir "./" --port PORT

Output:

 %cd /content/image-to-recipe-transformers/src/
 !python test.py --model_name model --eval_split test --root DATASET_PATH --save_dir /path/to/saved/model/checkpoints 

Calculate the evaluation metrics such as MedR and recall using the following command.

 %cd /content/image-to-recipe-transformers/src/
 !python eval.py --embeddings_file /path/to/saved/model/checkpoints/model/feats_test.pkl --medr_N 10000 

Performance of Hierarchical Recipe Transformer

Hierarchical Recipe Transformer is trained and evaluated on the largest public recipe dataset, Recipe1M+. Other competing models are trained and evaluated on the same dataset under identical device configurations. All the models are evaluated on either direction, namely, image-to-recipe conversion and recipe-to-image conversion. 

recipe retrieval
Top 5 results for image-recipe or recipe-image query-results pair. The query is highlighted in the blue coloured window and the correct result is highlighted in a green coloured window. 

Amazon’s Hierarchical Recipe Transformer outperforms every other model such as R2GAN (Generative Adversarial Network), MCEN (Latent Variable Model), ACME (Adversarial Cross-Modal Embeddings), SCAN (Semantic Consistency and Attention Mechanisms) and Dac (Dividing and Conquering Cross-Modal Recipe Retrieval) on MedR and recall metrics. 

Moreover, this Hierarchical Recipe Transformer is tested under different incremental stages with pair loss (supervised recipe-image training), recipe loss (self-supervised recipe-only training) and Vision Transformer (image encoding).

recipe retrieval
Image-Recipe (highlighted in blue) evaluation pair and the model’s incremental performance with pair loss, recipe loss and ViT encoder. Green coloured highlights show exact retrieval. 

Amazon’s Hierarchical Recipe Transformer achieves state-of-the-art performance in all kinds of retrieval metrics and all retrieval scenarios.

Images and illustrations other than code outputs are obtained from this source.

Read more about this architecture here.

Find the source code repository here.

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Rajkumar Lakshmanamoorthy

Rajkumar Lakshmanamoorthy

A geek in Machine Learning with a Master's degree in Engineering and a passion for writing and exploring new things. Loves reading novels, cooking, practicing martial arts, and occasionally writing novels and poems.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories