Now Reading
How Amazon’s Image-Recipe Hierarchical Transformer excels in Cross-modal Recipe Retrieval

How Amazon’s Image-Recipe Hierarchical Transformer excels in Cross-modal Recipe Retrieval

Rajkumar Lakshmanamoorthy
Recipe retrieval

Though food is essential to everyone, it is not just identified as a basic need every time. Rather, humans consistently show interest in exploring new foods and improving their conventional foods’ taste. The present digital world gives a great passage to digitalise food recipes by listing the ingredients, nutritional information, cooking instructions, supporting images and videos, and reviews and ratings. Recipe retrieval with AI/ML attempts over a decade to fulfil the humans’ need to develop their cooking skills and consume something new and delicious!

Cross-modal recipe retrieval is one of the digital recipe approaches in which a machine learning model outputs a text recipe when it is provided with an image of food. This is a challenging task because of the two entirely different modes: natural language processing and image processing. A lot of training data are available on various sites that make modeling possible. Nevertheless, data are spread across various sites and they do not have a proper structure or assurance for completeness. Various researches resulted in amazing machine learning models, but they fail to fulfil humans’ expectations on performance. 

Recent cross-modal recipe retrieval approaches use the LSTM cells to encode the text recipes and the corresponding image embeddings. These models mostly require heavily pre-trained text representations, complex multi-stage training strategies and adversarial losses. Amaia Salvador, Erhan Gundogdu, Loris Bazzani and Michael Donoser of Amazon have introduced a transformer-based cross-modal recipe retrieval method that is straightforward, simple and versatile to train and deploy. 

Attention-based transformer networks have recently replaced traditional convolution neural networks and recurrent neural networks in various domains, including text, audio, image, video and structured data. Transformer networks show computational efficiency and performance improvements over those traditional approaches. To this end, the Amazon scientists have implemented a hierarchical transformers-based self-learning algorithm in the interesting inter-domain task, cross-modal recipe retrieval to great success. This hierarchical recipe transformer is an end-to-end machine learning model with attention-based encoders for both text and images. 

recipe retrieval
Hierarchical Recipe Transformer overview during training

This hierarchical model has two parallel encoder architecture, one for image encoding and another for recipe text encoding. Recipe text encoding is performed hierarchically from the recipe’s title, the ingredients of the recipe to the recipe’s instructions. Individual encoders are used for each of these text tasks. The final embedded text representations are supplied to the recipe encoder that is paired with the image encoder. The hierarchical transformer encoder (HTR) reads the text sentence by sentence. Thus, it effectively retrieves the correct information of ingredients and instructions without any data loss or mismatch. 

recipe retrieval
Difference between the traditional Transformer Encoder (on left) and the Hierarchical Transformer Encoder (on right)

The pair of encoders for image and text are trained simultaneously by incorporating a pair loss. However, some large recipe datasets do not have accompanying images making supervised training impossible. For training on recipe text that has no accompanying image, a self-supervised learning approach is introduced with a special loss function, known as the recipe loss. Therefore, the model is robust and powerful to train with either recipe-image pairs or recipe-only data. 

Self-supervised training strategy for recipe-only data

Python Implementation

Amazon’s Image-Recipe Transformer requires Git’s LFS (Large File Storage) module. The following commands install git lfs in the local machine.

 curl -s | sudo bash
 sudo apt-get install git-lfs
 git lfs install 


Install timm module using the command,

!pip install timm

Download the source code from the official repository to the local machine.

!git clone


Change the directory to refer to the source contents and check for proper download of source files using the following command.

 %cd /content/image-to-recipe-transformers/
 !ls -p 


If the machine has no Anaconda installed, the following command may help install the Anaconda-3 package.


Activate the conda environment and create the development environment using the following command inside the base mode. It takes some time to install the dependencies and activate the required environment.


and provide the following inside the base mode,

conda env create -f environment.yml


The following command inside the base mode runs the transformer in the conda environment.

conda activate im2recipetransformers


Download the recipe data from the Recipe1M dataset by creating an account. This dataset contains more than one million recipes and around 13 million supporting images. Once the dataset is downloaded, extract it and move to a directory named /root/DATASET_PATH. The following command preprocesses the data.

!python --root DATASET_PATH

Enable training using the following command. Training may take more time based on the device configuration and memory availability.

See Also

 %cd /content/image-to-recipe-transformers/src/
 !python --model_name model --root DATASET_PATH --save_dir /path/to/saved/model/checkpoints 

Launch tensorboard logging using the following command.

!tensorboard --logdir "./" --port PORT


 %cd /content/image-to-recipe-transformers/src/
 !python --model_name model --eval_split test --root DATASET_PATH --save_dir /path/to/saved/model/checkpoints 

Calculate the evaluation metrics such as MedR and recall using the following command.

 %cd /content/image-to-recipe-transformers/src/
 !python --embeddings_file /path/to/saved/model/checkpoints/model/feats_test.pkl --medr_N 10000 

Performance of Hierarchical Recipe Transformer

Hierarchical Recipe Transformer is trained and evaluated on the largest public recipe dataset, Recipe1M+. Other competing models are trained and evaluated on the same dataset under identical device configurations. All the models are evaluated on either direction, namely, image-to-recipe conversion and recipe-to-image conversion. 

recipe retrieval
Top 5 results for image-recipe or recipe-image query-results pair. The query is highlighted in the blue coloured window and the correct result is highlighted in a green coloured window. 

Amazon’s Hierarchical Recipe Transformer outperforms every other model such as R2GAN (Generative Adversarial Network), MCEN (Latent Variable Model), ACME (Adversarial Cross-Modal Embeddings), SCAN (Semantic Consistency and Attention Mechanisms) and Dac (Dividing and Conquering Cross-Modal Recipe Retrieval) on MedR and recall metrics. 

Moreover, this Hierarchical Recipe Transformer is tested under different incremental stages with pair loss (supervised recipe-image training), recipe loss (self-supervised recipe-only training) and Vision Transformer (image encoding).

recipe retrieval
Image-Recipe (highlighted in blue) evaluation pair and the model’s incremental performance with pair loss, recipe loss and ViT encoder. Green coloured highlights show exact retrieval. 

Amazon’s Hierarchical Recipe Transformer achieves state-of-the-art performance in all kinds of retrieval metrics and all retrieval scenarios.

Images and illustrations other than code outputs are obtained from this source.

Read more about this architecture here.

Find the source code repository here.

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join Our Telegram Group. Be part of an engaging online community. Join Here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top