Though food is essential to everyone, it is not just identified as a basic need every time. Rather, humans consistently show interest in exploring new foods and improving their conventional foods’ taste. The present digital world gives a great passage to digitalise food recipes by listing the ingredients, nutritional information, cooking instructions, supporting images and videos, and reviews and ratings. Recipe retrieval with AI/ML attempts over a decade to fulfil the humans’ need to develop their cooking skills and consume something new and delicious!
Cross-modal recipe retrieval is one of the digital recipe approaches in which a machine learning model outputs a text recipe when it is provided with an image of food. This is a challenging task because of the two entirely different modes: natural language processing and image processing. A lot of training data are available on various sites that make modeling possible. Nevertheless, data are spread across various sites and they do not have a proper structure or assurance for completeness. Various researches resulted in amazing machine learning models, but they fail to fulfil humans’ expectations on performance.
Recent cross-modal recipe retrieval approaches use the LSTM cells to encode the text recipes and the corresponding image embeddings. These models mostly require heavily pre-trained text representations, complex multi-stage training strategies and adversarial losses. Amaia Salvador, Erhan Gundogdu, Loris Bazzani and Michael Donoser of Amazon have introduced a transformer-based cross-modal recipe retrieval method that is straightforward, simple and versatile to train and deploy.
Sign up for your weekly dose of what's up in emerging technology.
Attention-based transformer networks have recently replaced traditional convolution neural networks and recurrent neural networks in various domains, including text, audio, image, video and structured data. Transformer networks show computational efficiency and performance improvements over those traditional approaches. To this end, the Amazon scientists have implemented a hierarchical transformers-based self-learning algorithm in the interesting inter-domain task, cross-modal recipe retrieval to great success. This hierarchical recipe transformer is an end-to-end machine learning model with attention-based encoders for both text and images.
This hierarchical model has two parallel encoder architecture, one for image encoding and another for recipe text encoding. Recipe text encoding is performed hierarchically from the recipe’s title, the ingredients of the recipe to the recipe’s instructions. Individual encoders are used for each of these text tasks. The final embedded text representations are supplied to the recipe encoder that is paired with the image encoder. The hierarchical transformer encoder (HTR) reads the text sentence by sentence. Thus, it effectively retrieves the correct information of ingredients and instructions without any data loss or mismatch.
The pair of encoders for image and text are trained simultaneously by incorporating a pair loss. However, some large recipe datasets do not have accompanying images making supervised training impossible. For training on recipe text that has no accompanying image, a self-supervised learning approach is introduced with a special loss function, known as the recipe loss. Therefore, the model is robust and powerful to train with either recipe-image pairs or recipe-only data.
Amazon’s Image-Recipe Transformer requires Git’s LFS (Large File Storage) module. The following commands install
git lfs in the local machine.
%%bash curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash sudo apt-get install git-lfs git lfs install
timm module using the command,
!pip install timm
Download the source code from the official repository to the local machine.
!git clone https://github.com/amzn/image-to-recipe-transformers.git
Change the directory to refer to the source contents and check for proper download of source files using the following command.
%cd /content/image-to-recipe-transformers/ !ls -p
If the machine has no Anaconda installed, the following command may help install the Anaconda-3 package.
!wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh !bash Anaconda3-2020.02-Linux-x86_64.sh
Activate the conda environment and create the development environment using the following command inside the base mode. It takes some time to install the dependencies and activate the required environment.
and provide the following inside the base mode,
conda env create -f environment.yml
The following command inside the base mode runs the transformer in the conda environment.
conda activate im2recipetransformers
Download the recipe data from the Recipe1M dataset by creating an account. This dataset contains more than one million recipes and around 13 million supporting images. Once the dataset is downloaded, extract it and move to a directory named
/root/DATASET_PATH. The following command preprocesses the data.
!python preprocessing.py --root DATASET_PATH
Enable training using the following command. Training may take more time based on the device configuration and memory availability.
%cd /content/image-to-recipe-transformers/src/ !python train.py --model_name model --root DATASET_PATH --save_dir /path/to/saved/model/checkpoints
Launch tensorboard logging using the following command.
!tensorboard --logdir "./" --port PORT
%cd /content/image-to-recipe-transformers/src/ !python test.py --model_name model --eval_split test --root DATASET_PATH --save_dir /path/to/saved/model/checkpoints
Calculate the evaluation metrics such as MedR and recall using the following command.
%cd /content/image-to-recipe-transformers/src/ !python eval.py --embeddings_file /path/to/saved/model/checkpoints/model/feats_test.pkl --medr_N 10000
Performance of Hierarchical Recipe Transformer
Hierarchical Recipe Transformer is trained and evaluated on the largest public recipe dataset, Recipe1M+. Other competing models are trained and evaluated on the same dataset under identical device configurations. All the models are evaluated on either direction, namely, image-to-recipe conversion and recipe-to-image conversion.
Amazon’s Hierarchical Recipe Transformer outperforms every other model such as R2GAN (Generative Adversarial Network), MCEN (Latent Variable Model), ACME (Adversarial Cross-Modal Embeddings), SCAN (Semantic Consistency and Attention Mechanisms) and Dac (Dividing and Conquering Cross-Modal Recipe Retrieval) on MedR and recall metrics.
Moreover, this Hierarchical Recipe Transformer is tested under different incremental stages with pair loss (supervised recipe-image training), recipe loss (self-supervised recipe-only training) and Vision Transformer (image encoding).
Amazon’s Hierarchical Recipe Transformer achieves state-of-the-art performance in all kinds of retrieval metrics and all retrieval scenarios.
Images and illustrations other than code outputs are obtained from this source.
Read more about this architecture here.
Find the source code repository here.