Now Reading
MIT One Ups The Game For #Foodgrammers, How Neural Network Suggests Recipes Based On Photos

MIT One Ups The Game For #Foodgrammers, How Neural Network Suggests Recipes Based On Photos

Let’s admit it — taking countless photos of well-plated dishes is fun. We millennials enjoy documenting great food before we dive into the calories. Choosing the right filter, using the appropriate hashtag and posting it on social media, is almost a ritual now. Now an MIT researcher has taken this #foodgram addiction to a whole new level by training a machine learning system to look at a photograph of food, dissect the ingredients and suggest similar recipes.

Register for this Session>>

Researchers from Massachusetts Institute of Technology’s Computer Science and Artificial Intelligence Laboratory have developed a neural network called Receipe1M (its end-product is called Pic2Recipe) will not only help us learn the recipes but also better understand people’s eating habits and dietary preferences.

What Is Receipe1M

Receipe1M  is a neural network which is trained on the largest structured dataset to find patterns and make connections between the food images and the corresponding ingredients and recipes. The Recipe1M contains one million structures cooking recipes and 800,000 food images. By looking the photo of a food item, Receipe1M  can identify ingredients like flour, eggs, and butter and then suggest recipes that it determined to be similar to images from the database.

Receipe1M jointly learns to embed images and recipes in a common space which is semantically regularised by the addition of a high-level classification task.

To create a larger database, the MIT researchers have also collected images and recipes which already exist on cooking websites. Compiling data from places like and All Recipes, the CSAIL researchers created Recipe1M.

According to a release on the MIT website, the neural system did particularly well with desserts like cookies or muffins, since that was a main theme in the database. However, it had difficulty determining ingredients for more ambiguous foods, like sushi rolls and smoothies. It was also often stumped when there were similar recipes for the same dishes. For example, there are dozens of ways to make lasagna, so the team needed to make sure that system wouldn’t “penalize” recipes that are similar when trying to separate those that are different.

Dissecting Recipe1M

The contents of Recipe1M dataset is grouped into two layers. The first layer contains basic information including the title of the recipe, a list of ingredients and a sequence of instructions for preparing the dish in text format.

The second layer is built upon the first and includes associated images in RGB and JPEG format. Additionally, a subset of recipes are annotated with course labels like appetizer, side dish, dessert, etc.

The research paper also states that the average recipe in the dataset consists of nine ingredients that are transformed over the course of 10 instructions. Half of the recipes have images which, due to the nature of the data sources, depict the fully prepared dish. But Recipe1M includes 0.4 percent duplicate recipes and two percent duplicate images.

However, the researchers explain that excluding those 0.4 percent recipes, 20 percent of recipes have non-unique titles but symmetrically differ by a median of 16 ingredients. 0.2 percent of recipes share the same ingredients but are relatively simple, having a median of six ingredients.

How Recipe1M Works

The joint embedding model of Recipe1M is built upon two representations — recipe and image.

Recipe Representation

Ingredients and cooking instructions are the two major components of a recipe. Each recipe contains a set of ingredient text. For each ingredients, the researchers used word2vec representation. The actual ingredient names are extracted from each ingredient text. For example, in “two tablespoons of olive oil”, the olive oil is extracted as the ingredient name and treated as a single word for word2vec computation. The initial ingredient name extraction task is solved by a bi-directional LSTM that performs logistic regression on each word in the ingredient text.

Each recipe also has a list of cooking instructions for which the researchers used a two-stage LSTM model. First, each instruction or sentence is represented as a skip-instructions vector and then an LSTM is trained over the sequence of these vectors to obtain the representation of all instructions. The resulting fixed-length representation is then fed into to their joint embedding model.

Food Image Representation

For the image representation, the researchers adopted state-of-the-art deep convolutional networks called VGG-16 and Resnet-50 models and connected them to the joint embedding model.

The recipe model also includes two encoders: One for ingredients and the other for instructions, the combination of which is designed to learn a recipe level representation. The ingredients encoder combines the sequence of ingredient word vectors. The instructions encoder is implemented as a forward LSTM model over skip-instructions vectors. The outputs of both encoders are concatenated and embedded into a recipe-image joint space.

The team has also incorporated semantic regularis ation on their recipe and image embedding through solving the same high-level classification problem in multiple modalities with shared high-level weights. The key idea is that if high-level discriminative weights are shared, then both of the modalities (recipe and image embeddings) should utilise these weights in a similar way which brings another level of alignment based on discrimination.

See Also

Test Results Of Recipe1M

The MIT researchers evaluated the performance of the neural network against Canonical Correlation Analysis baselines and humans, which showed a remarkable improvement over the former while faring comparably to the latter.




The researcher has also demonstrated the capabilities of their learned embeddings with simple arithmetic operations. In the context of food recipes, this what researchers found:

 v(chicken\_pizza) - v(pizza) + v(lasagna) = v(chicken\_lasagna)

‘v’ represents the map into the embedding space.

The figures below show some results with same and cross-modality embedding arithmetics.


To sum up, the neural network is still in an early stage of development, but going forward the researchers plan to use food data more extensively. According to MIT news, the team hopes to be able to improve the neural system so that it can understand food in even more detail. This could mean being able to infer how a food is prepared (for example, stewed vs diced) or distinguish different variations of foods, like mushrooms or onions. The researchers are also interested in potentially developing the system into a dinner aide that could figure out what to cook given a dietary preference and a list of items in the fridge.

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top