A year ago, Facebook launched a unified computer vision model, GrokNet. Now, the social media giant is looking to use the model to power new applications on Facebook, like product tagging, product suggestions, visual search and more.
Facebook AI is building the world’s largest shoppable social media platform. The model is currently live on Facebook Marketplace. Soon, it plans to expand GrokNet to new applications on Facebook and Instagram.
How GrokNet works
GrokNet identifies what products are in an image and predicts their categories, according to Facebook blog. Unlike previous models, Facebook’s product recommendation system is an all-in-one model that scales across billions of photos across verticals, including fashion, auto, and home decor.
For instance, when a seller posts an image on their Facebook page, the AI helps identify untagged items and suggests tags based on their product catalogue. So, when a user views an untagged post from a seller, the system recommends similar products below the post from the seller’s product catalogue.
The below visual depicts this. “These are visual demonstrations only — exact model experiences may vary,” wrote Tamara Berg, research scientist manager at Facebook AI.
How GrokNet works (Source: Facebook)
Evolution of GrokNet
GrokNet grew out of an AI research project with initial applications geared for the Facebook Marketplace. The model analysed search queries like ‘mid-century modern sofa’ and predicted matches to search indexes and brought up the most relevant results.
“With billions of product images uploaded to Shops on Instagram and Facebook by sellers, predicting just the right product at any given moment is hard,” said Berg. Facebook has now extended the application to other products. For example, Instagram users can now find similar dresses by tapping on an image. “While it’s still early, we think this will enhance mobile shopping by making even more images on Instagram shoppable,” said Sean Bell, a research scientist at Facebook AI.
How different is GrokNet
However, the scalability of Facebook’s product recognition systems compared to supervised learning or manual labelling remains a challenge. Furthermore, the complexity of the recommendations increases with time as some combinations might occur more frequently in data.
“We have built a new model that learns from some attribute-object pairs and generalises to unseen combinations. So, if you train on blue cars, blue skirts, and blue skies, you would still be able to recognise blue pants even if your model never saw them during training,” said Facebook. The new compositional framework was trained on 78 million public Instagram images — built on top of its research that uses hashtags as weak supervision to achieve SOTA image recognition.
Compositional framework architecture (Source: Facebook)
Facebook has incorporated a new compositional framework that takes object classifier weights and attributes and learns how to compose them into ‘attribute-object classifiers.’ “This makes it possible to predict combinations of attributes and objects not seen during training, and it outperforms the standard approach of individual attribute and object predictions,” said Berg. In other words, it can scale to millions of images and hundreds of thousands of fine-grained class labels and quickly churn predictions for new verticals.
Facebook had sampled objects and attributes worldwide while collecting the data to train these models. “Although the artificial intelligence field is just beginning to understand the challenges of fairness in AI, we are continuously working to understand and improve the way our products work for everyone,” said Bell.
Towards multimodal model
Further, to improve content understanding across its platform, Facebook is leveraging SOTA multimodal advancements across formats (image, texts, etc.). As a result, it has significantly improved the accuracy of product categorisation.
Facebook combined visual signals from the image and related text description to design the final model prediction. “We found a great formula for a multimodal model, which includes a slew of artificial intelligence frameworks and tools,” wrote Berg. It includes Facebook AI’s Multimodal Bitransformer, generalised as the MMF Transformer in Facebook’s AI’s Multimodal Framework, and the Transformer encoder, pre-trained on public Facebook posts. The early-fusion multimodal transformers outperformed late-fusion architectures.
For images without text details, Facebook added a modality dropout trick during training. It randomly removes either text or image when both modalities are present to ensure robustness against missing details. Compared with vision-only models, the advancement provided them significant improvements in accuracy. Soon, Facebook plans to expand these multimodal attributes to other verticals.