Now Reading
Now A Model That Learns Visual Concepts, Words And Semantic Parsing Of Sentences Without Explicit Supervision

Now A Model That Learns Visual Concepts, Words And Semantic Parsing Of Sentences Without Explicit Supervision

Building a computer system that can answer by just looking at the images has been in the works for many years now. Researchers in the field of artificial intelligence are working on this concept to make the current systems more intelligent than ever. Fusing this with natural language processing will not only reduce the time for training a model but will also help in predicting the emotions behind the objects.

How To Start Your Career In Data Science?

In May 2019, researchers from IBM, DeepMind and MIT developed a deep learning model known as the Neuro-Symbolic Concept Learner. It is closely related to the joint learning of vision and natural language processing. This deep learning model learns from natural supervision such as visual perception, words, and semantic language parsing from images and question-answer pairs. 

How It Works

The Neuro-Symbolic Concept Learner uses the techniques of artificial neural networks in order to extract features from images and construct information as symbols. Then a quasi-symbolic program executor is applied to the model to infer the answer of questions which is based on the scene representation.  

For visual perception, the researchers used a pre-trained Mask R-CNN technique in order to generate object proposals for all objects. Then the method of Res-Net is applied to extract the region-based and image-based features. For translating the natural language questions into executable programs basically designed for visual question answering (VQA), a semantic parsing module is applied. 

Behind the Model

The researchers used a dataset known as CLEVR to test the deep learning model. This is a diagnostic dataset that helps in solving visual question answering (VQA). The model learns via curriculum learning such that it starts by learning representations or concepts of the individual objects from short questions on simple scenes. This helps the model to learn object-based concepts such as colours and shapes. The model then learns the relational concepts by leveraging these object-based concepts to interpret object referrals. 

Moreover, the model naturally learns the disentangled visual and language concepts, enabling combinatorial generalisation with respect to both visual scenes and semantic programs. In this case, there are four forms of generalisation, the model first generalises to scenes with more objects and longer semantic programs than those in the training set. Secondly, the model generalises to new visual attribute compositions. Thirdly, it generalises to enable fast adaptation to novel visual concepts and finally, the learned visual concepts transfer to new tasks such as image-caption retrieval.

See Also

NS-CL contains three modules as mentioned below

  • Neural-Based Perception Module: This module works by extracting the object-level representations from the scene.
  • Visually-Grounded Semantic Parser: This module works for translating questions into executable programs. 
  • Symbolic Program Executor: This module reads out the perceptual representation of objects, classifies their attributes or relations, and executes the program to obtain an answer. 

Advantages of NS-CL

  • This deep learning model learns visual concepts with great accuracy rate. The researchers gained a classification accuracy of nearly 99% for all object properties.
  • This model allows data-efficient visual reasoning on the CLEVR dataset.
  • NS_CL generalises well to new attributes, new visual composition as well as new domain-specific languages.
  • This model can be directly applied to visual question answering (VQA).


In order to train deep learning models, it is necessary to have a large amount of data, what we called as big data. It is indeed a complex task to collect big data and work on it. However, organisations are collecting data every day and from this massive number of collected data, organisations are able to utilise only a little chunk from it. This is one of the main reasons why this model is developed. For future work, the researchers would like to extend this framework to other domains such as video understanding and robotic manipulation. 

You can read the full paper here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top