Using A Unique Neural Network Framework For Visual Question Answering

Over the last few years, niche areas in artificial intelligence such as computer vision (CV) and natural language processing (NLP), have seen tremendous growth. This can be attributed to the fact that the nature of research has improved greatly in this field. Although research in AI gathers insights from various disciplines, in case of CV and NLP, there haven’t been sufficient methods to determine images and text together (known as ‘image captioning’). Apart from this, AI implementation needs a standard metric for monitoring progress, which is a tough challenge.

In this article, we will explore Visual Question Answering (VQA) system used to set a response for images, and how it is made better with a unique neural network framework known as end-to-end module network.


Sign up for your weekly dose of what's up in emerging technology.

Information From Images

In a VQA system, pictures or images and natural language questions are provided as input data. The system gives a natural language answer as output in response to the input. This requires a lot of data to be tested in the system for it to be fully AI-capable. The natural language questions in a VQA system are usually based on a variety of features, such as object detection and identifying activities based on common sense and knowledge.

Deriving questions from pictures (Image courtesy : VQA, research paper by Stanislaw Antol

Lately, research for improving VQAs has been garnering a lot of attention. Right from harbouring information from a large dataset to the use of recurrent neural networks (RNN) and convolutional neural networks (CNN), VQA has witnessed many modifications. Now, researchers at University of California, Berkeley, in collaboration with Facebook and Boston University have proposed a novel neural network framework called End-To-End Module Network, which is supposed to speed up VQA.

Download our Mobile App

End-to-End Module Networks

According to researchers, these unique networks aim to solve VQA tasks by analysing a class of models which predict modular network architectures, serve them as a source of text which is then applied to images considered in the project. In addition, they use a parser to understand the textual information for building neural network layouts.

Their neural network model has two components. The first one is a set of modules called ‘co-attentive neural modules’ which have parameterised functions for solving sub-tasks. The second component is a layout policy which creates individual neural layouts to provide responses based on the questions encountered in the VQA system.

The ‘co-attentive neural modules’ in the model is constructed into a neural network. These modules consider input in the form of tensors by absorbing features from the image and text input, and then give out a unique tensor as the output. In the study, every input tensor was an image attention map placed on a convolutional feature grid and consequently, the output tensor is either the attention map or a probability distribution spread across the answers collected in the context. Therefore, a total of nine modules are studied to extract text and image features.

In order to provide the best appropriate reasoning for the questions, a layout policy is determined with the help of a sequence-to-sequence recurrent neural network. This policy gives output in the form of a probability distribution and builds a layout. Lastly, a neural network is built by combining the neural modules and outputs from these layout policies.

However, the neural networks are to be trained to give out meaningful output. So, a prior training step is included in the process at the very end. The training is also done to estimate a loss function from the layout policies. The loss function, described by the researchers, is given below:

“Let θ be all the parameters in our model. Suppose we obtain a layout l sampled from p(l|q; θ) and receive a final question answering loss L˜(θ, l; q, I) on question q and image I after predicting an answer using the network assembled with l. Our training loss function L(θ) is as follows.

L(θ) = El∼p(l|q;θ) [L˜(θ, l; q, I)]

where we use the softmax loss over the output answer scores as L˜(θ, l; q, I) in our implementation.”

The loss is reduced significantly by introducing another variable called baseline in place of gradient formed from the loss function. Researchers have pointed out that optimising loss is a challenge and requires constant learning from the parameters in the VQA.

Now, the fully-built model is tested on three datasets in total. Firstly, with a small dataset known as SHAPES dataset and then on to larger datasets, CLEVR and VGA. The performance of the model on all these datasets was found to be very satisfactory (close to 90 percent of the reasoning to be accurate for the questions).


Although the model has achieved considerable success, it is yet to come out as a standard way to address visual and text data simultaneously for AI systems. Nonetheless, with advancements like these, VQA systems will soon have neural networks like these powering them to be fully AI-capable.

More Great AIM Stories

Abhishek Sharma
I research and cover latest happenings in data science. My fervent interests are in latest technology and humor/comedy (an odd combination!). When I'm not busy reading on these subjects, you'll find me watching movies or playing badminton.

AIM Upcoming Events

Regular Passes expire on 3rd Mar

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Early Bird Passes expire on 17th Feb

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, Virtual
Deep Learning DevCon 2023
27 May, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

What went wrong with Meta?

Many users have opted out of Facebook and other applications tracking their activities now that they must explicitly ask for permission.