Now Reading
How Good Are Deep Learning Networks At Picking Up Visual Cues

How Good Are Deep Learning Networks At Picking Up Visual Cues

Ram Sagar

The ability to generalise have made neural networks learn tasks at a faster rate.Generalising in this context means to classify data from the same class as the learning data that it has never seen before. In other words, this can be likened to transfer learning. A model or an agent in a reinforcement learning environment is trained on one task and uses this knowledge to perform a task that is new.

Making the networks to learn new strategies to generalise better is usually the aim behind any algorithmic enhancement. But how good are these networks at generalisation immediately after training? 

Exploring this question will lead one to pressing challenges of training deep learning algorithms such as the need for long hours of training, the efficacy of the model and other such questions.

To demonstrate how the network generalises, researchers from Stanford University in collaboration with  DeepMind and University College of London, conduct experiments on agents in a simulation of 3D world. 

The agents are monitored for how they learn actions that contain verbs. For example, an agent can be asked to “lift” an object or “find” a particular thing within that environment.

Overview Of The Process

via paper by Stanford et al.,

The above figure is an illustration of the architecture used in all experiments. The  simplicity of the architecture is intended to emphasize the generality of the findings.

The agent is assessed through tasks that require visual, language and memory related skills. The whole process can be summarised in three steps:

  1. The visual observations are passed on to a convolutional neural network. The output of this network is joined with an embedding of the language observation.
  2. Language instructions are received at every timestep as a string.  The agent splits these with a (word-level) LSTM network. The final hidden state is concatenated with the output of the visual processor to yield a representation at each timestep.
  3. The  multimodal  representation  is passed to a 128-unit LSTM. At each timestep, the state of this LSTM is multiplied by a weight matrix containing and then the output is passed through a softmax function.

The agent receives a positive reward if it finds or lifts the correct object, and the episode ends with no reward if the agent finds or lifts the incorrect object.

Systematic Generalization In Agents

In their experiments, the researchers consider addressing the following three factors:

See Also
DeepMind Vs Google
DeepMind Vs Google: The Inner Feud Between Two Tech Behemoths

  • the number of words and objects experienced during training; 
  • a first-person egocentric perspective; and 
  • the diversity of perceptual input afforded by the perspective of a first-person interactive agent over time.

Executing a simple instruction like ‘find a toothbrush’ (which can be accomplished on average in six actions by a well-trained agent in our corresponding grid world) requires an average of 20 action decisions.

In this environment, the agent observes the world from a first-person perspective, the Unity objects are 3D renderings of everyday objects, the environment has simulated physics enabling objects to be picked-up, moved and stacked, and the agent’s action-space consists of 26 actions that allow the agent to move its location, its field of vision, and to grip, lift, lower and manipulate objects. 

The results showed that the agent therefore learned a notion of what it is to lift an object (and how this binds to the word lift) with sufficient generality that it can, without further training, apply it to novel objects, or familiar objects with novel modes of linguistic reference.

The findings of this work can be summarised as follows:

  • Neural-network-based agent with standard architectural components can learn to execute goal-directed motion in response to instructions.
  • The first-person perspective of an agent acting over time plays an important role in the emergence of this generalization.
  • Language can provide a form of supervision for how to break down the world and/or learned behaviours into meaningful sub-parts, which in turn might stimulate systematicity and generalisation.
  • Agents trained in 3D worlds generalize better. in  the 3D world, the agent experiences a much richer variety of  (highly correlated) visual stimuli in any particular episode.


The significance of these experiments can be better understood when put in the context of artificial general intelligence. A robot listening to the voice based instructions and performing actions such as moving and finding requires a symbiosis of most efficient natural language and reinforcement learning models. Though the results of these experiments only cater to simpler environments, we can safely assume that in due course, with improved models and unlimited resources, systematic generalization might emerge more readily than ever before.

Provide your comments below


If you loved this story, do join our Telegram Community.

Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top