The ability to generalise have made neural networks learn tasks at a faster rate.Generalising in this context means to classify data from the same class as the learning data that it has never seen before. In other words, this can be likened to transfer learning. A model or an agent in a reinforcement learning environment is trained on one task and uses this knowledge to perform a task that is new.\r\n\r\nMaking the networks to learn new strategies to generalise better is usually the aim behind any algorithmic enhancement. But how good are these networks at generalisation immediately after training?\u00a0\r\n\r\nExploring this question will lead one to pressing challenges of training deep learning algorithms such as the need for long hours of training, the efficacy of the model and other such questions.\r\n\r\nTo demonstrate how the network generalises, researchers from Stanford University in collaboration with\u00a0 DeepMind and University College of London, conduct experiments on agents in a simulation of 3D world.\u00a0\r\n\r\nThe agents are monitored for how they learn actions that contain verbs. For example, an agent can be asked to \u201clift\u201d an object or \u201cfind\u201d a particular thing within that environment.\r\nOverview Of The Process\r\n\r\n\r\nvia paper by Stanford et al.,\r\n\r\nThe above figure is an illustration of the architecture used in all experiments. The\u00a0 simplicity of the architecture is intended to emphasize the generality of the findings.\r\n\r\nThe agent is assessed through tasks that require visual, language and memory related skills. The whole process can be summarised in three steps:\r\n\r\n \tThe visual observations are passed on to a convolutional neural network. The output of this network is joined with an embedding of the language observation.\r\n \tLanguage instructions are received at every timestep as a string.\u00a0 The agent splits these with a (word-level) LSTM network. The final hidden state is concatenated with the output of the visual processor to yield a representation at each timestep.\r\n \tThe\u00a0 multimodal\u00a0 representation\u00a0 is passed to a 128-unit LSTM. At each timestep, the state of this LSTM is multiplied by a weight matrix containing and then the output is passed through a softmax function.\r\n\r\nThe agent receives a positive reward if it finds or lifts the correct object, and the episode ends with no reward if the agent finds or lifts the incorrect object.\r\nSystematic Generalization In Agents\r\nIn their experiments, the researchers consider addressing the following three factors:\r\n\r\n \tthe number of words and objects experienced during training;\u00a0\r\n \ta first-person egocentric perspective; and\u00a0\r\n \tthe diversity of perceptual input afforded by the perspective of a first-person interactive agent over time.\r\n\r\nExecuting a simple instruction like \u2018find a toothbrush\u2019 (which can be accomplished on average in six actions by a well-trained agent in our corresponding grid world) requires an average of 20 action decisions.\r\n\r\nIn this environment, the agent observes the world from a first-person perspective, the Unity objects are 3D renderings of everyday objects, the environment has simulated physics enabling objects to be picked-up, moved and stacked, and the agent\u2019s action-space consists of 26 actions that allow the agent to move its location, its field of vision, and to grip, lift, lower and manipulate objects.\u00a0\r\n\r\nThe results showed that the agent therefore learned a notion of what it is to lift an object (and how this binds to the word lift) with sufficient generality that it can, without further training, apply it to novel objects, or familiar objects with novel modes of linguistic reference.\r\n\r\nThe findings of this work can be summarised as follows:\r\n\r\n \tNeural-network-based agent with standard architectural components can learn to execute goal-directed motion in response to instructions.\r\n \tThe first-person perspective of an agent acting over time plays an important role in the emergence of this generalization.\r\n \tLanguage can provide a form of supervision for how to break down the world and\/or learned behaviours into meaningful sub-parts, which in turn might stimulate systematicity and generalisation.\r\n \tAgents trained in 3D worlds generalize better. in\u00a0 the 3D world, the agent experiences a much richer variety of\u00a0 (highly correlated) visual stimuli in any particular episode.\r\n\r\nOutlook\r\nThe significance of these experiments can be better understood when put in the context of artificial general intelligence. A robot listening to the voice based instructions and performing actions such as moving and finding requires a symbiosis of most efficient natural language and reinforcement learning models. Though the results of these experiments only cater to simpler environments, we can safely assume that in due course, with improved models and unlimited resources, systematic generalization might emerge more readily than ever before.