Google’s DeepMind made new advancement and changed the way how reinforcement learning works to complete tasks and receive maximum rewards. In its attempt to enhance the way RL works through experience, Google tried to resemble humans’ capability of prolifically engaging in mental “time travel”. This allows machine learning models to make decisions based on potential future outcomes.
Humans learn about various events and make decisions based on past experiences. The idea is to teach machines to do the same without even training with a hit-and-trial method on every occasion, which is what reinforcement learning does.
When humans do something, they realise if it was a right or wrong decision based on their past experiences. For one, if we place a glass too close to the edge of the table, we realise that it might accidentally hit the floor a moment later. Predicting such instance even before disasters occur is what makes humans superiors to machines. Over the years, we have obtained such cognitive intuitions to make better decisions. However, ML models always have to go through the hit-and-trial methodology for determining the best action.
But what if we tell you that a machine might describe a long-term consequence before even going thought various experiences? This will enable humans to make better decisions for choosing appropriate careers, lifestyles and even monetary investments. This is where Google’s DeepMind team focused and innovated within a game.
Temporal Value Transport
DeepMind’s deep learning program is called Temporal Value Transport (TVT). It is a methodology to send lessons from the future. In other words, it assimilates the long-term consequence of various choices and makes the right decisions in the present. In a nutshell, they are gamifying memory to make informed actions.
However, this does not mean that they are creating a memory or recreating what happens in the human mind. Instead, they are offering a mechanical description of behaviour that can inspire models in neuroscience, psychology, and behavioural economics. The memory agent will use several objectives to learn, store, and retrieve a record of past states as a kind of memory.
Long-term credit allocation or discounted utility is the ability of people to identify the fruitfulness of actions based on its consequences. Such response and reward methodology are used in reinforcement learning, but it has numerous limitation as it does not make long-term correlations.
A lot of learning happens in humans without the need for immediate reward or direct feedback. To replicate such human abilities, DeepMind uses TVT to send reward signal backwards from far away in the future as an alternative form of neural networks, thereby, creating a feedback loop. The researchers used Turing Neural Machine (NMT) that was created by DeepMind in 2014. Back then, it was deployed to make computer search memory records based on descent of gradient. However, in this research, they embraced it to retrieve memories of past actions. This technique uses NMT to handle storage and gathering memorise, hence the name Reconstructive Memory Agent (RMA).
The researchers also mentioned that such techniques had been adopted in the past to enhance the capabilities of reinforcement learning. But this is the first time that memories of past events have been coded. This is somewhat similar to the approach of encoding in a generative neural network through the variational automatic encoder.
The result of the research has drawn attention throughout the world as this approach performed better than traditional RL models. However, as all of this was carried out in a game through simulation, one cannot expect the model to defy physics in the real world.