In a paper published by DeepMind, a team of scientists from Alphabet’s UK-based research division has claimed that Agent57 – the first deep reinforcement learning agent – has outperformed humans on all 57 Atari 2600 games in the Arcade Learning Environment data set.
Agent57 combines an algorithm to efficiently explore a meta controller which adapts the exploration along with the long versus short-term behaviour of the agent. The study’s co-authors wrote, “With Agent57, we have succeeded in building a more generally intelligent agent that has above-human performance on all tasks in the Atari57 benchmark. Agent57 was able to scale with increasing amounts of computation: the longer it trained, the higher its score got.”
With this latest development, Agent57 can be further used to generate more capable artificial intelligence (AI) decision-making models that the world has right now. The new models could be a blessing for several organizations which can leverage it to increase productivity and streamline various operations through workplace automation. This will enable the AI to not only perform small repetitive tasks, but also improvise as per its environment.
Arcade Learning Environment
As per the researchers, the Arcade Learning Environment was selected as a platform to evaluate the agent’s design and its competency across a range of games. Due to this reason, the Atari 2600 games were selected, as they provide an environment that is believed to be challenging as well as engaging for human players.
Earlier, a system from OpenAI and DeepMind showcased high performance in games like Pong and Enduro. This is not all, as DeepMind’s MuZero was able to surpass the most elevated scores made by humans on 51 games. However, it is the first time an algorithm has been able to reach an absolute score across 57 games in an Arcade Learning Environment.
Reinforcement Learning Challenges
To achieve the desired results, the team of researchers made the Agent57 process on several computers simultaneously and reinforcement learning (RL) along with the help of AI-driven software who were tasked with action to draw maximum reward. Previously, RL has helped in increasing the performance of several games. For example, OpenAI’s OpenAI Five and DeepMind’s own AlphaStar RL agents beat 99.4% of Dota 2 players and 99.8% of StarCraft 2 players, respectively, on public servers. However, the researchers did not term them as perfect scores.
Several problems were observed by them, such as exploration and catastrophic forgetting, as well as agents not able to move further while searching for the pattern through random data. IT also included forgetting previously learned information when a piece of new information is fed to them. The researchers also found it troublesome to assign credits for positive or negative outcomes.
To overcome these hurdles, the team developed a Never Give Up (NGU) technique which delivers a signal with an internally generated reward sensitive for two levels, such as short term which is limited to an episode, and long-term that goes across several episodes. The team also used an episodic memory to teach the NGU several policies to exploit and explore. The NGU was taught to obtain the highest score, keeping the exploitation policy as a barrier.
Agent57 has been designed to collect from a numerous actor feed into a centralized repository that a learner can sample. The repository comprises sequenced transitions used for regular pruning that are straightway derived from actor processes. This actor processes independent copies of the game environment.
Two different AI models were used by the research team to determine how an agent performs a particular task with a set policy (exploit/explore). This allowed the agent to the environment by identifying the reward which follows the policy. The team also included a meta-controller to run independently on each actor that can adaptively select the policies to be run during the time of training and evaluation.
To determine the real prowess of Agent57, the team of researchers compared it with leading algorithms, including MuZero and R2D2. According to the team, MuZero scored the highest mean of 5661.84 and median of 2381.51 across all the 57 games, but massively failed to score with games like Venture. Agent57 displayed a better mean performance versus both R2D2 (96.93) and MuZero (89.92). It took five billion frames to beat human performance on 51 games along with 78 billion frames to beat humans in Skiing game.
As per the researchers, using the meta-controller, the performance was enhanced to 20% compared to R2D2, even in long-term episodes such as Skiing and Solaris. The agents received information input for a more extended period to learn the necessary feedback.