Recently, researchers from the Intel Lab and the University of Southern California introduced an AI system known as Sample Factory that can optimise the efficiency of reinforcement learning algorithms on a single-machine setting. Sample Factory is a high-throughput training system optimised for a single-machine setting that combines a highly efficient, asynchronous, GPU-based sampler with off-policy correction techniques.
Researchers from organisations and academia have achieved several ground-breaking results in both training sophisticated agents for video games and in sim-to-real transfer for robotics over the past few years. The results were achieved by increasing the scale of reinforcement learning experiments.
However, such experiments rely on large distributed systems and require expensive hardware setups. Billion-scale experiments with complex environments have become commonplace for these researches, and the most advanced efforts consume trillions of environment transitions in a single training session. This, in result, limits the broader access to this exciting area of research.
This is where Sample Factory comes into play. According to the researchers, Sample Factory mitigates this issue by optimising the efficiency and resource utilisation of reinforcement learning algorithms instead of relying on distributed computations.
Behind Sample Factory
Sample Factory is an architecture for high-throughput reinforcement learning on a single machine scenario. It is built around an Asynchronous Proximal Policy Optimisation (APPO) algorithm, which is a reinforcement learning architecture that allows to aggressively parallelise the experience collection and achieve throughput as high as 130000 FPS (environment frames per second) on a single multi-core compute node with only one GPU.
A typical reinforcement learning scenario involves three major computational workloads, which are environment simulation, model inference, and backpropagation. The key motivation of this research was to build a system in which the slowest of three workloads never has to wait for any other processes to provide the data necessary to perform the next computation since the overall throughput of the algorithm is ultimately defined by the workload with the lowest throughput.
To minimise the idle time for all key computations, the researchers associated each computational workload with one of three dedicated types of components. These components communicate with each other using a fast protocol based on FIFO queues and shared memory.
Here, the queueing mechanism provides the basis for continuous and asynchronous execution, where the next computation step can be started immediately as long as there is something in the queue to process. The decision to assign each workload to a dedicated component type also allowed the researchers to parallelise them independently, thereby achieving optimised resource balance.
The researchers further evaluated the algorithm on a set of challenging 3D environments. They used three reinforcement learning domains for benchmarking, which are Atari, VizDoom, and DeepMind Lab.
Home » Intel’s New AI System Can Optimise Reinforcement Learning Training On A Single System
Benefits of Sample Factory
Through this research, the researchers aimed to democratise deep reinforcement learning and make it possible to train the whole populations of agents on billions of environment transitions using the widely available commodity hardware. It can benefit any project that leverages model-free reinforcement learning.
Sample Factory can be used as a single node in a distributed setup, where each machine has a sampler and a learner. They extended Sample Factory to support self-play and population-based training. The researchers stared, “With our system architecture, researchers can iterate on their ideas faster, thus accelerating progress in the field.”
In this research, the researchers presented an efficient high-throughput reinforcement learning architecture. The architecture combines a highly efficient, asynchronous, GPU-based sampler with off-policy correction techniques, which allows achieving throughput that is higher than 10^5 environment frames/second on non-trivial control problems in 3D without sacrificing the sample efficiency.
Read the paper here.
Provide your comments below
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box. Contact: firstname.lastname@example.org