MITB Banner

Intel’s New AI System Can Optimise Reinforcement Learning Training On A Single System

Share

Recently, researchers from the Intel Lab and the University of Southern California introduced an AI system known as Sample Factory that can optimise the efficiency of reinforcement learning algorithms on a single-machine setting. Sample Factory is a high-throughput training system optimised for a single-machine setting that combines a highly efficient, asynchronous, GPU-based sampler with off-policy correction techniques.

Researchers from organisations and academia have achieved several ground-breaking results in both training sophisticated agents for video games and in sim-to-real transfer for robotics over the past few years. The results were achieved by increasing the scale of reinforcement learning experiments. 

However, such experiments rely on large distributed systems and require expensive hardware setups. Billion-scale experiments with complex environments have become commonplace for these researches, and the most advanced efforts consume trillions of environment transitions in a single training session. This, in result, limits the broader access to this exciting area of research. 

This is where Sample Factory comes into play. According to the researchers, Sample Factory mitigates this issue by optimising the efficiency and resource utilisation of reinforcement learning algorithms instead of relying on distributed computations. 

Behind Sample Factory

Sample Factory is an architecture for high-throughput reinforcement learning on a single machine scenario. It is built around an Asynchronous Proximal Policy Optimisation (APPO) algorithm, which is a reinforcement learning architecture that allows to aggressively parallelise the experience collection and achieve throughput as high as 130000 FPS (environment frames per second) on a single multi-core compute node with only one GPU.

A typical reinforcement learning scenario involves three major computational workloads, which are environment simulation, model inference, and backpropagation. The key motivation of this research was to build a system in which the slowest of three workloads never has to wait for any other processes to provide the data necessary to perform the next computation since the overall throughput of the algorithm is ultimately defined by the workload with the lowest throughput.

To minimise the idle time for all key computations, the researchers associated each computational workload with one of three dedicated types of components. These components communicate with each other using a fast protocol based on FIFO queues and shared memory.

Here, the queueing mechanism provides the basis for continuous and asynchronous execution, where the next computation step can be started immediately as long as there is something in the queue to process. The decision to assign each workload to a dedicated component type also allowed the researchers to parallelise them independently, thereby achieving optimised resource balance.

The researchers further evaluated the algorithm on a set of challenging 3D environments. They used three reinforcement learning domains for benchmarking, which are Atari, VizDoom, and DeepMind Lab. 

Benefits of Sample Factory

Through this research, the researchers aimed to democratise deep reinforcement learning and make it possible to train the whole populations of agents on billions of environment transitions using the widely available commodity hardware. It can benefit any project that leverages model-free reinforcement learning

Sample Factory can be used as a single node in a distributed setup, where each machine has a sampler and a learner. They extended Sample Factory to support self-play and population-based training. The researchers stared, “With our system architecture, researchers can iterate on their ideas faster, thus accelerating progress in the field.” 

Wrapping Up

In this research, the researchers presented an efficient high-throughput reinforcement learning architecture. The architecture combines a highly efficient, asynchronous, GPU-based sampler with off-policy correction techniques, which allows achieving throughput that is higher than 10^5 environment frames/second on non-trivial control problems in 3D without sacrificing the sample efficiency. 

Read the paper here.

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.