In 2016, DeepMind’s AlphaGo defeated the 18-time world champion of Go, Lee Sedol— a watershed moment in the evolution of AI. DeepMind’s computer programs like AlphaGo, AlphaZero, AlphaStar, Player of Games and MuZero have surpassed human level performances in Atari games and board games like Go, Chess and shogi. Now, DeepMind is taking it up a notch, stepping out of the research domains, and into the real-world.
MuZero,with its ability to master several games with little predefined knowledge, is a significant step in DeepMind’s pursuit of general-purpose algorithms. The computer program leverages reinforcement learning to come up with winning strategies in uncharted environments.
MuZero’s RL for sequential decision making
Planning effectively in unknown and complex domains has been a challenge in artificial intelligence. MuZero solves this by learning a model that focuses only on the most important aspects of the environment for planning. The model-based reinforcement learning algorithm takes advantage of AlphaZero’s powerful lookahead tree search for optimal planning. MuZero set a new state of the art result on the Atari benchmark while simultaneously matching the performance of AlphaZero in the classic planning challenges of Go, chess and shogi.
MuZero’s unique approach is its ability to model only the aspects important to the agent’s decision-making process and not the entire environment. It models three critical elements through a deep learning framework:
- The value determining how good the current position is
- The policy determining the best action to take
- The reward determining how good the action taken was
Rather than collecting new data from each environment, MuZero can use this learned model repeatedly to improve its planning.
MuZero for YouTube video optimisation
To test MuZero’s ability to make decisions in real-world scenarios, Deepmind collaborated with YouTube to tackle the compression problem in video streaming. MuZero has improved the state of the art in video compression. “Since launching to production on a portion of YouTube’s live traffic, we’ve demonstrated an average 4% bitrate reduction across a large, diverse set of videos,” the team claimed.
YouTube uses codec for video compression. It makes multiple decisions for each video frame and is responsible for compressing the video at its source, transmitting it to the viewer and decompressing it again during playback. The codec is hand-engineered and works in the background of video on demand, video calls, video games and VR. Deepmind has explored the role of RL-learning algorithms in helping the sequential decision-making problems in codecs, specifically, the open-sourced version, libvpx, of the VP9 codec. VP9 is an open video coding format developed by Google. MuZero’s ability to learn a model from its environment to plan decisions comes in handy here.
VP9’s bitrate ranges from 0.20 mbit/s to 480 mbit/s across 14 levels. The bitrate signifies the number of ones and zeros required to send a video frame, managing the compute and bandwidth to store and serve the videos. The level of bitrate further affects the buffering time and data usage. VP9’s Quantisation Parameter optimises the codec and determines the level of compression needed in each frame. The QP selecting algorithm allocates QP to maximise overall video quality while determining how the QP value of a video frame affects the bitrate allocation of the rest of the video frames. Deepmind’s RL plays the role of a sequential decision-maker in this context. It overcomes the problem of learning a rate control policy to select the quantisation parameters (QP) in the encoding process of libvpx.
In the real world scenario of video sets, the environmental problem is huge in scope. Users upload videos of varying size and quality on YouTube, making it necessary for an AI agent to generalise across the media. MuZero combines the power of search with its ability to learn a model of the environment to plan accordingly during making sequential decisions for the codecs.
MuZero-RC replacing the frames on VP9’s default rate control mechanism
The self-competition mechanism for the MuZero Rate-Controller
The researchers have created a self-competition mechanism for MuZero to strengthen its ability to deal with environmental challenges. The process converts the complex video objectives of video compression into a WIN or LOSS signal comparing the agent’s current performance against historical performance. Essentially, the team has converted the complicated codec requirements into a simple signal for the agent to optimise. This allows MuZero to learn the dynamics of video encoding, based on which it allocates bits. The MuZero Rate-Controller is created to reduce bitrate without degrading the video quality.
MuZero was evaluated against libvpx’s two-pass VBR rate control implementation. The sample used includes 3062 video clips from the YouTube UGC dataset. MuZero-RC achieved an average 6.28% reduction in bitrate in comparison to baseline. It also demonstrated better bitrate constraint satisfaction and can now be readily deployed in libvpx via the SimpleEncode API.
Video encoded with prior QP heuristics
Video encoded with MuZero-RC
What’s in store
The team envisions using MuZero for applications beyond video compression and in research environments for RL agents to solve real-world problems. QP selection is just one of the major decisions in the encoding process. Still, in the future, the team will explore a single algorithm that can learn and make several encoding decisions for optimal rate-distortion tradeoff. The team’s goal is to build a single algorithm to optimise thousands of real-world systems and make computer systems faster, automated and less intensive.