Now Reading
After Poker And Go, Reinforcement Learning Is Now Beating Mahjong Players

After Poker And Go, Reinforcement Learning Is Now Beating Mahjong Players

Ambika Choudhury

For the first time, an AI model has outperformed top players in the game of Mahjong. Microsoft Research Asia designed an AI model for Mahjong known as Suphx. The researchers evaluated Suphx on the most popular and competitive Mahjong platform, Tenhou, which has more than 3,50,000 active users. The Suphx model has exhibited higher performance than most top players in terms of stable rank – t is rated above 99.99% of all the officially ranked players in the Tenhou platform.

Games have become one of the most popular testbeds for testing reinforcement learning algorithms. AI researchers have already become successful in beating human players with deep reinforcement learning algorithms in two or multi-player games like Go, Texas Hold’em, Atari, among others. 

Companies like OpenAI and DeepMind have been doing a lot around this. Last year, OpenAI benchmarked reinforcement learning so that the learning model avoids overfitting. Why do researchers choose this algorithm? It is because this learning approach enables the computer to make a series of decisions that maximizes a reward metric for the task without human intervention, and without being explicitly programmed to achieve the task. 



Behind the Model

Suphx – short for Super Phoenix – is an AI system for four-player Japanese Mahjong (Riichi Mahjong). The training of Suphx is based on distributed reinforcement learning. The model adopts deep convolutional neural networks (CNNs) as the model architecture for its policy.

Due to the complex rules of Mahjong, Suphx learns five models to handle different situations. These are the discard model, the Riichi model, the Chow model, the Pong model, and the Kong model. Besides these, Suphx employs another rule-based winning model to decide whether to declare a winning hand and win the round.

The learning phase of Suphx contains three significant steps. They are mentioned below:-

  • The five models of Suphx are trained by supervised learning, using (state, action) pairs of top players collected from the Tenhou platform.
  • The supervised models are improved through self-play reinforcement learning (RL), with the models as policy. The researchers adopt the popular policy gradient algorithm and introduce global reward prediction and oracle guiding to handle the unique challenges of Mahjong.
  • During online playing, the researchers employed run-time policy adaptation to leverage new observations on the current round to perform even better.

Why Mahjong

According to the researchers, Mahjong is a much more complicated game than other games like chess, Go, etc. which have been played by AI models. It is a multi-round tile-based game with imperfect information and multiple players. In each round, four players compete with each other towards the first completion of a winning hand.

The researchers chose this game mainly because of three reasons. Firstly, according to them, this game has complicated scoring rules. Each game of Mahjong contains multiple rounds, and the final ranking, as well as the reward of the game, is determined by the accumulated round scores of those rounds. Furthermore, it has a vast number of possible winning hands, making the scoring rules more complex than previously studied games, including chess, Go, etc.

See Also

Secondly, the broad set of hidden information of the tiles makes Mahjong a much more difficult imperfect-information game than previously studied ones, such as Texas hold’em poker. Thirdly, the playing rule of Mahjong is much more complicated because of the various actions involved. 

Wrapping Up

The researchers claim that building a strong Mahjong program raises great challenges to the current studies on game AI. Furthermore, they claim that Suphx can help in solving complex real-world problems in finance market prediction and logistic optimization. They stated, “We believe our techniques designed in Suphx for Mahjong, including global reward prediction, oracle guiding and parametric Monte-Carlo policy adaptation, have a great potential to benefit for a wide range of real-world applications.”

Read the paper here.

Provide your comments below

comments


If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top